Skip to main content

Memory in the Age of AI Agents: A Survey Large Forms, Functions and Dynamics

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan

Abstract

Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. It underpins long-horizon reasoning, continual adaptation, and effective interaction with complex environments. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, assumptions, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity and dynamics of contemporary agent memory systems. This survey aims to provide an up-to-date and comprehensive landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we move beyond coarse temporal categorizations and propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time as agents interact with their environments. To support empirical research and practical development, we compile a comprehensive summary of representative benchmarks and open source memory frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including automation-oriented memory design, the deep integration of reinforcement learning with memory systems, multimodal memory, shared memory for multi-agent systems, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.

Introduction

The past two years have witnessed the overwhelming evolution of increasingly capable large language models (LLMs) into powerful AI agents (Matarazzo and Torlone, 2025; Minaee et al., 2025; Luo et al., 2025a). These foundation-model-powered agents have demonstrated remarkable progress across diverse domains such as deep research (Xu and Peng, 2025; Zhang et al., 2025p), software engineering (Wang et al., 2024i), and scientific discovery (Wei et al., 2025c), continuously advancing the trajectory toward artificial general interlligence (AGI) (Fang et al., 2025a; Durante et al., 2024). Although early conceptions of 'agents' were highly heterogeneous, a growing consensus has since emerged within the community: beyond a pure LLM backbone, an agent is typically equipped with capabilities such as reasoning , planning , perception , memory , and tool-use . Some of these abilities, such as reasoning and tool-use, have been largely internalized within model parameters through reinforcement learning (Wang et al., 2025m; Qu et al., 2025b), while some still depend heavily on external agentic scaffolds. Together, these components transform LLMs from static conditional generators into learnable policies that can interact with diverse external environments and adaptively evolve over time (Zhang et al., 2025f; Liu et al., 2025a).

Among these agentic faculties, memory stands out as a cornerstone, explicitly enabling the transformation of static LLMs, whose parameters cannot be rapidly updated, into adaptive agents capable of continual adaptation through environmental interaction (Zhang et al., 2025s; Wu et al., 2025g). From an application perspective, numerous domains demand agents with proactive memory management rather than ephemeral, forgetful behaviors: personalized chatbots (Chhikara et al., 2025; Li et al., 2025b), recommender systems (Liu et al., 2025c), social simulations (Park et al., 2023; Yang et al., 2025), and financial investigations (Zhang et al., 2024) all rely on the agent's ability to process, store, and manage historical information. From a developmental standpoint, one of the defining aspirations of AGI research is to endow agents with the capacity for continual evolution through environment interactions (Hendrycks et al., 2025), a capability fundamentally

grounded in agent memory.

Agent Memory Needs A New Taxonomy Given the growing significance and community attention surrounding agent memory systems, it has become both timely and necessary to provide an updated perspective on contemporary agent memory research. The motivation for a new taxonomy and survey is twofold: ❶ Limitations of Existing Taxonomies: While several recent surveys have provided valuable and comprehensive overviews of agent memory (Zhang et al., 2025s; Wu et al., 2025g), their taxonomies were developed prior to a number of rapid methodological advances and therefore do not fully reflect the current breadth and complexity of the research landscape. For example, emerging directions in 2025, such as memory frameworks that distill reusable tools from past experiences (Qiu et al., 2025a,c; Zhao et al., 2025c), or memory-augmented test-time scaling methods (Zhang et al., 2025g; Suzgun et al., 2025), remain underrepresented in earlier classification schemes. ❷ Conceptual Fragmentation: With the explosive growth of memory-related studies, the concept itself has become increasingly expansive and fragmented. Researchers often find that papers claiming to study 'agent memory' differ drastically in implementation, objectives, and underlying assumptions. The proliferation of diverse terminologies (declarative, episodic, semantic, parametric memory, etc.) further obscures conceptual clarity, highlighting the urgent need for a coherent taxonomy that can unify these emerging concepts.

Therefore, this paper seeks to establish a systematic framework that reconciles existing definitions, bridges emerging trends, and elucidates the foundational principles of memory in agentic systems. Specifically, this survey aims to address the following key questions:

Agent Memory Needs A New Taxonomy

At a conceptual level, agent memory and retrieval-augmented generation (RAG) exhibit substantial overlap: both systems construct, organize, and leverage auxiliary information stores to extend the capabilities of LLM/agents beyond their native parametric knowledge. For instance, structured representations such as knowledge graphs and indexing strategies appear in both communities' methods, and recent developments in agentic RAG demonstrate how autonomous retrieval mechanisms can interact with dynamic databases in ways reminiscent of agent memory architectures (Singh et al., 2025). Indeed, the engineering stacks underlying many RAG and agent memory systems share common building blocks, including vector indices, semantic search, and context expansion modules.

Despite these technological convergences, the two paradigms have historically been distinguished by the contexts in which they are applied. Classical RAG techniques primarily augment an LLM with access to static knowledge sources , whether flat document stores, structured knowledge bases, or large corpora externally indexed to support retrieval on demand (Zhang et al., 2025q; Han et al., 2025b). These systems are designed to ground generation in up-to-date facts, mitigate hallucinations, and improve accuracy in knowledge-intensive tasks, but they generally do not maintain an internal, evolving memory of past interactions. In contrast, agent memory systems are instantiated within an agent's ongoing interaction with an environment , continuously incorporating new information generated by the agent's own actions and environmental feedback into a persistent memory base (Wang et al., 2024m; Zhao et al., 2024; Sun et al., 2025e).

In early formulations the distinction between RAG and agent memory was relatively clear: RAG retrieved from externally maintained knowledge for a single task invocation, whereas agent memory evolved over multi-turn, multi-task interaction. However, this boundary has become increasingly blurred as retrieval systems themselves become more dynamic. For example, certain retrieval tasks continuously update relevant context during iterative querying (e.g., multi-hop QA settings where related context is progressively added). Interestingly, systems such as HippoRAG/HippoRAG2 (Gutierrez et al., 2024; Gutiérrez et al., 2025) have been interpreted by both RAG and memory communities as addressing long-term memory challenges for LLMs. Consequently, a more practical (though not perfectly separable) distinction lies in the task domain . RAG is predominantly applied to augment LLMs with large, externally sourced context for individual inference tasks, exemplified by classical multi-hop and knowledge-intensive benchmarks such as HotpotQA (Yang et al., 2018), 2WikiMQA (Ho et al., 2020), and MuSiQue (Trivedi et al., 2022). By contrast, agent memory systems are typically evaluated in settings requiring sustained multi-turn interaction, temporal dependency, or environment-driven adaptation. Representative benchmarks include long-context dialogue evaluations such as LoCoMo (Maharana et al., 2024) and LongMemEval (Wu et al., 2025a), complex problem-solving and deep-research benchmarks such as GAIA (Mialon et al., 2023), XBench (Chen et al., 2025c), and BrowseComp (Wei et al., 2025b), code-centric

agentic tasks such as SWE-bench Verified (Jimenez et al., 2024), as well as lifelong learning benchmarks such as StreamBench (Wu et al., 2024a). We provide a comprehensive summary of memory-related benchmarks in Section 6.1.

Nevertheless, even this domain-based distinction contains substantial gray areas. Many works self-described as agent memory systems are evaluated under long-document question-answering tasks such as HotpotQA (Wang et al., 2025g,p), while numerous papers foregrounded as RAG systems in fact implement forms of agentic selfimprovement, continually distilling and refining knowledge or skills over time. As a result, titles, methodologies, and empirical evaluations frequently blur the conceptual boundary between the two paradigms. To further clarify these relationships, the following three paragraphs draw upon established taxonomies of RAG from (Mei et al., 2025): modular RAG , graph RAG , and agentic RAG , and examine how the core techniques associated with each lineage manifest within both RAG and agent memory systems.

Modular RAG Modular RAG refers to architectures in which the retrieval pipeline is decomposed into clearly specified components, such as indexing, candidate retrieval, reranking, filtering, and context assembly, that operate in a largely static and pipeline-like fashion (Singh et al., 2025). These systems treat retrieval as a well-engineered, modular subsystem external to the LLM, designed primarily for injecting relevant knowledge into the model's context window during inference. Within the agent memory perspective, the corresponding techniques typically appear in the retrieval stage , where memory access is realized through vector search, semantic similarity matching, or rule-based filtering, as seen in popular agent memory frameworks like Memary (Memary, 2025), MemOS (Li et al., 2025l), and Mem0 (Chhikara et al., 2025).

Graph RAG Graph RAG systems structure the knowledge base as a graph, ranging from knowledge graphs to concept graphs or document-entity relations, and leverage graph traversal or graph-based ranking algorithms to retrieve context (Peng et al., 2024). This representation enables multi-hop relational reasoning, which has proven effective for knowledge-intensive tasks (Edge et al., 2025; Han et al., 2025b; Dong et al., 2025a). In the context of agent memory, graph-structured memory arises naturally when agents accumulate relational insights over time, such as linking concepts, tracking dependencies among subtasks, or recording causal relations inferred through interaction. Several well-established practices include Mem0 g (Chhikara et al., 2025), A-MEM (Xu et al., 2025c), Zep (Rasmussen et al., 2025), and G-memory (Zhang et al., 2025c). Notably, graph-based agent memory systems may construct, extend, or reorganize its internal graph throughout the agent's operation. Consequently, graph-based retrieval forms the structural backbone for both paradigms, but only agent memory treats the graph as a living, evolving representation of experience. We provide further analysis on graph-based memory forms in Section 3.1.2 and also refer the readers to a relevant survey (Liu et al., 2025h).

Agentic RAG Agentic RAG integrates retrieval into an autonomous decision-making loop, where an LLM agent actively controls when, how, and what to retrieve (Singh et al., 2025; Sun et al., 2025e). These systems often employ iterative querying, multi-step planning, or self-directed search procedures, enabling the agent to refine its information needs through deliberate reasoning, as implemented in PlanRAG (Lee et al., 2024b) and Self-RAG (Asai et al., 2023). For a more detailed understanding of agentic RAG, we refer the readers to Singh et al. (2025). From the agent memory perspective, agentic RAG occupies the closest conceptual space: both systems involve autonomous interaction with an external information store, both support multi-step refinement, and both may incorporate retrieved insights into subsequent reasoning. The key distinction is that classical agentic RAG typically operates over an external and often task-specific database, whereas agent memory maintains an internal, persistent, and self-evolving memory base that accumulates knowledge across tasks (Yan et al., 2025b; Xu et al., 2025c).

Contributions

After initiating the retrieval process, the next challenge lies in transforming the raw query into an effective retrieval signal aligned with the memory index. Query construction acts as the translation layer between the user's surface utterance and the memory's latent storage. Traditional approaches typically perform retrieval directly based on the user query, which is simple but fails to align the query semantics with those of the memory index. To bridge this gap, agentic memory systems proactively perform query decomposition or query rewriting, generating intermediate retrieval signals that better match the latent structure of the memory.

Query Decomposition This approach breaks down a complex query into simpler sub-queries, allowing the system to retrieve more fine-grained and relevant information. Such decomposition alleviates the one-shot retrieval bottleneck by enabling modular retrieval and reasoning over intermediate results. For instance, Visconde (Pereira et al., 2023) and ChemAgent (Tang et al., 2025c) employ LLMs to decompose the original question into sub-problems, retrieve candidate results for each from the memory, and finally aggregate them into a coherent answer. However, these methods lack global planning. To address this issue, PRIME (Tran et al., 2025) and MA-RAG (Nguyen et al., 2025) introduce a Planner Agent, inspired by the ReAct (Yao et al., 2023b) paradigm, that first formulates a global retrieval plan before decomposing it into sub-queries. Yet, these approaches mainly rely on problem-driven decomposition and thus cannot explicitly identify what specific knowledge the model is missing. To make sub-queries more targeted, Agent KB (Tang et al., 2025d) adopts a two-stage retrieval process in which a teacher model observes the student model's failures and generates fine-grained sub-queries accordingly. This targeted decomposition improves retrieval precision and reduces irrelevant results, particularly in knowledge-intensive tasks.

Query Rewriting Instead of decomposing, this strategy rewrites the original query or generates a hypothetical document to refine its semantics before retrieval. Such rewriting mitigates the mismatch between user intent and the memory index. HyDE (Gao et al., 2023b), for example, instructs the LLM to generate a hypothetical document in a zero-shot manner and performs retrieval using its semantic embedding. The generated document encapsulates the desired semantics, effectively bridging the gap between the user query and the target memory. MemoRAG (Qian et al., 2025) extends this idea by incorporating global memory into hypothetical document generation. It first compresses the global memory and then generates a draft answer conditioned on both the query and the compressed memory; this draft is then used as a rewritten query. Since the draft has access to the global memory context, it captures user intent more faithfully and uncovers implicit information needs. Similarly, MemGuide (Du et al., 2025b) leverages the dialogue context to prompt an LLM to produce a concise, command-like phrase that serves as a high-level intent description for retrieval. Beyond directly prompting an LLM to rewrite the query, Rewrite-Retrieve-Read (Ma et al., 2023b) trains a small language model as a dedicated rewriter through reinforcement learning, while ToC (Kim et al., 2023a) employs a Tree of Clarifications to progressively refine and specify the user's retrieval objective.

Summary These two paradigms, decomposition and rewriting, are not mutually exclusive. Auto-RAG (Kim et al., 2024a) integrates both by evaluating HyDE and Visconde under identical retrieval conditions and then selecting the strategy that performs best for the given task. The findings of this work demonstrate that the quality of the memory-retrieval query has a substantial impact on reasoning performance. In contrast to earlier research, which primarily focused on designing sophisticated memory architectures, recent studies (Yan et al., 2025b) place increasing emphasis on the retrieval construction process, shifting the role of memory toward serving retrieval. The choice of what to retrieve with is, unsurprisingly, a critical component of this process.

Outline of the Survey

Preliminaries: Formalizing Agents and Memory

LLM agents increasingly serve as the decision-making core of interactive systems that operate over time, manipulate external tools, and coordinate with humans or other agents. To study memory in such settings, we begin by formalizing LLM-based agent systems in a manner that encompasses both single-agent and multi-agent configurations. We then formalize the memory system coupled to the agent's decision process through read/write interactions, enabling a unified treatment of memory phenomena that arise both within a task (inside-trial / short-term memory) and across tasks (cross-trial / long-term memory).

LLM-based Agent Systems

Agents and Environment Let I = { 1 , . . . , N } denote the index set of agents, where N = 1 corresponds to the single-agent case (e.g., ReAct), and N > 1 represents multi-agent settings such as debate (Li et al., 2024c) or planner-executor architectures (Wan et al., 2025). The environment is characterized by a state space S . At each time step t , the environment evolves according to a controlled stochastic transition model

$$

$$

where a t denotes the action executed at time t . In multi-agent systems, this abstraction allows for either sequential decision-making (where a single agent acts at each step) or implicit coordination through environment-mediated effects. Each agent i ∈ I receives an observation

$$

$$

where h i t denotes the portion of the interaction history visible to agent i . This history may include previous messages, intermediate tool outputs, partial reasoning traces, shared workspace states, or other agents' contributions, depending on the system design. Q denotes the task specification, such as a user instruction, goal description, or external constraints, which is treated as fixed within a task unless otherwise specified.

Action Space A distinguishing feature of LLM-based agents is the heterogeneity of their action space. Rather than restricting actions to plain text generation, agents may operate over a multimodal and semantically structured action space, including:

· Natural-language generation , such as producing intermediate reasoning, explanations, responses, or instructions (Li et al., 2023b; Wu et al., 2024b; Hong et al., 2024; Qian et al., 2024). · Tool invocation actions , which call external APIs, search engines, calculators, databases, simulators, or code execution environments (Qin et al., 2025; Li et al., 2025h; Zhou et al., 2023c, 2024c). · Planning actions , which explicitly output task decompositions, execution plans, or subgoal specifications to guide later behavior (CAMEL-AI, 2025; Liu et al., 2025g; Pan et al., 2024). · Environment-controlactions , where the agent directly manipulates the external environment (e.g., navigation in embodied settings (Shridhar et al., 2021; Wang et al., 2022a), editing a software repository (Jimenez et al., 2024; Aleithan et al., 2024), or modifying a shared memory buffer). · Communication actions , enabling collaboration or negotiation with other agents through structured messages (Marro et al., 2024).

These actions, though diverse in semantics, are unified by the fact that they are produced through an autoregressive LLM backbone conditioned on a contextual input. Formally, each agent i follows a policy

$$

$$

where m i t is a memory-derived signal defined in Section 2.2. The policy may internally generate multi-step reasoning chains, latent deliberation, or scratchpad computations prior to emitting an executable action; such internal processes are abstracted away and not explicitly modeled.

Interaction Process and Trajectories A full execution of the system induces a trajectory

$$

$$

where T is determined by task termination conditions or system-specific stopping criteria. At each step, the trajectory reflects the interleaving of (i) environment observation, (ii) optional memory retrieval, (iii) LLM-based computation, and (iv) action execution that drives the next state transition.

This formulation captures a broad class of agentic systems, ranging from a single agent solving reasoning tasks with tool augmentation to teams of role-specialized agents collaboratively developing software (Qian et al., 2024; Wang et al., 2025l) or conducting scientific inquiry (Weng et al., 2025). We next formalize the memory systems that integrate into this agent loop.

Agents and Environment

This section articulates key positions and emerging frontiers in the design of memory systems for LLM-based agents. Moving beyond descriptive surveys of existing methods, we focus on paradigm-level shifts that redefine how memory is constructed, managed, and optimized in long-horizon agentic settings. Specifically, we examine the transition from retrieval-centric to generative memory, from manually engineered to autonomously managed memory systems, and from heuristic pipelines to reinforcement learning-driven memory control. We further discuss how these shifts intersect with multimodal reasoning, multi-agent collaboration, and trustworthiness, outlining open challenges and research directions that are likely to shape the next generation of agent memory architectures.

Action Space

As shown above, such a large body of work has focused on agent memory, clearly demonstrating that memory mechanisms are essential for agent systems (Zhang et al., 2025s). The choice of memory type in an agent system reflects how designers expect the agent to behave in a given task. Designers are not simply asking the agent to remember certain information, but also implicitly expressing how they want that information to shape the agent's behavior. Therefore, choosing the right type of memory for a task is far more than a simple combinatorial choice.

In this section, we start from the features of each memory type and discuss which tasks and scenarios they are best suited for in an ideal setting, as shown in Figure 5. We hope this discussion can offer useful ideas and guidance for making practical choices. The examples illustrate only one possible form of memory in these idealized settings and do not imply that other memory types lack unique advantages in the same scenarios.

Token-level Memory Token-level memory remains symbolic , addressable , and transparent , making it particularly well suited for scenarios where explicit reasoning, controllability, and accountability are essential. This type of memory excels in real-time, high-frequency update settings, where an agent must continuously track and revise information, and where the knowledge itself exhibits a clear structure that can be explicitly modeled. Its externalizability allows memory to be easily inspected, audited, transferred, or revised, making it especially suitable for domains requiring precise add/delete/update operations. The high level of interpretability further ensures that an agent's decision process can be traced back to concrete memory units, a crucial property in high-stakes applications. Moreover, token-level memory provides long-term stability and avoids catastrophic forgetting, enabling agents to accumulate reliable knowledge over extended time horizons. Another practical advantage is that token-level memory is often implemented as a plug-and-play module, allowing it to be readily integrated with the latest closed-source or open-source foundation models without modifying their internal parameters.

Interaction Process and Trajectories

Agent Memory Systems

While an LLM-based agent interacts with an environment, its instantaneous observation o i t is often insufficient for effective decision-making. Agents therefore rely on additional information derived from prior interactions, both within the current task and across previously completed tasks. We formalize this capability through a unified agent memory system , represented as an evolving memory state

$$

$$

where M denotes the space of admissible memory configurations. No specific internal structure is imposed on M t ; it may take the form of a text buffer, key-value store, vector database, graph structure, or any hybrid representation. At the beginning of a task, M t may already contain information distilled from prior trajectories (cross-trial memory). During task execution, new information accumulates and functions as short-term, task-specific memory. Both roles are supported within a single memory container, with temporal distinctions emerging from usage patterns rather than architectural separation.

Memory Lifecycle: Formation, Evolution, and Retrieval The dynamics of the memory system are characterized by three conceptual operators.

Memory Formation At time step t , the agent produces informational artifacts ϕ t , which may include tool outputs, reasoning traces, partial plans, self-evaluations, or environmental feedback. A formation operator

$$

$$

selectively transforms these artifacts into memory candidates, extracting information with potential future utility rather than storing the entire interaction history verbatim.

Memory Evolution Formed memory candidates are integrated into the existing memory base through an evolution operator

$$

$$

which may consolidate redundant entries (Zhao et al., 2024), resolve conflicts (Rasmussen et al., 2025; Li et al., 2025l), discard low-utility information (Wang et al., 2025r), or restructure memory for efficient retrieval. The resulting memory state persists across subsequent decision steps and tasks.

MemoryRetrieval When selecting an action, agent i retrieves a context-dependent memory signal

$$

$$

where R denotes a retrieval operator that constructs a task-aware query and returns relevant memory content. The retrieved signal m i t is formatted for direct consumption by the LLM policy, for example as a sequence of textual snippets or a structured summary.

Temporal Roles Within the Agent Loop Although memory is represented as a unified state M t , the three lifecycle operators (formation F , evolution E , and retrieval R ) need not be invoked at every time step. Instead, different memory effects arise from distinct temporal invocation patterns. For instance, some systems perform retrieval only once at task initialization,

$$

$$

where ⊥ denotes null retrieval strategy. Others may retrieve memory intermittently or continuously based on contextual triggers. Similarly, memory formation may range from minimal accumulation of raw observations,

$$

$$

to sophisticated extraction and refinement of reusable patterns or abstractions. Thus, inside a task , short-term memory effects may arise from lightweight logging just as in Yao et al. (2023b); Chen et al. (2023a) or from more elaborate iterative refinement (Hu et al., 2025a); across tasks , long-term memory may be updated episodically at task boundaries or continuously throughout operation. Short-term and long-term memory phenomena therefore emerge not from discrete architectural modules but from the temporal patterns with which formation, evolution, and retrieval are engaged.

Memory-Agent Coupling The interaction between memory and the agent's decision process is similarly flexible. In general, the agent policy is written as

$$

$$

where the retrieved memory signal m i t may be present or absent depending on the retrieval schedule. When retrieval is disabled at a given step, m i t can be treated as a distinguished null input.

Consequently, the overall agent loop consists of observing the environment, optionally retrieving memory, computing an action, receiving feedback, and optionally updating memory through formation and evolution. Different agent implementations instantiate different subsets of these operations at different temporal frequencies, giving rise to memory systems that range from passive buffers to actively evolving knowledge bases.

Memory Lifecycle: Formation, Evolution, and Retrieval

Memory-oriented benchmarks focus primarily on how well an agent can construct, maintain, and exploit an explicit memory of past interactions or world facts. These tasks typically probe the retention and retrieval of information across multi-turn dialogues, user-specific sessions, or long synthetic narratives, sometimes including multimodal signals.

A consolidated overview of these benchmarks, including their memory focus, environment type, modality, and evaluation scale, is provided in Table 8, which serves as a structured reference for comparing their design objectives and evaluation settings. Representative examples such as MemBench (Tan et al., 2025a), LoCoMo (Maharana et al., 2024), WebChoreArena (Miyai et al., 2025), MT-Mind2Web (Deng et al., 2024), PersonaMem (Jiang et al., 2025a), PerLTQA (Du et al., 2024), MPR (Zhang et al., 2025v), PrefEval (Zhao et al., 2025d), LOCCO (Jia et al., 2025), StoryBench (Wan and Ma, 2025), Madial-Bench (He et al., 2025), DialSim (Zheng et al., 2025b), LongBench (Bai et al., 2024), LongBench v2 (Bai et al., 2025), RULER (Hsieh et al., 2024), BALILong (Kuratov et al., 2024) MM-Needle (Wang et al., 2025e), and HaluMem (Chen et al., 2025a) stress user modeling, preference tracking, and conversation-level consistency, often under simulated settings where ground-truth memories can be precisely controlled.

Lifelong-learning benchmarks extend beyond isolated memory retrieval to examine how agents continually acquire, consolidate, and update knowledge over long horizons and evolving task distributions. Benchmarks such as LongMemEval (Wu et al., 2025a), MemoryBank (Zhong et al., 2024), MemoryBench (Ai et al., 2025), LifelongAgentBench (Zheng et al., 2025b), and StreamBench (Wu et al., 2024a), are designed around sequences of tasks or episodes in which new information gradually arrives and earlier information may become obsolete or conflicting. These setups emphasize phenomena like catastrophic forgetting, forward and backward transfer, and test-time adaptation, making them suitable for studying how memory mechanisms interact with continual-learning objectives. In many cases, performance is tracked not only on the current task but also on

Table 8 Overview of benchmarks relevant to LLM agent memory, long-term, lifelong learning, and self-evolving evaluation. The table covers two categories of benchmarks: (i) benchmarks explicitly designed for memory-, lifelong learning-, or self-evolving agent evaluation, and (ii) other agent-oriented benchmarks that implicitly stress long-horizon memory through sequential, multi-step, or multi-task interactions. Fac. and Exp. indicate whether a benchmark evaluates factual memory or experiential (interaction-derived) memory, respectively. MM. denotes the presence of multimodal inputs, while Env. indicates whether the benchmark is conducted in a simulated or real environment. Feature summarizes the primary capability under evaluation, and Scale reports the approximate benchmark size in terms of samples (s.) or tasks (t.). PDDL denotes commonly used PDDL-based planning subsets.

previously seen tasks or conversations, thereby quantifying how well the agent preserves useful knowledge

while adapting to new users, domains, or interaction patterns.

Self-evolving-agent benchmarks go a step further by treating the agent as an open-ended system that can iteratively refine its own memory, skills, and strategies through interaction. Here, the focus is not only on storing and recalling information, but also on meta-level behaviors such as self-reflection, memory editing, tool-augmented storage, and policy improvement over multiple episodes or games. Benchmarks like MemoryAgentBench (Hu et al., 2025c), Evo-Memory (Wei et al., 2025e), and other multi-episode or mission-style environments can be instantiated in a self-evolving setting by allowing the agent to accumulate trajectories, synthesize higher-level abstractions, and adjust its behavior in future runs based on its own past performance. When viewed through this lens, these benchmarks provide a testbed for evaluating whether an agent can autonomously bootstrap more capable behaviors over time-turning static tasks into arenas for long-term adaptation, strategy refinement, and genuinely self-improving memory use.

Temporal Roles Within the Agent Loop
Memory--Agent Coupling

Memory forgetting refers to the deliberate removal of outdated, redundant, or low-value information to free capacity and maintain focus on salient knowledge. Unlike update mechanisms, which resolve conflicts between memories, forgetting prioritizes eliminating outdated information to ensure efficiency and relevance. Over time, unbounded memory accumulation leads to increased noise, retrieval delays, and interference from outdated knowledge. Controlled forgetting helps mitigate overload and maintain cognitive focus. Yet, overly aggressive pruning risks erasing rare but essential knowledge, harming reasoning continuity in long-term contexts.

Forgetting mechanisms can be categorized into Time-based Forgetting, Frequency-based Forgetting, and Importance-driven Forgetting, corresponding respectively to creation time, retrieval activity, and integrated semantic valuation.

Time-based Forgetting Time-driven forgetting considers only the creation time of memories, gradually decaying their strength over time to emulate human memory fading. MemGPT (Packer et al., 2023a) evicts the earliest messages upon context overflow. Xu et al. (2025c) and Wang et al. (2025n) employ stochastic token replacement, with a replacement ratio of K/N, to simulate exponential forgetting in human cognition, discarding the oldest entries once the pool exceeds capacity. Unlike explicit deletion of old memories, MAICC (Jiang et al., 2025c) implements soft forgetting by gradually decaying the weights of memories over time. This process mirrors natural forgetting, ensuring continuous adaptation without historical overload.

Frequency-based Forgetting Frequency-driven forgetting prioritizes memory based on retrieval behavior, retaining frequently accessed entries while discarding inactive ones. XMem (Cheng and Schwing, 2022) employs an LFU policy to remove low-frequency entries; KARMA (Wang et al., 2025r) uses counting Bloom filters to track access frequency; MemOS (Li et al., 2025l) applies an LRU strategy, removing long-unused items while archiving highly active ones. This ensures efficient retrieval and storage equilibrium. By distinguishing between creation time and retrieval frequency, these two axes form a more orthogonal taxonomy: time-based decay captures natural temporal aging, while frequency-based forgetting reflects usage dynamics, together maintaining system efficiency and recency.

Importance-driven Forgetting Importance-driven forgetting integrates temporal, frequency, and semantic signals to retain high-value knowledge while pruning redundancy. Early works such as Zhong et al. (2024) and Chen et al. (2025e) quantified importance via composite scores combining temporal decay and access frequency, achieving numeric-based selective forgetting. Later methods evolved toward semantic-level evaluation: VLN (Song et al., 2025b) pools semantically redundant memories via similarity clustering, while Livia (Xi and Wang, 2025) incorporates emotional salience and contextual relevance to model emotion-driven selective forgetting. As LLMs develop increasingly powerful judgment capabilities, TiM (Liu et al., 2023a) and MemTool (Lumer et al., 2025) leverage LLMs to assess memory importance and explicitly prune or forget less important memories. This shift reflects a transition from static numeric scoring to semantic intelligence. Agents can now perform conscious forgetting and selectively retain memories most pertinent to the task context, semantics, and affective cues.

Summary Time-based decay reflects the natural temporal fading of memory, frequency-based forgetting ensures efficient access to frequently used memories, and importance-driven forgetting introduces semantic discernment. These three forgetting mechanisms jointly govern how agentic memory remains timely, efficiently accessible, and semantically relevant. However, heuristic forgetting mechanisms like LRU may eliminate long-tail knowledge, which is seldom accessed but essential for correct decision-making. Therefore, when storage cost is not a critical constraint, many memory systems avoid directly deleting certain memories.

The above discussion describes how prior work manually designs memory architectures at different stages, enabling agents' memory contents to evolve online. More recently, MemEvolve (Zhang et al., 2025h) proposes a meta-evolutionary framework that jointly evolves both agents' experiential knowledge and their underlying memory architecture, allowing the memory framework itself to continuously learn and adapt.

Comparing Agent Memory with Other Key Concepts

Despite the growing interest in agentic systems endowed with memory, the community's understanding of what constitutes agent memory remains fragmented. In practice, researchers and practitioners often conflate agent memory with related constructs such as LLM memory (Wu et al., 2025g), retrieval-augmented generation (RAG) (Gao et al., 2024), and context engineering (Mei et al., 2025). Although these concepts are intrinsically connected by their involvement in how information is managed and utilized in LLM-driven systems, they differ in scope, temporal characteristics, and functional roles.

These overlapping yet distinct notions have led to ambiguity in the literature and practice. To clarify these distinctions and situate agent memory within this broader landscape, we examine how agent memory relates to , and diverges from , LLM memory, RAG, and context engineering in the subsequent subsubsections. Figure 2 visually illustrates the commonalities and distinctions among these fields through a Venn diagram.

Figure 2 Conceptual comparison of Agent Memory with LLM Memory , RAG , and Context Engineering . The diagram illustrates shared technical implementations (e.g., KV reuse, graph retrieval) while highlighting fundamental distinctions: unlike the architectural optimizations of LLM Memory, the static knowledge access of RAG, or the transient resource management of Context Engineering, Agent Memory is uniquely characterized by its focus on maintaining a persistent and self-evolving cognitive state that integrates factual knowledge and experience. The listed categories and examples are illustrative rather than strictly parallel, serving as representative reference points to clarify conceptual relationships rather than to define a rigid taxonomy.

Figure 2 Conceptual comparison of Agent Memory with LLM Memory , RAG , and Context Engineering . The diagram illustrates shared technical implementations (e.g., KV reuse, graph retrieval) while highlighting fundamental distinctions: unlike the architectural optimizations of LLM Memory, the static knowledge access of RAG, or the transient resource management of Context Engineering, Agent Memory is uniquely characterized by its focus on maintaining a persistent and self-evolving cognitive state that integrates factual knowledge and experience. The listed categories and examples are illustrative rather than strictly parallel, serving as representative reference points to clarify conceptual relationships rather than to define a rigid taxonomy.

Agent Memory vs. LLM Memory

At a high level, agent memory almost fully subsumes what has traditionally been referred to as LLM memory . Since 2023, many works describing themselves as 'LLM memory mechanisms' (Zhong et al., 2024; Packer et al., 2023a; Wang et al., 2023b) are more appropriately interpreted, under contemporary terminology, as early instances of agent memory. This reinterpretation arises from the historical ambiguity surrounding the very notion of an 'LLM agent.' During 2023-2024, the community had no stable or coherent definition: in some cases, prompting an LLM to call a calculator already sufficed to qualify the system as an agent (Wu et al., 2024c); in other cases, agency required substantially richer capabilities such as explicit planning, tool use, memory, and reflective reasoning (Ruan et al., 2023). Only recently has a more unified and structured definition begun to emerge (e.g., LLM-based agent = LLM + reasoning + planning + memory + tool use + self-improvement + multi-turn interaction + perception, as discussed by Zhang et al. (2025f)), though even this formulation is not universally applicable. Against this historical backdrop, early systems such as MemoryBank (Zhong et al., 2024) and MemGPT (Packer et al., 2023a) framed their contributions as providing LLM memory . Yet what they fundamentally addressed were classical agentic challenges, for example enabling an LLM-based conversational agent to track user preferences, maintain dialogue-state information, and accumulate experience across multi-turn interactions. Under a modern and more mature understanding of agency, such systems are naturally categorized as instances of agent memory .

That said, the subsumption is not absolute. A distinct line of research genuinely concerns LLM-internal memory : managing the transformer's key-value (KV) cache, designing long-context processing mechanisms, or modifying model architectures (e.g., RWKV (Peng et al., 2023), Mamba (Gu and Dao, 2024; Lieber et al., 2024), diffusion-based LMs (Nie et al., 2025)) to better retain information as sequence length grows. These works focus on intrinsic model dynamics and typically address tasks that do not require agentic behavior, and thus should be considered outside the scope of agent memory.

Overlap Within our taxonomy, the majority of what has historically been called 'LLM memory' corresponds to forms of agent memory. Techniques such as few-shot prompting (Prabhumoye et al., 2022; Ma et al., 2023a) can be viewed as a form of long-term memory, where past exemplars or distilled task summaries serve as reusable knowledge incorporated through retrieval or context injection. Self-reflection and iterative refinement methods (Madaan et al., 2023; Mousavi et al., 2023; Han et al., 2025c) naturally align with short-term, inside-trial memory, as the agent repeatedly leverages intermediate reasoning traces or outcomes from prior attempts within the same task. Even KV compression and context-window management (Yoon et al., 2024; Jiang et al., 2023), when used to preserve salient information across the course of a single task, function as short-term memory mechanisms in an agentic sense. These techniques all support the agent's ability to accumulate, transform, and reuse information throughout a task's execution.

Distinctions In contrast, memory mechanisms that intervene directly in the model's internal state-such as architectural modifications for longer effective context, cache rewriting strategies, recurrent-state persistence, attention-sparsity mechanisms, or externalized KV-store expansions-are more appropriately classified as LLM memory rather than agent memory. Their goal is to expand or reorganize the representational capacity of the underlying model, not to furnish a decision-making agent with an evolving external memory base. They do not typically support cross-task persistence, environment-driven adaptation, or deliberate memory operations (e.g., formation, evolution, retrieval), and therefore lie outside the operational scope of agent memory as defined in this survey.

Overlap
Distinctions

The past two years have witnessed the overwhelming evolution of increasingly capable large language models (LLMs) into powerful AI agents (Matarazzo and Torlone, 2025; Minaee et al., 2025; Luo et al., 2025a). These foundation-model-powered agents have demonstrated remarkable progress across diverse domains such as deep research (Xu and Peng, 2025; Zhang et al., 2025p), software engineering (Wang et al., 2024i), and scientific discovery (Wei et al., 2025c), continuously advancing the trajectory toward artificial general interlligence (AGI) (Fang et al., 2025a; Durante et al., 2024). Although early conceptions of 'agents' were highly heterogeneous, a growing consensus has since emerged within the community: beyond a pure LLM backbone, an agent is typically equipped with capabilities such as reasoning , planning , perception , memory , and tool-use . Some of these abilities, such as reasoning and tool-use, have been largely internalized within model parameters through reinforcement learning (Wang et al., 2025m; Qu et al., 2025b), while some still depend heavily on external agentic scaffolds. Together, these components transform LLMs from static conditional generators into learnable policies that can interact with diverse external environments and adaptively evolve over time (Zhang et al., 2025f; Liu et al., 2025a).

Among these agentic faculties, memory stands out as a cornerstone, explicitly enabling the transformation of static LLMs, whose parameters cannot be rapidly updated, into adaptive agents capable of continual adaptation through environmental interaction (Zhang et al., 2025s; Wu et al., 2025g). From an application perspective, numerous domains demand agents with proactive memory management rather than ephemeral, forgetful behaviors: personalized chatbots (Chhikara et al., 2025; Li et al., 2025b), recommender systems (Liu et al., 2025c), social simulations (Park et al., 2023; Yang et al., 2025), and financial investigations (Zhang et al., 2024) all rely on the agent's ability to process, store, and manage historical information. From a developmental standpoint, one of the defining aspirations of AGI research is to endow agents with the capacity for continual evolution through environment interactions (Hendrycks et al., 2025), a capability fundamentally

grounded in agent memory.

Agent Memory Needs A New Taxonomy Given the growing significance and community attention surrounding agent memory systems, it has become both timely and necessary to provide an updated perspective on contemporary agent memory research. The motivation for a new taxonomy and survey is twofold: ❶ Limitations of Existing Taxonomies: While several recent surveys have provided valuable and comprehensive overviews of agent memory (Zhang et al., 2025s; Wu et al., 2025g), their taxonomies were developed prior to a number of rapid methodological advances and therefore do not fully reflect the current breadth and complexity of the research landscape. For example, emerging directions in 2025, such as memory frameworks that distill reusable tools from past experiences (Qiu et al., 2025a,c; Zhao et al., 2025c), or memory-augmented test-time scaling methods (Zhang et al., 2025g; Suzgun et al., 2025), remain underrepresented in earlier classification schemes. ❷ Conceptual Fragmentation: With the explosive growth of memory-related studies, the concept itself has become increasingly expansive and fragmented. Researchers often find that papers claiming to study 'agent memory' differ drastically in implementation, objectives, and underlying assumptions. The proliferation of diverse terminologies (declarative, episodic, semantic, parametric memory, etc.) further obscures conceptual clarity, highlighting the urgent need for a coherent taxonomy that can unify these emerging concepts.

Therefore, this paper seeks to establish a systematic framework that reconciles existing definitions, bridges emerging trends, and elucidates the foundational principles of memory in agentic systems. Specifically, this survey aims to address the following key questions:

Agent Memory vs. RAG

At a conceptual level, agent memory and retrieval-augmented generation (RAG) exhibit substantial overlap: both systems construct, organize, and leverage auxiliary information stores to extend the capabilities of LLM/agents beyond their native parametric knowledge. For instance, structured representations such as knowledge graphs and indexing strategies appear in both communities' methods, and recent developments in agentic RAG demonstrate how autonomous retrieval mechanisms can interact with dynamic databases in ways reminiscent of agent memory architectures (Singh et al., 2025). Indeed, the engineering stacks underlying many RAG and agent memory systems share common building blocks, including vector indices, semantic search, and context expansion modules.

Despite these technological convergences, the two paradigms have historically been distinguished by the contexts in which they are applied. Classical RAG techniques primarily augment an LLM with access to static knowledge sources , whether flat document stores, structured knowledge bases, or large corpora externally indexed to support retrieval on demand (Zhang et al., 2025q; Han et al., 2025b). These systems are designed to ground generation in up-to-date facts, mitigate hallucinations, and improve accuracy in knowledge-intensive tasks, but they generally do not maintain an internal, evolving memory of past interactions. In contrast, agent memory systems are instantiated within an agent's ongoing interaction with an environment , continuously incorporating new information generated by the agent's own actions and environmental feedback into a persistent memory base (Wang et al., 2024m; Zhao et al., 2024; Sun et al., 2025e).

In early formulations the distinction between RAG and agent memory was relatively clear: RAG retrieved from externally maintained knowledge for a single task invocation, whereas agent memory evolved over multi-turn, multi-task interaction. However, this boundary has become increasingly blurred as retrieval systems themselves become more dynamic. For example, certain retrieval tasks continuously update relevant context during iterative querying (e.g., multi-hop QA settings where related context is progressively added). Interestingly, systems such as HippoRAG/HippoRAG2 (Gutierrez et al., 2024; Gutiérrez et al., 2025) have been interpreted by both RAG and memory communities as addressing long-term memory challenges for LLMs. Consequently, a more practical (though not perfectly separable) distinction lies in the task domain . RAG is predominantly applied to augment LLMs with large, externally sourced context for individual inference tasks, exemplified by classical multi-hop and knowledge-intensive benchmarks such as HotpotQA (Yang et al., 2018), 2WikiMQA (Ho et al., 2020), and MuSiQue (Trivedi et al., 2022). By contrast, agent memory systems are typically evaluated in settings requiring sustained multi-turn interaction, temporal dependency, or environment-driven adaptation. Representative benchmarks include long-context dialogue evaluations such as LoCoMo (Maharana et al., 2024) and LongMemEval (Wu et al., 2025a), complex problem-solving and deep-research benchmarks such as GAIA (Mialon et al., 2023), XBench (Chen et al., 2025c), and BrowseComp (Wei et al., 2025b), code-centric

agentic tasks such as SWE-bench Verified (Jimenez et al., 2024), as well as lifelong learning benchmarks such as StreamBench (Wu et al., 2024a). We provide a comprehensive summary of memory-related benchmarks in Section 6.1.

Nevertheless, even this domain-based distinction contains substantial gray areas. Many works self-described as agent memory systems are evaluated under long-document question-answering tasks such as HotpotQA (Wang et al., 2025g,p), while numerous papers foregrounded as RAG systems in fact implement forms of agentic selfimprovement, continually distilling and refining knowledge or skills over time. As a result, titles, methodologies, and empirical evaluations frequently blur the conceptual boundary between the two paradigms. To further clarify these relationships, the following three paragraphs draw upon established taxonomies of RAG from (Mei et al., 2025): modular RAG , graph RAG , and agentic RAG , and examine how the core techniques associated with each lineage manifest within both RAG and agent memory systems.

Modular RAG Modular RAG refers to architectures in which the retrieval pipeline is decomposed into clearly specified components, such as indexing, candidate retrieval, reranking, filtering, and context assembly, that operate in a largely static and pipeline-like fashion (Singh et al., 2025). These systems treat retrieval as a well-engineered, modular subsystem external to the LLM, designed primarily for injecting relevant knowledge into the model's context window during inference. Within the agent memory perspective, the corresponding techniques typically appear in the retrieval stage , where memory access is realized through vector search, semantic similarity matching, or rule-based filtering, as seen in popular agent memory frameworks like Memary (Memary, 2025), MemOS (Li et al., 2025l), and Mem0 (Chhikara et al., 2025).

Graph RAG Graph RAG systems structure the knowledge base as a graph, ranging from knowledge graphs to concept graphs or document-entity relations, and leverage graph traversal or graph-based ranking algorithms to retrieve context (Peng et al., 2024). This representation enables multi-hop relational reasoning, which has proven effective for knowledge-intensive tasks (Edge et al., 2025; Han et al., 2025b; Dong et al., 2025a). In the context of agent memory, graph-structured memory arises naturally when agents accumulate relational insights over time, such as linking concepts, tracking dependencies among subtasks, or recording causal relations inferred through interaction. Several well-established practices include Mem0 g (Chhikara et al., 2025), A-MEM (Xu et al., 2025c), Zep (Rasmussen et al., 2025), and G-memory (Zhang et al., 2025c). Notably, graph-based agent memory systems may construct, extend, or reorganize its internal graph throughout the agent's operation. Consequently, graph-based retrieval forms the structural backbone for both paradigms, but only agent memory treats the graph as a living, evolving representation of experience. We provide further analysis on graph-based memory forms in Section 3.1.2 and also refer the readers to a relevant survey (Liu et al., 2025h).

Agentic RAG Agentic RAG integrates retrieval into an autonomous decision-making loop, where an LLM agent actively controls when, how, and what to retrieve (Singh et al., 2025; Sun et al., 2025e). These systems often employ iterative querying, multi-step planning, or self-directed search procedures, enabling the agent to refine its information needs through deliberate reasoning, as implemented in PlanRAG (Lee et al., 2024b) and Self-RAG (Asai et al., 2023). For a more detailed understanding of agentic RAG, we refer the readers to Singh et al. (2025). From the agent memory perspective, agentic RAG occupies the closest conceptual space: both systems involve autonomous interaction with an external information store, both support multi-step refinement, and both may incorporate retrieved insights into subsequent reasoning. The key distinction is that classical agentic RAG typically operates over an external and often task-specific database, whereas agent memory maintains an internal, persistent, and self-evolving memory base that accumulates knowledge across tasks (Yan et al., 2025b; Xu et al., 2025c).

Modular RAG

At a conceptual level, agent memory and retrieval-augmented generation (RAG) exhibit substantial overlap: both systems construct, organize, and leverage auxiliary information stores to extend the capabilities of LLM/agents beyond their native parametric knowledge. For instance, structured representations such as knowledge graphs and indexing strategies appear in both communities' methods, and recent developments in agentic RAG demonstrate how autonomous retrieval mechanisms can interact with dynamic databases in ways reminiscent of agent memory architectures (Singh et al., 2025). Indeed, the engineering stacks underlying many RAG and agent memory systems share common building blocks, including vector indices, semantic search, and context expansion modules.

Despite these technological convergences, the two paradigms have historically been distinguished by the contexts in which they are applied. Classical RAG techniques primarily augment an LLM with access to static knowledge sources , whether flat document stores, structured knowledge bases, or large corpora externally indexed to support retrieval on demand (Zhang et al., 2025q; Han et al., 2025b). These systems are designed to ground generation in up-to-date facts, mitigate hallucinations, and improve accuracy in knowledge-intensive tasks, but they generally do not maintain an internal, evolving memory of past interactions. In contrast, agent memory systems are instantiated within an agent's ongoing interaction with an environment , continuously incorporating new information generated by the agent's own actions and environmental feedback into a persistent memory base (Wang et al., 2024m; Zhao et al., 2024; Sun et al., 2025e).

In early formulations the distinction between RAG and agent memory was relatively clear: RAG retrieved from externally maintained knowledge for a single task invocation, whereas agent memory evolved over multi-turn, multi-task interaction. However, this boundary has become increasingly blurred as retrieval systems themselves become more dynamic. For example, certain retrieval tasks continuously update relevant context during iterative querying (e.g., multi-hop QA settings where related context is progressively added). Interestingly, systems such as HippoRAG/HippoRAG2 (Gutierrez et al., 2024; Gutiérrez et al., 2025) have been interpreted by both RAG and memory communities as addressing long-term memory challenges for LLMs. Consequently, a more practical (though not perfectly separable) distinction lies in the task domain . RAG is predominantly applied to augment LLMs with large, externally sourced context for individual inference tasks, exemplified by classical multi-hop and knowledge-intensive benchmarks such as HotpotQA (Yang et al., 2018), 2WikiMQA (Ho et al., 2020), and MuSiQue (Trivedi et al., 2022). By contrast, agent memory systems are typically evaluated in settings requiring sustained multi-turn interaction, temporal dependency, or environment-driven adaptation. Representative benchmarks include long-context dialogue evaluations such as LoCoMo (Maharana et al., 2024) and LongMemEval (Wu et al., 2025a), complex problem-solving and deep-research benchmarks such as GAIA (Mialon et al., 2023), XBench (Chen et al., 2025c), and BrowseComp (Wei et al., 2025b), code-centric

agentic tasks such as SWE-bench Verified (Jimenez et al., 2024), as well as lifelong learning benchmarks such as StreamBench (Wu et al., 2024a). We provide a comprehensive summary of memory-related benchmarks in Section 6.1.

Nevertheless, even this domain-based distinction contains substantial gray areas. Many works self-described as agent memory systems are evaluated under long-document question-answering tasks such as HotpotQA (Wang et al., 2025g,p), while numerous papers foregrounded as RAG systems in fact implement forms of agentic selfimprovement, continually distilling and refining knowledge or skills over time. As a result, titles, methodologies, and empirical evaluations frequently blur the conceptual boundary between the two paradigms. To further clarify these relationships, the following three paragraphs draw upon established taxonomies of RAG from (Mei et al., 2025): modular RAG , graph RAG , and agentic RAG , and examine how the core techniques associated with each lineage manifest within both RAG and agent memory systems.

Modular RAG Modular RAG refers to architectures in which the retrieval pipeline is decomposed into clearly specified components, such as indexing, candidate retrieval, reranking, filtering, and context assembly, that operate in a largely static and pipeline-like fashion (Singh et al., 2025). These systems treat retrieval as a well-engineered, modular subsystem external to the LLM, designed primarily for injecting relevant knowledge into the model's context window during inference. Within the agent memory perspective, the corresponding techniques typically appear in the retrieval stage , where memory access is realized through vector search, semantic similarity matching, or rule-based filtering, as seen in popular agent memory frameworks like Memary (Memary, 2025), MemOS (Li et al., 2025l), and Mem0 (Chhikara et al., 2025).

Graph RAG Graph RAG systems structure the knowledge base as a graph, ranging from knowledge graphs to concept graphs or document-entity relations, and leverage graph traversal or graph-based ranking algorithms to retrieve context (Peng et al., 2024). This representation enables multi-hop relational reasoning, which has proven effective for knowledge-intensive tasks (Edge et al., 2025; Han et al., 2025b; Dong et al., 2025a). In the context of agent memory, graph-structured memory arises naturally when agents accumulate relational insights over time, such as linking concepts, tracking dependencies among subtasks, or recording causal relations inferred through interaction. Several well-established practices include Mem0 g (Chhikara et al., 2025), A-MEM (Xu et al., 2025c), Zep (Rasmussen et al., 2025), and G-memory (Zhang et al., 2025c). Notably, graph-based agent memory systems may construct, extend, or reorganize its internal graph throughout the agent's operation. Consequently, graph-based retrieval forms the structural backbone for both paradigms, but only agent memory treats the graph as a living, evolving representation of experience. We provide further analysis on graph-based memory forms in Section 3.1.2 and also refer the readers to a relevant survey (Liu et al., 2025h).

Agentic RAG Agentic RAG integrates retrieval into an autonomous decision-making loop, where an LLM agent actively controls when, how, and what to retrieve (Singh et al., 2025; Sun et al., 2025e). These systems often employ iterative querying, multi-step planning, or self-directed search procedures, enabling the agent to refine its information needs through deliberate reasoning, as implemented in PlanRAG (Lee et al., 2024b) and Self-RAG (Asai et al., 2023). For a more detailed understanding of agentic RAG, we refer the readers to Singh et al. (2025). From the agent memory perspective, agentic RAG occupies the closest conceptual space: both systems involve autonomous interaction with an external information store, both support multi-step refinement, and both may incorporate retrieved insights into subsequent reasoning. The key distinction is that classical agentic RAG typically operates over an external and often task-specific database, whereas agent memory maintains an internal, persistent, and self-evolving memory base that accumulates knowledge across tasks (Yan et al., 2025b; Xu et al., 2025c).

Graph RAG
Agentic RAG

At a conceptual level, agent memory and retrieval-augmented generation (RAG) exhibit substantial overlap: both systems construct, organize, and leverage auxiliary information stores to extend the capabilities of LLM/agents beyond their native parametric knowledge. For instance, structured representations such as knowledge graphs and indexing strategies appear in both communities' methods, and recent developments in agentic RAG demonstrate how autonomous retrieval mechanisms can interact with dynamic databases in ways reminiscent of agent memory architectures (Singh et al., 2025). Indeed, the engineering stacks underlying many RAG and agent memory systems share common building blocks, including vector indices, semantic search, and context expansion modules.

Despite these technological convergences, the two paradigms have historically been distinguished by the contexts in which they are applied. Classical RAG techniques primarily augment an LLM with access to static knowledge sources , whether flat document stores, structured knowledge bases, or large corpora externally indexed to support retrieval on demand (Zhang et al., 2025q; Han et al., 2025b). These systems are designed to ground generation in up-to-date facts, mitigate hallucinations, and improve accuracy in knowledge-intensive tasks, but they generally do not maintain an internal, evolving memory of past interactions. In contrast, agent memory systems are instantiated within an agent's ongoing interaction with an environment , continuously incorporating new information generated by the agent's own actions and environmental feedback into a persistent memory base (Wang et al., 2024m; Zhao et al., 2024; Sun et al., 2025e).

In early formulations the distinction between RAG and agent memory was relatively clear: RAG retrieved from externally maintained knowledge for a single task invocation, whereas agent memory evolved over multi-turn, multi-task interaction. However, this boundary has become increasingly blurred as retrieval systems themselves become more dynamic. For example, certain retrieval tasks continuously update relevant context during iterative querying (e.g., multi-hop QA settings where related context is progressively added). Interestingly, systems such as HippoRAG/HippoRAG2 (Gutierrez et al., 2024; Gutiérrez et al., 2025) have been interpreted by both RAG and memory communities as addressing long-term memory challenges for LLMs. Consequently, a more practical (though not perfectly separable) distinction lies in the task domain . RAG is predominantly applied to augment LLMs with large, externally sourced context for individual inference tasks, exemplified by classical multi-hop and knowledge-intensive benchmarks such as HotpotQA (Yang et al., 2018), 2WikiMQA (Ho et al., 2020), and MuSiQue (Trivedi et al., 2022). By contrast, agent memory systems are typically evaluated in settings requiring sustained multi-turn interaction, temporal dependency, or environment-driven adaptation. Representative benchmarks include long-context dialogue evaluations such as LoCoMo (Maharana et al., 2024) and LongMemEval (Wu et al., 2025a), complex problem-solving and deep-research benchmarks such as GAIA (Mialon et al., 2023), XBench (Chen et al., 2025c), and BrowseComp (Wei et al., 2025b), code-centric

agentic tasks such as SWE-bench Verified (Jimenez et al., 2024), as well as lifelong learning benchmarks such as StreamBench (Wu et al., 2024a). We provide a comprehensive summary of memory-related benchmarks in Section 6.1.

Nevertheless, even this domain-based distinction contains substantial gray areas. Many works self-described as agent memory systems are evaluated under long-document question-answering tasks such as HotpotQA (Wang et al., 2025g,p), while numerous papers foregrounded as RAG systems in fact implement forms of agentic selfimprovement, continually distilling and refining knowledge or skills over time. As a result, titles, methodologies, and empirical evaluations frequently blur the conceptual boundary between the two paradigms. To further clarify these relationships, the following three paragraphs draw upon established taxonomies of RAG from (Mei et al., 2025): modular RAG , graph RAG , and agentic RAG , and examine how the core techniques associated with each lineage manifest within both RAG and agent memory systems.

Modular RAG Modular RAG refers to architectures in which the retrieval pipeline is decomposed into clearly specified components, such as indexing, candidate retrieval, reranking, filtering, and context assembly, that operate in a largely static and pipeline-like fashion (Singh et al., 2025). These systems treat retrieval as a well-engineered, modular subsystem external to the LLM, designed primarily for injecting relevant knowledge into the model's context window during inference. Within the agent memory perspective, the corresponding techniques typically appear in the retrieval stage , where memory access is realized through vector search, semantic similarity matching, or rule-based filtering, as seen in popular agent memory frameworks like Memary (Memary, 2025), MemOS (Li et al., 2025l), and Mem0 (Chhikara et al., 2025).

Graph RAG Graph RAG systems structure the knowledge base as a graph, ranging from knowledge graphs to concept graphs or document-entity relations, and leverage graph traversal or graph-based ranking algorithms to retrieve context (Peng et al., 2024). This representation enables multi-hop relational reasoning, which has proven effective for knowledge-intensive tasks (Edge et al., 2025; Han et al., 2025b; Dong et al., 2025a). In the context of agent memory, graph-structured memory arises naturally when agents accumulate relational insights over time, such as linking concepts, tracking dependencies among subtasks, or recording causal relations inferred through interaction. Several well-established practices include Mem0 g (Chhikara et al., 2025), A-MEM (Xu et al., 2025c), Zep (Rasmussen et al., 2025), and G-memory (Zhang et al., 2025c). Notably, graph-based agent memory systems may construct, extend, or reorganize its internal graph throughout the agent's operation. Consequently, graph-based retrieval forms the structural backbone for both paradigms, but only agent memory treats the graph as a living, evolving representation of experience. We provide further analysis on graph-based memory forms in Section 3.1.2 and also refer the readers to a relevant survey (Liu et al., 2025h).

Agentic RAG Agentic RAG integrates retrieval into an autonomous decision-making loop, where an LLM agent actively controls when, how, and what to retrieve (Singh et al., 2025; Sun et al., 2025e). These systems often employ iterative querying, multi-step planning, or self-directed search procedures, enabling the agent to refine its information needs through deliberate reasoning, as implemented in PlanRAG (Lee et al., 2024b) and Self-RAG (Asai et al., 2023). For a more detailed understanding of agentic RAG, we refer the readers to Singh et al. (2025). From the agent memory perspective, agentic RAG occupies the closest conceptual space: both systems involve autonomous interaction with an external information store, both support multi-step refinement, and both may incorporate retrieved insights into subsequent reasoning. The key distinction is that classical agentic RAG typically operates over an external and often task-specific database, whereas agent memory maintains an internal, persistent, and self-evolving memory base that accumulates knowledge across tasks (Yan et al., 2025b; Xu et al., 2025c).

Agent Memory vs. Context Engineering

The relationship between agent memory and context engineering is best understood as an intersection of distinct operational paradigms rather than a hierarchical subsumption. Context engineering is a systematic design methodology that treats the context window as a constrained computational resource. It rigorously optimizes the information payload, including instructions, knowledge, state, and memory, to mitigate the asymmetry between massive input capacity and the model's generation capability (Mei et al., 2025). While

agent memory focuses on the cognitive modeling of a persistent entity with an evolving identity, context engineering operates under a resource management paradigm. From the perspective of context engineering, agent memory is merely one variable within the context assembly function that requires efficient scheduling to maximize inference efficacy. Conversely, from the perspective of an agent, context engineering serves as the implementation layer that ensures cognitive continuity remains within the physical limits of the underlying model.

Overlap The two fields converge significantly in the technical realization of working memory during longhorizon interactions and often employ functionally identical mechanisms to address the constraints imposed by a finite context window (Hu et al., 2025a; Zhang et al., 2025r; Kang et al., 2025c; Yu et al., 2025a). Both paradigms rely on advanced information compression (Zhou et al., 2025b; Wu et al., 2025f), organization (Xu et al., 2025c; Zhang et al., 2025c; Anokhin et al., 2024), and selection (Zhang et al., 2025r) techniques to preserve operational continuity over extended interaction sequences. For example, token pruning and importance-based selection methods (Jiang et al., 2023; Li et al., 2023c) that are central to context engineering frameworks play a fundamental role in agentic memory systems by filtering noise and retaining salient information. Similarly, the rolling summary technique serves as a shared foundational primitive, functioning simultaneously as a buffer management strategy and a transient episodic memory mechanism (Yu et al., 2025a; Lu et al., 2025b). In practice, the boundary between engineering the context and maintaining an agent's short-term memory effectively dissolves in these scenarios, as both rely on the same underlying summarization, dynamic information retrieval, and recursive state updates (Tang et al., 2025b; Yoon et al., 2024).

Distinctions The distinction becomes most pronounced when moving beyond short-term text processing to the broader scope of long-lived agents. Context engineering primarily addresses the structural organization of the interaction interface between LLMs and their operational environment. This includes optimizing tool-integrated reasoning and selection pipelines (Qin et al., 2024a; Schick et al., 2023; Jia and Li, 2025) and standardizing communication protocols, such as MCP (Qiu et al., 2025c). These methods focus on ensuring that instructions, tool calls, and intermediate states are correctly formatted, efficiently scheduled, and executable within the constraints of the context window. As such, context engineering operates at the level of resource allocation and interface correctness , emphasizing syntactic validity and execution efficiency.

In contrast, agent memory defines a substantially broader cognitive scope. Beyond transient context assembly, it encompasses the persistent storage of factual knowledge (Zhong et al., 2024), the accumulation and evolution of experiential traces (Zhao et al., 2024; Tang et al., 2025d; Zhang et al., 2025d), and, in some cases, the internalization of memory into model parameters (Wang et al., 2025o). Rather than managing how information is presented to the model at inference time, agent memory governs what the agent knows , what it has experienced , and how these elements evolve over time. This includes consolidating repeated interactions into knowledge (Tan et al., 2025c), abstracting procedural knowledge from past successes and failures (Ouyang et al., 2025), and maintaining a coherent identity across tasks and episodes (Wang et al., 2024f).

From this perspective, context engineering constructs the external scaffolding that enables perception and action under resource constraints, whereas agent memory constitutes the internal substrate that supports learning, adaptation, and autonomy. The former optimizes the momentary interface between the agent and the model, while the latter sustains a persistent cognitive state that extends beyond any single context window.

Overlap
Distinctions

The past two years have witnessed the overwhelming evolution of increasingly capable large language models (LLMs) into powerful AI agents (Matarazzo and Torlone, 2025; Minaee et al., 2025; Luo et al., 2025a). These foundation-model-powered agents have demonstrated remarkable progress across diverse domains such as deep research (Xu and Peng, 2025; Zhang et al., 2025p), software engineering (Wang et al., 2024i), and scientific discovery (Wei et al., 2025c), continuously advancing the trajectory toward artificial general interlligence (AGI) (Fang et al., 2025a; Durante et al., 2024). Although early conceptions of 'agents' were highly heterogeneous, a growing consensus has since emerged within the community: beyond a pure LLM backbone, an agent is typically equipped with capabilities such as reasoning , planning , perception , memory , and tool-use . Some of these abilities, such as reasoning and tool-use, have been largely internalized within model parameters through reinforcement learning (Wang et al., 2025m; Qu et al., 2025b), while some still depend heavily on external agentic scaffolds. Together, these components transform LLMs from static conditional generators into learnable policies that can interact with diverse external environments and adaptively evolve over time (Zhang et al., 2025f; Liu et al., 2025a).

Among these agentic faculties, memory stands out as a cornerstone, explicitly enabling the transformation of static LLMs, whose parameters cannot be rapidly updated, into adaptive agents capable of continual adaptation through environmental interaction (Zhang et al., 2025s; Wu et al., 2025g). From an application perspective, numerous domains demand agents with proactive memory management rather than ephemeral, forgetful behaviors: personalized chatbots (Chhikara et al., 2025; Li et al., 2025b), recommender systems (Liu et al., 2025c), social simulations (Park et al., 2023; Yang et al., 2025), and financial investigations (Zhang et al., 2024) all rely on the agent's ability to process, store, and manage historical information. From a developmental standpoint, one of the defining aspirations of AGI research is to endow agents with the capacity for continual evolution through environment interactions (Hendrycks et al., 2025), a capability fundamentally

grounded in agent memory.

Agent Memory Needs A New Taxonomy Given the growing significance and community attention surrounding agent memory systems, it has become both timely and necessary to provide an updated perspective on contemporary agent memory research. The motivation for a new taxonomy and survey is twofold: ❶ Limitations of Existing Taxonomies: While several recent surveys have provided valuable and comprehensive overviews of agent memory (Zhang et al., 2025s; Wu et al., 2025g), their taxonomies were developed prior to a number of rapid methodological advances and therefore do not fully reflect the current breadth and complexity of the research landscape. For example, emerging directions in 2025, such as memory frameworks that distill reusable tools from past experiences (Qiu et al., 2025a,c; Zhao et al., 2025c), or memory-augmented test-time scaling methods (Zhang et al., 2025g; Suzgun et al., 2025), remain underrepresented in earlier classification schemes. ❷ Conceptual Fragmentation: With the explosive growth of memory-related studies, the concept itself has become increasingly expansive and fragmented. Researchers often find that papers claiming to study 'agent memory' differ drastically in implementation, objectives, and underlying assumptions. The proliferation of diverse terminologies (declarative, episodic, semantic, parametric memory, etc.) further obscures conceptual clarity, highlighting the urgent need for a coherent taxonomy that can unify these emerging concepts.

Therefore, this paper seeks to establish a systematic framework that reconciles existing definitions, bridges emerging trends, and elucidates the foundational principles of memory in agentic systems. Specifically, this survey aims to address the following key questions:

Form: What Carries Memory?

As a starting point for organizing prior work, we begin by examining the most fundamental representational units out of which agent memory can be constructed. We first try to answer: what architectural or representational forms can agent memory take?

Across diverse agent systems, memory is not realized through a single, unified structure. Instead, different task settings call for different storage forms, each with its own structural properties. These architectures endow memory with distinct capabilities, shaping how an agent accumulates information over interactions and maintains behavioral consistency. They ultimately enable memory to fulfill its intended roles across varied task scenarios.

Based on where memory resides and in what form it is represented, we organize these memories into three categories:

Token-level Memory

Flat Memory (1D)

Dialogue
Preference

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 3639-3664, 2025.

Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong,

and Jeff Z Pan. Rethinking memory in ai: Taxonomy, operations, topics, and future directions. arXiv preprint arXiv:2505.00675 , 2025a.

the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023 , pages 1762-1777. Association for Computational Linguistics, 2023b. doi: 10.18653/V1/2023.ACL-LONG.99. https://doi.org/10.18653/v1/2023.acl-long.99 .

Jingyi Jia and Qinbin Li. Autotool: Efficient tool selection for large language model agents, November 2025.

Conference on Intelligent Robots and Systems, IROS 2022, Kyoto, Japan, October 23-27, 2022 , pages 3316-3323. IEEE, 2022. doi: 10.1109/IROS47612.2022.9981090. https://doi.org/10.1109/IROS47612.2022.9981090 .

Vancouver, BC, Canada, December 10 - 15, 2024 , 2024. http://papers.nips.cc/paper_files/paper/2024/hash/ 7ede97c3e082c6df10a8d6103a2eebd2-Abstract-Conference.html .

Linguistics: ACL 2025 , pages 19336-19352, Vienna, Austria, July 2025a. Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.989. https://aclanthology.org/2025.findings-acl.989/ .

Sun, Lei Bai, and Bowen Zhou. From ai for science to agentic science: A survey on autonomous scientific discovery, 2025c. https://arxiv.org/abs/2508.14111 .

Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors, Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-24, 2025 , pages 3764-3777. Association for Computational Linguistics, 2025b. https://aclanthology.org/2025.coling-main.254/ .

and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , 2023. http: //papers.nips.cc/paper_files/paper/2023/hash/6ceefa7b15572587b78ecfcebb2827f8-Abstract-Conference.html .

1 Introduction1 Introduction1 Introduction4
2 Preliminaries: Formalizing Agents and Memory2 Preliminaries: Formalizing Agents and Memory2 Preliminaries: Formalizing Agents and Memory6
2.1LLM-based Agent Systems. . . . . . . . . .6
2.2Agent Memory Systems .Agent Memory Systems .7
2.3. . . . . . . . . . . . . . . . . . . Comparing Agent Memory with Other Key Concepts . . . .. . . . . . . . . . . . . . . . . . . Comparing Agent Memory with Other Key Concepts . . . .8
2.3.1Agent Memory vs. LLM Memory . . .9
2.3.2Agent Memory vs. RAG10
2.3.3. . . . . . . . Agent Memory vs. Context Engineering11
3 Form: What Carries Memory?3 Form: What Carries Memory?3 Form: What Carries Memory?12
3.1Token-level Memory .. . . . . . . . . . . . .13
3.1.1Flat Memory (1D) . . . . . . . . . .15
3.1.2. Planar Memory (2D) . . . . . . .20
3.1.3. . . Hierarchical Memory (3D) . . . . . . .21
3.2Parametric Memory . . .. . . . . . .22
3.2.1. . . . Internal Parametric Memory . . . . .22
3.2.2External Parametric Memory . . . . .24
3.3LatentMemory . . . . . . . . . . . . . . . . .26
3.3.1Generate . . . . . . . . . . . . . . . .26
3.3.2Reuse . . . . . . . . . . . . . . . . . .28
3.3.3Transform . . . . . . . . . . . . . . . .28
3.4Adaptation. . . . . . . . . . . . . . . . . . .30
4 Functions:Memory?31
4.1Why Agents Need Factual Memory . . . . . . .. . . . . . . . .32
4.1.1User factual memory . . . . . . .35
4.1.2. . . Environment factual memory . . . . .36
4.2ExperientialMemory . . . . . . . . . . . . . .37
4.2.1Case-based Memory . . . . . . . . . . Memory39
4.2.2Strategy-based . . . . . . . . Memory40
4.2.3Skill-based . . . . . . . . . . . .41
4.34.2.4Hybrid memory . . . . . . . . . . . Memory . .42 42
Working 4.3.1. . . . . . . . . . . . . . Single-turn Working Memory . . . . .43
4.3.2Multi-turn Working Memory . . . . .45
5Dynamics: How Memory Operates and Evolves? . . . . . . . . . . . . . . . . . . . . . .Dynamics: How Memory Operates and Evolves? . . . . . . . . . . . . . . . . . . . . . .46
5.1Memory Formation . .Memory Formation . .48
5.1.1Semantic Summarization . . . . . . .48
5.1.2Knowledge Distillation . . . . . . . . .50
5.1.3Structured Construction . . . . . . . .51
5.1.4Latent Representation . . . . . . .53
5.1.5. . Parametric Internalization . . . . . . .54
5.2Memory Evolution .. . . . . . . . . . . . . .55
5.2.1Consolidation . . . . . . . . . . . . . .55
5.2.2Updating . . . . . . . . . . . . . . .57
5.2.3. Forgetting . . . . . . . . . . . . . . . .58
5.3Memory Retrieval .. . . . . .59
5.3.1. . . . . . . . Retrieval Timing and Intent . . . . . .60
5.3.2Query Construction . . . . . . . . . .62
5.3.3 5.3.4Retrieval Strategies . . . . . . . . . . . Post-Retrieval Processing . . . . . . .62 64
6 Resources and Frameworks6 Resources and Frameworks6 Resources and Frameworks65
6.1 Benchmarks and Datasets . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .65
6.1.1Benchmarks for Memory / Lifelong / Self-Evolving Agents . . . . . . . . . . . . . . .65
6.1.2Other Related Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . .67
6.2Open-Source Frameworks . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .68
7Positions and FrontiersPositions and Frontiers69
7.1Memory Retrieval vs. Memory Generation .. . . . . . . . . . . . . . . . . . . . . . . .69
7.1.1Look Back: From Memory Retrieval to Memory Generation . . . . . . . . . . .69
7.1.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . .69
. . . . . . . . . 7.2 Automated Memory Management . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .70
7.2.1Look-Back: From Hand-crafted to Automatically Constructed Memory Systems.70
7.2.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70
7.3Reinforcement Learning Meets Agent Memory .71
7.3.1. . . . . . . . . . . . . . . . . . . . . . Look-Back: RL is Internalizing Memory Management Abilities for Agents. . . .71
7.3.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
7.4Multimodal Memory . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .72
7.4.1Look-Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
7.4.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
7.5SharedMemory in Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . .73
7.5.1Look-Back: From Isolated Memories to Shared Cognitive Substrates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
7.5.2Future Perspective . . . . . .73
7.6Memory for World Model . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .74
7.6.1 Look-Back . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .74
7.6.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74
7.7Trustworthy Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75
. . .. . . . . .75
7.7.1Look-Back: From Trustworthy RAG to Trustworthy Memory . . . . . . . . . . . . . . . . . . . . . . . . . .75
7.7.2Future Perspective . . . . . . . . .
7.8Human-Cognitive Connections . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
7.8.1 Look Back . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
7.8.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
MethodMultiTypeMemory FormTask
Flat Memory ModelsFlat Memory ModelsFlat Memory ModelsFlat Memory ModelsFlat Memory Models
Reflexion (Shinn et al., 2023b)E&WTrajectory as short-term and feedback as long-termQA, Reasoning, Coding
Memento (Zhou et al., 2025a)✗ ✔ExpTrajectory case (success/failure).Reasoning Game
JARVIS-1 (Wang et al., 2025q)ExpPlan-environment pairs.
Expel (Zhao et al., 2024)ExpInsights and few-shot examples.Reasoning
Buffer of Thoughts (Yang et al., 2024b)ExpHigh-level thought-templates.Game, Reasoning, Coding
SAGE (Liang et al., 2025)ExpDual-store with forgetting mechanism.Game, Reasoning, Coding
ChemAgent (Tang et al., 2025c)ExpStructured sub-tasks and principles.Chemistry
AgentKB (Tang et al., 2025d)Exp5-tuple experience nodes.Coding, Reasoning
H 2 R (Ye et al., 2025b)ExpPlanning and Execution layers.Game, Embodied Simula- tion
AWM (Wang et al., 2024m)ExpAbstracted universal workflows.Web
PRINCIPLES (Kim et al., 2025a)ExpRule templates from self-play.Emotional Companion
ReasoningBank (Ouyang et al., 2025)ExpTransferable reasoning strategy items.Web
Voyager (Wang et al., 2024b)ExpExecutable skill code library.Game
DGM (Zhang et al., 2025i)ExpRecursive self-modifiable codebase.Coding
Memp (Fang et al., 2025d)ExpInstructions and abstract scripts.Embodied Simulation, Travel Planning
UFO2 (Zhang et al., 2025a)ExpSystem docs and interaction records.Windows OS
LEGOMem (Han et al., 2025a)ExpVectorized task trajectories.Office
ToolMem (Xiao et al., 2025b)ExpTool capability.Tool Calling
SCM (Wang et al., 2025a)FactMemory stream and vector database.Long-context
MemoryBank (Zhong et al., 2024)FactHistory and user profile.Emotional Companion
MPC (Lee et al., 2023)FactPersona and summary vector pool.QA
RecMind (Wang et al., 2024h)FactUser metadata and external knowledge.Recommendation
InteRecAgent (Huang et al., 2025d)FactUser profiles and candidate item.Recommendation
Ego-LLaVA (Shen et al., 2024)FactLanguage-encoded chunk embeddings.Multimodal QA
ChatHaruhi (Li et al., 2023a)FactDialogue database from media.Role-Playing
Memochat (Lu et al., 2023)FactMemos and categorized dialogue history.Long-conv QA
RecursiveSum (Wang et al., 2025h)FactRecursive summaries of short dialogues.Long-conv QA
MemGPT (Packer et al., 2023a)FactVirtual memory (Main/External contexts).Long-conv QA, Doc QA
MethodMultiTypeMemory StructureTask
RoleLLM (Wang et al., 2024d)FactRole-specific QA pairs.Role-Playing
Think-in-memory (Liu et al., 2023a)FactHash table of inductive thoughts.Long-conv QA
PLA (Yuan et al., 2025b)FactEvolving records of history and summaries.QA, Human Feedback
COMEDY (Chen et al., 2025d)FactSingle-model compressed memory format.Summary, Compression, QA
Memoro (Zulfikar et al., 2024)FactSpeech-to-text vector embeddings.User Study
Memory Sharing (Gao and Zhang, 2024a)FactQuery-Response pair retrieval.Literary Creation, Logic, Plan Generation
Conv Agent(Alonso et al., 2024)FactChain-of-tables and vector entries.QA
EM-LLM (Fountas et al., 2025)FactEpisodic events with Bayesian boundaries.Long-context
Memocrs (Xi et al., 2024a)FactUser metadata and knowledge.Recommendation
SECOM (Pan et al., 2025)FactParagraph-level segmented blocks.Long-conv QA
Mem0 (Chhikara et al., 2025)FactSummary and original dialogue.Long-conv QA
RMM (Tan et al., 2025c)FactReflection-organized flat entries.Personalization
MEMENTO (Kwon et al., 2025)FactInteraction history entries.Personalization
MemGuide (Du et al., 2025b)FactDialogue-derived QA pairs.Long-conv QA
MIRIX (Wang and Chen, 2025)FactSix optimized flat memory types.Long-conv QA
SemanticAnchor (Chatterjee and Agar- wal, 2025)FactSyntactic 5-tuple structure.Long-conv QA
MMS (Zhang et al., 2025b)FactDual Retrieval and Context units.Long-conv QA
Memory-R1 (Yan et al., 2025c)FactRL-managed mem0 architecture.Long-conv QA
ComoRAG (Wang et al., 2025f)FactFact/Semantic/Plot units with probes.Narrative QA
Nemori (Nan et al., 2025)FactPredictive calibration store.Long-conv QA
Livia (Xi and Wang, 2025)FactPruned interaction history.Emotional Companion
MOOM (Chen et al., 2025e)✗ ✗FactDecoupled plot and character stores.Role-Playing
Mem- α (Wang et al., 2025p)FactCore, Semantic, and Episodic Mem.Memory Management
Personalized Long term Interac- tion (Westhäußer et al., 2025)FactHierarchical history and summaries.Personalization
LightMem (Fang et al., 2025b)FactOptimized Long/Short-term store.Long-conv QA
MEXTRA (Wang et al., 2025b)FactExtracted raw dialogue data.Privacy Attack
MovieChat (Song et al., 2024)FactShort-term features and long-term persis- tence.Video Understanding
MA-LMM (He et al., 2024)FactVisual and Query memory banks.Video Understanding
VideoAgent (Wang et al., 2024g)FactTemporal text descriptions and object tracking.Video Understanding
Video-RAG (Luo et al., 2025b)FactVisually-aligned information .Video Understanding Embodied Task
KARMA (Wang et al., 2025r) Embodied VideoAgent (Fan et al.,✔ ✔Fact Fact3D scene graph and dynamic object states.MultiModal
2025)FactPersistent object and sensor store.Embodied Navigation
Mem2Ego (Zhang et al., 2025m)Map, landmark, and visited location stores.Generation
Context-as-Memory (Yu et al., 2025b)FactGenerated context frames.Video
RCR-Router (Liu et al., 2025d)FactBudget-aware semantic subsets.QA
ELL (Cai et al., 2025a)FactLiflong memory and skills.Lifelong Learning
MemRL (Zhang et al., 2026)ExpRL for memory management.Web
ReMe (Cao et al., 2025b)ExpStep level experience and insight.Web
MMAG (Zeppieri, 2025) Hindsight (Latimer et al., 2025)Fact FactFive interacting memory layers.User Study Long-conv
Retains, recalls, and reflects.QA
GAM (Yan et al., 2025a)FactSimple memory but search is guided.Long-conv QA
Planar Memory ModelsPlanar Memory ModelsPlanar Memory ModelsPlanar Memory ModelsPlanar Memory Models
D-SMART (Lei et al., 2025)FactStructured memory with reasoning trees.Long-conv QA
Reflexion (Shinn et al., 2023b)WorkReflective text buffer from experiences.QA, Reasoning, Coding
MethodMultiTypeMemory StructureTask
PREMem (Kim et al., 2025b)FactDynamic cross-session linked triples.Long-conv QA
Query Reconstruct (Xu et al., 2025b)ExpLogic graphs built from knowledge bases.KnowledgeGraph QA
KGT (Sun et al., 2024)FactKG node from query and feedback.QA
Optimus-1 (Li et al., 2024d)F&EKnowledge graph and experience pool.Game
SALI (Pan et al., 2024)ExpTopological graph with spatial nodesNavigation
HAT (A et al., 2024)FactHierarchical aggregate tree.Long-conv QA
MemTree (Rezazadeh et al., 2025c)FactDynamic hierarchical conversation tree.Long-conv QA
TeaFarm (iunn Ong et al., 2025)FactCausal edges connecting memories.Long-conv QA
COMET (Kim et al., 2024b)FactContext-aware memory through graph.Long-conv QA
Intrinsic Memory (Yuen et al., 2025)FactPrivate internal and shared external mem.Planning
A-MEM (Xu et al., 2025c)FactCard-based connected mem.Long-conv QA
Ret-LLM (Modarressi et al., 2023)FactTriplet table and LSH vectors.QA
HuaTuo (Wang et al., 2023a)FactMedical Knowledge Graph.Medical QA
M3-Agent (Long et al., 2025)FactMultimodal nodes in graph structure.Embodied QA
EMem (Zhou and Han, 2025a)FactEvent-centric alternative with pagerank.Long-conv QA
WorldMM (Yeo et al., 2025)FactMultiple complementary memories.Video Understanding
Memoria (Sarin et al., 2025)FactKnowledge-graph profile and summary.Long-conv QA
LingoEDU (Zhou et al., 2026)FactRelation tree of Elementary Discourse Units.Long-conv QA
Hierarchical Memory ModelsHierarchical Memory ModelsHierarchical Memory ModelsHierarchical Memory ModelsHierarchical Memory Models
GraphRAG (Edge et al., 2025)FactMulti-level community graph indices.QA, Summarization
H-Mem (Sun and Zeng, 2025)FactDecoupled index layers and content layers.Long-conv QA
EMG-RAG (Wang et al., 2024l)FactThree-tiered memory graph.QA
G-Memory (Zhang et al., 2025c)ExpQuery-centric three-layer graph structure.QA, Game, Embodied Task
Zep (Rasmussen et al., 2025)FactTemporal Knowledge Graphs.Long-conv QA
SGMem (Wu et al., 2025h)FactChunk Graph and Sentence Graph.Long-conv QA
HippoRAG (Gutierrez et al., 2024)FactKnowledge with query nodes.QA
HippoRAG 2 (Gutiérrez et al., 2025)FactKG with phrase and passage.QA
AriGraph (Anokhin et al., 2024)FactSemantic and Episodic memory graph.Game
Lyfe Agents (Kaiya et al., 2023)FactWorking, Short & Long-term layers.Social Simulation
CAM (Li et al., 2025g)FactMultilayer graph with topic.Doc QA
HiAgent (Hu et al., 2025a)E&WGoal graphs with recursive cluster.Agentic Tasks
ILM-TR (Tang et al., 2024)FactHierarchical Memory tree.Long-context
CompassMem (Hu et al., 2026b)FactHierarchical event-centric Memory.QA
MAGMA (Jiang et al., 2026)FactSemantic, temporal, causal, entity graphs.Long-conv QA
EverMemOS (Hu et al., 2026a)FactReusable memories covering multi types.Long-conv QA
RGMem (Tian et al., 2025a)FactRenormalization Group-based memory.Long-conv QA
MemVerse (Liu et al., 2025e)FactMultimodal hierarchical knowledge graphs.Reasoning, QA
MethodTypeTaskOptimization
I. Internal Parametric MemoryI. Internal Parametric MemoryI. Internal Parametric MemoryI. Internal Parametric Memory
(a) Pre-Train Phase
TNL (Qin et al., 2024b)WorkingQA, ReasoningSFT
StreamingLLM (Xiao et al., 2024)WorkingQA, ReasoningSFT
LMLM (Zhao et al., 2025b)FactualQA, Factual GenSFT
HierMemLM (Pouransari et al., 2025)FactualQA, Language ModelingSFT
Function Token (Zhang et al., 2025o)FactualLanguage ModelingPretrain
(b) Mid-Train Phase
Agent-Founder (Su et al., 2025)ExperientialTool Calling, Deep ResearchSFT
Early Experience (Zhang et al., 2025k)ExperientialTool Calling, Embodied Simulation, Reasoning, WebSFT
(c) Post-Train Phase
Character-LM (Shao et al., 2023)FactualRole PlayingSFT
CharacterGLM (Zhou et al., 2024a)FactualRole PlayingSFT
SELF-PARAM (Wang et al., 2025o)FactualQA, RecommendationKL Tuning
Room (Kim et al., 2023b)ExperientialEmbodied TaskRL
KnowledgeEditor (Cao et al., 2021)FactualQA, Fact CheckingFT
Mend (Mitchell et al., 2022)FactualQA, Fact Checking, Model EditingFT
PersonalityEdit Mao et al. (2024)FactualQA, Model EditingFT, PE
APP (Ma et al., 2024)FactualQAFT
DINM (Wang et al., 2024c)ExperientialQA, DetoxificationFT
AlphaEdit (Fang et al., 2025c)FactualQAFT
II. External Parametric MemoryII. External Parametric MemoryII. External Parametric MemoryII. External Parametric Memory
(a) Adapter-based Modules
MLP-Memory (Wei et al., 2025d)FactualQA, Classification, Textual EntailmentSFT
K-Adapter (Wang et al., 2021)FactualQA, Entity Typing, ClassificationSFT
WISE (Wang et al., 2024e)FactualQA, Hallucination DetectionSFT
ELDER (Li et al., 2025d)FactualModel EditingSFT
T-Patcher (Huang et al., 2023)FactualQAFT
Sparse Memory FT (Lin et al., 2025a)FactualQASFT
Memory Decoder (Cao et al., 2025a)FactualQA, Language ModelingSFT
MemLoRA (Bini et al., 2025)FactualQASFT
(b) Auxiliary LM-based Modules
MAC (Tack et al., 2024)FactualQASFT
Retroformer (Yao et al., 2024a)ExperientialQA, Web NavigationRL
MethodFormTypeTask
I. GenerateI. GenerateI. GenerateI. Generate
(a) Single Modal
Gist (Mu et al., 2023)Gist TokensWorking WorkingLong-context Compression
Taking a Deep Breath (Luo et al., 2024)Sentinel TokensLong-context QA
SoftCoT (Xu et al., 2025d)Soft TokensWorkingReasoning
CARE (Choi et al., 2025)Memory TokensWorkingQA, Fact Checking
AutoCompressor (Chevalier et al., 2023)Summary VectorsWorkingQA, Compression
MemoRAG (Qian et al., 2025)Global Semantic StatesWorkingQA, Summary
MemoryLLM (Wang et al., 2024j)Persistent TokensFactualLong-conv QA, Model Editing
M+ (Wang et al., 2025n)Cross-layer Token PoolsFactualQA
LM2 (Kang et al., 2025b)Matrix SlotsWorkingQA, Reasoning
Titans (Behrouz et al., 2025b)Neural Weights (MLP)WorkingQA, Language Modeling
MemGen (Zhang et al., 2025d)LoRA FragmentsWorking, Exp.QA, Math, Code, Embodied Task, Reasoning
EMU (Na et al., 2024)Embeddings w/ ReturnsFactualGame
TokMem (Wu et al., 2025j)Memory TokensExp.Funcation calling
Nested Learning (Behrouz et al., 2025a)Nested OptimizationFactualLanguage Modeling
Memoria (Park and Bak, 2024)Three memory layers with engramsFactualLanguage Modeling
(b) Multi-Modal
CoMem (Wu et al., 2025d)Multimodal EmbeddingsFactualMultimodal QA
ACM (Wu et al., 2025e)Trajectory EmbeddingsWorkingWeb
Time-VLM (Zhong et al., 2025)Patch EmbeddingsWorkingVideo Understanding
Mem Augmented RL (Mezghani et al., 2022)Novelty State EncoderWorkingVisual Navigation
MemoryVLA (Shi et al., 2025a)Perceptual StatesFactual, WorkingEmbodied Task
XMem (Cheng and Schwing, 2022)Key-Value EmbeddingsWorkingVideo Segmentation
II. ReuseII. ReuseII. ReuseII. Reuse
Memorizing Transformers (Wu et al., 2022)External KV CacheWorkingLanguage Modeling
SirLLM (Yao et al., 2024b)Entropy-selected KVFactualLong-conv QA
Memory 3 (Yang et al., 2024a)Critical KV PairsFactualQA
FOT (Tworkowski et al., 2023)Memory-Attention KVWorkingQA, Few-shot
LONGMEM (Wang et al., 2023b)Residual SideNet KVWorkinglearning, Language Modeling Language Modeling and Understanding
III. TransformIII. TransformIII. TransformIII. Transform
Scissorhands (Liu et al., 2023b)Pruned KVWorkingImage classification & generation
SnapKV (Li et al., 2024b)Aggregated Prefix KVWorkingLanguage Modeling
PyramidKV (Cai et al., 2024)Layer-wise BudgetWorkingLanguage Modeling
RazorAttention (Tang et al., 2025a)Compensated WindowWorkingLanguage Modeling
H2O (Zhang et al., 2023) R 3 Mem (Wang et al., 2025k)Heavy Hitter Tokens Virtual memory tokens with reversible compressionWorking WorkingQA, Language Modeling QA, Language Modeling
MethodCarrierStructureTaskOptimization
I. User factual MemoryI. User factual MemoryI. User factual MemoryI. User factual MemoryI. User factual Memory
(a) Dialogue Coherence
MemGPT (Packer et al., 2023b)Token-level1DLong-term dialoguePE
TiM (Liu et al., 2023a)Token-level2DQAPE
MemoryBank (Zhong et al., 2024)Token-level1DEmotional CompanionPE
AI Persona (Wang et al., 2024f)Token-level1DEmotional CompanionPE
Encode-Store-Retrieve (Shen et al., 2024)Token-level1DMultimodal QAPE
MethodCarrierFormTaskOptimization
Livia (Xi and Wang, 2025)Token-level1DEmotional CompanionPE
mem0 (Chhikara et al., 2025)Token-level1DLong-term dialogue, QAPE
RMM (Tan et al., 2025c)Token-level2DPersonalizationPE, RL
D-SMART (Lei et al., 2025)Token-level2DReasoningPE
Comedy (Chen et al., 2025d)Token-level1DSummary, Compression, QAPE
MEMENTO (Kwon et al., 2025)Token-level1DEmbodied, PersonalizationPE
O-Mem (Wang et al., 2025g)Token-level3DPersonalized DialoguePE
DAM-LLM (Lu and Li, 2025)Token-level1DEmotional CompanionPE
MemInsight (Salama et al., 2025)Token-level1DPersonalized DialoguePE
EMem (Zhou and Han, 2025a)Token-level1DPersonalized DialoguePE
RGMem (Tian et al., 2025a)Token-level1DLong-conv QAPE
Memoria (Sarin et al., 2025)Token-level1DLong-conv QAPE
MemVerse (Liu et al., 2025e)FactMultimodal hierarchical knowledge graphs.Reasoning, QA
(b) Goal Consistency
RecurrentGPT (Zhou et al., 2023b)Token-level1DLong-Context Generation, Personalized Interactive FictionPE
Memolet (Yen and Zhao, 2024)Token-level2DQA, Document ReasoningPE
MemGuide (Du et al., 2025b)Token-level1DLong-conv QAPE, SFT
SGMem (Wu et al., 2025h)Token-level2DLong-contextPE
A-Mem (Xu et al., 2025c)Token-level2DQA, ReasoningPE
M3-agent (Long et al., 2025)Token-level2DMultimodal QAPE, SFT
WorldMM (Yeo et al., 2025)Token-level1DMultimodal QAPE
EverMemOS (Hu et al., 2026a)Token-level1DLong-conv QAPE
Environment factual MemoryEnvironment factual MemoryEnvironment factual MemoryEnvironment factual Memory
(a) Knowledge Persistence
MemGPT (Packer et al., 2023b)Token-level1DDocument QAPE
CALYPSO (Zhu et al., 2023)Token-level1DTabletop GamingPE
AriGraph (Anokhin et al., 2024)Token-level3DGame, Multi-op QAPE
HippoRAG (Gutierrez et al., 2024)Token-level3DQAPE
WISE (Wang et al., 2024e)Parametric/Document Reasoning, QASFT
MemoryLLM (Wang et al., 2024j)Parametric/Document ReasoningSFT
Memoria (Park and Bak, 2024)latent/Language ModelingPE
Zep (Rasmussen et al., 2025)Token-level3DDocument analysisPE
MemTree (Rezazadeh et al., 2025c)Token-level2DDocument Reasoning, Dia- loguePE
LMLM (Zhao et al., 2025b)Token-level1DQASFT
M+ (Wang et al., 2025n)Latent/Document Reasoning, QASFT
CAM (Li et al., 2025g)Token-level3DMulti-hop QASFT, RFT
MemAct (Zhang et al., 2025r)Token-level1DMulti-obj QARL
Mem- α (Wang et al., 2025p)Token-Level1DDocument ReasoningRL
WebWeaver (Li et al., 2025m)Token-level1DDeep ResearchSFT
MemLoRA (Bini et al., 2025)Parametric/QASFT
Memory Decoder (Cao et al., 2025a)Parametric/QA, Language ModelingSFT
(b) Shared Access
GameGPT (Chen et al., 2023b)Token-level1DGame DevelopmentPE
Generative Agent (Park et al., 2023)Token-level2DSocial SimulationPE
S³ (Gao et al., 2023a)Token-level1DSocial SimulationPE
Memory Sharing (Gao and Zhang, 2024a)Token-level1DDocument ReasoningPE
MetaGPT (HongToken-level1DSoftware DevelopmentPE
et al., 2024) G-Memory (Zhang et al., 2025e)Token-level3DQAPE
OASIS (Yang et al., 2025)Token-level, Parametric1DSocial SimulationPE
MethodCarrierFormTaskOptimization
I. Case-based MemoryI. Case-based MemoryI. Case-based MemoryI. Case-based MemoryI. Case-based Memory
Expel (Zhao et al., 2024)Token-levelSolutionReasoningPE
Synapse (Zheng et al., 2024a)Token-levelSolutionWeb Interaction, Instruction-guided Web TaskPE
Fincon (Yu et al., 2024)Token-levelSolutionFinancialPE
MapCoder (Islam et al., 2024)Token-levelSolutionCodingPE
Memento (Zhou et al., 2025a)Token-levelTrajectoryReasoningRL
COLA (Zhao et al., 2025a)Token-levelTrajectoryGUI, Web Navigation, ReasoningPE
Continuous Memory (Wu et al., 2025e)LatentTrajectoryGUISFT
JARVIS-1 (Wang et al., 2025q)Token-levelTrajectoryGame, GUI InteractionPE
MemGen (Zhang et al., 2025d)LatentTrajectoryWeb Search, Embodied Simulation, Reasoning, Math, CodeRL, SFT
Early Experience (Zhang et al., 2025k)ParametricTrajectoryEmbodied Simulation, Reasoning, Web NavigationSFT
DreamGym (Chen et al., 2025f)Token-levelTrajectoryWeb Interaction, Embodied Simula- tion, ShoppingRL
MemRL (Zhang et al., 2026)Token-levelTrajectoryCoding, Embodied Simulation, Rea- soningRL
II. Strategy-based MemoryII. Strategy-based MemoryII. Strategy-based MemoryII. Strategy-based MemoryII. Strategy-based Memory
Reflexion (Shinn et al., 2023a)Token-levelInsightEmbodied Simulation, Reasoning, CodingPE
Buffer of Thoughts (Yang et al., 2024b)Token-levelPatternGame, Reasoning, CodingPE
AWM (Wang et al., 2024m)Token-levelWorkflowWeb Interaction, Instruction-guided Web TaskPE
RecMind (Wang et al., 2024h)Token-levelPatternRecommendationPE
H 2 R (Ye et al., 2025b)Token-levelInsightGame, Embodied SimulationPE
ReasoningBank (Ouyang et al., 2025)Token-levelInsightWeb Interaction, Instruction-guided Web TaskPE
R2D2 (Huang et al., 2025c)Token-levelInsightWeb InteractionPE
MethodCarrierFormTaskOptimization
BrowserAgent (Yu et al., 2025d)Token-levelInsightGeneral QA, Web searchRL, SFT
Agent KB (Tang et al., 2025d)Token-levelWorkflowCode, ReasoningPE
ToolMem (Xiao et al., 2025b)Token-levelInsightReasoning, Image GenerationPE
PRINCIPLES (Kim et al., 2025a)Token-levelPatternEmotional CompanionPE
SE-Agent (Sun et al., 2025c)Token-levelInsightCodingPE
ACE (Zhang et al., 2025n)Token-levelInsightCoding, Tool calling, FinancialPE
Flex (Cai et al., 2025c)Token-levelInsightMath, Chemistry, BiologyPE
AgentEvolver (Zhai et al., 2025)ParametricPatternTool-augmented TaskRL
Dynamic Cheatsheet (Suzgun et al., 2025)Token-levelInsightMath, Reasoning, GamePE
Training-Free GRPO (Cai et al., 2025b)Token-levelInsightMath, Reasoning, Web SearchPE
MemEvolve (Zhang et al., 2025h)Token-levelSolution,InsightWeb Search, ReasoningPE
III. Skill-based MemoryIII. Skill-based MemoryIII. Skill-based MemoryIII. Skill-based MemoryIII. Skill-based Memory
CREATOR (Qian et al., 2023)Token-levelFunction and ScriptReasoning, MathPE
Gorilla (Patil et al., 2024)Token-levelAPITool callingSFT
ToolRerank (Zheng et al., 2024b)Token-levelAPITool callingPE
Voyager (Wang et al., 2024b)Token-levelCode SnippetGamePE
RepairAgent (Bouzenia et al., 2024)Token-levelFunction and ScriptCodingPE
COLT (Qu et al., 2024)Token-levelAPITool callingSFT
ToolLLM (Qin et al., 2024a)Token-levelAPITool CallingSFT
LEGOMem (Han et al., 2025a)Token-levelFunction and ScriptOfficePE
Darwin Gödel Machine (Zhang et al., 2025i)Token-levelCode SnippetCodePE
Huxley-Gödel Machine (Wang et al., 2025j)Token-levelCode SnippetCodePE
Memp p (Fang et al., 2025d)Token-levelFunction and ScriptEmbodied Simulation, Travel Plan- ningPE
SkillWeaver (Zheng et al., 2025a)Token-levelFunction and ScriptWeb Interaction, Instruction-guided Web TaskPE
Alita (Qiu et al., 2025c)Token-levelMCPMath, Reasoning, VQAPE
Alita-G (Qiu et al., 2025b)Token-levelMCPMath, Reasoning, VQAPE
LearnAct (Liu et al., 2025b)Token-levelFunction and ScriptMobile GUIPE
ToolGen (Wang et al., 2025i)ParametricAPITool callingSFT
MemTool (Lumer et al., 2025)Token-levelMCPTool callingSFT
ToolRet (Shi et al., 2025c)Token-levelAPIWeb, Code, Tool RetrievalSFT
DRAFT (Qu et al., 2025a)Token-levelAPITool callingPE
ASI (Wang et al., 2025s)Token-levelFunctions and ScriptsWeb InteractionPE
MethodCarrierTaskOptimization
I. Single-turn Working MemoryI. Single-turn Working MemoryI. Single-turn Working MemoryI. Single-turn Working Memory
(a) Input Condensation
Gist (Mu et al., 2023)LatentInstruction Fine-tuningSFT
ICAE (Ge et al., 2024)LatentLanguage Modeling, Instruction Fine-tuningPretrain, LoRA
AutoCompressors (Chevalier et al., 2023)LatentLangague ModelingSFT
LLMLingua (Jiang et al., 2023)Token-levelReasoning, Conversation, SummarizationPE
LongLLMLingua (Jiang et al., 2024)Token-levelMulti-doc QA, Long-context, Multi-hop QAPE
CompAct (Yoon et al., 2024)Token-levelDocument QASFT
HyCo2 (Liao et al., 2025a)HybridSummarization, Open-domain QA, Multi-hop QASFT
Sentence-Anchor (Tarasov et al., 2025)LatentDocument QASFT
MELODI (Chen et al., 2024c)HybridPretrainingPretrain
R 3 Mem (Wang et al., 2025k)LatentDocument QA, Language ModelingPEFT
(b) Observation Abstraction
Synapse (Zheng et al., 2024a)Token-levelComputer Control, Web NavigationPE
VideoAgent (Wang et al., 2024g)Token-levelLong-term Video UnderstandingPE
MA-LMM (He et al., 2024)LatentLong-term Video UnderstandingSFT
Context as Memory (Yu et al., 2025b)Token-levelLong-term Video GenerationPE
II. Multi-turn Working MemoryII. Multi-turn Working MemoryII. Multi-turn Working MemoryII. Multi-turn Working Memory
(c) State Consolidation
MEM1 (Zhou et al., 2025b)LatentRetrieval, Open-domain QA, ShoppingRL
MemGen (Zhang et al., 2025d)LatentReasoning, Embodied Action, Web Search, CodingRL
MemAgent (Yu et al., 2025a)Token-levelLong-term Doc. QARL
ReMemAgent (Shi et al., 2025b)Token-levelLong-term Doc. QARL
ReSum (Wu et al., 2025f)Token-levelLong-horizon Web SearchRL
MemSearcher (Yuan et al., 2025a)Token-levelMulti-hop QASFT, RL
ACON (Kang et al., 2025c)Token-levelApp use, Multi-objective QAPE
IterResearch (Chen et al., 2025b)Token-levelReasoning, Web Navigation, Long-Horizon QARL
SUPO (Lu et al., 2025a)Token-levelLong-horizon taskRL
AgentDiet (Xiao et al., 2025a)Token-levelLong-horizon taskPE
SUMER (Zheng et al., 2025c)Token-levelQARL
Sculptor (Li et al., 2025f)Token-levelMulti-Needle QAPE,RL
AgeMem (Yu et al., 2026)Token-levelQA, Embodied ActionPE,RL
(d) Hierarchical Folding
HiAgent (Hu et al., 2025a)Token-levelLong-horizon Agent TaskPE
Context-Folding (Sun et al., 2025b)Token-levelDeep Research, SWERL
AgentFold (Ye et al., 2025a)Token-levelWeb SearchSFT
DeepAgent (Li et al., 2025i)Token-levelTool Use, Shopping, ReasoningRL
(e) Cognitive Planning
SayPlan (Rana et al., 2023)Token-level3D Scene Graph, RoboticsPE
KARMA (Wang et al., 2025r)Token-levelHouseholdPE
Agent-S (Agashe et al., 2025)Token-levelComputer UsePE
PRIME (Tran et al., 2025)Token-levelMulti-hop QA, Knowledge-intensive ReasoningPE
MethodSub-TypeRepresentation FormKey Mechanism
I. Semantic SummarizationI. Semantic SummarizationI. Semantic SummarizationI. Semantic Summarization
MemGPT (Packer et al., 2023a) Mem0 (Chhikara et al., 2025) Mem1 (Zhou et al., 2025b) MemAgent (Yu et al., 2025a) MemoryBank (Zhong et al., 2024) ReadAgent (Lee et al., 2024a) LightMem (Fang et al., 2025b) DeepSeek-OCR (Wei et al., 2025a) FDVS (You et al., 2024) LangRepo (Kahatapitiya et al., 2025) TiM (Liu et al., 2023a) RMM (Tan et al., 2025b) MemGuide (Du et al., 2025b) M3-Agent (Long et al., 2025) AWM (Wang et al., 2024m)Incremental Incremental Incremental Incremental Partitioned Partitioned Partitioned Partitioned Partitioned Partitioned Factual Factual Factual Factual ExperientialTextual Summary Textual Summary Textual Summary Textual Summary Textual Summary Textual Summary Textual Summary Visual Token Mapping Multimodal Summary Multimodal Summary II. Knowledge Distillation Textual Insight Topic Insight User Intent Text-addressable Facts Workflow PatternsMerging new chunks into the working context LLM-driven summarization RL-optimized summarization (PPO) RL-optimized summarization (GRPO) Daily/Session-based segmentation Semantic clustering before summarization Topic-clustered summarization Optical 2D mapping compression Multi-source signal integration (Subtitle/Object) Hierarchical video clip aggregation Abstraction of dialogue into thoughts Abstraction of dialogue into topic-based memory Capturing high-level user intent Compressing egocentric visual observations Workflow extraction from success trajectories
KGT (Sun et al., 2024) Mem0 g (Chhikara et al., 2025) D-SMART (Lei et al., 2025) GraphRAG (Edge et al., 2025) AriGraph (Anokhin et al., 2024) Zep (Rasmussen et al., 2025) RAPTOR (Sarthi et al., 2024) MemTree (Rezazadeh et al., 2025c) H-MEM (Sun and Zeng, 2025) A-MEM (Xu et al., 2025c) PREMem (Kim et al., 2025b) CAM (Li et al., 2025g) G-Memory (Zhang et al., 2025c)Entity-Level Entity-Level Entity-Level Entity-Level Entity-Level Entity-Level Chunk-Level Chunk-Level Chunk-Level Chunk-Level Chunk-Level Chunk-Level Chunk-LevelUser Graph Knowledge Graph Dynamic Memory Graph Hierarchical KG Semantic+Episodic Graph Temporal KG Tree Structure Tree Structure Hierarchical JSON Networked Notes Reasoning Patterns Hierarchical Graph Hierarchical GraphEncoding user preferences as nodes/edges LLM-based entity and triplet extraction Constructing an OWL-compliant graph Community detection and iterative summarization Dual-layer (Semantic nodes + Episodic links) 3-layer graph (Episodic, Semantic, Community) Recursive GMM clustering and summarization Bottom-up insertion and summary updates Top-down 4-level hierarchy organization Discrete notes with semantic links Cross-session reasoning pattern clustering Disentangling overlapping clusters via replication 3-tier graph (interaction, query, insight)
IV. Latent RepresentationIV. Latent RepresentationIV. Latent RepresentationIV. Latent Representation
MemoryLLM (Wang et al., 2024j) M+ (Wang et al., 2025n) MemGen (Zhang et al., 2025d) ESR (Shen et al., 2024) CoMEM (Wu et al., 2025d) Mem2Ego (Zhang et al., 2025m)Textual Textual Textual MultimodalLatent Vector Latent Vector Latent Token Latent Vector
KARMA (Wang et al., 2025r)Multimodal Multimodal MultimodalContinuous Embedding Multimodal Embedding Multimodal Embedding ParametricSelf-updatable latent embeddings Cross-layer long-term memory tokens Latent memory trigger and weaver Video-to-Language-to-Vector encoding Vision-language compression via Q-Former
Embedding landmark semantics as latent Hybrid long/short-term memory encoding
memory Internalizationmemory Internalizationmemory Internalizationmemory Internalization
MEND (Mitchell et al., 2022) ROME (Meng et al., 2022) MEMIT (Meng et al., 2023)Knowledge Knowledge KnowledgeGradient Decomposition Model Parameters Model Parameters
LoRA ParametersAuxiliary network for fast edits Causal tracing and rank-one update Mass-editing via residual distribution
V.V.V.V.
CoLoR (Wistuba et al., 2023) ToolFormer (Schick et al., 2023)Knowledge CapabilityModel ParametersLow-rank adapter training Supervised fine-tuning on API calls
NameLinkFac.Exp.MM.Env.FeatureScale
Memory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented Benchmarks
MemBench/github GitHubsimulatedinteractive scenarios53,000 s.
MemoryAgentBench/github GitHubsimulatedmulti-turn interactions4 t.
LoCoMo/globe Websiterealconversational memory300 s.
WebChoreArena/github GitHubrealtedious web browsing4 t./532 s.
MT-Mind2Web/github GitHubrealconversational web navigation720 s.
PersonaMem/globe Websitesimulateddynamic user profiling15 t./180 s.
LongMemEval/github GitHubsimulatedinteractive memory5 t./500 s.
PerLTQA/globe Websitesimulatedsocial personalized interactions8,593 s.
MemoryBank/globe Websitesimulateduser memory updating194 s.
MPR/github GitHubsimulateduser personalization108,000 s.
PrefEval/globe Websitesimulatedpersonal preferences3,000 s.
LOCCO/globe Websitesimulatedchronological conversations3,080 s.
StoryBench/globe Websitemixedinteractive fiction games3 t.
MemoryBench/globe Websitesimulatedcontinual learning4 t./ ∼ 20,000 s.
Madial-Bench/github GitHubsimulatedmemory recalling331 s.
Evo-Memory/globe Websitesimulatedtest-time learning10 t./ ∼ 3,700 s.
LifelongAgentBench/globe Websitesimulatedlifelong learning1,396 s.
StreamBench/globe Websitesimulatedcontinuous online learning9,702 s.
DialSim/globe Websiterealmulti-dialogue understanding∼ 1,300 s.
LongBench/globe Websitemixedlong-context understanding21 t./4,750 s.
LongBench v2/globe Websitemixedlong-context multitasks20 t./503 s.
RULER/github GitHubsimulatedlong-context retrieval13 t.
BABILong/github GitHubsimulatedlong-context reasoning20 t.
MM-Needle/globe Websitesimulatedmultimodal long-context retrieval∼ 280,000 s.
HaluMem/github GitHubsimulatedmemory hallucinations3,467 s.
HotpotQA/globe Websitesimulatedlong-context QA113k s.
Other Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related Benchmarks
ALFWorld/globe Websitesimulatedtext-based embodied environment3,353 t.
ScienceWorld/github GitHubsimulatedinteractive embodied environment10 t./30 t.
AgentGym/globe Websitemixedmultiple environments89 t./20,509 s.
AgentBoard/github GitHubmixedmulti-round interaction9 t./1013 s.
PDDL ∗/globe Websitesimulatedstrategy game-
BabyAI/globe Websitesimulatedlanguage learning19 t.
WebShop/globe Websitesimulatede-commerce web interaction12,087 s.
WebArena/globe Websiterealweb interaction812 s.
MMInA/globe Websiterealmultihop web interaction1,050 s.
SWE-Bench Verified/globe Websiterealcode repair500 s.
GAIA/globe Websiterealhuman-level deep research466 s.
xBench-DS/globe Websiterealdeep-search evaluation100 s.
ToolBench/github GitHubrealAPI tool use126,486 s.
GenAI-Bench/globe Websiterealvisual generation evaluation∼ 40,000 s.
FrameworkLinksFac.Exp.MM.StructureEvaluation
MemGPT/github GitHub /globe Websitehierachical (S/LTM)LoCoMo
Mem0/github GitHub /globe Websitegraph + vectorLoCoMo
Memobase/github GitHub /globe Websitestructured profilesLoCoMo
MIRIX/github GitHub /globe Websitestructured memoryLoCoMo, MemoryA- gentBench
MemoryOS/github GitHub /globe Websitehierarchical (S/M/LTM)LoCoMo, Memory- Bank
MemOS/github GitHub /globe Websitetree memory + memcubeLoCoMo, PreFEval, LongMemEval, Per- sonaMem
Zep/github GitHub /globe Websitetemporal knowledge graphLongMemEval
LangMem/github GitHub /globe Websitecore API + manager-
SuperMemory/github GitHub /globe Websitevector + semantic-
Cognee/github GitHub /globe Websiteknowledge graph-
Memary/github GitHub /globe Websitestream + entity store-
Pinecone/github GitHub /globe Websitevector database-
Chroma/github GitHub /globe Websitevector database-
Weaviate/github GitHub /globe Websitevector + graph-
Second Me/github GitHub /globe Websiteagent ego-
MemU/github GitHub /globe Websitehierachical layers-
MemEngine/github GitHubmodular space-
Memori/github GitHub /globe Websitememory database-
ReMe/github GitHub /globe Websitememory management-
AgentMemory/github GitHub /globe Websitememory management-
MineContext/github GitHub /globe Websitecontext engineering-
Acontext/github GitHubcontext engineering + skill learning-
PowerMem/github GitHuboceanbase-
ReMe/github GitHubagentscopeBFCL, AppWorld
HindSight/github GitHubparallel retrieval + reflection-
Profile
Experience

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 3639-3664, 2025.

Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong,

and Jeff Z Pan. Rethinking memory in ai: Taxonomy, operations, topics, and future directions. arXiv preprint arXiv:2505.00675 , 2025a.

the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023 , pages 1762-1777. Association for Computational Linguistics, 2023b. doi: 10.18653/V1/2023.ACL-LONG.99. https://doi.org/10.18653/v1/2023.acl-long.99 .

Jingyi Jia and Qinbin Li. Autotool: Efficient tool selection for large language model agents, November 2025.

Conference on Intelligent Robots and Systems, IROS 2022, Kyoto, Japan, October 23-27, 2022 , pages 3316-3323. IEEE, 2022. doi: 10.1109/IROS47612.2022.9981090. https://doi.org/10.1109/IROS47612.2022.9981090 .

Vancouver, BC, Canada, December 10 - 15, 2024 , 2024. http://papers.nips.cc/paper_files/paper/2024/hash/ 7ede97c3e082c6df10a8d6103a2eebd2-Abstract-Conference.html .

Linguistics: ACL 2025 , pages 19336-19352, Vienna, Austria, July 2025a. Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.989. https://aclanthology.org/2025.findings-acl.989/ .

Sun, Lei Bai, and Bowen Zhou. From ai for science to agentic science: A survey on autonomous scientific discovery, 2025c. https://arxiv.org/abs/2508.14111 .

Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors, Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-24, 2025 , pages 3764-3777. Association for Computational Linguistics, 2025b. https://aclanthology.org/2025.coling-main.254/ .

and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , 2023. http: //papers.nips.cc/paper_files/paper/2023/hash/6ceefa7b15572587b78ecfcebb2827f8-Abstract-Conference.html .

1 Introduction1 Introduction1 Introduction4
2 Preliminaries: Formalizing Agents and Memory2 Preliminaries: Formalizing Agents and Memory2 Preliminaries: Formalizing Agents and Memory6
2.1LLM-based Agent Systems. . . . . . . . . .6
2.2Agent Memory Systems .Agent Memory Systems .7
2.3. . . . . . . . . . . . . . . . . . . Comparing Agent Memory with Other Key Concepts . . . .. . . . . . . . . . . . . . . . . . . Comparing Agent Memory with Other Key Concepts . . . .8
2.3.1Agent Memory vs. LLM Memory . . .9
2.3.2Agent Memory vs. RAG10
2.3.3. . . . . . . . Agent Memory vs. Context Engineering11
3 Form: What Carries Memory?3 Form: What Carries Memory?3 Form: What Carries Memory?12
3.1Token-level Memory .. . . . . . . . . . . . .13
3.1.1Flat Memory (1D) . . . . . . . . . .15
3.1.2. Planar Memory (2D) . . . . . . .20
3.1.3. . . Hierarchical Memory (3D) . . . . . . .21
3.2Parametric Memory . . .. . . . . . .22
3.2.1. . . . Internal Parametric Memory . . . . .22
3.2.2External Parametric Memory . . . . .24
3.3LatentMemory . . . . . . . . . . . . . . . . .26
3.3.1Generate . . . . . . . . . . . . . . . .26
3.3.2Reuse . . . . . . . . . . . . . . . . . .28
3.3.3Transform . . . . . . . . . . . . . . . .28
3.4Adaptation. . . . . . . . . . . . . . . . . . .30
4 Functions:Memory?31
4.1Why Agents Need Factual Memory . . . . . . .. . . . . . . . .32
4.1.1User factual memory . . . . . . .35
4.1.2. . . Environment factual memory . . . . .36
4.2ExperientialMemory . . . . . . . . . . . . . .37
4.2.1Case-based Memory . . . . . . . . . . Memory39
4.2.2Strategy-based . . . . . . . . Memory40
4.2.3Skill-based . . . . . . . . . . . .41
4.34.2.4Hybrid memory . . . . . . . . . . . Memory . .42 42
Working 4.3.1. . . . . . . . . . . . . . Single-turn Working Memory . . . . .43
4.3.2Multi-turn Working Memory . . . . .45
5Dynamics: How Memory Operates and Evolves? . . . . . . . . . . . . . . . . . . . . . .Dynamics: How Memory Operates and Evolves? . . . . . . . . . . . . . . . . . . . . . .46
5.1Memory Formation . .Memory Formation . .48
5.1.1Semantic Summarization . . . . . . .48
5.1.2Knowledge Distillation . . . . . . . . .50
5.1.3Structured Construction . . . . . . . .51
5.1.4Latent Representation . . . . . . .53
5.1.5. . Parametric Internalization . . . . . . .54
5.2Memory Evolution .. . . . . . . . . . . . . .55
5.2.1Consolidation . . . . . . . . . . . . . .55
5.2.2Updating . . . . . . . . . . . . . . .57
5.2.3. Forgetting . . . . . . . . . . . . . . . .58
5.3Memory Retrieval .. . . . . .59
5.3.1. . . . . . . . Retrieval Timing and Intent . . . . . .60
5.3.2Query Construction . . . . . . . . . .62
5.3.3 5.3.4Retrieval Strategies . . . . . . . . . . . Post-Retrieval Processing . . . . . . .62 64
6 Resources and Frameworks6 Resources and Frameworks6 Resources and Frameworks65
6.1 Benchmarks and Datasets . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .65
6.1.1Benchmarks for Memory / Lifelong / Self-Evolving Agents . . . . . . . . . . . . . . .65
6.1.2Other Related Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . .67
6.2Open-Source Frameworks . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .68
7Positions and FrontiersPositions and Frontiers69
7.1Memory Retrieval vs. Memory Generation .. . . . . . . . . . . . . . . . . . . . . . . .69
7.1.1Look Back: From Memory Retrieval to Memory Generation . . . . . . . . . . .69
7.1.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . .69
. . . . . . . . . 7.2 Automated Memory Management . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .70
7.2.1Look-Back: From Hand-crafted to Automatically Constructed Memory Systems.70
7.2.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70
7.3Reinforcement Learning Meets Agent Memory .71
7.3.1. . . . . . . . . . . . . . . . . . . . . . Look-Back: RL is Internalizing Memory Management Abilities for Agents. . . .71
7.3.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
7.4Multimodal Memory . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .72
7.4.1Look-Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
7.4.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
7.5SharedMemory in Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . .73
7.5.1Look-Back: From Isolated Memories to Shared Cognitive Substrates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
7.5.2Future Perspective . . . . . .73
7.6Memory for World Model . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .74
7.6.1 Look-Back . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .74
7.6.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74
7.7Trustworthy Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75
. . .. . . . . .75
7.7.1Look-Back: From Trustworthy RAG to Trustworthy Memory . . . . . . . . . . . . . . . . . . . . . . . . . .75
7.7.2Future Perspective . . . . . . . . .
7.8Human-Cognitive Connections . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
7.8.1 Look Back . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
7.8.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
MethodMultiTypeMemory FormTask
Flat Memory ModelsFlat Memory ModelsFlat Memory ModelsFlat Memory ModelsFlat Memory Models
Reflexion (Shinn et al., 2023b)E&WTrajectory as short-term and feedback as long-termQA, Reasoning, Coding
Memento (Zhou et al., 2025a)✗ ✔ExpTrajectory case (success/failure).Reasoning Game
JARVIS-1 (Wang et al., 2025q)ExpPlan-environment pairs.
Expel (Zhao et al., 2024)ExpInsights and few-shot examples.Reasoning
Buffer of Thoughts (Yang et al., 2024b)ExpHigh-level thought-templates.Game, Reasoning, Coding
SAGE (Liang et al., 2025)ExpDual-store with forgetting mechanism.Game, Reasoning, Coding
ChemAgent (Tang et al., 2025c)ExpStructured sub-tasks and principles.Chemistry
AgentKB (Tang et al., 2025d)Exp5-tuple experience nodes.Coding, Reasoning
H 2 R (Ye et al., 2025b)ExpPlanning and Execution layers.Game, Embodied Simula- tion
AWM (Wang et al., 2024m)ExpAbstracted universal workflows.Web
PRINCIPLES (Kim et al., 2025a)ExpRule templates from self-play.Emotional Companion
ReasoningBank (Ouyang et al., 2025)ExpTransferable reasoning strategy items.Web
Voyager (Wang et al., 2024b)ExpExecutable skill code library.Game
DGM (Zhang et al., 2025i)ExpRecursive self-modifiable codebase.Coding
Memp (Fang et al., 2025d)ExpInstructions and abstract scripts.Embodied Simulation, Travel Planning
UFO2 (Zhang et al., 2025a)ExpSystem docs and interaction records.Windows OS
LEGOMem (Han et al., 2025a)ExpVectorized task trajectories.Office
ToolMem (Xiao et al., 2025b)ExpTool capability.Tool Calling
SCM (Wang et al., 2025a)FactMemory stream and vector database.Long-context
MemoryBank (Zhong et al., 2024)FactHistory and user profile.Emotional Companion
MPC (Lee et al., 2023)FactPersona and summary vector pool.QA
RecMind (Wang et al., 2024h)FactUser metadata and external knowledge.Recommendation
InteRecAgent (Huang et al., 2025d)FactUser profiles and candidate item.Recommendation
Ego-LLaVA (Shen et al., 2024)FactLanguage-encoded chunk embeddings.Multimodal QA
ChatHaruhi (Li et al., 2023a)FactDialogue database from media.Role-Playing
Memochat (Lu et al., 2023)FactMemos and categorized dialogue history.Long-conv QA
RecursiveSum (Wang et al., 2025h)FactRecursive summaries of short dialogues.Long-conv QA
MemGPT (Packer et al., 2023a)FactVirtual memory (Main/External contexts).Long-conv QA, Doc QA
MethodMultiTypeMemory StructureTask
RoleLLM (Wang et al., 2024d)FactRole-specific QA pairs.Role-Playing
Think-in-memory (Liu et al., 2023a)FactHash table of inductive thoughts.Long-conv QA
PLA (Yuan et al., 2025b)FactEvolving records of history and summaries.QA, Human Feedback
COMEDY (Chen et al., 2025d)FactSingle-model compressed memory format.Summary, Compression, QA
Memoro (Zulfikar et al., 2024)FactSpeech-to-text vector embeddings.User Study
Memory Sharing (Gao and Zhang, 2024a)FactQuery-Response pair retrieval.Literary Creation, Logic, Plan Generation
Conv Agent(Alonso et al., 2024)FactChain-of-tables and vector entries.QA
EM-LLM (Fountas et al., 2025)FactEpisodic events with Bayesian boundaries.Long-context
Memocrs (Xi et al., 2024a)FactUser metadata and knowledge.Recommendation
SECOM (Pan et al., 2025)FactParagraph-level segmented blocks.Long-conv QA
Mem0 (Chhikara et al., 2025)FactSummary and original dialogue.Long-conv QA
RMM (Tan et al., 2025c)FactReflection-organized flat entries.Personalization
MEMENTO (Kwon et al., 2025)FactInteraction history entries.Personalization
MemGuide (Du et al., 2025b)FactDialogue-derived QA pairs.Long-conv QA
MIRIX (Wang and Chen, 2025)FactSix optimized flat memory types.Long-conv QA
SemanticAnchor (Chatterjee and Agar- wal, 2025)FactSyntactic 5-tuple structure.Long-conv QA
MMS (Zhang et al., 2025b)FactDual Retrieval and Context units.Long-conv QA
Memory-R1 (Yan et al., 2025c)FactRL-managed mem0 architecture.Long-conv QA
ComoRAG (Wang et al., 2025f)FactFact/Semantic/Plot units with probes.Narrative QA
Nemori (Nan et al., 2025)FactPredictive calibration store.Long-conv QA
Livia (Xi and Wang, 2025)FactPruned interaction history.Emotional Companion
MOOM (Chen et al., 2025e)✗ ✗FactDecoupled plot and character stores.Role-Playing
Mem- α (Wang et al., 2025p)FactCore, Semantic, and Episodic Mem.Memory Management
Personalized Long term Interac- tion (Westhäußer et al., 2025)FactHierarchical history and summaries.Personalization
LightMem (Fang et al., 2025b)FactOptimized Long/Short-term store.Long-conv QA
MEXTRA (Wang et al., 2025b)FactExtracted raw dialogue data.Privacy Attack
MovieChat (Song et al., 2024)FactShort-term features and long-term persis- tence.Video Understanding
MA-LMM (He et al., 2024)FactVisual and Query memory banks.Video Understanding
VideoAgent (Wang et al., 2024g)FactTemporal text descriptions and object tracking.Video Understanding
Video-RAG (Luo et al., 2025b)FactVisually-aligned information .Video Understanding Embodied Task
KARMA (Wang et al., 2025r) Embodied VideoAgent (Fan et al.,✔ ✔Fact Fact3D scene graph and dynamic object states.MultiModal
2025)FactPersistent object and sensor store.Embodied Navigation
Mem2Ego (Zhang et al., 2025m)Map, landmark, and visited location stores.Generation
Context-as-Memory (Yu et al., 2025b)FactGenerated context frames.Video
RCR-Router (Liu et al., 2025d)FactBudget-aware semantic subsets.QA
ELL (Cai et al., 2025a)FactLiflong memory and skills.Lifelong Learning
MemRL (Zhang et al., 2026)ExpRL for memory management.Web
ReMe (Cao et al., 2025b)ExpStep level experience and insight.Web
MMAG (Zeppieri, 2025) Hindsight (Latimer et al., 2025)Fact FactFive interacting memory layers.User Study Long-conv
Retains, recalls, and reflects.QA
GAM (Yan et al., 2025a)FactSimple memory but search is guided.Long-conv QA
Planar Memory ModelsPlanar Memory ModelsPlanar Memory ModelsPlanar Memory ModelsPlanar Memory Models
D-SMART (Lei et al., 2025)FactStructured memory with reasoning trees.Long-conv QA
Reflexion (Shinn et al., 2023b)WorkReflective text buffer from experiences.QA, Reasoning, Coding
MethodMultiTypeMemory StructureTask
PREMem (Kim et al., 2025b)FactDynamic cross-session linked triples.Long-conv QA
Query Reconstruct (Xu et al., 2025b)ExpLogic graphs built from knowledge bases.KnowledgeGraph QA
KGT (Sun et al., 2024)FactKG node from query and feedback.QA
Optimus-1 (Li et al., 2024d)F&EKnowledge graph and experience pool.Game
SALI (Pan et al., 2024)ExpTopological graph with spatial nodesNavigation
HAT (A et al., 2024)FactHierarchical aggregate tree.Long-conv QA
MemTree (Rezazadeh et al., 2025c)FactDynamic hierarchical conversation tree.Long-conv QA
TeaFarm (iunn Ong et al., 2025)FactCausal edges connecting memories.Long-conv QA
COMET (Kim et al., 2024b)FactContext-aware memory through graph.Long-conv QA
Intrinsic Memory (Yuen et al., 2025)FactPrivate internal and shared external mem.Planning
A-MEM (Xu et al., 2025c)FactCard-based connected mem.Long-conv QA
Ret-LLM (Modarressi et al., 2023)FactTriplet table and LSH vectors.QA
HuaTuo (Wang et al., 2023a)FactMedical Knowledge Graph.Medical QA
M3-Agent (Long et al., 2025)FactMultimodal nodes in graph structure.Embodied QA
EMem (Zhou and Han, 2025a)FactEvent-centric alternative with pagerank.Long-conv QA
WorldMM (Yeo et al., 2025)FactMultiple complementary memories.Video Understanding
Memoria (Sarin et al., 2025)FactKnowledge-graph profile and summary.Long-conv QA
LingoEDU (Zhou et al., 2026)FactRelation tree of Elementary Discourse Units.Long-conv QA
Hierarchical Memory ModelsHierarchical Memory ModelsHierarchical Memory ModelsHierarchical Memory ModelsHierarchical Memory Models
GraphRAG (Edge et al., 2025)FactMulti-level community graph indices.QA, Summarization
H-Mem (Sun and Zeng, 2025)FactDecoupled index layers and content layers.Long-conv QA
EMG-RAG (Wang et al., 2024l)FactThree-tiered memory graph.QA
G-Memory (Zhang et al., 2025c)ExpQuery-centric three-layer graph structure.QA, Game, Embodied Task
Zep (Rasmussen et al., 2025)FactTemporal Knowledge Graphs.Long-conv QA
SGMem (Wu et al., 2025h)FactChunk Graph and Sentence Graph.Long-conv QA
HippoRAG (Gutierrez et al., 2024)FactKnowledge with query nodes.QA
HippoRAG 2 (Gutiérrez et al., 2025)FactKG with phrase and passage.QA
AriGraph (Anokhin et al., 2024)FactSemantic and Episodic memory graph.Game
Lyfe Agents (Kaiya et al., 2023)FactWorking, Short & Long-term layers.Social Simulation
CAM (Li et al., 2025g)FactMultilayer graph with topic.Doc QA
HiAgent (Hu et al., 2025a)E&WGoal graphs with recursive cluster.Agentic Tasks
ILM-TR (Tang et al., 2024)FactHierarchical Memory tree.Long-context
CompassMem (Hu et al., 2026b)FactHierarchical event-centric Memory.QA
MAGMA (Jiang et al., 2026)FactSemantic, temporal, causal, entity graphs.Long-conv QA
EverMemOS (Hu et al., 2026a)FactReusable memories covering multi types.Long-conv QA
RGMem (Tian et al., 2025a)FactRenormalization Group-based memory.Long-conv QA
MemVerse (Liu et al., 2025e)FactMultimodal hierarchical knowledge graphs.Reasoning, QA
MethodTypeTaskOptimization
I. Internal Parametric MemoryI. Internal Parametric MemoryI. Internal Parametric MemoryI. Internal Parametric Memory
(a) Pre-Train Phase
TNL (Qin et al., 2024b)WorkingQA, ReasoningSFT
StreamingLLM (Xiao et al., 2024)WorkingQA, ReasoningSFT
LMLM (Zhao et al., 2025b)FactualQA, Factual GenSFT
HierMemLM (Pouransari et al., 2025)FactualQA, Language ModelingSFT
Function Token (Zhang et al., 2025o)FactualLanguage ModelingPretrain
(b) Mid-Train Phase
Agent-Founder (Su et al., 2025)ExperientialTool Calling, Deep ResearchSFT
Early Experience (Zhang et al., 2025k)ExperientialTool Calling, Embodied Simulation, Reasoning, WebSFT
(c) Post-Train Phase
Character-LM (Shao et al., 2023)FactualRole PlayingSFT
CharacterGLM (Zhou et al., 2024a)FactualRole PlayingSFT
SELF-PARAM (Wang et al., 2025o)FactualQA, RecommendationKL Tuning
Room (Kim et al., 2023b)ExperientialEmbodied TaskRL
KnowledgeEditor (Cao et al., 2021)FactualQA, Fact CheckingFT
Mend (Mitchell et al., 2022)FactualQA, Fact Checking, Model EditingFT
PersonalityEdit Mao et al. (2024)FactualQA, Model EditingFT, PE
APP (Ma et al., 2024)FactualQAFT
DINM (Wang et al., 2024c)ExperientialQA, DetoxificationFT
AlphaEdit (Fang et al., 2025c)FactualQAFT
II. External Parametric MemoryII. External Parametric MemoryII. External Parametric MemoryII. External Parametric Memory
(a) Adapter-based Modules
MLP-Memory (Wei et al., 2025d)FactualQA, Classification, Textual EntailmentSFT
K-Adapter (Wang et al., 2021)FactualQA, Entity Typing, ClassificationSFT
WISE (Wang et al., 2024e)FactualQA, Hallucination DetectionSFT
ELDER (Li et al., 2025d)FactualModel EditingSFT
T-Patcher (Huang et al., 2023)FactualQAFT
Sparse Memory FT (Lin et al., 2025a)FactualQASFT
Memory Decoder (Cao et al., 2025a)FactualQA, Language ModelingSFT
MemLoRA (Bini et al., 2025)FactualQASFT
(b) Auxiliary LM-based Modules
MAC (Tack et al., 2024)FactualQASFT
Retroformer (Yao et al., 2024a)ExperientialQA, Web NavigationRL
MethodFormTypeTask
I. GenerateI. GenerateI. GenerateI. Generate
(a) Single Modal
Gist (Mu et al., 2023)Gist TokensWorking WorkingLong-context Compression
Taking a Deep Breath (Luo et al., 2024)Sentinel TokensLong-context QA
SoftCoT (Xu et al., 2025d)Soft TokensWorkingReasoning
CARE (Choi et al., 2025)Memory TokensWorkingQA, Fact Checking
AutoCompressor (Chevalier et al., 2023)Summary VectorsWorkingQA, Compression
MemoRAG (Qian et al., 2025)Global Semantic StatesWorkingQA, Summary
MemoryLLM (Wang et al., 2024j)Persistent TokensFactualLong-conv QA, Model Editing
M+ (Wang et al., 2025n)Cross-layer Token PoolsFactualQA
LM2 (Kang et al., 2025b)Matrix SlotsWorkingQA, Reasoning
Titans (Behrouz et al., 2025b)Neural Weights (MLP)WorkingQA, Language Modeling
MemGen (Zhang et al., 2025d)LoRA FragmentsWorking, Exp.QA, Math, Code, Embodied Task, Reasoning
EMU (Na et al., 2024)Embeddings w/ ReturnsFactualGame
TokMem (Wu et al., 2025j)Memory TokensExp.Funcation calling
Nested Learning (Behrouz et al., 2025a)Nested OptimizationFactualLanguage Modeling
Memoria (Park and Bak, 2024)Three memory layers with engramsFactualLanguage Modeling
(b) Multi-Modal
CoMem (Wu et al., 2025d)Multimodal EmbeddingsFactualMultimodal QA
ACM (Wu et al., 2025e)Trajectory EmbeddingsWorkingWeb
Time-VLM (Zhong et al., 2025)Patch EmbeddingsWorkingVideo Understanding
Mem Augmented RL (Mezghani et al., 2022)Novelty State EncoderWorkingVisual Navigation
MemoryVLA (Shi et al., 2025a)Perceptual StatesFactual, WorkingEmbodied Task
XMem (Cheng and Schwing, 2022)Key-Value EmbeddingsWorkingVideo Segmentation
II. ReuseII. ReuseII. ReuseII. Reuse
Memorizing Transformers (Wu et al., 2022)External KV CacheWorkingLanguage Modeling
SirLLM (Yao et al., 2024b)Entropy-selected KVFactualLong-conv QA
Memory 3 (Yang et al., 2024a)Critical KV PairsFactualQA
FOT (Tworkowski et al., 2023)Memory-Attention KVWorkingQA, Few-shot
LONGMEM (Wang et al., 2023b)Residual SideNet KVWorkinglearning, Language Modeling Language Modeling and Understanding
III. TransformIII. TransformIII. TransformIII. Transform
Scissorhands (Liu et al., 2023b)Pruned KVWorkingImage classification & generation
SnapKV (Li et al., 2024b)Aggregated Prefix KVWorkingLanguage Modeling
PyramidKV (Cai et al., 2024)Layer-wise BudgetWorkingLanguage Modeling
RazorAttention (Tang et al., 2025a)Compensated WindowWorkingLanguage Modeling
H2O (Zhang et al., 2023) R 3 Mem (Wang et al., 2025k)Heavy Hitter Tokens Virtual memory tokens with reversible compressionWorking WorkingQA, Language Modeling QA, Language Modeling
MethodCarrierStructureTaskOptimization
I. User factual MemoryI. User factual MemoryI. User factual MemoryI. User factual MemoryI. User factual Memory
(a) Dialogue Coherence
MemGPT (Packer et al., 2023b)Token-level1DLong-term dialoguePE
TiM (Liu et al., 2023a)Token-level2DQAPE
MemoryBank (Zhong et al., 2024)Token-level1DEmotional CompanionPE
AI Persona (Wang et al., 2024f)Token-level1DEmotional CompanionPE
Encode-Store-Retrieve (Shen et al., 2024)Token-level1DMultimodal QAPE
MethodCarrierFormTaskOptimization
Livia (Xi and Wang, 2025)Token-level1DEmotional CompanionPE
mem0 (Chhikara et al., 2025)Token-level1DLong-term dialogue, QAPE
RMM (Tan et al., 2025c)Token-level2DPersonalizationPE, RL
D-SMART (Lei et al., 2025)Token-level2DReasoningPE
Comedy (Chen et al., 2025d)Token-level1DSummary, Compression, QAPE
MEMENTO (Kwon et al., 2025)Token-level1DEmbodied, PersonalizationPE
O-Mem (Wang et al., 2025g)Token-level3DPersonalized DialoguePE
DAM-LLM (Lu and Li, 2025)Token-level1DEmotional CompanionPE
MemInsight (Salama et al., 2025)Token-level1DPersonalized DialoguePE
EMem (Zhou and Han, 2025a)Token-level1DPersonalized DialoguePE
RGMem (Tian et al., 2025a)Token-level1DLong-conv QAPE
Memoria (Sarin et al., 2025)Token-level1DLong-conv QAPE
MemVerse (Liu et al., 2025e)FactMultimodal hierarchical knowledge graphs.Reasoning, QA
(b) Goal Consistency
RecurrentGPT (Zhou et al., 2023b)Token-level1DLong-Context Generation, Personalized Interactive FictionPE
Memolet (Yen and Zhao, 2024)Token-level2DQA, Document ReasoningPE
MemGuide (Du et al., 2025b)Token-level1DLong-conv QAPE, SFT
SGMem (Wu et al., 2025h)Token-level2DLong-contextPE
A-Mem (Xu et al., 2025c)Token-level2DQA, ReasoningPE
M3-agent (Long et al., 2025)Token-level2DMultimodal QAPE, SFT
WorldMM (Yeo et al., 2025)Token-level1DMultimodal QAPE
EverMemOS (Hu et al., 2026a)Token-level1DLong-conv QAPE
Environment factual MemoryEnvironment factual MemoryEnvironment factual MemoryEnvironment factual Memory
(a) Knowledge Persistence
MemGPT (Packer et al., 2023b)Token-level1DDocument QAPE
CALYPSO (Zhu et al., 2023)Token-level1DTabletop GamingPE
AriGraph (Anokhin et al., 2024)Token-level3DGame, Multi-op QAPE
HippoRAG (Gutierrez et al., 2024)Token-level3DQAPE
WISE (Wang et al., 2024e)Parametric/Document Reasoning, QASFT
MemoryLLM (Wang et al., 2024j)Parametric/Document ReasoningSFT
Memoria (Park and Bak, 2024)latent/Language ModelingPE
Zep (Rasmussen et al., 2025)Token-level3DDocument analysisPE
MemTree (Rezazadeh et al., 2025c)Token-level2DDocument Reasoning, Dia- loguePE
LMLM (Zhao et al., 2025b)Token-level1DQASFT
M+ (Wang et al., 2025n)Latent/Document Reasoning, QASFT
CAM (Li et al., 2025g)Token-level3DMulti-hop QASFT, RFT
MemAct (Zhang et al., 2025r)Token-level1DMulti-obj QARL
Mem- α (Wang et al., 2025p)Token-Level1DDocument ReasoningRL
WebWeaver (Li et al., 2025m)Token-level1DDeep ResearchSFT
MemLoRA (Bini et al., 2025)Parametric/QASFT
Memory Decoder (Cao et al., 2025a)Parametric/QA, Language ModelingSFT
(b) Shared Access
GameGPT (Chen et al., 2023b)Token-level1DGame DevelopmentPE
Generative Agent (Park et al., 2023)Token-level2DSocial SimulationPE
S³ (Gao et al., 2023a)Token-level1DSocial SimulationPE
Memory Sharing (Gao and Zhang, 2024a)Token-level1DDocument ReasoningPE
MetaGPT (HongToken-level1DSoftware DevelopmentPE
et al., 2024) G-Memory (Zhang et al., 2025e)Token-level3DQAPE
OASIS (Yang et al., 2025)Token-level, Parametric1DSocial SimulationPE
MethodCarrierFormTaskOptimization
I. Case-based MemoryI. Case-based MemoryI. Case-based MemoryI. Case-based MemoryI. Case-based Memory
Expel (Zhao et al., 2024)Token-levelSolutionReasoningPE
Synapse (Zheng et al., 2024a)Token-levelSolutionWeb Interaction, Instruction-guided Web TaskPE
Fincon (Yu et al., 2024)Token-levelSolutionFinancialPE
MapCoder (Islam et al., 2024)Token-levelSolutionCodingPE
Memento (Zhou et al., 2025a)Token-levelTrajectoryReasoningRL
COLA (Zhao et al., 2025a)Token-levelTrajectoryGUI, Web Navigation, ReasoningPE
Continuous Memory (Wu et al., 2025e)LatentTrajectoryGUISFT
JARVIS-1 (Wang et al., 2025q)Token-levelTrajectoryGame, GUI InteractionPE
MemGen (Zhang et al., 2025d)LatentTrajectoryWeb Search, Embodied Simulation, Reasoning, Math, CodeRL, SFT
Early Experience (Zhang et al., 2025k)ParametricTrajectoryEmbodied Simulation, Reasoning, Web NavigationSFT
DreamGym (Chen et al., 2025f)Token-levelTrajectoryWeb Interaction, Embodied Simula- tion, ShoppingRL
MemRL (Zhang et al., 2026)Token-levelTrajectoryCoding, Embodied Simulation, Rea- soningRL
II. Strategy-based MemoryII. Strategy-based MemoryII. Strategy-based MemoryII. Strategy-based MemoryII. Strategy-based Memory
Reflexion (Shinn et al., 2023a)Token-levelInsightEmbodied Simulation, Reasoning, CodingPE
Buffer of Thoughts (Yang et al., 2024b)Token-levelPatternGame, Reasoning, CodingPE
AWM (Wang et al., 2024m)Token-levelWorkflowWeb Interaction, Instruction-guided Web TaskPE
RecMind (Wang et al., 2024h)Token-levelPatternRecommendationPE
H 2 R (Ye et al., 2025b)Token-levelInsightGame, Embodied SimulationPE
ReasoningBank (Ouyang et al., 2025)Token-levelInsightWeb Interaction, Instruction-guided Web TaskPE
R2D2 (Huang et al., 2025c)Token-levelInsightWeb InteractionPE
MethodCarrierFormTaskOptimization
BrowserAgent (Yu et al., 2025d)Token-levelInsightGeneral QA, Web searchRL, SFT
Agent KB (Tang et al., 2025d)Token-levelWorkflowCode, ReasoningPE
ToolMem (Xiao et al., 2025b)Token-levelInsightReasoning, Image GenerationPE
PRINCIPLES (Kim et al., 2025a)Token-levelPatternEmotional CompanionPE
SE-Agent (Sun et al., 2025c)Token-levelInsightCodingPE
ACE (Zhang et al., 2025n)Token-levelInsightCoding, Tool calling, FinancialPE
Flex (Cai et al., 2025c)Token-levelInsightMath, Chemistry, BiologyPE
AgentEvolver (Zhai et al., 2025)ParametricPatternTool-augmented TaskRL
Dynamic Cheatsheet (Suzgun et al., 2025)Token-levelInsightMath, Reasoning, GamePE
Training-Free GRPO (Cai et al., 2025b)Token-levelInsightMath, Reasoning, Web SearchPE
MemEvolve (Zhang et al., 2025h)Token-levelSolution,InsightWeb Search, ReasoningPE
III. Skill-based MemoryIII. Skill-based MemoryIII. Skill-based MemoryIII. Skill-based MemoryIII. Skill-based Memory
CREATOR (Qian et al., 2023)Token-levelFunction and ScriptReasoning, MathPE
Gorilla (Patil et al., 2024)Token-levelAPITool callingSFT
ToolRerank (Zheng et al., 2024b)Token-levelAPITool callingPE
Voyager (Wang et al., 2024b)Token-levelCode SnippetGamePE
RepairAgent (Bouzenia et al., 2024)Token-levelFunction and ScriptCodingPE
COLT (Qu et al., 2024)Token-levelAPITool callingSFT
ToolLLM (Qin et al., 2024a)Token-levelAPITool CallingSFT
LEGOMem (Han et al., 2025a)Token-levelFunction and ScriptOfficePE
Darwin Gödel Machine (Zhang et al., 2025i)Token-levelCode SnippetCodePE
Huxley-Gödel Machine (Wang et al., 2025j)Token-levelCode SnippetCodePE
Memp p (Fang et al., 2025d)Token-levelFunction and ScriptEmbodied Simulation, Travel Plan- ningPE
SkillWeaver (Zheng et al., 2025a)Token-levelFunction and ScriptWeb Interaction, Instruction-guided Web TaskPE
Alita (Qiu et al., 2025c)Token-levelMCPMath, Reasoning, VQAPE
Alita-G (Qiu et al., 2025b)Token-levelMCPMath, Reasoning, VQAPE
LearnAct (Liu et al., 2025b)Token-levelFunction and ScriptMobile GUIPE
ToolGen (Wang et al., 2025i)ParametricAPITool callingSFT
MemTool (Lumer et al., 2025)Token-levelMCPTool callingSFT
ToolRet (Shi et al., 2025c)Token-levelAPIWeb, Code, Tool RetrievalSFT
DRAFT (Qu et al., 2025a)Token-levelAPITool callingPE
ASI (Wang et al., 2025s)Token-levelFunctions and ScriptsWeb InteractionPE
MethodCarrierTaskOptimization
I. Single-turn Working MemoryI. Single-turn Working MemoryI. Single-turn Working MemoryI. Single-turn Working Memory
(a) Input Condensation
Gist (Mu et al., 2023)LatentInstruction Fine-tuningSFT
ICAE (Ge et al., 2024)LatentLanguage Modeling, Instruction Fine-tuningPretrain, LoRA
AutoCompressors (Chevalier et al., 2023)LatentLangague ModelingSFT
LLMLingua (Jiang et al., 2023)Token-levelReasoning, Conversation, SummarizationPE
LongLLMLingua (Jiang et al., 2024)Token-levelMulti-doc QA, Long-context, Multi-hop QAPE
CompAct (Yoon et al., 2024)Token-levelDocument QASFT
HyCo2 (Liao et al., 2025a)HybridSummarization, Open-domain QA, Multi-hop QASFT
Sentence-Anchor (Tarasov et al., 2025)LatentDocument QASFT
MELODI (Chen et al., 2024c)HybridPretrainingPretrain
R 3 Mem (Wang et al., 2025k)LatentDocument QA, Language ModelingPEFT
(b) Observation Abstraction
Synapse (Zheng et al., 2024a)Token-levelComputer Control, Web NavigationPE
VideoAgent (Wang et al., 2024g)Token-levelLong-term Video UnderstandingPE
MA-LMM (He et al., 2024)LatentLong-term Video UnderstandingSFT
Context as Memory (Yu et al., 2025b)Token-levelLong-term Video GenerationPE
II. Multi-turn Working MemoryII. Multi-turn Working MemoryII. Multi-turn Working MemoryII. Multi-turn Working Memory
(c) State Consolidation
MEM1 (Zhou et al., 2025b)LatentRetrieval, Open-domain QA, ShoppingRL
MemGen (Zhang et al., 2025d)LatentReasoning, Embodied Action, Web Search, CodingRL
MemAgent (Yu et al., 2025a)Token-levelLong-term Doc. QARL
ReMemAgent (Shi et al., 2025b)Token-levelLong-term Doc. QARL
ReSum (Wu et al., 2025f)Token-levelLong-horizon Web SearchRL
MemSearcher (Yuan et al., 2025a)Token-levelMulti-hop QASFT, RL
ACON (Kang et al., 2025c)Token-levelApp use, Multi-objective QAPE
IterResearch (Chen et al., 2025b)Token-levelReasoning, Web Navigation, Long-Horizon QARL
SUPO (Lu et al., 2025a)Token-levelLong-horizon taskRL
AgentDiet (Xiao et al., 2025a)Token-levelLong-horizon taskPE
SUMER (Zheng et al., 2025c)Token-levelQARL
Sculptor (Li et al., 2025f)Token-levelMulti-Needle QAPE,RL
AgeMem (Yu et al., 2026)Token-levelQA, Embodied ActionPE,RL
(d) Hierarchical Folding
HiAgent (Hu et al., 2025a)Token-levelLong-horizon Agent TaskPE
Context-Folding (Sun et al., 2025b)Token-levelDeep Research, SWERL
AgentFold (Ye et al., 2025a)Token-levelWeb SearchSFT
DeepAgent (Li et al., 2025i)Token-levelTool Use, Shopping, ReasoningRL
(e) Cognitive Planning
SayPlan (Rana et al., 2023)Token-level3D Scene Graph, RoboticsPE
KARMA (Wang et al., 2025r)Token-levelHouseholdPE
Agent-S (Agashe et al., 2025)Token-levelComputer UsePE
PRIME (Tran et al., 2025)Token-levelMulti-hop QA, Knowledge-intensive ReasoningPE
MethodSub-TypeRepresentation FormKey Mechanism
I. Semantic SummarizationI. Semantic SummarizationI. Semantic SummarizationI. Semantic Summarization
MemGPT (Packer et al., 2023a) Mem0 (Chhikara et al., 2025) Mem1 (Zhou et al., 2025b) MemAgent (Yu et al., 2025a) MemoryBank (Zhong et al., 2024) ReadAgent (Lee et al., 2024a) LightMem (Fang et al., 2025b) DeepSeek-OCR (Wei et al., 2025a) FDVS (You et al., 2024) LangRepo (Kahatapitiya et al., 2025) TiM (Liu et al., 2023a) RMM (Tan et al., 2025b) MemGuide (Du et al., 2025b) M3-Agent (Long et al., 2025) AWM (Wang et al., 2024m)Incremental Incremental Incremental Incremental Partitioned Partitioned Partitioned Partitioned Partitioned Partitioned Factual Factual Factual Factual ExperientialTextual Summary Textual Summary Textual Summary Textual Summary Textual Summary Textual Summary Textual Summary Visual Token Mapping Multimodal Summary Multimodal Summary II. Knowledge Distillation Textual Insight Topic Insight User Intent Text-addressable Facts Workflow PatternsMerging new chunks into the working context LLM-driven summarization RL-optimized summarization (PPO) RL-optimized summarization (GRPO) Daily/Session-based segmentation Semantic clustering before summarization Topic-clustered summarization Optical 2D mapping compression Multi-source signal integration (Subtitle/Object) Hierarchical video clip aggregation Abstraction of dialogue into thoughts Abstraction of dialogue into topic-based memory Capturing high-level user intent Compressing egocentric visual observations Workflow extraction from success trajectories
KGT (Sun et al., 2024) Mem0 g (Chhikara et al., 2025) D-SMART (Lei et al., 2025) GraphRAG (Edge et al., 2025) AriGraph (Anokhin et al., 2024) Zep (Rasmussen et al., 2025) RAPTOR (Sarthi et al., 2024) MemTree (Rezazadeh et al., 2025c) H-MEM (Sun and Zeng, 2025) A-MEM (Xu et al., 2025c) PREMem (Kim et al., 2025b) CAM (Li et al., 2025g) G-Memory (Zhang et al., 2025c)Entity-Level Entity-Level Entity-Level Entity-Level Entity-Level Entity-Level Chunk-Level Chunk-Level Chunk-Level Chunk-Level Chunk-Level Chunk-Level Chunk-LevelUser Graph Knowledge Graph Dynamic Memory Graph Hierarchical KG Semantic+Episodic Graph Temporal KG Tree Structure Tree Structure Hierarchical JSON Networked Notes Reasoning Patterns Hierarchical Graph Hierarchical GraphEncoding user preferences as nodes/edges LLM-based entity and triplet extraction Constructing an OWL-compliant graph Community detection and iterative summarization Dual-layer (Semantic nodes + Episodic links) 3-layer graph (Episodic, Semantic, Community) Recursive GMM clustering and summarization Bottom-up insertion and summary updates Top-down 4-level hierarchy organization Discrete notes with semantic links Cross-session reasoning pattern clustering Disentangling overlapping clusters via replication 3-tier graph (interaction, query, insight)
IV. Latent RepresentationIV. Latent RepresentationIV. Latent RepresentationIV. Latent Representation
MemoryLLM (Wang et al., 2024j) M+ (Wang et al., 2025n) MemGen (Zhang et al., 2025d) ESR (Shen et al., 2024) CoMEM (Wu et al., 2025d) Mem2Ego (Zhang et al., 2025m)Textual Textual Textual MultimodalLatent Vector Latent Vector Latent Token Latent Vector
KARMA (Wang et al., 2025r)Multimodal Multimodal MultimodalContinuous Embedding Multimodal Embedding Multimodal Embedding ParametricSelf-updatable latent embeddings Cross-layer long-term memory tokens Latent memory trigger and weaver Video-to-Language-to-Vector encoding Vision-language compression via Q-Former
Embedding landmark semantics as latent Hybrid long/short-term memory encoding
memory Internalizationmemory Internalizationmemory Internalizationmemory Internalization
MEND (Mitchell et al., 2022) ROME (Meng et al., 2022) MEMIT (Meng et al., 2023)Knowledge Knowledge KnowledgeGradient Decomposition Model Parameters Model Parameters
LoRA ParametersAuxiliary network for fast edits Causal tracing and rank-one update Mass-editing via residual distribution
V.V.V.V.
CoLoR (Wistuba et al., 2023) ToolFormer (Schick et al., 2023)Knowledge CapabilityModel ParametersLow-rank adapter training Supervised fine-tuning on API calls
NameLinkFac.Exp.MM.Env.FeatureScale
Memory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented Benchmarks
MemBench/github GitHubsimulatedinteractive scenarios53,000 s.
MemoryAgentBench/github GitHubsimulatedmulti-turn interactions4 t.
LoCoMo/globe Websiterealconversational memory300 s.
WebChoreArena/github GitHubrealtedious web browsing4 t./532 s.
MT-Mind2Web/github GitHubrealconversational web navigation720 s.
PersonaMem/globe Websitesimulateddynamic user profiling15 t./180 s.
LongMemEval/github GitHubsimulatedinteractive memory5 t./500 s.
PerLTQA/globe Websitesimulatedsocial personalized interactions8,593 s.
MemoryBank/globe Websitesimulateduser memory updating194 s.
MPR/github GitHubsimulateduser personalization108,000 s.
PrefEval/globe Websitesimulatedpersonal preferences3,000 s.
LOCCO/globe Websitesimulatedchronological conversations3,080 s.
StoryBench/globe Websitemixedinteractive fiction games3 t.
MemoryBench/globe Websitesimulatedcontinual learning4 t./ ∼ 20,000 s.
Madial-Bench/github GitHubsimulatedmemory recalling331 s.
Evo-Memory/globe Websitesimulatedtest-time learning10 t./ ∼ 3,700 s.
LifelongAgentBench/globe Websitesimulatedlifelong learning1,396 s.
StreamBench/globe Websitesimulatedcontinuous online learning9,702 s.
DialSim/globe Websiterealmulti-dialogue understanding∼ 1,300 s.
LongBench/globe Websitemixedlong-context understanding21 t./4,750 s.
LongBench v2/globe Websitemixedlong-context multitasks20 t./503 s.
RULER/github GitHubsimulatedlong-context retrieval13 t.
BABILong/github GitHubsimulatedlong-context reasoning20 t.
MM-Needle/globe Websitesimulatedmultimodal long-context retrieval∼ 280,000 s.
HaluMem/github GitHubsimulatedmemory hallucinations3,467 s.
HotpotQA/globe Websitesimulatedlong-context QA113k s.
Other Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related Benchmarks
ALFWorld/globe Websitesimulatedtext-based embodied environment3,353 t.
ScienceWorld/github GitHubsimulatedinteractive embodied environment10 t./30 t.
AgentGym/globe Websitemixedmultiple environments89 t./20,509 s.
AgentBoard/github GitHubmixedmulti-round interaction9 t./1013 s.
PDDL ∗/globe Websitesimulatedstrategy game-
BabyAI/globe Websitesimulatedlanguage learning19 t.
WebShop/globe Websitesimulatede-commerce web interaction12,087 s.
WebArena/globe Websiterealweb interaction812 s.
MMInA/globe Websiterealmultihop web interaction1,050 s.
SWE-Bench Verified/globe Websiterealcode repair500 s.
GAIA/globe Websiterealhuman-level deep research466 s.
xBench-DS/globe Websiterealdeep-search evaluation100 s.
ToolBench/github GitHubrealAPI tool use126,486 s.
GenAI-Bench/globe Websiterealvisual generation evaluation∼ 40,000 s.
FrameworkLinksFac.Exp.MM.StructureEvaluation
MemGPT/github GitHub /globe Websitehierachical (S/LTM)LoCoMo
Mem0/github GitHub /globe Websitegraph + vectorLoCoMo
Memobase/github GitHub /globe Websitestructured profilesLoCoMo
MIRIX/github GitHub /globe Websitestructured memoryLoCoMo, MemoryA- gentBench
MemoryOS/github GitHub /globe Websitehierarchical (S/M/LTM)LoCoMo, Memory- Bank
MemOS/github GitHub /globe Websitetree memory + memcubeLoCoMo, PreFEval, LongMemEval, Per- sonaMem
Zep/github GitHub /globe Websitetemporal knowledge graphLongMemEval
LangMem/github GitHub /globe Websitecore API + manager-
SuperMemory/github GitHub /globe Websitevector + semantic-
Cognee/github GitHub /globe Websiteknowledge graph-
Memary/github GitHub /globe Websitestream + entity store-
Pinecone/github GitHub /globe Websitevector database-
Chroma/github GitHub /globe Websitevector database-
Weaviate/github GitHub /globe Websitevector + graph-
Second Me/github GitHub /globe Websiteagent ego-
MemU/github GitHub /globe Websitehierachical layers-
MemEngine/github GitHubmodular space-
Memori/github GitHub /globe Websitememory database-
ReMe/github GitHub /globe Websitememory management-
AgentMemory/github GitHub /globe Websitememory management-
MineContext/github GitHub /globe Websitecontext engineering-
Acontext/github GitHubcontext engineering + skill learning-
PowerMem/github GitHuboceanbase-
ReMe/github GitHubagentscopeBFCL, AppWorld
HindSight/github GitHubparallel retrieval + reflection-
Multimodal
Discussion

This survey has examined agent memory as a foundational component of modern LLM-based agentic systems. By framing existing research through the unified lenses of forms, functions, and dynamics , we have clarified the conceptual landscape of agent memory and situated it within the broader evolution of agentic intelligence. On the level of forms , we identify three principal realizations: token-level, parametric, and latent memory, each of which has undergone distinct and rapid advances in recent years, reflecting fundamentally different trade-offs in representation, adaptability, and integration with agent policies. On the level of functions , we move beyond the coarse long-term versus short-term dichotomy prevalent in prior surveys, and instead propose a more fine-grained and encompassing taxonomy that distinguishes factual, experiential, and working memory

according to their roles in knowledge retention, capability accumulation, and task-level reasoning. Together, these perspectives reveal that memory is not merely an auxiliary storage mechanism, but an essential substrate through which agents achieve temporal coherence, continual adaptation, and long-horizon competence.

Beyond organizing prior work, we have identified key challenges and emerging directions that point toward the next stage of agent memory research. In particular, the increasing integration of reinforcement learning, the rise of multimodal and multi-agent settings, and the shift from retrieval-centric to generative memory paradigms suggest a future in which memory systems become fully learnable, adaptive, and self-organizing. Such systems hold the potential to transform large language models from powerful but static generators into agents capable of sustained interaction, self-improvement, and principled reasoning over time.

We hope this survey provides a coherent foundation for future research and serves as a reference for both researchers and practitioners. As agentic systems continue to mature, the design of memory will remain a central and open problem, one that is likely to play a decisive role in the development of robust, general, and enduring artificial intelligence.

Planar Memory (2D)

Tree

In contrast to methods that generate new latent representations, another line of work directly reuses the model's internal activations, primarily the key-value (KV) cache , as latent memory. These approaches do not transform(modify, compress) the stored KV pairs and instead treat the raw activations from forward passes as reusable memory entries. The main challenge is to determine which KV pairs to keep, how to index them, and how to retrieve them efficiently under long-context or continual-processing demands.

From a cognitive perspective, Gershman et al. (2025) provides conceptual grounding by framing biological memory as a key-value system, where keys function as retrieval addresses and values encode stored content-an abstraction closely aligned with KV-based memory in modern LLMs. Memorizing Transformers (Wu et al., 2022) explicitly store past KV pairs and retrieve them via K-nearest-neighbor search during inference. FOT (Tworkowski et al., 2023) extends this line of work by introducing memory-attention layers that perform KNN-based retrieval over additional KV memories during inference. LONGMEM (Wang et al., 2023b) similarly augments long-range retrieval, employing a lightweight residual SideNet that treats historical KV embeddings as a persistent memory store. These systems demonstrate how retrieval-aware organization of latent KV states can substantially enhance access to distant information.

Discussion Reuse-type latent memory methods highlight the effectiveness of directly leveraging the model's own internal activations as memory, showing that carefully curated KV representations can serve as a powerful and efficient substrate for long-range retrieval and reasoning.

Their greatest strength lies in preserving the full fidelity of the model's internal activations, ensuring that no information is lost through pruning or compression. This makes them conceptually simple, easy to integrate into existing forms, and highly faithful to the model's original computation. However, raw KV caches grow rapidly with context length, which increases memory consumption and can make retrieval less efficient. The effectiveness of reuse therefore depends heavily on indexing strategies.

Graph

Amajor line of work builds memory by generatingnewlatentrepresentations rather than reusing or transforming existing activations. In this paradigm, the model or an auxiliary encoder creates compact continuous states. These states may appear as special tokens in the sequence or as standalone vectors. They summarize the essential information from long contexts, task trajectories, or multimodal inputs. The generated latent summaries are then stored, inserted, or used as conditions for later reasoning or decision-making. This enables the system to operate beyond its native context length, maintain task-specific intermediate states, and retain knowledge across episodes without revisiting the original input. Although the concrete forms vary across studies, the underlying idea remains consistent. Memory is explicitly produced through learned encoding or compression, and the resulting latent states serve as reusable memory units that support future inference.

This design choice may also raise potential ambiguity with parametric memory, particularly since many methods rely on separately trained models to generate latent representations. In this chapter, however, our classification is grounded in the form of memory rather than the learning mechanism. Crucially, although these approaches generate memory through learned encoding, the produced latent representations are explicitly instantiated and reused as independent memory units, rather than being directly embedded into the model's parameters or forward-pass activations. We will return to this distinction when discussing individual methods in detail.

Single Modal In the single-modal setting, a major group of methods focuses on long-context processing and language modeling, where models generate a small set of internal representations to replace long raw inputs (Mu et al., 2023; Luo et al., 2024; Xu et al., 2025d; Chevalier et al., 2023; Qian et al., 2025; Wang et al., 2024j, 2025n). A typical strategy is to compress long sequences into a few internal tokens or continuous vectors that can be reused during later inference. For example, Gist (Mu et al., 2023) train a language model to produce a set of gist tokens after processing a long prompt. Luo et al. (2024) introduce a special sentinel token at each chunk boundary and encourage the model to aggregate local semantics into that token. SoftCoT (Xu et al., 2025d) follows a similar direction by generating instance-specific soft tokens from the last hidden state.

Table 3 Taxonomy of latent memory methods. We categorize existing works based on the origin of the latent state: Generate synthesizes memory via auxiliary modules, Reuse propagates internal computational states, and Transform compresses, modifies or restructs existing latent state. Methods are compared across three technical dimensions: (1) Form specifies the specific data type of the latent memory, (2) Type defines the nature of the recorded content (e.g., Working, Factual, and Experiential), and (3) Task denotes the target downstream application.

CARE (Choi et al., 2025) further extends the latent tokens by training a context assessor that compresses retrieved RAG documents into compact memory tokens.

Work such as AutoCompressor (Chevalier et al., 2023) and MemoRAG (Qian et al., 2025) emphasizes vectorized or standalone latent representations. AutoCompressor (Chevalier et al., 2023) encodes entire long documents into a small number of summary vectors serving as soft prompts, while MemoRAG (Qian et al., 2025) uses an LLM to produce compact hidden-state memories capturing global semantic structure. These approaches not only abstract away from raw text but also transform retrieved or contextualized information into new latent memory units optimized for reuse. To support more persistent memory, MemoryLLM (Wang et al., 2024j) embeds a set of dedicated memory tokens within the model's latent space. M+ (Wang et al., 2025n) extends this idea into a cross-layer long-term memory architecture. LM2 (Kang et al., 2025b) follows a related but structurally distinct direction by introducing matrix-shaped latent memory slots into every layer.

A different branch of work internalizes the generation of latent memory within the model's parameter dynamics. Although these works rely on parameterized modules, their operational memory units remain latent representations, placing them firmly within this category. Titans (Behrouz et al., 2025b) compresses long-range information into an online-updated MLP weight, producing latent vectors during inference. MemGen (Zhang et al., 2025d) dynamically generates latent memory during decoding: two LoRA adapters determine where to insert memory fragments and what latent content to insert. EMU (Na et al., 2024) trains a state encoder to produce latent embeddings annotated with returns and desirability.

Multi Modal In multimodal settings, generative latent memory extends to images, audios and videos, encoding them as compact latent representations. CoMem (Wu et al., 2025d) uses a VLM to compress

multimodal knowledge into a set of embeddings that act as plug-and-play memory. Similarly, Wu et al. (2025e) compresses entire GUI interaction trajectories into fixed-length embeddings and injects them into the VLM input space. For temporal modeling, Time-VLM (Zhong et al., 2025) divides video or interaction streams into patches and generates a latent embedding for each patch.

In vision-based navigation, Mezghani et al. (2022) learns a state encoder that maps visual observations into a latent space and constructs an episodic memory containing only novel observations. MemoryVLA (Shi et al., 2025a) maintains a Perceptual-Cognitive Memory Bank that stores both perceptual details and high-level semantics as transformer hidden states. In long-video object segmentation, XMem(Cheng and Schwing, 2022) encodes each frame into key-value latent embeddings and organizes them into a multi-stage memory comprising perceptual, working, and long-term components.

Discussion These single-modal and multimodal approaches share the same fundamental principle: first generate compact latent representations, then maintain and retrieve them as memory entries. The model can actively construct highly information-dense representations tailored to the task, capturing key dynamics, long-range dependencies, or cross-modal relations with minimal storage cost. It also avoids repeatedly processing the full context, enabling more efficient reasoning across extended interactions.

However, the drawbacks are equally evident. The generation process itself may introduce information loss or bias, and the states can drift or accumulate errors over multiple read-write cycles. Moreover, training a dedicated module to generate latent representations introduces additional computational overhead, data requirements, and engineering complexity.

Hybrid

Advanced agent architectures increasingly adopt a hybrid design that integrates multiple forms of experiential memory to balance grounded evidence with generalizable logic. By maintaining a spectrum of knowledge spanning raw episodes, distilled rules, and executable skills, these systems dynamically select the most appropriate memory format, ensuring both retrieval precision and broad generalization across contexts.

A prominent direction involves coupling case-based and strategy-based memories to facilitate complementary reasoning. For example, ExpeL (Zhao et al., 2024) synergizes concrete trajectories with abstract textual insights, allowing agents to recall specific solutions while applying general heuristics. Agent KB (Tang et al., 2025d) employs a hierarchical structure where high-level workflows guide planning and specific solution paths provide execution details. Similarly, R2D2 (Huang et al., 2025c) integrates a replay buffer of historical traces with a reflective mechanism that refines decision strategies from past errors, effectively bridging case retrieval and strategic abstraction. Complementing these, Dynamic Cheatsheet (Suzgun et al., 2025) prevents redundant computation by storing accumulated strategies and problem-solving insights for immediate reuse at inference time.

Furthermore, recent frameworks strive to unify the lifecycle of memory, incorporating Skill-based components or establishing comprehensive cognitive architectures (Sun et al., 2025a; Cai et al., 2025a). In scientific reasoning, ChemAgent (Tang et al., 2025c) constructs a self-updating library that pairs execution cases with decomposable skill modules, enabling the model to refine its chemical reasoning through accumulated experience. Taking a holistic approach, LARP (Yan et al., 2023) establishes a cognitive architecture for open-world games that harmonizes semantic memory for world knowledge, episodic memory for interaction cases, and procedural memory for learnable skills, ensuring consistent role-playing and robust decision-making. Finally, evolutionary systems like G-Memory (Zhang et al., 2025c) and Memp (Fang et al., 2025d) implement dynamic transitions, where repeated successful cases are gradually compiled into efficient skills, automating the shift from heavy retrieval to rapid execution. A recent effort, MemVerse (Liu et al., 2025e) combines both parametric memory and token-level prcedural memory.

Discussion

This survey has examined agent memory as a foundational component of modern LLM-based agentic systems. By framing existing research through the unified lenses of forms, functions, and dynamics , we have clarified the conceptual landscape of agent memory and situated it within the broader evolution of agentic intelligence. On the level of forms , we identify three principal realizations: token-level, parametric, and latent memory, each of which has undergone distinct and rapid advances in recent years, reflecting fundamentally different trade-offs in representation, adaptability, and integration with agent policies. On the level of functions , we move beyond the coarse long-term versus short-term dichotomy prevalent in prior surveys, and instead propose a more fine-grained and encompassing taxonomy that distinguishes factual, experiential, and working memory

according to their roles in knowledge retention, capability accumulation, and task-level reasoning. Together, these perspectives reveal that memory is not merely an auxiliary storage mechanism, but an essential substrate through which agents achieve temporal coherence, continual adaptation, and long-horizon competence.

Beyond organizing prior work, we have identified key challenges and emerging directions that point toward the next stage of agent memory research. In particular, the increasing integration of reinforcement learning, the rise of multimodal and multi-agent settings, and the shift from retrieval-centric to generative memory paradigms suggest a future in which memory systems become fully learnable, adaptive, and self-organizing. Such systems hold the potential to transform large language models from powerful but static generators into agents capable of sustained interaction, self-improvement, and principled reasoning over time.

We hope this survey provides a coherent foundation for future research and serves as a reference for both researchers and practitioners. As agentic systems continue to mature, the design of memory will remain a central and open problem, one that is likely to play a decisive role in the development of robust, general, and enduring artificial intelligence.

Hierarchical Memory (3D)

Pyramid
Multi-Layer
Discussion

This survey has examined agent memory as a foundational component of modern LLM-based agentic systems. By framing existing research through the unified lenses of forms, functions, and dynamics , we have clarified the conceptual landscape of agent memory and situated it within the broader evolution of agentic intelligence. On the level of forms , we identify three principal realizations: token-level, parametric, and latent memory, each of which has undergone distinct and rapid advances in recent years, reflecting fundamentally different trade-offs in representation, adaptability, and integration with agent policies. On the level of functions , we move beyond the coarse long-term versus short-term dichotomy prevalent in prior surveys, and instead propose a more fine-grained and encompassing taxonomy that distinguishes factual, experiential, and working memory

according to their roles in knowledge retention, capability accumulation, and task-level reasoning. Together, these perspectives reveal that memory is not merely an auxiliary storage mechanism, but an essential substrate through which agents achieve temporal coherence, continual adaptation, and long-horizon competence.

Beyond organizing prior work, we have identified key challenges and emerging directions that point toward the next stage of agent memory research. In particular, the increasing integration of reinforcement learning, the rise of multimodal and multi-agent settings, and the shift from retrieval-centric to generative memory paradigms suggest a future in which memory systems become fully learnable, adaptive, and self-organizing. Such systems hold the potential to transform large language models from powerful but static generators into agents capable of sustained interaction, self-improvement, and principled reasoning over time.

We hope this survey provides a coherent foundation for future research and serves as a reference for both researchers and practitioners. As agentic systems continue to mature, the design of memory will remain a central and open problem, one that is likely to play a decisive role in the development of robust, general, and enduring artificial intelligence.

Parametric Memory

In contrast to token-level memory, which stores information as visible and editable discrete units, parametric memory stores information directly in the model's parameters. In this section, we examine methods that embed memory into learnable parameter spaces, allowing the model to internalize and recall information without referring to external storage.

Based on where the memory is stored relative to the core model parameters, we distinguish two primary forms of parametric memory:

Internal Parametric Memory

Internal parameter memory injects domain knowledge, personalized knowledge, or priors required by downstream tasks into the model. We also regard enhancing the model's long-context capability as injecting a prior.

Table 2 Taxonomy of parametric memory methods. We categorize existing works based on the storage location relative to the core model: Internal Parametric Memory embeds knowledge directly into the original weights, while External Parametric Memory isolates information within auxiliary parameter sets. Based on the training phase , we performed a secondary classification of the articles. Methods are compared across three technical dimensions: (1) Type defines the nature of the memory, (2) Task specifies the target downstream application, and (3) Optimization denotes the optimization strategy, such as SFT , FT (fine-tuning) , and PE (prompt engineering).

The timing of memory injection can be the pre-training phase, continued pre-training phase, mid-training phase, or post-training phase. The memory stored in internal parameters does not add extra parameters or additional modules.

Pre-Train Some works introduce memory mechanisms during the pre-training phase, aiming to address the issue that long-tail world knowledge is difficult to compress into the limited model parameters. LMLM (Zhao et al., 2025b) and HierMemLM (Pouransari et al., 2025) store the memory for knowledge retrieval in the model during the pre-training phase, while storing the knowledge itself in an external knowledge base. Some works also optimize the computational efficiency of attention to enhance long-window memory capability (Xiao et al., 2024; Qin et al., 2024b,c; Dao, 2024; Shah et al., 2024).

Mid-Train During the continued pre-training phase, some works incorporate generalizable experience from downstream tasks. For instance, Su et al. (2025) and Zhang et al. (2025k) integrate agent experience. Some works improve the long-window performance or efficiency of LLMs during the mid-training phase, enabling the model to maintain more short-term memory with longer windows in memory-aided tasks (Zaheer et al., 2020; Chen et al., 2024a).

Post-Train Other works incorporate memory during the post-training phase to adapt to downstream tasks. Some works enable LLMs to memorize personalized user history or styles. Some works allow LLMs to learn from the successes or failures of past similar task executions. Character-LM (Shao et al., 2023) and CharacterGLM (Zhou et al., 2024a) fine-tunes the LLM into different characteristics. During the posttraining phase, SELF-PARAM (Wang et al., 2025o) injects additional knowledge through KL divergence distillation without requiring extra parameters. Room (Kim et al., 2023b) stores knowledge externally while save experience internally. KnowledgeEditor (Cao et al., 2021) modifies internal parameters, aiming to alter only the knowledge that requires editing. MEND (Mitchell et al., 2022) achieves fast knowledge editing by using small networks to modify the gradients of large models. PersonalityEdit (Mao et al., 2024) proposes an LLM personality editing dataset based on personality theories in psychology. APP (Ma et al., 2024) employs multiple training objectives to ensure that adjacent knowledge is minimally disturbed during knowledge editing. DINM (Wang et al., 2024c) proposes a model editing method that enables the model to learn to reject such dangerous requests without affecting its normal functions.

Discussion The advantages of internal parameters lie in their simple structure, which does not add extra inference overhead or deployment costs to the vanilla model. Their drawback is the difficulty in updating internal parameters: storing new memory requires retraining, which is costly and prone to forgetting old memory. Therefore, internal parameter memory is more suitable for large-scale storage of domain knowledge or task priors, rather than short segments of personalized memory or working memory.

Pre-Train

As shown above, such a large body of work has focused on agent memory, clearly demonstrating that memory mechanisms are essential for agent systems (Zhang et al., 2025s). The choice of memory type in an agent system reflects how designers expect the agent to behave in a given task. Designers are not simply asking the agent to remember certain information, but also implicitly expressing how they want that information to shape the agent's behavior. Therefore, choosing the right type of memory for a task is far more than a simple combinatorial choice.

In this section, we start from the features of each memory type and discuss which tasks and scenarios they are best suited for in an ideal setting, as shown in Figure 5. We hope this discussion can offer useful ideas and guidance for making practical choices. The examples illustrate only one possible form of memory in these idealized settings and do not imply that other memory types lack unique advantages in the same scenarios.

Token-level Memory Token-level memory remains symbolic , addressable , and transparent , making it particularly well suited for scenarios where explicit reasoning, controllability, and accountability are essential. This type of memory excels in real-time, high-frequency update settings, where an agent must continuously track and revise information, and where the knowledge itself exhibits a clear structure that can be explicitly modeled. Its externalizability allows memory to be easily inspected, audited, transferred, or revised, making it especially suitable for domains requiring precise add/delete/update operations. The high level of interpretability further ensures that an agent's decision process can be traced back to concrete memory units, a crucial property in high-stakes applications. Moreover, token-level memory provides long-term stability and avoids catastrophic forgetting, enabling agents to accumulate reliable knowledge over extended time horizons. Another practical advantage is that token-level memory is often implemented as a plug-and-play module, allowing it to be readily integrated with the latest closed-source or open-source foundation models without modifying their internal parameters.

Mid-Train

The past two years have witnessed the overwhelming evolution of increasingly capable large language models (LLMs) into powerful AI agents (Matarazzo and Torlone, 2025; Minaee et al., 2025; Luo et al., 2025a). These foundation-model-powered agents have demonstrated remarkable progress across diverse domains such as deep research (Xu and Peng, 2025; Zhang et al., 2025p), software engineering (Wang et al., 2024i), and scientific discovery (Wei et al., 2025c), continuously advancing the trajectory toward artificial general interlligence (AGI) (Fang et al., 2025a; Durante et al., 2024). Although early conceptions of 'agents' were highly heterogeneous, a growing consensus has since emerged within the community: beyond a pure LLM backbone, an agent is typically equipped with capabilities such as reasoning , planning , perception , memory , and tool-use . Some of these abilities, such as reasoning and tool-use, have been largely internalized within model parameters through reinforcement learning (Wang et al., 2025m; Qu et al., 2025b), while some still depend heavily on external agentic scaffolds. Together, these components transform LLMs from static conditional generators into learnable policies that can interact with diverse external environments and adaptively evolve over time (Zhang et al., 2025f; Liu et al., 2025a).

Among these agentic faculties, memory stands out as a cornerstone, explicitly enabling the transformation of static LLMs, whose parameters cannot be rapidly updated, into adaptive agents capable of continual adaptation through environmental interaction (Zhang et al., 2025s; Wu et al., 2025g). From an application perspective, numerous domains demand agents with proactive memory management rather than ephemeral, forgetful behaviors: personalized chatbots (Chhikara et al., 2025; Li et al., 2025b), recommender systems (Liu et al., 2025c), social simulations (Park et al., 2023; Yang et al., 2025), and financial investigations (Zhang et al., 2024) all rely on the agent's ability to process, store, and manage historical information. From a developmental standpoint, one of the defining aspirations of AGI research is to endow agents with the capacity for continual evolution through environment interactions (Hendrycks et al., 2025), a capability fundamentally

grounded in agent memory.

Agent Memory Needs A New Taxonomy Given the growing significance and community attention surrounding agent memory systems, it has become both timely and necessary to provide an updated perspective on contemporary agent memory research. The motivation for a new taxonomy and survey is twofold: ❶ Limitations of Existing Taxonomies: While several recent surveys have provided valuable and comprehensive overviews of agent memory (Zhang et al., 2025s; Wu et al., 2025g), their taxonomies were developed prior to a number of rapid methodological advances and therefore do not fully reflect the current breadth and complexity of the research landscape. For example, emerging directions in 2025, such as memory frameworks that distill reusable tools from past experiences (Qiu et al., 2025a,c; Zhao et al., 2025c), or memory-augmented test-time scaling methods (Zhang et al., 2025g; Suzgun et al., 2025), remain underrepresented in earlier classification schemes. ❷ Conceptual Fragmentation: With the explosive growth of memory-related studies, the concept itself has become increasingly expansive and fragmented. Researchers often find that papers claiming to study 'agent memory' differ drastically in implementation, objectives, and underlying assumptions. The proliferation of diverse terminologies (declarative, episodic, semantic, parametric memory, etc.) further obscures conceptual clarity, highlighting the urgent need for a coherent taxonomy that can unify these emerging concepts.

Therefore, this paper seeks to establish a systematic framework that reconciles existing definitions, bridges emerging trends, and elucidates the foundational principles of memory in agentic systems. Specifically, this survey aims to address the following key questions:

Post-Train

Initial retrieval often returns fragments that are redundant, noisy, or semantically inconsistent. Directly injecting these results into the prompt can lead to excessively long contexts, conflicting information, and reasoning distracted by irrelevant content. Post-retrieval processing, therefore, becomes essential for ensuring prompt quality. Its goal is to distill the retrieved results into a concise, accurate, and semantically coherent context. In practice, two components are central: (1) Re-rankingandFiltering: performing fine-grained relevance estimation to remove irrelevant or outdated memories and reorder the remaining fragments, thereby reducing noise and redundancy. (2) Aggregation and Compression: integrating the retrieved memories with the original query, eliminating duplication, merging semantically similar information, and reconstructing a compact and coherent final context.

Re-ranking and Filtering To maintain a concise and coherent context, initial retrieval results are re-ranked and filtered to ensure a concise and coherent context by removing low-relevance items. Early approaches rely on heuristic criteria for evaluating semantic consistency. For example, Semantic Anchoring (Chatterjee and Agarwal, 2025) integrates vector similarity with entity- and discourse-level alignment, whereas RCR-Router (Liu et al., 2025d) combines multiple handcrafted signals, including role relevance, task-stage priority, and recency. These methods, however, often require extensive hyperparameter tuning to balance heterogeneous importance scores. To alleviate this burden, learn-to-memorize (Zhang et al., 2025u) formulates score aggregation as a reinforcement-learning problem, enabling the model to learn optimal weights over retrieval signals. While these techniques primarily optimize semantic coherence, scenarios demanding strict temporal reasoning require additional constraints: Rasmussen et al. (2025) and Tan et al. (2025b) filter memories based on their timestamps and validity windows to satisfy complex temporal dependencies.

With the increasing capability of LLMs, recent methods leverage their intrinsic language understanding to assess memory quality directly. Memory-R1 (Yan et al., 2025c) and Westhäußer et al. (2025) both introduce LLM-based evaluators (Answer Agents or Self-Validator Agents) that filter retrieved content before producing the final response. However, prompt-based filtering remains limited by the LLM's inherent capacity and by mismatches between prompt semantics and downstream usage. Consequently, many systems train auxiliary models to estimate memory importance more robustly (Tan et al., 2025c). Memento (Zhou et al., 2025a) uses Q-learning (Watkins and Dayan, 1992) to predict the probability that a retrieved item contributes to a correct answer, and MemGuide (Du et al., 2025b) fine-tunes LLaMA-8B (Grattafiori et al., 2024) to re-rank candidates using marginal slot-completion gain. Together, these re-ranking and filtering strategies refine retrieval results without modifying the underlying retriever, enabling compatibility with any pre-trained retrieval model while supporting task-specific optimization.

Aggregation and Compression Another approach to improving both the quality and efficiency of downstream reasoning through post-retrieval processing is the aggregation and compression. This process integrates the retrieved evidence with the query to form a coherent and compact context. Unlike filtering and

re-ranking, which mainly address noise and prioritization, this stage focuses on merging multiple fragmented memory items into higher-level and distilled knowledge representations, and on refining these representations when task-specific adaptations are required. ComoRAG (Wang et al., 2025f) illustrates this idea through its Integration Agent, which identifies historical signals that are semantically aligned with the query and combines them into an abstract global summary that provides broad contextual grounding. The Extractor Agent in MA-RAG (Nguyen et al., 2025) performs fine-grained content selection over the retrieved documents, retaining only the key information that is strongly relevant to the current subquery and producing concise snippets tailored to local reasoning needs.

Furthermore, G-Memory (Zhang et al., 2025c) extends aggregation and compression into the personalization for multi-agent systems. It consolidates retrieved high-level insights and sparsified trajectories, and then uses an LLM to customize these condensed experiences according to the agent's role. This process refines general knowledge into role-specific prompts that populate the agent's personalized memory.

Summary In conclusion, post-retrieval processing acts as a crucial intermediate step that transforms noisy, fragmented retrieval results into a precise and coherent context for reasoning. Through the above mechanisms, the post-retrieval processing not only enhances the density and fidelity of the memories supplied to the model but also aligns the information with task requirements and agent characteristics.

Discussion

This survey has examined agent memory as a foundational component of modern LLM-based agentic systems. By framing existing research through the unified lenses of forms, functions, and dynamics , we have clarified the conceptual landscape of agent memory and situated it within the broader evolution of agentic intelligence. On the level of forms , we identify three principal realizations: token-level, parametric, and latent memory, each of which has undergone distinct and rapid advances in recent years, reflecting fundamentally different trade-offs in representation, adaptability, and integration with agent policies. On the level of functions , we move beyond the coarse long-term versus short-term dichotomy prevalent in prior surveys, and instead propose a more fine-grained and encompassing taxonomy that distinguishes factual, experiential, and working memory

according to their roles in knowledge retention, capability accumulation, and task-level reasoning. Together, these perspectives reveal that memory is not merely an auxiliary storage mechanism, but an essential substrate through which agents achieve temporal coherence, continual adaptation, and long-horizon competence.

Beyond organizing prior work, we have identified key challenges and emerging directions that point toward the next stage of agent memory research. In particular, the increasing integration of reinforcement learning, the rise of multimodal and multi-agent settings, and the shift from retrieval-centric to generative memory paradigms suggest a future in which memory systems become fully learnable, adaptive, and self-organizing. Such systems hold the potential to transform large language models from powerful but static generators into agents capable of sustained interaction, self-improvement, and principled reasoning over time.

We hope this survey provides a coherent foundation for future research and serves as a reference for both researchers and practitioners. As agentic systems continue to mature, the design of memory will remain a central and open problem, one that is likely to play a decisive role in the development of robust, general, and enduring artificial intelligence.

External Parametric Memory

Storing memory as tokens outside LLMs leads to insufficient understanding of token-form memory content in the input window by the model. Meanwhile, storing memory in the parameters of LLMs has issues, such as difficulty in updating and conflicts with pre-trained knowledge. Some works adopt a compromise approach, which introduces memory through external parameters without altering the original parameters of LLMs.

Adapter A common line of external parametric memory methods relies on modules that are attached to a frozen base model. MLP-Memory (Wei et al., 2025d) integrates RAG knowledge with Transformer decoders through MLP. K-Adapter (Wang et al., 2021) injects new knowledge by training task-specific adapter modules while keeping the original backbone unchanged, enabling continual knowledge expansion without interfering with pre-trained representations. WISE (Wang et al., 2024e) further introduces a dual-parameter memory setup-separating pre-trained knowledge and edited knowledge-and a routing mechanism that dynamically selects which parameter memory to use at inference time, thus mitigating conflicts during lifelong editing. ELDER (Li et al., 2025d) advances this direction by maintaining multiple LoRA modules and learning a routing function that adaptively selects or blends them based on input semantics, improving robustness and scalability in long-term editing scenarios. Collectively, these methods leverage additional parameter subspaces to store and retrieve memory in a modular and reversible manner, avoiding the risks of catastrophic interference associated with directly modifying the core model weights.

Figure 4 Overview of Latent Memory integration in LLM agents. Unlike explicit text storage, latent memory operates within the model's internal representational space. The framework is categorized by the origin of the latent state: (a) Generate , where auxiliary models synthesize embeddings to interfere with or augment the LLM's forward pass; (b) Reuse , which directly propagates prior computational states such as KV caches or intermediate embeddings; and (c) Transform , which compresses internal states through token selection, merging, or projection to maintain efficient

Figure 4 Overview of Latent Memory integration in LLM agents. Unlike explicit text storage, latent memory operates within the model's internal representational space. The framework is categorized by the origin of the latent state: (a) Generate , where auxiliary models synthesize embeddings to interfere with or augment the LLM's forward pass; (b) Reuse , which directly propagates prior computational states such as KV caches or intermediate embeddings; and (c) Transform , which compresses internal states through token selection, merging, or projection to maintain efficient

context.

Auxiliary LM Beyond Adapter-based storage, another line of work adopts a more architecturally decoupled form of external parametric memory, where memory is stored in a separate model or external knowledge module. MAC (Tack et al., 2024) compresses the information from a new document into a compact modulation through an amortization network, and stores it in a memory bank. Retroformer (Yao et al., 2024a) proposes a learning paradigm for memorizing the experiences of successes or failures in past task executions.

Discussion This external parametric memory approach provides a balance between adaptability and model stability. Because memory is encoded into additional parameter modules, it can be added, removed, or replaced without interfering with the base model's pre-trained representation space. This supports modular updates, task-specific personalization, and controlled rollback, while avoiding the catastrophic forgetting or global weight distortion that may occur in full model fine-tuning.

However, this approach also comes with limitations. External parameter modules must still integrate with the model's internal representation flow, meaning that their influence is indirect and mediated through the model's attention and computation pathways. As a result, the effectiveness of memory injection depends on how well the external parameters can interface with internal parametric knowledge.

Adapter

As shown above, such a large body of work has focused on agent memory, clearly demonstrating that memory mechanisms are essential for agent systems (Zhang et al., 2025s). The choice of memory type in an agent system reflects how designers expect the agent to behave in a given task. Designers are not simply asking the agent to remember certain information, but also implicitly expressing how they want that information to shape the agent's behavior. Therefore, choosing the right type of memory for a task is far more than a simple combinatorial choice.

In this section, we start from the features of each memory type and discuss which tasks and scenarios they are best suited for in an ideal setting, as shown in Figure 5. We hope this discussion can offer useful ideas and guidance for making practical choices. The examples illustrate only one possible form of memory in these idealized settings and do not imply that other memory types lack unique advantages in the same scenarios.

Token-level Memory Token-level memory remains symbolic , addressable , and transparent , making it particularly well suited for scenarios where explicit reasoning, controllability, and accountability are essential. This type of memory excels in real-time, high-frequency update settings, where an agent must continuously track and revise information, and where the knowledge itself exhibits a clear structure that can be explicitly modeled. Its externalizability allows memory to be easily inspected, audited, transferred, or revised, making it especially suitable for domains requiring precise add/delete/update operations. The high level of interpretability further ensures that an agent's decision process can be traced back to concrete memory units, a crucial property in high-stakes applications. Moreover, token-level memory provides long-term stability and avoids catastrophic forgetting, enabling agents to accumulate reliable knowledge over extended time horizons. Another practical advantage is that token-level memory is often implemented as a plug-and-play module, allowing it to be readily integrated with the latest closed-source or open-source foundation models without modifying their internal parameters.

Auxiliary LM
Discussion

This survey has examined agent memory as a foundational component of modern LLM-based agentic systems. By framing existing research through the unified lenses of forms, functions, and dynamics , we have clarified the conceptual landscape of agent memory and situated it within the broader evolution of agentic intelligence. On the level of forms , we identify three principal realizations: token-level, parametric, and latent memory, each of which has undergone distinct and rapid advances in recent years, reflecting fundamentally different trade-offs in representation, adaptability, and integration with agent policies. On the level of functions , we move beyond the coarse long-term versus short-term dichotomy prevalent in prior surveys, and instead propose a more fine-grained and encompassing taxonomy that distinguishes factual, experiential, and working memory

according to their roles in knowledge retention, capability accumulation, and task-level reasoning. Together, these perspectives reveal that memory is not merely an auxiliary storage mechanism, but an essential substrate through which agents achieve temporal coherence, continual adaptation, and long-horizon competence.

Beyond organizing prior work, we have identified key challenges and emerging directions that point toward the next stage of agent memory research. In particular, the increasing integration of reinforcement learning, the rise of multimodal and multi-agent settings, and the shift from retrieval-centric to generative memory paradigms suggest a future in which memory systems become fully learnable, adaptive, and self-organizing. Such systems hold the potential to transform large language models from powerful but static generators into agents capable of sustained interaction, self-improvement, and principled reasoning over time.

We hope this survey provides a coherent foundation for future research and serves as a reference for both researchers and practitioners. As agentic systems continue to mature, the design of memory will remain a central and open problem, one that is likely to play a decisive role in the development of robust, general, and enduring artificial intelligence.

Latent Memory

Generate

Amajor line of work builds memory by generatingnewlatentrepresentations rather than reusing or transforming existing activations. In this paradigm, the model or an auxiliary encoder creates compact continuous states. These states may appear as special tokens in the sequence or as standalone vectors. They summarize the essential information from long contexts, task trajectories, or multimodal inputs. The generated latent summaries are then stored, inserted, or used as conditions for later reasoning or decision-making. This enables the system to operate beyond its native context length, maintain task-specific intermediate states, and retain knowledge across episodes without revisiting the original input. Although the concrete forms vary across studies, the underlying idea remains consistent. Memory is explicitly produced through learned encoding or compression, and the resulting latent states serve as reusable memory units that support future inference.

This design choice may also raise potential ambiguity with parametric memory, particularly since many methods rely on separately trained models to generate latent representations. In this chapter, however, our classification is grounded in the form of memory rather than the learning mechanism. Crucially, although these approaches generate memory through learned encoding, the produced latent representations are explicitly instantiated and reused as independent memory units, rather than being directly embedded into the model's parameters or forward-pass activations. We will return to this distinction when discussing individual methods in detail.

Single Modal In the single-modal setting, a major group of methods focuses on long-context processing and language modeling, where models generate a small set of internal representations to replace long raw inputs (Mu et al., 2023; Luo et al., 2024; Xu et al., 2025d; Chevalier et al., 2023; Qian et al., 2025; Wang et al., 2024j, 2025n). A typical strategy is to compress long sequences into a few internal tokens or continuous vectors that can be reused during later inference. For example, Gist (Mu et al., 2023) train a language model to produce a set of gist tokens after processing a long prompt. Luo et al. (2024) introduce a special sentinel token at each chunk boundary and encourage the model to aggregate local semantics into that token. SoftCoT (Xu et al., 2025d) follows a similar direction by generating instance-specific soft tokens from the last hidden state.

Table 3 Taxonomy of latent memory methods. We categorize existing works based on the origin of the latent state: Generate synthesizes memory via auxiliary modules, Reuse propagates internal computational states, and Transform compresses, modifies or restructs existing latent state. Methods are compared across three technical dimensions: (1) Form specifies the specific data type of the latent memory, (2) Type defines the nature of the recorded content (e.g., Working, Factual, and Experiential), and (3) Task denotes the target downstream application.

CARE (Choi et al., 2025) further extends the latent tokens by training a context assessor that compresses retrieved RAG documents into compact memory tokens.

Work such as AutoCompressor (Chevalier et al., 2023) and MemoRAG (Qian et al., 2025) emphasizes vectorized or standalone latent representations. AutoCompressor (Chevalier et al., 2023) encodes entire long documents into a small number of summary vectors serving as soft prompts, while MemoRAG (Qian et al., 2025) uses an LLM to produce compact hidden-state memories capturing global semantic structure. These approaches not only abstract away from raw text but also transform retrieved or contextualized information into new latent memory units optimized for reuse. To support more persistent memory, MemoryLLM (Wang et al., 2024j) embeds a set of dedicated memory tokens within the model's latent space. M+ (Wang et al., 2025n) extends this idea into a cross-layer long-term memory architecture. LM2 (Kang et al., 2025b) follows a related but structurally distinct direction by introducing matrix-shaped latent memory slots into every layer.

A different branch of work internalizes the generation of latent memory within the model's parameter dynamics. Although these works rely on parameterized modules, their operational memory units remain latent representations, placing them firmly within this category. Titans (Behrouz et al., 2025b) compresses long-range information into an online-updated MLP weight, producing latent vectors during inference. MemGen (Zhang et al., 2025d) dynamically generates latent memory during decoding: two LoRA adapters determine where to insert memory fragments and what latent content to insert. EMU (Na et al., 2024) trains a state encoder to produce latent embeddings annotated with returns and desirability.

Multi Modal In multimodal settings, generative latent memory extends to images, audios and videos, encoding them as compact latent representations. CoMem (Wu et al., 2025d) uses a VLM to compress

multimodal knowledge into a set of embeddings that act as plug-and-play memory. Similarly, Wu et al. (2025e) compresses entire GUI interaction trajectories into fixed-length embeddings and injects them into the VLM input space. For temporal modeling, Time-VLM (Zhong et al., 2025) divides video or interaction streams into patches and generates a latent embedding for each patch.

In vision-based navigation, Mezghani et al. (2022) learns a state encoder that maps visual observations into a latent space and constructs an episodic memory containing only novel observations. MemoryVLA (Shi et al., 2025a) maintains a Perceptual-Cognitive Memory Bank that stores both perceptual details and high-level semantics as transformer hidden states. In long-video object segmentation, XMem(Cheng and Schwing, 2022) encodes each frame into key-value latent embeddings and organizes them into a multi-stage memory comprising perceptual, working, and long-term components.

Discussion These single-modal and multimodal approaches share the same fundamental principle: first generate compact latent representations, then maintain and retrieve them as memory entries. The model can actively construct highly information-dense representations tailored to the task, capturing key dynamics, long-range dependencies, or cross-modal relations with minimal storage cost. It also avoids repeatedly processing the full context, enabling more efficient reasoning across extended interactions.

However, the drawbacks are equally evident. The generation process itself may introduce information loss or bias, and the states can drift or accumulate errors over multiple read-write cycles. Moreover, training a dedicated module to generate latent representations introduces additional computational overhead, data requirements, and engineering complexity.

Single Modal

Single-turn working memory addresses the challenge of processing massive immediate inputs, including long documents (Chevalier et al., 2023) and high-dimensional multimodal streams (Wang et al., 2024g), within a single forward pass. Rather than passively consuming the entire context, the objective is to actively construct a writable workspace. This involves filtering and transforming raw information to increase density and operability under fixed attention and memory budgets (Jiang et al., 2023, 2024). We categorize these mechanisms into input condensation , which reduces physical token count, and observation abstraction , which transforms data into structured semantic representations.

Input Condensation Input condensation techniques aim to preprocess the context to minimize token usage while preserving essential information (Jiang et al., 2023). These methods generally fall into three paradigms: hard, soft, and hybrid condensation (Liao et al., 2025a).

Hard condensation discretely selects tokens based on importance metrics. Approaches like LLMLingua (Jiang et al., 2023) and LongLLMLingua (Jiang et al., 2024) estimate token perplexity to discard predictable or task-irrelevant content, while CompAct (Yoon et al., 2024) adopts an iterative strategy to retain segments that maximize information gain. Although efficient, hard selection risks severing syntactic or semantic dependencies. Soft condensation encodes variable-length contexts into dense latent vectors (memory slots). Methods such as Gist (Mu et al., 2023), In-Context Autoencoder (ICAE) (Ge et al., 2024), and AutoCompressors (Chevalier et al., 2023) train models to compress prompts into valid summary tokens or distinct memory embeddings. This achieves high compression ratios but requires additional training and may obscure fine-grained details. Hybrid approaches like HyCo2 (Liao et al., 2025a) attempt to reconcile these trade-offs by combining global semantic adapters (soft) with token-level retention probabilities (hard).

Observation Abstraction While condensation focuses on reduction, observation abstraction aims to transform raw observations into structured formats that facilitate reasoning. This mechanism maps dynamic, high-dimensional observation spaces into fixed-size memory states, preventing agents from being overwhelmed by raw data.

In complex interactive environments, abstraction converts verbose inputs into concise state descriptions. Synapse (Zheng et al., 2024a) rewrites unstructured HTML DOM trees into task-relevant state summaries to guide GUI automation. Similarly, in multimodal settings, processing every frame of a video stream is computationally prohibitive. Working memory mechanisms address this by extracting semantic structures: Context as Memory (Yu et al., 2025b) filters frames based on field-of-view overlap, VideoAgent (Wang et al.,

Table 6 Taxonomy of working memory methods. We categorize approaches into Single-turn and Multi-turn settings based on interaction dynamics. Methods are compared across three technical dimensions: (1) Carrier (Section 3) identifies the storage medium, (2) Task specifies the evaluation domain or application scenario, and (3) Optimization denotes the integration strategy, where PE encompasses prompt engineering and inference-time techniques without parameter updates, distinct from gradient-based methods like SFT and RL.

2024g) converts streams into temporal event descriptions, and MA-LMM (He et al., 2024) maintains a bank of visual features. These methods effectively rewrite high-dimensional, redundant streams into low-dimensional, semantically rich representations operable within a limited context window for efficient processing.

Summary Single-turn working memory functions as an active compression layer that maximizes the utility of the context window for immediate reasoning. By employing input condensation and observation abstraction, these mechanisms effectively increase the information density of the operational workspace, ensuring that critical evidence is retained despite capacity constraints. However, this optimization is strictly intra-turn ; it addresses the breadth and complexity of static inputs rather than the temporal continuity of dynamic interactions.

Multi Modal
Discussion

This survey has examined agent memory as a foundational component of modern LLM-based agentic systems. By framing existing research through the unified lenses of forms, functions, and dynamics , we have clarified the conceptual landscape of agent memory and situated it within the broader evolution of agentic intelligence. On the level of forms , we identify three principal realizations: token-level, parametric, and latent memory, each of which has undergone distinct and rapid advances in recent years, reflecting fundamentally different trade-offs in representation, adaptability, and integration with agent policies. On the level of functions , we move beyond the coarse long-term versus short-term dichotomy prevalent in prior surveys, and instead propose a more fine-grained and encompassing taxonomy that distinguishes factual, experiential, and working memory

according to their roles in knowledge retention, capability accumulation, and task-level reasoning. Together, these perspectives reveal that memory is not merely an auxiliary storage mechanism, but an essential substrate through which agents achieve temporal coherence, continual adaptation, and long-horizon competence.

Beyond organizing prior work, we have identified key challenges and emerging directions that point toward the next stage of agent memory research. In particular, the increasing integration of reinforcement learning, the rise of multimodal and multi-agent settings, and the shift from retrieval-centric to generative memory paradigms suggest a future in which memory systems become fully learnable, adaptive, and self-organizing. Such systems hold the potential to transform large language models from powerful but static generators into agents capable of sustained interaction, self-improvement, and principled reasoning over time.

We hope this survey provides a coherent foundation for future research and serves as a reference for both researchers and practitioners. As agentic systems continue to mature, the design of memory will remain a central and open problem, one that is likely to play a decisive role in the development of robust, general, and enduring artificial intelligence.

Reuse

In contrast to methods that generate new latent representations, another line of work directly reuses the model's internal activations, primarily the key-value (KV) cache , as latent memory. These approaches do not transform(modify, compress) the stored KV pairs and instead treat the raw activations from forward passes as reusable memory entries. The main challenge is to determine which KV pairs to keep, how to index them, and how to retrieve them efficiently under long-context or continual-processing demands.

From a cognitive perspective, Gershman et al. (2025) provides conceptual grounding by framing biological memory as a key-value system, where keys function as retrieval addresses and values encode stored content-an abstraction closely aligned with KV-based memory in modern LLMs. Memorizing Transformers (Wu et al., 2022) explicitly store past KV pairs and retrieve them via K-nearest-neighbor search during inference. FOT (Tworkowski et al., 2023) extends this line of work by introducing memory-attention layers that perform KNN-based retrieval over additional KV memories during inference. LONGMEM (Wang et al., 2023b) similarly augments long-range retrieval, employing a lightweight residual SideNet that treats historical KV embeddings as a persistent memory store. These systems demonstrate how retrieval-aware organization of latent KV states can substantially enhance access to distant information.

Discussion Reuse-type latent memory methods highlight the effectiveness of directly leveraging the model's own internal activations as memory, showing that carefully curated KV representations can serve as a powerful and efficient substrate for long-range retrieval and reasoning.

Their greatest strength lies in preserving the full fidelity of the model's internal activations, ensuring that no information is lost through pruning or compression. This makes them conceptually simple, easy to integrate into existing forms, and highly faithful to the model's original computation. However, raw KV caches grow rapidly with context length, which increases memory consumption and can make retrieval less efficient. The effectiveness of reuse therefore depends heavily on indexing strategies.

Discussion

This survey has examined agent memory as a foundational component of modern LLM-based agentic systems. By framing existing research through the unified lenses of forms, functions, and dynamics , we have clarified the conceptual landscape of agent memory and situated it within the broader evolution of agentic intelligence. On the level of forms , we identify three principal realizations: token-level, parametric, and latent memory, each of which has undergone distinct and rapid advances in recent years, reflecting fundamentally different trade-offs in representation, adaptability, and integration with agent policies. On the level of functions , we move beyond the coarse long-term versus short-term dichotomy prevalent in prior surveys, and instead propose a more fine-grained and encompassing taxonomy that distinguishes factual, experiential, and working memory

according to their roles in knowledge retention, capability accumulation, and task-level reasoning. Together, these perspectives reveal that memory is not merely an auxiliary storage mechanism, but an essential substrate through which agents achieve temporal coherence, continual adaptation, and long-horizon competence.

Beyond organizing prior work, we have identified key challenges and emerging directions that point toward the next stage of agent memory research. In particular, the increasing integration of reinforcement learning, the rise of multimodal and multi-agent settings, and the shift from retrieval-centric to generative memory paradigms suggest a future in which memory systems become fully learnable, adaptive, and self-organizing. Such systems hold the potential to transform large language models from powerful but static generators into agents capable of sustained interaction, self-improvement, and principled reasoning over time.

We hope this survey provides a coherent foundation for future research and serves as a reference for both researchers and practitioners. As agentic systems continue to mature, the design of memory will remain a central and open problem, one that is likely to play a decisive role in the development of robust, general, and enduring artificial intelligence.

Transform

Transform-type latent memory methods focus on modifying, compressing, or restructuring existing latent states rather than generating entirely new ones or directly reusing raw KV caches. These approaches treat KV caches and hidden activations as malleable memory units, reshaping them through selection, aggregation, or structural transformation. In doing so, they occupy a conceptual middle ground between generate-type and

Figure 5 Overview of three complementary memory paradigms for LLM agents. Token-level, parametric, and latent memories differ in their representational form, update dynamics, interpretability, and efficiency, leading to distinct strengths, limitations, and application domains in long-horizon and interactive agent systems.

Figure 5 Overview of three complementary memory paradigms for LLM agents. Token-level, parametric, and latent memories differ in their representational form, update dynamics, interpretability, and efficiency, leading to distinct strengths, limitations, and application domains in long-horizon and interactive agent systems.

reuse-type memory: the model does not create fresh latent representations, but it also does more than simply replay stored KV pairs.

A major line of work focuses on compressing KV caches while preserving essential semantics. Some methods reduce memory usage by keeping only the most influential tokens. Scissorhands (Liu et al., 2023b) prunes tokens based on attention scores when cache capacity is exceeded, whereas SnapKV (Li et al., 2024b) aggregates high-importance prefix KV representations via a head-wise voting mechanism. PyramidKV (Cai et al., 2024) reallocates KV budgets across layers. SirLLM (Yao et al., 2024b) builds on this perspective by estimating token importance with a token-entropy criterion and selectively retaining only informative KV entries. Memory 3 (Yang et al., 2024a) only stores the most critical attention key-value pairs, significantly shrinking storage requirements. RazorAttention (Tang et al., 2025a) introduces a more explicit compression scheme: it computes the effective attention span of each head, retains only a limited local window, and uses compensation tokens to preserve information from discarded entries. From a more efficiency-oriented perspective, H2O (Zhang et al., 2023) adopts a simpler eviction strategy, retaining only the most recent tokens along with special H2 tokens to reduce memory footprint.

Discussion These methods demonstrate how latent memory can be transformed, through selection, retrieval enhancement, or compressed re-encoding, into more effective memory representations, enabling LLMs to extend their usable context length and improve reasoning performance without relying on raw cache reuse.

Their main advantage lies in producing more compact and information-dense memory representations, which reduce storage cost and enable efficient retrieval over long contexts. By reshaping latent states, these methods allow the model to access distilled semantic signals that may be more useful than raw activations. However, transformation introduces the risk of information loss, and the compressed states can become harder to interpret or verify compared with directly reused KV caches. The additional computation required for pruning, aggregation, or re-encoding also increases system complexity.

Discussion

This survey has examined agent memory as a foundational component of modern LLM-based agentic systems. By framing existing research through the unified lenses of forms, functions, and dynamics , we have clarified the conceptual landscape of agent memory and situated it within the broader evolution of agentic intelligence. On the level of forms , we identify three principal realizations: token-level, parametric, and latent memory, each of which has undergone distinct and rapid advances in recent years, reflecting fundamentally different trade-offs in representation, adaptability, and integration with agent policies. On the level of functions , we move beyond the coarse long-term versus short-term dichotomy prevalent in prior surveys, and instead propose a more fine-grained and encompassing taxonomy that distinguishes factual, experiential, and working memory

according to their roles in knowledge retention, capability accumulation, and task-level reasoning. Together, these perspectives reveal that memory is not merely an auxiliary storage mechanism, but an essential substrate through which agents achieve temporal coherence, continual adaptation, and long-horizon competence.

Beyond organizing prior work, we have identified key challenges and emerging directions that point toward the next stage of agent memory research. In particular, the increasing integration of reinforcement learning, the rise of multimodal and multi-agent settings, and the shift from retrieval-centric to generative memory paradigms suggest a future in which memory systems become fully learnable, adaptive, and self-organizing. Such systems hold the potential to transform large language models from powerful but static generators into agents capable of sustained interaction, self-improvement, and principled reasoning over time.

We hope this survey provides a coherent foundation for future research and serves as a reference for both researchers and practitioners. As agentic systems continue to mature, the design of memory will remain a central and open problem, one that is likely to play a decisive role in the development of robust, general, and enduring artificial intelligence.

Adaptation

As shown above, such a large body of work has focused on agent memory, clearly demonstrating that memory mechanisms are essential for agent systems (Zhang et al., 2025s). The choice of memory type in an agent system reflects how designers expect the agent to behave in a given task. Designers are not simply asking the agent to remember certain information, but also implicitly expressing how they want that information to shape the agent's behavior. Therefore, choosing the right type of memory for a task is far more than a simple combinatorial choice.

In this section, we start from the features of each memory type and discuss which tasks and scenarios they are best suited for in an ideal setting, as shown in Figure 5. We hope this discussion can offer useful ideas and guidance for making practical choices. The examples illustrate only one possible form of memory in these idealized settings and do not imply that other memory types lack unique advantages in the same scenarios.

Token-level Memory Token-level memory remains symbolic , addressable , and transparent , making it particularly well suited for scenarios where explicit reasoning, controllability, and accountability are essential. This type of memory excels in real-time, high-frequency update settings, where an agent must continuously track and revise information, and where the knowledge itself exhibits a clear structure that can be explicitly modeled. Its externalizability allows memory to be easily inspected, audited, transferred, or revised, making it especially suitable for domains requiring precise add/delete/update operations. The high level of interpretability further ensures that an agent's decision process can be traced back to concrete memory units, a crucial property in high-stakes applications. Moreover, token-level memory provides long-term stability and avoids catastrophic forgetting, enabling agents to accumulate reliable knowledge over extended time horizons. Another practical advantage is that token-level memory is often implemented as a plug-and-play module, allowing it to be readily integrated with the latest closed-source or open-source foundation models without modifying their internal parameters.

Token-level Memory
Parametric Memory

In contrast to token-level memory, which stores information as visible and editable discrete units, parametric memory stores information directly in the model's parameters. In this section, we examine methods that embed memory into learnable parameter spaces, allowing the model to internalize and recall information without referring to external storage.

Based on where the memory is stored relative to the core model parameters, we distinguish two primary forms of parametric memory:

Latent Memory

Functions: Why Agents Need Memory?

The transition from large language models as general-purpose, stateless text processors to autonomous, goal-directed agents is not merely an incremental step but a fundamental paradigm shift. This shift exposes the critical limitation of statelessness. By definition, an agent must persist, adapt, and interact coherently over time. Achieving this relies not merely on a large context window but fundamentally on the capacity for memory . This section addresses the functions , or fundamental purpose , of agent memory, prioritizing the question of why it is essential over how it is implemented . We posit that agent memory is not a monolithic component but a set of distinct functional capabilities, each serving a unique objective in enabling persistent, intelligent behavior.

To provide a systematic analysis, this section organizes the why of memory around a functional taxonomy that maps directly to an agent's core requirements. At the highest level, we distinguish between two temporal categories: long-term memory , which serves as the persistent, cross-session store for accumulated knowledge, and short-term memory , which functions as the transient, in-session workspace for active reasoning. This high-level temporal split is further resolved into three primary functional pillars, which form the structure of our analysis. An overview of this taxonomy is provided in Figure 6.

Factual Memory

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

User factual memory

User factual memory persists verifiable facts about a specific user across sessions and tasks, including identity, preferences, routines, historical commitments, and salient events.

Its primary function is to prevent characteristic failure modes of stateless interaction, such as coreference drift, repeated elicitation, and contradictory responses, thereby reducing interruptions to long-horizon goals (Tan et al., 2025c; Zhong et al., 2024). Engineering practice typically comprises selection and compression, structured organization, retrieval and reuse, and consistency governance, aiming to sustain long-range dialogic and behavioral coherence under bounded access cost.

Dialogue Coherence Dialogue coherence requires an agent to preserve conversational context, user-specific facts, and a stable persona over extended periods. This ensures that later turns remain sensitive to earlier disclosures and affective cues, rather than degrading into repeated clarifications or inconsistent replies. To achieve this, modern systems implement user factual memory through two complementary strategies: heuristic selection and semantic abstraction .

To navigate finite context windows efficiently, a primary strategy is to selectively retain and rank interaction histories. Rather than retaining all raw logs, systems (Xi and Wang, 2025; Zhong et al., 2024; Park et al., 2023; Lei et al., 2025) maintain structured stores of past interactions, ranking entries by metrics such as relevance , recency , importance , or distinctiveness . By filtering retrieval based on these scores, high-value items are preserved and periodically condensed into higher-level summaries, conditioning subsequent responses to maintain continuity without overwhelming the agent's working memory.

Beyond mere selection, advanced frameworks emphasize the transformation and abstraction of raw dialogue fragments into higher-level semantic representations. Approaches such as Think in Memory (Liu et al., 2023a) and Reflective Memory Management (Tan et al., 2025c) convert raw interaction traces into thought representations or reflections via iterative update operations. This allows the agent to query a stable semantic memory, keeping later replies topically consistent and less repetitive. Similarly, COMEDY (Chen et al., 2025d) employs a single language model to generate, compress, and reuse memory while updating compact user profiles. These methods effectively stabilize persona and preference expression over long conversational histories by decoupling memory storage from the raw token surface form.

Goal Consistency Goal consistency requires an agent to maintain and refine an explicit task representation over time. This ensures that clarifying questions, information requests, and actions remain strictly aligned with the primary objective, minimizing intent drift.

To mitigate such drift, systems utilize factual memory to dynamically track and update the task state. Approaches like RecurrentGPT (Zhou et al., 2023b), Memolet (Yen and Zhao, 2024), and MemGuide (Du et al., 2025b) retain confirmed information while highlighting unresolved elements. By guiding retrieval based on task intent, these methods help agents satisfy missing constraints and maintain focus across sessions.

For complex, long-horizon tasks, memory forms are often structured to facilitate localized retrieval centered on the active goal (Wu et al., 2025h). For instance, A-Mem (Xu et al., 2025c) organizes memories as an interconnected graph of linked notes, while H-Mem (Limbacher and Legenstein, 2020) employs associative mechanisms to recall prerequisite facts when subsequent steps depend on prior observations.

In embodied scenarios, factual memory grounds agent behavior in user-specific habits and environmental context. Systems such as M3-Agent (Long et al., 2025) and MEMENTO (Kwon et al., 2025) persist data on household members, object locations, and routines, reusing this information to minimize redundant exploration and repeated instructions. Similarly, Encode-Store-Retrieve (Shen et al., 2024) processes egocentric visual streams into text-addressable entries, allowing agents to answer questions based on past visual experiences without requiring user repetition.

Summary Collectively, these mechanisms transform ephemeral interaction traces into a persistent cognitive substrate. By integrating retrieval-based ranking with generative abstraction, user factual memory upgrades the system from simple similarity matching to the active maintenance of explicit goals and constraints. This foundation yields a dual benefit: it fosters a sense of familiarity and trust through long-term behavioral

coherence, while simultaneously enhancing operational efficiency by increasing task success rates, reducing redundancy, and lowering error recovery overhead.

Dialogue Coherence

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 3639-3664, 2025.

Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong,

and Jeff Z Pan. Rethinking memory in ai: Taxonomy, operations, topics, and future directions. arXiv preprint arXiv:2505.00675 , 2025a.

the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023 , pages 1762-1777. Association for Computational Linguistics, 2023b. doi: 10.18653/V1/2023.ACL-LONG.99. https://doi.org/10.18653/v1/2023.acl-long.99 .

Jingyi Jia and Qinbin Li. Autotool: Efficient tool selection for large language model agents, November 2025.

Conference on Intelligent Robots and Systems, IROS 2022, Kyoto, Japan, October 23-27, 2022 , pages 3316-3323. IEEE, 2022. doi: 10.1109/IROS47612.2022.9981090. https://doi.org/10.1109/IROS47612.2022.9981090 .

Vancouver, BC, Canada, December 10 - 15, 2024 , 2024. http://papers.nips.cc/paper_files/paper/2024/hash/ 7ede97c3e082c6df10a8d6103a2eebd2-Abstract-Conference.html .

Linguistics: ACL 2025 , pages 19336-19352, Vienna, Austria, July 2025a. Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.989. https://aclanthology.org/2025.findings-acl.989/ .

Sun, Lei Bai, and Bowen Zhou. From ai for science to agentic science: A survey on autonomous scientific discovery, 2025c. https://arxiv.org/abs/2508.14111 .

Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors, Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-24, 2025 , pages 3764-3777. Association for Computational Linguistics, 2025b. https://aclanthology.org/2025.coling-main.254/ .

and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , 2023. http: //papers.nips.cc/paper_files/paper/2023/hash/6ceefa7b15572587b78ecfcebb2827f8-Abstract-Conference.html .

1 Introduction1 Introduction1 Introduction4
2 Preliminaries: Formalizing Agents and Memory2 Preliminaries: Formalizing Agents and Memory2 Preliminaries: Formalizing Agents and Memory6
2.1LLM-based Agent Systems. . . . . . . . . .6
2.2Agent Memory Systems .Agent Memory Systems .7
2.3. . . . . . . . . . . . . . . . . . . Comparing Agent Memory with Other Key Concepts . . . .. . . . . . . . . . . . . . . . . . . Comparing Agent Memory with Other Key Concepts . . . .8
2.3.1Agent Memory vs. LLM Memory . . .9
2.3.2Agent Memory vs. RAG10
2.3.3. . . . . . . . Agent Memory vs. Context Engineering11
3 Form: What Carries Memory?3 Form: What Carries Memory?3 Form: What Carries Memory?12
3.1Token-level Memory .. . . . . . . . . . . . .13
3.1.1Flat Memory (1D) . . . . . . . . . .15
3.1.2. Planar Memory (2D) . . . . . . .20
3.1.3. . . Hierarchical Memory (3D) . . . . . . .21
3.2Parametric Memory . . .. . . . . . .22
3.2.1. . . . Internal Parametric Memory . . . . .22
3.2.2External Parametric Memory . . . . .24
3.3LatentMemory . . . . . . . . . . . . . . . . .26
3.3.1Generate . . . . . . . . . . . . . . . .26
3.3.2Reuse . . . . . . . . . . . . . . . . . .28
3.3.3Transform . . . . . . . . . . . . . . . .28
3.4Adaptation. . . . . . . . . . . . . . . . . . .30
4 Functions:Memory?31
4.1Why Agents Need Factual Memory . . . . . . .. . . . . . . . .32
4.1.1User factual memory . . . . . . .35
4.1.2. . . Environment factual memory . . . . .36
4.2ExperientialMemory . . . . . . . . . . . . . .37
4.2.1Case-based Memory . . . . . . . . . . Memory39
4.2.2Strategy-based . . . . . . . . Memory40
4.2.3Skill-based . . . . . . . . . . . .41
4.34.2.4Hybrid memory . . . . . . . . . . . Memory . .42 42
Working 4.3.1. . . . . . . . . . . . . . Single-turn Working Memory . . . . .43
4.3.2Multi-turn Working Memory . . . . .45
5Dynamics: How Memory Operates and Evolves? . . . . . . . . . . . . . . . . . . . . . .Dynamics: How Memory Operates and Evolves? . . . . . . . . . . . . . . . . . . . . . .46
5.1Memory Formation . .Memory Formation . .48
5.1.1Semantic Summarization . . . . . . .48
5.1.2Knowledge Distillation . . . . . . . . .50
5.1.3Structured Construction . . . . . . . .51
5.1.4Latent Representation . . . . . . .53
5.1.5. . Parametric Internalization . . . . . . .54
5.2Memory Evolution .. . . . . . . . . . . . . .55
5.2.1Consolidation . . . . . . . . . . . . . .55
5.2.2Updating . . . . . . . . . . . . . . .57
5.2.3. Forgetting . . . . . . . . . . . . . . . .58
5.3Memory Retrieval .. . . . . .59
5.3.1. . . . . . . . Retrieval Timing and Intent . . . . . .60
5.3.2Query Construction . . . . . . . . . .62
5.3.3 5.3.4Retrieval Strategies . . . . . . . . . . . Post-Retrieval Processing . . . . . . .62 64
6 Resources and Frameworks6 Resources and Frameworks6 Resources and Frameworks65
6.1 Benchmarks and Datasets . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .65
6.1.1Benchmarks for Memory / Lifelong / Self-Evolving Agents . . . . . . . . . . . . . . .65
6.1.2Other Related Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . .67
6.2Open-Source Frameworks . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .68
7Positions and FrontiersPositions and Frontiers69
7.1Memory Retrieval vs. Memory Generation .. . . . . . . . . . . . . . . . . . . . . . . .69
7.1.1Look Back: From Memory Retrieval to Memory Generation . . . . . . . . . . .69
7.1.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . .69
. . . . . . . . . 7.2 Automated Memory Management . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .70
7.2.1Look-Back: From Hand-crafted to Automatically Constructed Memory Systems.70
7.2.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70
7.3Reinforcement Learning Meets Agent Memory .71
7.3.1. . . . . . . . . . . . . . . . . . . . . . Look-Back: RL is Internalizing Memory Management Abilities for Agents. . . .71
7.3.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
7.4Multimodal Memory . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .72
7.4.1Look-Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
7.4.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
7.5SharedMemory in Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . .73
7.5.1Look-Back: From Isolated Memories to Shared Cognitive Substrates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
7.5.2Future Perspective . . . . . .73
7.6Memory for World Model . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .74
7.6.1 Look-Back . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .74
7.6.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74
7.7Trustworthy Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75
. . .. . . . . .75
7.7.1Look-Back: From Trustworthy RAG to Trustworthy Memory . . . . . . . . . . . . . . . . . . . . . . . . . .75
7.7.2Future Perspective . . . . . . . . .
7.8Human-Cognitive Connections . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
7.8.1 Look Back . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
7.8.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
MethodMultiTypeMemory FormTask
Flat Memory ModelsFlat Memory ModelsFlat Memory ModelsFlat Memory ModelsFlat Memory Models
Reflexion (Shinn et al., 2023b)E&WTrajectory as short-term and feedback as long-termQA, Reasoning, Coding
Memento (Zhou et al., 2025a)✗ ✔ExpTrajectory case (success/failure).Reasoning Game
JARVIS-1 (Wang et al., 2025q)ExpPlan-environment pairs.
Expel (Zhao et al., 2024)ExpInsights and few-shot examples.Reasoning
Buffer of Thoughts (Yang et al., 2024b)ExpHigh-level thought-templates.Game, Reasoning, Coding
SAGE (Liang et al., 2025)ExpDual-store with forgetting mechanism.Game, Reasoning, Coding
ChemAgent (Tang et al., 2025c)ExpStructured sub-tasks and principles.Chemistry
AgentKB (Tang et al., 2025d)Exp5-tuple experience nodes.Coding, Reasoning
H 2 R (Ye et al., 2025b)ExpPlanning and Execution layers.Game, Embodied Simula- tion
AWM (Wang et al., 2024m)ExpAbstracted universal workflows.Web
PRINCIPLES (Kim et al., 2025a)ExpRule templates from self-play.Emotional Companion
ReasoningBank (Ouyang et al., 2025)ExpTransferable reasoning strategy items.Web
Voyager (Wang et al., 2024b)ExpExecutable skill code library.Game
DGM (Zhang et al., 2025i)ExpRecursive self-modifiable codebase.Coding
Memp (Fang et al., 2025d)ExpInstructions and abstract scripts.Embodied Simulation, Travel Planning
UFO2 (Zhang et al., 2025a)ExpSystem docs and interaction records.Windows OS
LEGOMem (Han et al., 2025a)ExpVectorized task trajectories.Office
ToolMem (Xiao et al., 2025b)ExpTool capability.Tool Calling
SCM (Wang et al., 2025a)FactMemory stream and vector database.Long-context
MemoryBank (Zhong et al., 2024)FactHistory and user profile.Emotional Companion
MPC (Lee et al., 2023)FactPersona and summary vector pool.QA
RecMind (Wang et al., 2024h)FactUser metadata and external knowledge.Recommendation
InteRecAgent (Huang et al., 2025d)FactUser profiles and candidate item.Recommendation
Ego-LLaVA (Shen et al., 2024)FactLanguage-encoded chunk embeddings.Multimodal QA
ChatHaruhi (Li et al., 2023a)FactDialogue database from media.Role-Playing
Memochat (Lu et al., 2023)FactMemos and categorized dialogue history.Long-conv QA
RecursiveSum (Wang et al., 2025h)FactRecursive summaries of short dialogues.Long-conv QA
MemGPT (Packer et al., 2023a)FactVirtual memory (Main/External contexts).Long-conv QA, Doc QA
MethodMultiTypeMemory StructureTask
RoleLLM (Wang et al., 2024d)FactRole-specific QA pairs.Role-Playing
Think-in-memory (Liu et al., 2023a)FactHash table of inductive thoughts.Long-conv QA
PLA (Yuan et al., 2025b)FactEvolving records of history and summaries.QA, Human Feedback
COMEDY (Chen et al., 2025d)FactSingle-model compressed memory format.Summary, Compression, QA
Memoro (Zulfikar et al., 2024)FactSpeech-to-text vector embeddings.User Study
Memory Sharing (Gao and Zhang, 2024a)FactQuery-Response pair retrieval.Literary Creation, Logic, Plan Generation
Conv Agent(Alonso et al., 2024)FactChain-of-tables and vector entries.QA
EM-LLM (Fountas et al., 2025)FactEpisodic events with Bayesian boundaries.Long-context
Memocrs (Xi et al., 2024a)FactUser metadata and knowledge.Recommendation
SECOM (Pan et al., 2025)FactParagraph-level segmented blocks.Long-conv QA
Mem0 (Chhikara et al., 2025)FactSummary and original dialogue.Long-conv QA
RMM (Tan et al., 2025c)FactReflection-organized flat entries.Personalization
MEMENTO (Kwon et al., 2025)FactInteraction history entries.Personalization
MemGuide (Du et al., 2025b)FactDialogue-derived QA pairs.Long-conv QA
MIRIX (Wang and Chen, 2025)FactSix optimized flat memory types.Long-conv QA
SemanticAnchor (Chatterjee and Agar- wal, 2025)FactSyntactic 5-tuple structure.Long-conv QA
MMS (Zhang et al., 2025b)FactDual Retrieval and Context units.Long-conv QA
Memory-R1 (Yan et al., 2025c)FactRL-managed mem0 architecture.Long-conv QA
ComoRAG (Wang et al., 2025f)FactFact/Semantic/Plot units with probes.Narrative QA
Nemori (Nan et al., 2025)FactPredictive calibration store.Long-conv QA
Livia (Xi and Wang, 2025)FactPruned interaction history.Emotional Companion
MOOM (Chen et al., 2025e)✗ ✗FactDecoupled plot and character stores.Role-Playing
Mem- α (Wang et al., 2025p)FactCore, Semantic, and Episodic Mem.Memory Management
Personalized Long term Interac- tion (Westhäußer et al., 2025)FactHierarchical history and summaries.Personalization
LightMem (Fang et al., 2025b)FactOptimized Long/Short-term store.Long-conv QA
MEXTRA (Wang et al., 2025b)FactExtracted raw dialogue data.Privacy Attack
MovieChat (Song et al., 2024)FactShort-term features and long-term persis- tence.Video Understanding
MA-LMM (He et al., 2024)FactVisual and Query memory banks.Video Understanding
VideoAgent (Wang et al., 2024g)FactTemporal text descriptions and object tracking.Video Understanding
Video-RAG (Luo et al., 2025b)FactVisually-aligned information .Video Understanding Embodied Task
KARMA (Wang et al., 2025r) Embodied VideoAgent (Fan et al.,✔ ✔Fact Fact3D scene graph and dynamic object states.MultiModal
2025)FactPersistent object and sensor store.Embodied Navigation
Mem2Ego (Zhang et al., 2025m)Map, landmark, and visited location stores.Generation
Context-as-Memory (Yu et al., 2025b)FactGenerated context frames.Video
RCR-Router (Liu et al., 2025d)FactBudget-aware semantic subsets.QA
ELL (Cai et al., 2025a)FactLiflong memory and skills.Lifelong Learning
MemRL (Zhang et al., 2026)ExpRL for memory management.Web
ReMe (Cao et al., 2025b)ExpStep level experience and insight.Web
MMAG (Zeppieri, 2025) Hindsight (Latimer et al., 2025)Fact FactFive interacting memory layers.User Study Long-conv
Retains, recalls, and reflects.QA
GAM (Yan et al., 2025a)FactSimple memory but search is guided.Long-conv QA
Planar Memory ModelsPlanar Memory ModelsPlanar Memory ModelsPlanar Memory ModelsPlanar Memory Models
D-SMART (Lei et al., 2025)FactStructured memory with reasoning trees.Long-conv QA
Reflexion (Shinn et al., 2023b)WorkReflective text buffer from experiences.QA, Reasoning, Coding
MethodMultiTypeMemory StructureTask
PREMem (Kim et al., 2025b)FactDynamic cross-session linked triples.Long-conv QA
Query Reconstruct (Xu et al., 2025b)ExpLogic graphs built from knowledge bases.KnowledgeGraph QA
KGT (Sun et al., 2024)FactKG node from query and feedback.QA
Optimus-1 (Li et al., 2024d)F&EKnowledge graph and experience pool.Game
SALI (Pan et al., 2024)ExpTopological graph with spatial nodesNavigation
HAT (A et al., 2024)FactHierarchical aggregate tree.Long-conv QA
MemTree (Rezazadeh et al., 2025c)FactDynamic hierarchical conversation tree.Long-conv QA
TeaFarm (iunn Ong et al., 2025)FactCausal edges connecting memories.Long-conv QA
COMET (Kim et al., 2024b)FactContext-aware memory through graph.Long-conv QA
Intrinsic Memory (Yuen et al., 2025)FactPrivate internal and shared external mem.Planning
A-MEM (Xu et al., 2025c)FactCard-based connected mem.Long-conv QA
Ret-LLM (Modarressi et al., 2023)FactTriplet table and LSH vectors.QA
HuaTuo (Wang et al., 2023a)FactMedical Knowledge Graph.Medical QA
M3-Agent (Long et al., 2025)FactMultimodal nodes in graph structure.Embodied QA
EMem (Zhou and Han, 2025a)FactEvent-centric alternative with pagerank.Long-conv QA
WorldMM (Yeo et al., 2025)FactMultiple complementary memories.Video Understanding
Memoria (Sarin et al., 2025)FactKnowledge-graph profile and summary.Long-conv QA
LingoEDU (Zhou et al., 2026)FactRelation tree of Elementary Discourse Units.Long-conv QA
Hierarchical Memory ModelsHierarchical Memory ModelsHierarchical Memory ModelsHierarchical Memory ModelsHierarchical Memory Models
GraphRAG (Edge et al., 2025)FactMulti-level community graph indices.QA, Summarization
H-Mem (Sun and Zeng, 2025)FactDecoupled index layers and content layers.Long-conv QA
EMG-RAG (Wang et al., 2024l)FactThree-tiered memory graph.QA
G-Memory (Zhang et al., 2025c)ExpQuery-centric three-layer graph structure.QA, Game, Embodied Task
Zep (Rasmussen et al., 2025)FactTemporal Knowledge Graphs.Long-conv QA
SGMem (Wu et al., 2025h)FactChunk Graph and Sentence Graph.Long-conv QA
HippoRAG (Gutierrez et al., 2024)FactKnowledge with query nodes.QA
HippoRAG 2 (Gutiérrez et al., 2025)FactKG with phrase and passage.QA
AriGraph (Anokhin et al., 2024)FactSemantic and Episodic memory graph.Game
Lyfe Agents (Kaiya et al., 2023)FactWorking, Short & Long-term layers.Social Simulation
CAM (Li et al., 2025g)FactMultilayer graph with topic.Doc QA
HiAgent (Hu et al., 2025a)E&WGoal graphs with recursive cluster.Agentic Tasks
ILM-TR (Tang et al., 2024)FactHierarchical Memory tree.Long-context
CompassMem (Hu et al., 2026b)FactHierarchical event-centric Memory.QA
MAGMA (Jiang et al., 2026)FactSemantic, temporal, causal, entity graphs.Long-conv QA
EverMemOS (Hu et al., 2026a)FactReusable memories covering multi types.Long-conv QA
RGMem (Tian et al., 2025a)FactRenormalization Group-based memory.Long-conv QA
MemVerse (Liu et al., 2025e)FactMultimodal hierarchical knowledge graphs.Reasoning, QA
MethodTypeTaskOptimization
I. Internal Parametric MemoryI. Internal Parametric MemoryI. Internal Parametric MemoryI. Internal Parametric Memory
(a) Pre-Train Phase
TNL (Qin et al., 2024b)WorkingQA, ReasoningSFT
StreamingLLM (Xiao et al., 2024)WorkingQA, ReasoningSFT
LMLM (Zhao et al., 2025b)FactualQA, Factual GenSFT
HierMemLM (Pouransari et al., 2025)FactualQA, Language ModelingSFT
Function Token (Zhang et al., 2025o)FactualLanguage ModelingPretrain
(b) Mid-Train Phase
Agent-Founder (Su et al., 2025)ExperientialTool Calling, Deep ResearchSFT
Early Experience (Zhang et al., 2025k)ExperientialTool Calling, Embodied Simulation, Reasoning, WebSFT
(c) Post-Train Phase
Character-LM (Shao et al., 2023)FactualRole PlayingSFT
CharacterGLM (Zhou et al., 2024a)FactualRole PlayingSFT
SELF-PARAM (Wang et al., 2025o)FactualQA, RecommendationKL Tuning
Room (Kim et al., 2023b)ExperientialEmbodied TaskRL
KnowledgeEditor (Cao et al., 2021)FactualQA, Fact CheckingFT
Mend (Mitchell et al., 2022)FactualQA, Fact Checking, Model EditingFT
PersonalityEdit Mao et al. (2024)FactualQA, Model EditingFT, PE
APP (Ma et al., 2024)FactualQAFT
DINM (Wang et al., 2024c)ExperientialQA, DetoxificationFT
AlphaEdit (Fang et al., 2025c)FactualQAFT
II. External Parametric MemoryII. External Parametric MemoryII. External Parametric MemoryII. External Parametric Memory
(a) Adapter-based Modules
MLP-Memory (Wei et al., 2025d)FactualQA, Classification, Textual EntailmentSFT
K-Adapter (Wang et al., 2021)FactualQA, Entity Typing, ClassificationSFT
WISE (Wang et al., 2024e)FactualQA, Hallucination DetectionSFT
ELDER (Li et al., 2025d)FactualModel EditingSFT
T-Patcher (Huang et al., 2023)FactualQAFT
Sparse Memory FT (Lin et al., 2025a)FactualQASFT
Memory Decoder (Cao et al., 2025a)FactualQA, Language ModelingSFT
MemLoRA (Bini et al., 2025)FactualQASFT
(b) Auxiliary LM-based Modules
MAC (Tack et al., 2024)FactualQASFT
Retroformer (Yao et al., 2024a)ExperientialQA, Web NavigationRL
MethodFormTypeTask
I. GenerateI. GenerateI. GenerateI. Generate
(a) Single Modal
Gist (Mu et al., 2023)Gist TokensWorking WorkingLong-context Compression
Taking a Deep Breath (Luo et al., 2024)Sentinel TokensLong-context QA
SoftCoT (Xu et al., 2025d)Soft TokensWorkingReasoning
CARE (Choi et al., 2025)Memory TokensWorkingQA, Fact Checking
AutoCompressor (Chevalier et al., 2023)Summary VectorsWorkingQA, Compression
MemoRAG (Qian et al., 2025)Global Semantic StatesWorkingQA, Summary
MemoryLLM (Wang et al., 2024j)Persistent TokensFactualLong-conv QA, Model Editing
M+ (Wang et al., 2025n)Cross-layer Token PoolsFactualQA
LM2 (Kang et al., 2025b)Matrix SlotsWorkingQA, Reasoning
Titans (Behrouz et al., 2025b)Neural Weights (MLP)WorkingQA, Language Modeling
MemGen (Zhang et al., 2025d)LoRA FragmentsWorking, Exp.QA, Math, Code, Embodied Task, Reasoning
EMU (Na et al., 2024)Embeddings w/ ReturnsFactualGame
TokMem (Wu et al., 2025j)Memory TokensExp.Funcation calling
Nested Learning (Behrouz et al., 2025a)Nested OptimizationFactualLanguage Modeling
Memoria (Park and Bak, 2024)Three memory layers with engramsFactualLanguage Modeling
(b) Multi-Modal
CoMem (Wu et al., 2025d)Multimodal EmbeddingsFactualMultimodal QA
ACM (Wu et al., 2025e)Trajectory EmbeddingsWorkingWeb
Time-VLM (Zhong et al., 2025)Patch EmbeddingsWorkingVideo Understanding
Mem Augmented RL (Mezghani et al., 2022)Novelty State EncoderWorkingVisual Navigation
MemoryVLA (Shi et al., 2025a)Perceptual StatesFactual, WorkingEmbodied Task
XMem (Cheng and Schwing, 2022)Key-Value EmbeddingsWorkingVideo Segmentation
II. ReuseII. ReuseII. ReuseII. Reuse
Memorizing Transformers (Wu et al., 2022)External KV CacheWorkingLanguage Modeling
SirLLM (Yao et al., 2024b)Entropy-selected KVFactualLong-conv QA
Memory 3 (Yang et al., 2024a)Critical KV PairsFactualQA
FOT (Tworkowski et al., 2023)Memory-Attention KVWorkingQA, Few-shot
LONGMEM (Wang et al., 2023b)Residual SideNet KVWorkinglearning, Language Modeling Language Modeling and Understanding
III. TransformIII. TransformIII. TransformIII. Transform
Scissorhands (Liu et al., 2023b)Pruned KVWorkingImage classification & generation
SnapKV (Li et al., 2024b)Aggregated Prefix KVWorkingLanguage Modeling
PyramidKV (Cai et al., 2024)Layer-wise BudgetWorkingLanguage Modeling
RazorAttention (Tang et al., 2025a)Compensated WindowWorkingLanguage Modeling
H2O (Zhang et al., 2023) R 3 Mem (Wang et al., 2025k)Heavy Hitter Tokens Virtual memory tokens with reversible compressionWorking WorkingQA, Language Modeling QA, Language Modeling
MethodCarrierStructureTaskOptimization
I. User factual MemoryI. User factual MemoryI. User factual MemoryI. User factual MemoryI. User factual Memory
(a) Dialogue Coherence
MemGPT (Packer et al., 2023b)Token-level1DLong-term dialoguePE
TiM (Liu et al., 2023a)Token-level2DQAPE
MemoryBank (Zhong et al., 2024)Token-level1DEmotional CompanionPE
AI Persona (Wang et al., 2024f)Token-level1DEmotional CompanionPE
Encode-Store-Retrieve (Shen et al., 2024)Token-level1DMultimodal QAPE
MethodCarrierFormTaskOptimization
Livia (Xi and Wang, 2025)Token-level1DEmotional CompanionPE
mem0 (Chhikara et al., 2025)Token-level1DLong-term dialogue, QAPE
RMM (Tan et al., 2025c)Token-level2DPersonalizationPE, RL
D-SMART (Lei et al., 2025)Token-level2DReasoningPE
Comedy (Chen et al., 2025d)Token-level1DSummary, Compression, QAPE
MEMENTO (Kwon et al., 2025)Token-level1DEmbodied, PersonalizationPE
O-Mem (Wang et al., 2025g)Token-level3DPersonalized DialoguePE
DAM-LLM (Lu and Li, 2025)Token-level1DEmotional CompanionPE
MemInsight (Salama et al., 2025)Token-level1DPersonalized DialoguePE
EMem (Zhou and Han, 2025a)Token-level1DPersonalized DialoguePE
RGMem (Tian et al., 2025a)Token-level1DLong-conv QAPE
Memoria (Sarin et al., 2025)Token-level1DLong-conv QAPE
MemVerse (Liu et al., 2025e)FactMultimodal hierarchical knowledge graphs.Reasoning, QA
(b) Goal Consistency
RecurrentGPT (Zhou et al., 2023b)Token-level1DLong-Context Generation, Personalized Interactive FictionPE
Memolet (Yen and Zhao, 2024)Token-level2DQA, Document ReasoningPE
MemGuide (Du et al., 2025b)Token-level1DLong-conv QAPE, SFT
SGMem (Wu et al., 2025h)Token-level2DLong-contextPE
A-Mem (Xu et al., 2025c)Token-level2DQA, ReasoningPE
M3-agent (Long et al., 2025)Token-level2DMultimodal QAPE, SFT
WorldMM (Yeo et al., 2025)Token-level1DMultimodal QAPE
EverMemOS (Hu et al., 2026a)Token-level1DLong-conv QAPE
Environment factual MemoryEnvironment factual MemoryEnvironment factual MemoryEnvironment factual Memory
(a) Knowledge Persistence
MemGPT (Packer et al., 2023b)Token-level1DDocument QAPE
CALYPSO (Zhu et al., 2023)Token-level1DTabletop GamingPE
AriGraph (Anokhin et al., 2024)Token-level3DGame, Multi-op QAPE
HippoRAG (Gutierrez et al., 2024)Token-level3DQAPE
WISE (Wang et al., 2024e)Parametric/Document Reasoning, QASFT
MemoryLLM (Wang et al., 2024j)Parametric/Document ReasoningSFT
Memoria (Park and Bak, 2024)latent/Language ModelingPE
Zep (Rasmussen et al., 2025)Token-level3DDocument analysisPE
MemTree (Rezazadeh et al., 2025c)Token-level2DDocument Reasoning, Dia- loguePE
LMLM (Zhao et al., 2025b)Token-level1DQASFT
M+ (Wang et al., 2025n)Latent/Document Reasoning, QASFT
CAM (Li et al., 2025g)Token-level3DMulti-hop QASFT, RFT
MemAct (Zhang et al., 2025r)Token-level1DMulti-obj QARL
Mem- α (Wang et al., 2025p)Token-Level1DDocument ReasoningRL
WebWeaver (Li et al., 2025m)Token-level1DDeep ResearchSFT
MemLoRA (Bini et al., 2025)Parametric/QASFT
Memory Decoder (Cao et al., 2025a)Parametric/QA, Language ModelingSFT
(b) Shared Access
GameGPT (Chen et al., 2023b)Token-level1DGame DevelopmentPE
Generative Agent (Park et al., 2023)Token-level2DSocial SimulationPE
S³ (Gao et al., 2023a)Token-level1DSocial SimulationPE
Memory Sharing (Gao and Zhang, 2024a)Token-level1DDocument ReasoningPE
MetaGPT (HongToken-level1DSoftware DevelopmentPE
et al., 2024) G-Memory (Zhang et al., 2025e)Token-level3DQAPE
OASIS (Yang et al., 2025)Token-level, Parametric1DSocial SimulationPE
MethodCarrierFormTaskOptimization
I. Case-based MemoryI. Case-based MemoryI. Case-based MemoryI. Case-based MemoryI. Case-based Memory
Expel (Zhao et al., 2024)Token-levelSolutionReasoningPE
Synapse (Zheng et al., 2024a)Token-levelSolutionWeb Interaction, Instruction-guided Web TaskPE
Fincon (Yu et al., 2024)Token-levelSolutionFinancialPE
MapCoder (Islam et al., 2024)Token-levelSolutionCodingPE
Memento (Zhou et al., 2025a)Token-levelTrajectoryReasoningRL
COLA (Zhao et al., 2025a)Token-levelTrajectoryGUI, Web Navigation, ReasoningPE
Continuous Memory (Wu et al., 2025e)LatentTrajectoryGUISFT
JARVIS-1 (Wang et al., 2025q)Token-levelTrajectoryGame, GUI InteractionPE
MemGen (Zhang et al., 2025d)LatentTrajectoryWeb Search, Embodied Simulation, Reasoning, Math, CodeRL, SFT
Early Experience (Zhang et al., 2025k)ParametricTrajectoryEmbodied Simulation, Reasoning, Web NavigationSFT
DreamGym (Chen et al., 2025f)Token-levelTrajectoryWeb Interaction, Embodied Simula- tion, ShoppingRL
MemRL (Zhang et al., 2026)Token-levelTrajectoryCoding, Embodied Simulation, Rea- soningRL
II. Strategy-based MemoryII. Strategy-based MemoryII. Strategy-based MemoryII. Strategy-based MemoryII. Strategy-based Memory
Reflexion (Shinn et al., 2023a)Token-levelInsightEmbodied Simulation, Reasoning, CodingPE
Buffer of Thoughts (Yang et al., 2024b)Token-levelPatternGame, Reasoning, CodingPE
AWM (Wang et al., 2024m)Token-levelWorkflowWeb Interaction, Instruction-guided Web TaskPE
RecMind (Wang et al., 2024h)Token-levelPatternRecommendationPE
H 2 R (Ye et al., 2025b)Token-levelInsightGame, Embodied SimulationPE
ReasoningBank (Ouyang et al., 2025)Token-levelInsightWeb Interaction, Instruction-guided Web TaskPE
R2D2 (Huang et al., 2025c)Token-levelInsightWeb InteractionPE
MethodCarrierFormTaskOptimization
BrowserAgent (Yu et al., 2025d)Token-levelInsightGeneral QA, Web searchRL, SFT
Agent KB (Tang et al., 2025d)Token-levelWorkflowCode, ReasoningPE
ToolMem (Xiao et al., 2025b)Token-levelInsightReasoning, Image GenerationPE
PRINCIPLES (Kim et al., 2025a)Token-levelPatternEmotional CompanionPE
SE-Agent (Sun et al., 2025c)Token-levelInsightCodingPE
ACE (Zhang et al., 2025n)Token-levelInsightCoding, Tool calling, FinancialPE
Flex (Cai et al., 2025c)Token-levelInsightMath, Chemistry, BiologyPE
AgentEvolver (Zhai et al., 2025)ParametricPatternTool-augmented TaskRL
Dynamic Cheatsheet (Suzgun et al., 2025)Token-levelInsightMath, Reasoning, GamePE
Training-Free GRPO (Cai et al., 2025b)Token-levelInsightMath, Reasoning, Web SearchPE
MemEvolve (Zhang et al., 2025h)Token-levelSolution,InsightWeb Search, ReasoningPE
III. Skill-based MemoryIII. Skill-based MemoryIII. Skill-based MemoryIII. Skill-based MemoryIII. Skill-based Memory
CREATOR (Qian et al., 2023)Token-levelFunction and ScriptReasoning, MathPE
Gorilla (Patil et al., 2024)Token-levelAPITool callingSFT
ToolRerank (Zheng et al., 2024b)Token-levelAPITool callingPE
Voyager (Wang et al., 2024b)Token-levelCode SnippetGamePE
RepairAgent (Bouzenia et al., 2024)Token-levelFunction and ScriptCodingPE
COLT (Qu et al., 2024)Token-levelAPITool callingSFT
ToolLLM (Qin et al., 2024a)Token-levelAPITool CallingSFT
LEGOMem (Han et al., 2025a)Token-levelFunction and ScriptOfficePE
Darwin Gödel Machine (Zhang et al., 2025i)Token-levelCode SnippetCodePE
Huxley-Gödel Machine (Wang et al., 2025j)Token-levelCode SnippetCodePE
Memp p (Fang et al., 2025d)Token-levelFunction and ScriptEmbodied Simulation, Travel Plan- ningPE
SkillWeaver (Zheng et al., 2025a)Token-levelFunction and ScriptWeb Interaction, Instruction-guided Web TaskPE
Alita (Qiu et al., 2025c)Token-levelMCPMath, Reasoning, VQAPE
Alita-G (Qiu et al., 2025b)Token-levelMCPMath, Reasoning, VQAPE
LearnAct (Liu et al., 2025b)Token-levelFunction and ScriptMobile GUIPE
ToolGen (Wang et al., 2025i)ParametricAPITool callingSFT
MemTool (Lumer et al., 2025)Token-levelMCPTool callingSFT
ToolRet (Shi et al., 2025c)Token-levelAPIWeb, Code, Tool RetrievalSFT
DRAFT (Qu et al., 2025a)Token-levelAPITool callingPE
ASI (Wang et al., 2025s)Token-levelFunctions and ScriptsWeb InteractionPE
MethodCarrierTaskOptimization
I. Single-turn Working MemoryI. Single-turn Working MemoryI. Single-turn Working MemoryI. Single-turn Working Memory
(a) Input Condensation
Gist (Mu et al., 2023)LatentInstruction Fine-tuningSFT
ICAE (Ge et al., 2024)LatentLanguage Modeling, Instruction Fine-tuningPretrain, LoRA
AutoCompressors (Chevalier et al., 2023)LatentLangague ModelingSFT
LLMLingua (Jiang et al., 2023)Token-levelReasoning, Conversation, SummarizationPE
LongLLMLingua (Jiang et al., 2024)Token-levelMulti-doc QA, Long-context, Multi-hop QAPE
CompAct (Yoon et al., 2024)Token-levelDocument QASFT
HyCo2 (Liao et al., 2025a)HybridSummarization, Open-domain QA, Multi-hop QASFT
Sentence-Anchor (Tarasov et al., 2025)LatentDocument QASFT
MELODI (Chen et al., 2024c)HybridPretrainingPretrain
R 3 Mem (Wang et al., 2025k)LatentDocument QA, Language ModelingPEFT
(b) Observation Abstraction
Synapse (Zheng et al., 2024a)Token-levelComputer Control, Web NavigationPE
VideoAgent (Wang et al., 2024g)Token-levelLong-term Video UnderstandingPE
MA-LMM (He et al., 2024)LatentLong-term Video UnderstandingSFT
Context as Memory (Yu et al., 2025b)Token-levelLong-term Video GenerationPE
II. Multi-turn Working MemoryII. Multi-turn Working MemoryII. Multi-turn Working MemoryII. Multi-turn Working Memory
(c) State Consolidation
MEM1 (Zhou et al., 2025b)LatentRetrieval, Open-domain QA, ShoppingRL
MemGen (Zhang et al., 2025d)LatentReasoning, Embodied Action, Web Search, CodingRL
MemAgent (Yu et al., 2025a)Token-levelLong-term Doc. QARL
ReMemAgent (Shi et al., 2025b)Token-levelLong-term Doc. QARL
ReSum (Wu et al., 2025f)Token-levelLong-horizon Web SearchRL
MemSearcher (Yuan et al., 2025a)Token-levelMulti-hop QASFT, RL
ACON (Kang et al., 2025c)Token-levelApp use, Multi-objective QAPE
IterResearch (Chen et al., 2025b)Token-levelReasoning, Web Navigation, Long-Horizon QARL
SUPO (Lu et al., 2025a)Token-levelLong-horizon taskRL
AgentDiet (Xiao et al., 2025a)Token-levelLong-horizon taskPE
SUMER (Zheng et al., 2025c)Token-levelQARL
Sculptor (Li et al., 2025f)Token-levelMulti-Needle QAPE,RL
AgeMem (Yu et al., 2026)Token-levelQA, Embodied ActionPE,RL
(d) Hierarchical Folding
HiAgent (Hu et al., 2025a)Token-levelLong-horizon Agent TaskPE
Context-Folding (Sun et al., 2025b)Token-levelDeep Research, SWERL
AgentFold (Ye et al., 2025a)Token-levelWeb SearchSFT
DeepAgent (Li et al., 2025i)Token-levelTool Use, Shopping, ReasoningRL
(e) Cognitive Planning
SayPlan (Rana et al., 2023)Token-level3D Scene Graph, RoboticsPE
KARMA (Wang et al., 2025r)Token-levelHouseholdPE
Agent-S (Agashe et al., 2025)Token-levelComputer UsePE
PRIME (Tran et al., 2025)Token-levelMulti-hop QA, Knowledge-intensive ReasoningPE
MethodSub-TypeRepresentation FormKey Mechanism
I. Semantic SummarizationI. Semantic SummarizationI. Semantic SummarizationI. Semantic Summarization
MemGPT (Packer et al., 2023a) Mem0 (Chhikara et al., 2025) Mem1 (Zhou et al., 2025b) MemAgent (Yu et al., 2025a) MemoryBank (Zhong et al., 2024) ReadAgent (Lee et al., 2024a) LightMem (Fang et al., 2025b) DeepSeek-OCR (Wei et al., 2025a) FDVS (You et al., 2024) LangRepo (Kahatapitiya et al., 2025) TiM (Liu et al., 2023a) RMM (Tan et al., 2025b) MemGuide (Du et al., 2025b) M3-Agent (Long et al., 2025) AWM (Wang et al., 2024m)Incremental Incremental Incremental Incremental Partitioned Partitioned Partitioned Partitioned Partitioned Partitioned Factual Factual Factual Factual ExperientialTextual Summary Textual Summary Textual Summary Textual Summary Textual Summary Textual Summary Textual Summary Visual Token Mapping Multimodal Summary Multimodal Summary II. Knowledge Distillation Textual Insight Topic Insight User Intent Text-addressable Facts Workflow PatternsMerging new chunks into the working context LLM-driven summarization RL-optimized summarization (PPO) RL-optimized summarization (GRPO) Daily/Session-based segmentation Semantic clustering before summarization Topic-clustered summarization Optical 2D mapping compression Multi-source signal integration (Subtitle/Object) Hierarchical video clip aggregation Abstraction of dialogue into thoughts Abstraction of dialogue into topic-based memory Capturing high-level user intent Compressing egocentric visual observations Workflow extraction from success trajectories
KGT (Sun et al., 2024) Mem0 g (Chhikara et al., 2025) D-SMART (Lei et al., 2025) GraphRAG (Edge et al., 2025) AriGraph (Anokhin et al., 2024) Zep (Rasmussen et al., 2025) RAPTOR (Sarthi et al., 2024) MemTree (Rezazadeh et al., 2025c) H-MEM (Sun and Zeng, 2025) A-MEM (Xu et al., 2025c) PREMem (Kim et al., 2025b) CAM (Li et al., 2025g) G-Memory (Zhang et al., 2025c)Entity-Level Entity-Level Entity-Level Entity-Level Entity-Level Entity-Level Chunk-Level Chunk-Level Chunk-Level Chunk-Level Chunk-Level Chunk-Level Chunk-LevelUser Graph Knowledge Graph Dynamic Memory Graph Hierarchical KG Semantic+Episodic Graph Temporal KG Tree Structure Tree Structure Hierarchical JSON Networked Notes Reasoning Patterns Hierarchical Graph Hierarchical GraphEncoding user preferences as nodes/edges LLM-based entity and triplet extraction Constructing an OWL-compliant graph Community detection and iterative summarization Dual-layer (Semantic nodes + Episodic links) 3-layer graph (Episodic, Semantic, Community) Recursive GMM clustering and summarization Bottom-up insertion and summary updates Top-down 4-level hierarchy organization Discrete notes with semantic links Cross-session reasoning pattern clustering Disentangling overlapping clusters via replication 3-tier graph (interaction, query, insight)
IV. Latent RepresentationIV. Latent RepresentationIV. Latent RepresentationIV. Latent Representation
MemoryLLM (Wang et al., 2024j) M+ (Wang et al., 2025n) MemGen (Zhang et al., 2025d) ESR (Shen et al., 2024) CoMEM (Wu et al., 2025d) Mem2Ego (Zhang et al., 2025m)Textual Textual Textual MultimodalLatent Vector Latent Vector Latent Token Latent Vector
KARMA (Wang et al., 2025r)Multimodal Multimodal MultimodalContinuous Embedding Multimodal Embedding Multimodal Embedding ParametricSelf-updatable latent embeddings Cross-layer long-term memory tokens Latent memory trigger and weaver Video-to-Language-to-Vector encoding Vision-language compression via Q-Former
Embedding landmark semantics as latent Hybrid long/short-term memory encoding
memory Internalizationmemory Internalizationmemory Internalizationmemory Internalization
MEND (Mitchell et al., 2022) ROME (Meng et al., 2022) MEMIT (Meng et al., 2023)Knowledge Knowledge KnowledgeGradient Decomposition Model Parameters Model Parameters
LoRA ParametersAuxiliary network for fast edits Causal tracing and rank-one update Mass-editing via residual distribution
V.V.V.V.
CoLoR (Wistuba et al., 2023) ToolFormer (Schick et al., 2023)Knowledge CapabilityModel ParametersLow-rank adapter training Supervised fine-tuning on API calls
NameLinkFac.Exp.MM.Env.FeatureScale
Memory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented Benchmarks
MemBench/github GitHubsimulatedinteractive scenarios53,000 s.
MemoryAgentBench/github GitHubsimulatedmulti-turn interactions4 t.
LoCoMo/globe Websiterealconversational memory300 s.
WebChoreArena/github GitHubrealtedious web browsing4 t./532 s.
MT-Mind2Web/github GitHubrealconversational web navigation720 s.
PersonaMem/globe Websitesimulateddynamic user profiling15 t./180 s.
LongMemEval/github GitHubsimulatedinteractive memory5 t./500 s.
PerLTQA/globe Websitesimulatedsocial personalized interactions8,593 s.
MemoryBank/globe Websitesimulateduser memory updating194 s.
MPR/github GitHubsimulateduser personalization108,000 s.
PrefEval/globe Websitesimulatedpersonal preferences3,000 s.
LOCCO/globe Websitesimulatedchronological conversations3,080 s.
StoryBench/globe Websitemixedinteractive fiction games3 t.
MemoryBench/globe Websitesimulatedcontinual learning4 t./ ∼ 20,000 s.
Madial-Bench/github GitHubsimulatedmemory recalling331 s.
Evo-Memory/globe Websitesimulatedtest-time learning10 t./ ∼ 3,700 s.
LifelongAgentBench/globe Websitesimulatedlifelong learning1,396 s.
StreamBench/globe Websitesimulatedcontinuous online learning9,702 s.
DialSim/globe Websiterealmulti-dialogue understanding∼ 1,300 s.
LongBench/globe Websitemixedlong-context understanding21 t./4,750 s.
LongBench v2/globe Websitemixedlong-context multitasks20 t./503 s.
RULER/github GitHubsimulatedlong-context retrieval13 t.
BABILong/github GitHubsimulatedlong-context reasoning20 t.
MM-Needle/globe Websitesimulatedmultimodal long-context retrieval∼ 280,000 s.
HaluMem/github GitHubsimulatedmemory hallucinations3,467 s.
HotpotQA/globe Websitesimulatedlong-context QA113k s.
Other Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related Benchmarks
ALFWorld/globe Websitesimulatedtext-based embodied environment3,353 t.
ScienceWorld/github GitHubsimulatedinteractive embodied environment10 t./30 t.
AgentGym/globe Websitemixedmultiple environments89 t./20,509 s.
AgentBoard/github GitHubmixedmulti-round interaction9 t./1013 s.
PDDL ∗/globe Websitesimulatedstrategy game-
BabyAI/globe Websitesimulatedlanguage learning19 t.
WebShop/globe Websitesimulatede-commerce web interaction12,087 s.
WebArena/globe Websiterealweb interaction812 s.
MMInA/globe Websiterealmultihop web interaction1,050 s.
SWE-Bench Verified/globe Websiterealcode repair500 s.
GAIA/globe Websiterealhuman-level deep research466 s.
xBench-DS/globe Websiterealdeep-search evaluation100 s.
ToolBench/github GitHubrealAPI tool use126,486 s.
GenAI-Bench/globe Websiterealvisual generation evaluation∼ 40,000 s.
FrameworkLinksFac.Exp.MM.StructureEvaluation
MemGPT/github GitHub /globe Websitehierachical (S/LTM)LoCoMo
Mem0/github GitHub /globe Websitegraph + vectorLoCoMo
Memobase/github GitHub /globe Websitestructured profilesLoCoMo
MIRIX/github GitHub /globe Websitestructured memoryLoCoMo, MemoryA- gentBench
MemoryOS/github GitHub /globe Websitehierarchical (S/M/LTM)LoCoMo, Memory- Bank
MemOS/github GitHub /globe Websitetree memory + memcubeLoCoMo, PreFEval, LongMemEval, Per- sonaMem
Zep/github GitHub /globe Websitetemporal knowledge graphLongMemEval
LangMem/github GitHub /globe Websitecore API + manager-
SuperMemory/github GitHub /globe Websitevector + semantic-
Cognee/github GitHub /globe Websiteknowledge graph-
Memary/github GitHub /globe Websitestream + entity store-
Pinecone/github GitHub /globe Websitevector database-
Chroma/github GitHub /globe Websitevector database-
Weaviate/github GitHub /globe Websitevector + graph-
Second Me/github GitHub /globe Websiteagent ego-
MemU/github GitHub /globe Websitehierachical layers-
MemEngine/github GitHubmodular space-
Memori/github GitHub /globe Websitememory database-
ReMe/github GitHub /globe Websitememory management-
AgentMemory/github GitHub /globe Websitememory management-
MineContext/github GitHub /globe Websitecontext engineering-
Acontext/github GitHubcontext engineering + skill learning-
PowerMem/github GitHuboceanbase-
ReMe/github GitHubagentscopeBFCL, AppWorld
HindSight/github GitHubparallel retrieval + reflection-
Goal Consistency

8

Conclusion

76

Figure 1 Overview of agent memory organized by the unified taxonomy of forms (Section 3), functions (Section 4), and dynamics (Section 5). The diagram positions memory artifacts by their dominant form and primary function. It further maps representative systems into this taxonomy to provide a consolidated landscape.

Figure 1 Overview of agent memory organized by the unified taxonomy of forms (Section 3), functions (Section 4), and dynamics (Section 5). The diagram positions memory artifacts by their dominant form and primary function. It further maps representative systems into this taxonomy to provide a consolidated landscape.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Environment factual memory

Environment factual memory pertains to entities and states external to the user, encompassing long documents, codebases, tools, and interaction traces.

This memory paradigm addresses incomplete factual recall and unverifiable provenance, minimizes contradictions and redundancy in multi-agent collaboration, and stabilizes long-horizon tasks in heterogeneous environments. The central objective is to furnish an updatable, retrievable, and governable external fact layer, providing a stable reference across sessions and stages. Concretely, we categorize existing implementations along two complementary dimensions: knowledge persistence and multi-agent shared access .

Knowledge Persistence Knowledge memory refers to persistent representations of world knowledge and domain-specific knowledge that support long document analysis, factual question answering, multihop reasoning, and reliable retrieval of code and data resources.

In terms of knowledge organization , existing research focuses on structuring external data to enhance reasoning capabilities. For instance, HippoRAG (Gutierrez et al., 2024) utilizes knowledge graphs to facilitate evidence propagation, while MemTree (Rezazadeh et al., 2025c) employs a dynamic hierarchical structure to optimize aggregation and targeted access in growing corpora. Regarding storage form, LMLM (Zhao et al., 2025b) explicitly decouples factual knowledge from model weights by externalizing it into a database, thereby enabling direct knowledge edits and provenance verification without retraining. In narrative domains, CALYPSO (Zhu et al., 2023) distills lengthy game contexts into bite-sized prose, preserving critical story state accessibility.

In scenarios requiring continuous knowledge updates , parameter-centric approaches integrate persistence directly into the model architecture . Methods such as MEMORYLLM (Wang et al., 2024j), M+ (Wang et al., 2025n), and WISE (Wang et al., 2024e) incorporate trainable memory pools or side networks to absorb new information. Rather than relying solely on static external retrieval, these designs focus on the challenge of model editing, allowing agents to adapt to dynamic environments and correct obsolete facts while preserving the stability of the pre-trained backbone.

Shared Access Shared memory establishes a visible and manageable common factual foundation for multiagent collaboration , serving to align goals, carry intermediate artifacts, and eliminate redundant work. By maintaining a centralized repository of past queries and responses, frameworks such as Memory Sharing (Gao and Zhang, 2024b) enable agents to access and build on peers' accumulated insights asynchronously. This mechanism ensures that individual agents directly benefit from collective knowledge, thereby suppressing contradictory conclusions and enhancing overall system efficiency.

For complex project coordination, systems such as MetaGPT (Hong et al., 2024) and GameGPT (Chen et al., 2023b) utilize shared message pools as central workspaces for publishing plans and partial results. Similarly, G-Memory (Zhang et al., 2025e) employs hierarchical memory graphs as a unified coordination medium. These architectures facilitate consistency maintenance around the current project state, which reduces communication overhead and enables the extraction of reusable workflows from historical collaborations.

In the domain of social simulation, platforms like Generative Agents (Park et al., 2023) and S 3 (Gao et al., 2023a), alongside large-scale simulators such as OASIS (Yang et al., 2025) and AgentSociety (Piao et al., 2025), model the global environment and public interaction logs as a shared memory substrate. This substrate is incrementally updated and observed by the population, allowing information to diffuse naturally among agents and supporting coherent, history-aware social dynamics at scale.

Summary environment factual memory furnishes a continuously updatable, auditable, and reusable external fact layer. On the knowledge axis, it improves completeness, interpretability, and editability of factual recall through structured organization and long-term memory modules. On the collaboration axis, it maintains crossagent and cross-stage consistency through sharing and governance, thereby enabling robust decision-making and execution under long horizons, multiple actors, and multi-source information.

Figure 7 Taxonomy of experiential memory paradigms. We classify approaches based on the abstraction level of stored knowledge: (1) Case-based Memory preserves raw trajectories and solutions as concrete exemplars; (2) Strategybased Memory abstracts experiences into high-level strategies, templates, or workflows; (3) Skill-based Memory distills procedural knowledge into executable functions and APIs; and (4) Hybrid Memory integrates multiple representations. Together, these systems mirror human procedural memory to enable continual learning and self-evolution. This figure draws inspiration from Gao et al. (2025).

Figure 7 Taxonomy of experiential memory paradigms. We classify approaches based on the abstraction level of stored knowledge: (1) Case-based Memory preserves raw trajectories and solutions as concrete exemplars; (2) Strategybased Memory abstracts experiences into high-level strategies, templates, or workflows; (3) Skill-based Memory distills procedural knowledge into executable functions and APIs; and (4) Hybrid Memory integrates multiple representations. Together, these systems mirror human procedural memory to enable continual learning and self-evolution. This figure draws inspiration from Gao et al. (2025).

Knowledge Persistence

While semantic summarization captures the global semantics of raw data at a macro level, knowledge distillation operates at a finer granularity, extracting reusable knowledge from interaction trajectories or documents. In a broad sense, knowledge refers to the various forms of factual and experiential memory described in Section 4, depending on the task's underlying functions.

Distilling Factual Memory This process focuses on transforming raw interactions and documents into explicit, declarative knowledge regarding users and environmental states. This process ensures that the agent maintains consistency and adaptability by retaining verifiable facts rather than transient context. In the domain of user modeling, systems such as TiM (Liu et al., 2023a), RMM (Tan et al., 2025c), and EMem (Zhou and Han, 2025b) employ abstraction mechanisms to convert dialogue turns into high-level thoughts or enriched elementary discourse units, thereby preserving long-term persona coherence. For users' objective modeling, approaches like MemGuide (Du et al., 2025b) extract user intent descriptions from dialogues. During reasoning, it captures and updates goal states, separating confirmed constraints from unresolved intents to mitigate goal drift. Furthermore, this distillation extends to multimodal environments, where agents like ESR (Shen et al., 2024) and M3-Agent (Long et al., 2025) compress egocentric visual observations into text-addressable facts about object locations and user routines. Furthermore, by equipping agents with multimodal understanding capabilities, Video-RAG (Luo et al., 2025b) converts audio, subtitles, and objects in videos into textual notes, i.e., factual memories, to enhance long-video understanding.

Distilling Experiential Memory This process focuses on extracting the strategies underlying task execution from historical trajectories. By deriving planning principles from successful rollouts and corrective signals from failures, this paradigm enhances the agent's problem-solving ability on specific tasks. Through abstraction and generalization, it further supports cross-task knowledge transfer. As a result, experiential generalization enables the agent to continually refine its competence and move toward lifelong learning.

This line of research aims to derive high-level planning strategies and key insights from both successful and failed trajectories. Some approaches focus on success-based distillation, where systems such as AgentRR (Feng et al., 2025) and AWM (Wang et al., 2024m) summarize overall task plans from successful cases. Mem p (Fang et al., 2025d) analyzes and summarizes the gold trajectories from the training set, distilling them into abstract procedural knowledge. Others adopt failure-driven reflection, exemplified by Matrix (Liu et al., 2024), SAGE (Liang et al., 2025), and R2D2 (Huang et al., 2025c), which compare reasoning traces against ground-truth answers to identify error sources and extract reflective insights. Combining both, ExpeL (Zhao et al., 2024), From Experience to Strategy (Xia et al., 2025), and ReMe (Cao et al., 2025b) contrast successful and failed experiences to uncover holistic planning insights.

However, prior work primarily focuses on summarizing task-level planning knowledge, lacking fine-grained, step-level insights. To address this gap, H 2 R (Ye et al., 2025b) introduces a two-tier reflection mechanism: it follows ExpeL to construct a pool of high-level planning insights, while further segmenting trajectories by subgoal sequences to derive step-wise execution insights.

Earlier methods relied on fixed prompts for insight extraction, making their performance sensitive to prompt design and the underlying LLM's capacity. Recently, trainable distillation methods have become prevalent. Learn-to-Memorize (Zhang et al., 2025u) optimizes task-specific prompts for different agents. On the other hand, Memory-R1 (Yan et al., 2025c) uses an LLMExtract module to obtain experiential and factual knowledge, while only the subsequent fusion component is trained to integrate these outputs into the memory bank. Although these approaches adopt an end-to-end framework, they still fall short in enhancing the LLM's intrinsic ability to distill insights. To overcome this limitation, Memα (Wang et al., 2025p) explicitly trains the LLM on what insights to extract and how to preserve them.

Summary This part focuses on extracting function-specific knowledge from the raw context, without addressing the structure of memory storage. Each piece of knowledge can be viewed as a flat memory unit. Simply storing multiple units in an unstructured table ignores the semantic and hierarchical relations among them. To address this, the memory formation process can apply structured rules to derive insights and store them within a hierarchical architecture. Simple but essential, the single knowledge distillation method introduced here serves as a foundational component for more complex and structured memory formation mechanisms.

Shared Access
Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Experiential Memory

Experiential memory encapsulates the mechanism by which agents encode historical trajectories, distilled strategies, and interaction outcomes into durable, retrievable representations. Unlike working memory, which manages transient context, experiential memory focuses on the long-term accumulation and transfer of knowledge across distinct episodes.

Theoretically grounded in cognitive science, this paradigm parallels human nondeclarative memory , specifically the procedural and habit systems (Squire, 2004; Seger and Spiering, 2011). Biological systems rely on distributed neural circuits for implicit skill acquisition (Reber, 2013). In contrast, agentic experiential memory typically employs explicit data structures, such as vector databases or symbolic logs. This implementation difference grants agents a unique capability absent in biological counterparts: the ability to introspect, edit, and reason over their own procedural knowledge.

Crucially, experiential memory serves as a foundation for continual learning and self-evolution in the era of experience (Sutton, 2025; Gao et al., 2025). By maintaining a repository of structured experiences, agents achieve a non-parametric path to adaptation and avoid the prohibitive costs of frequent parametric updates. This mechanism effectively closes the learning loop by converting interaction feedback into reusable knowledge. Through this process, agents rectify past errors, abstract generalizable heuristics, and compile routine behaviors. Consequently, such adaptation minimizes redundant computations and refines decision-making over time (Zhao et al., 2024; Shinn et al., 2023b).

To systematically analyze existing literature, we classify experiential memory based on the abstraction level of the stored information. An overview of this abstraction-based taxonomy and representative paradigms is illustrated in Figure 7. Representative methods under this abstraction-based taxonomy, together with their storage carriers, representation forms, and optimization strategies, are summarized in Table 5.

Case-based Memory

Case-based memory stores minimally processed records of historical events , prioritizing fidelity to ensure that episodes can be replayed or reused as in-context exemplars. Unlike strategy templates or skill modules, cases avoid extensive abstraction, thereby preserving the original alignment between situations and solutions.

Trajectories This category preserves interaction sequences to enable replay and evidence-driven learning. To optimize retrieval in text-based environments, Memento (Zhou et al., 2025a) employs soft Q-learning to dynamically refine the probability of selecting high-utility past trajectories. In multimodal settings, JARVIS1 (Wang et al., 2025q), EvoVLA (Liu et al., 2025i) and Auto-scaling Continuous Memory (Wu et al., 2025e) retain visual context, with the former storing survival experiences in Minecraft and the latter compressing GUI history into continuous embeddings. Furthermore, the early experience paradigm (Zhang et al., 2025k) constructs reward-free, agent-generated interaction traces and integrates them into model parameters via mid-training to enhance generalization.

Solutions This category treats memory as a repository of proven solutions. ExpeL (Zhao et al., 2024) autonomously gathers experience through trial-and-error, storing successful trajectories as exemplars while extracting textual insights to guide future actions. Synapse (Zheng et al., 2024a) similarly injects abstracted state-action episodes as contextual examples to align problem-solving patterns. In program synthesis, MapCoder (Islam et al., 2024) keeps relevant example code as a playbook-like case that multi-agent pipelines retrieve and adapt to improve reliability on complex tasks. In the financial domain, FinCon (Yu et al., 2024) maintains an episodic memory of past actions, PnL trajectories, and belief updates to facilitate robust cross-round decision-making.

Summary Case-based memory offers high informational fidelity and provides verifiable evidence for imitation. However, the reliance on raw data imposes challenges regarding retrieval efficiency and context window consumption. Distinguished from executable skills or abstract strategies, cases do not encompass orchestration logic or function interfaces. Instead, they serve as the factual substrate upon which higher-level reasoning operates.

Trajectories

Transform-type latent memory methods focus on modifying, compressing, or restructuring existing latent states rather than generating entirely new ones or directly reusing raw KV caches. These approaches treat KV caches and hidden activations as malleable memory units, reshaping them through selection, aggregation, or structural transformation. In doing so, they occupy a conceptual middle ground between generate-type and

Figure 5 Overview of three complementary memory paradigms for LLM agents. Token-level, parametric, and latent memories differ in their representational form, update dynamics, interpretability, and efficiency, leading to distinct strengths, limitations, and application domains in long-horizon and interactive agent systems.

Figure 5 Overview of three complementary memory paradigms for LLM agents. Token-level, parametric, and latent memories differ in their representational form, update dynamics, interpretability, and efficiency, leading to distinct strengths, limitations, and application domains in long-horizon and interactive agent systems.

reuse-type memory: the model does not create fresh latent representations, but it also does more than simply replay stored KV pairs.

A major line of work focuses on compressing KV caches while preserving essential semantics. Some methods reduce memory usage by keeping only the most influential tokens. Scissorhands (Liu et al., 2023b) prunes tokens based on attention scores when cache capacity is exceeded, whereas SnapKV (Li et al., 2024b) aggregates high-importance prefix KV representations via a head-wise voting mechanism. PyramidKV (Cai et al., 2024) reallocates KV budgets across layers. SirLLM (Yao et al., 2024b) builds on this perspective by estimating token importance with a token-entropy criterion and selectively retaining only informative KV entries. Memory 3 (Yang et al., 2024a) only stores the most critical attention key-value pairs, significantly shrinking storage requirements. RazorAttention (Tang et al., 2025a) introduces a more explicit compression scheme: it computes the effective attention span of each head, retains only a limited local window, and uses compensation tokens to preserve information from discarded entries. From a more efficiency-oriented perspective, H2O (Zhang et al., 2023) adopts a simpler eviction strategy, retaining only the most recent tokens along with special H2 tokens to reduce memory footprint.

Discussion These methods demonstrate how latent memory can be transformed, through selection, retrieval enhancement, or compressed re-encoding, into more effective memory representations, enabling LLMs to extend their usable context length and improve reasoning performance without relying on raw cache reuse.

Their main advantage lies in producing more compact and information-dense memory representations, which reduce storage cost and enable efficient retrieval over long contexts. By reshaping latent states, these methods allow the model to access distilled semantic signals that may be more useful than raw activations. However, transformation introduces the risk of information loss, and the compressed states can become harder to interpret or verify compared with directly reused KV caches. The additional computation required for pruning, aggregation, or re-encoding also increases system complexity.

Solutions

Memory consolidation aims to transform newly acquired short-term traces into structured and generalizable long-term knowledge. Its core mechanism is to identify semantic relationships between new and existing memories and to integrate them into higher-level abstractions or insights. This process serves two main purposes. First, it reorganizes fragmented pieces of information into coherent structures, preventing the loss of critical details during short-term retention and enabling the formation of stable knowledge schemas. Second, by abstracting, compressing, and generalizing experiential data, consolidation extracts reusable patterns from specific events, yielding insights that support cross-task generalization.

A central challenge is determining the granularity at which new memories should be matched and merged with existing ones. Prior work spans a spectrum of consolidation strategies, from local content merging to cluster-level fusion and global integration.

Local Consolidation This operation focuses on fine-grained updates involving highly similar memory fragments. In RMM (Tan et al., 2025c), each new topic memory retrieves its top-K most similar candidates, and an LLM decides whether merging is appropriate, thereby reducing the risk of incorrect generalization. In

Figure 9 The landscape of Memory Evolution mechanisms. We categorize the evolution process into three distinct branches that maintain the central MemoryDatabase : (a) Consolidation synthesizes insights by processing raw materials through local consolidation, cluster fusion, and global integration; (b) Updating ensures accuracy and consistency by performing conflict resolution on external databases and applying parameter updates to the internal model; and (c) Forgetting optimizes efficiency by pruning data based on specific criteria: time expiration, low access frequency, and low informational value. The outer ring displays representative frameworks and agents associated with each evolutionary mechanism.

Figure 9 The landscape of Memory Evolution mechanisms. We categorize the evolution process into three distinct branches that maintain the central MemoryDatabase : (a) Consolidation synthesizes insights by processing raw materials through local consolidation, cluster fusion, and global integration; (b) Updating ensures accuracy and consistency by performing conflict resolution on external databases and applying parameter updates to the internal model; and (c) Forgetting optimizes efficiency by pruning data based on specific criteria: time expiration, low access frequency, and low informational value. The outer ring displays representative frameworks and agents associated with each evolutionary mechanism.

multimodal settings, VLN (Song et al., 2025b) triggers a pooling mechanism when capacity is saturated. It identifies the most similar or redundant memory pairs and compresses them into higher-level abstractions. These approaches refine detailed knowledge while preserving the global structure of the memory store, improving precision and storage efficiency. However, they cannot fully capture cluster-level relations or the higher-order dependencies that emerge across semantically related memories.

Cluster-level Fusion Adopting cluster-level fusion is essential for capturing cross-instance regularities as memory grows. Across clusters, PREMem (Kim et al., 2025b) aligns new memory clusters with similar existing ones and applies fusion modes such as generalization and refinement to form higher-order reasoning units, substantially improving interpretability and reasoning depth. EverMemOS (Hu et al., 2026a) computes the similarity between a newly generated MemCell and the centroids of all MemScenes, and merges it into the MemScene that is sufficiently similar. Within a cluster, TiM (Liu et al., 2023a) periodically invokes an LLM to examine memories that share the same hashing bucket and merges semantically redundant entries. CAM (Li et al., 2025g) merges all nodes within the target cluster into a representative summary, yielding

higher-level and consistent cross-sample representations. These methods reorganize the memory structure at a broader scale and mark an important step toward structured knowledge.

Global Integration This operation performs holistic consolidation to maintain global coherence and to distill system-level insights from accumulated experience. Compared with Section 5.1.1, semantic summarization focuses on deriving a global summary from the existing context and can be viewed as the initial construction of the summary. In contrast, this paragraph emphasizes how new information is integrated into an existing summary as additional data arrives. For user factual memory, MOOM (Chen et al., 2025e) constructs stable role profiles by integrating temporary role snapshots with historical traces using rule-based processing, embedding methods, and LLM-driven abstraction. For experiential memory, Matrix (Liu et al., 2024) performs iterative optimization to combine execution trajectories and reflective insights with global memory, distilling task-agnostic principles that support reuse across scenarios. As single-step reasoning contexts and environmental feedback lengthen, methods like AgentFold (Ye et al., 2025a) and Context Folding (Zhang et al., 2025r) internalize the ability to compress working memory. In multi-step interactions, including web navigation, these methods automatically summarize and condense the global context after each step, supporting efficient and effective reasoning. Global integration consolidates high-level, structured knowledge from the complete history of experience, providing a reliable contextual foundation while improving generalization, reasoning accuracy, and personalized decision-making.

Summary Consolidation is the cognitive process of reorganizing fragmented short-term traces into coherent long-term schemas. It moves beyond simple storage to synthesize connections between isolated entries, forming a structured worldview. It enhances generalization and reduces storage redundancy. However, it risks information smoothing, where outlier events or unique exceptions are lost during the abstraction process, potentially reducing the agent's sensitivity to anomalies and specific events.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Strategy-based Memory

Unlike case libraries that retain what happened , strategy-based memory extracts transferable knowledge of how to act , encompassing reusable reasoning patterns, task decompositions, insights, abstractions, and cross-situational workflows. It elevates experiences into editable, auditable, and composable high-level knowledge, thereby reducing dependence on lengthy trajectory replay and improving cross-task generalization and efficiency. We focus on non-code or weakly code-based templates and workflows in this section, while executable functions, APIs, MCP protocols, and code snippets are classified under Section 4.2.3. Based on the granularity and structural complexity of the retained knowledge, we categorize strategy-based memory into three distinct types: atomic Insights , sequential Workflows , and schematic Patterns .

Insights This category of approaches focuses on distilling discrete pieces of knowledge, such as granular decision rules and reflective heuristics , from past trajectories. H 2 R (Ye et al., 2025b) explicitly decouples planning-level and execution-level memories, enabling high-level planning insights and low-level operational rules to be retrieved separately for fine-grained transfer in multi-task scenarios. R2D2 (Huang et al., 2025c) integrates remembering, reflecting, and dynamic decision-making for web navigation, deriving corrective insights from both failed and successful cases to inform subsequent episodes. For long-horizon web automation, BrowserAgent (Yu et al., 2025d) persists key conclusions as explicit memory to stabilize extended chains of reasoning and mitigate context drift.

Workflows Distinct from atomic, static insights, workflows encapsulate strategies as structured sequences of actions-executable routines abstracted from prior trajectories to guide multi-step execution at inference time. Agent Workflow Memory (AWM) (Wang et al., 2024m) induces reusable workflows on Mind2Web (Deng et al., 2023) and WebArena (Zhou et al., 2023a) and uses them as high-level scaffolds to guide subsequent generation, improving success rates and reducing steps without updating base model weights. This demonstrates that strategy templates can act as a top-level controller that complements case-level evidence. Agent KB (Tang et al., 2025d) establishes a unified knowledge base that treats workflows as transferable procedural knowledge. It employs hierarchical retrieval, accessing workflows first to structure the strategic approach and enabling problem-solving logic reuse across diverse agent architectures.

Patterns At a higher level of abstraction, reasoning patterns function as cognitive templates that encapsulate the structure of problem-solving, enabling agents to tackle complex reasoning tasks by instantiating these generalizable skeletons . Buffer of Thoughts (Yang et al., 2024b) maintains a meta-buffer of thought templates that are retrieved and instantiated to solve new problems. Similarly, ReasoningBank (Ouyang et al., 2025) abstracts both successes and failures into reusable reasoning units, facilitating test-time expansion and robust learning. RecMind's self-inspiring planning algorithm (Wang et al., 2024h) generates intermediate self-guidance to structure subsequent planning and tool use. In the domain of dialogue agents, PRINCIPLES (Kim et al., 2025a) builds a synthetic strategy memory via offline self-play to guide strategy planning at inference, thereby eliminating the need for additional training. These advances indicate a paradigmatic shift from descriptive rules to portable reasoning structures.

Summary Strategy-based memory, which encompasses insights, workflows, and patterns, serves as a highlevel scaffold to guide generative reasoning. Unlike case-based memory that relies on retrieving specific, raw trajectories which may be noisy or context-dependent, this form of memory distills generalizable schemas to effectively constrain the search space and improve robustness on unseen tasks. However, a key distinction is that these strategies function as structural guidelines rather than executable actions; they direct the planning process but do not interact with the environment directly. This limitation necessitates skill-based memory, discussed in the following section, which stores callable capabilities and tools. Ultimately, robust agents typically synergize these components: strategies provide the abstract planning logic, while skills handle the grounded execution.

Insights
Workflows
Patterns

8

Conclusion

76

Figure 1 Overview of agent memory organized by the unified taxonomy of forms (Section 3), functions (Section 4), and dynamics (Section 5). The diagram positions memory artifacts by their dominant form and primary function. It further maps representative systems into this taxonomy to provide a consolidated landscape.

Figure 1 Overview of agent memory organized by the unified taxonomy of forms (Section 3), functions (Section 4), and dynamics (Section 5). The diagram positions memory artifacts by their dominant form and primary function. It further maps representative systems into this taxonomy to provide a consolidated landscape.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Skill-based Memory

Skill memory captures an agent's procedural capacity and operationalizes abstract strategy into verifiable actions. It encodes what the agent can do, complements declarative knowledge of what the agent knows, and anchors the perception-reasoning-action loop by providing invocable, testable, and composable executables. Recent evidence shows that language models can learn when and how to call tools and scale reliably with large tool repertoires, establishing skill memory as the execution substrate of modern agents.

Skill memory spans a continuum from internal, fine-grained code to externalized, standardized interfaces. The unifying criteria are straightforward: skills must be callable by the agent, their outcomes must be verifiable to support learning, and they must compose with other skills to form larger routines.

Code Snippets Executable code stored as reusable snippets offers the fastest path from experience to capability. In open-ended tasks, agents distill successful sub-trajectories into interpretable programs and reuse them across environments. Voyager (Wang et al., 2024b) exemplifies this pattern with an ever-growing skill library; the Darwin Gödel Machine (Zhang et al., 2025i) goes further by safely rewriting its own code under empirical validation, yielding self-referential and progressively more capable skill sets.

Functions and Scripts Abstracting complex behaviors into modular functions or scripts enhances reusability and generalization. Recent advancements empower agents to autonomously create specialized tools for problemsolving (Qian et al., 2023; Yuan et al., 2024a), and to refine tool-use capabilities through demonstrations and environmental feedback across diverse domains such as mobile GUIs, web navigation, and software engineering (Fang et al., 2025d; Zheng et al., 2025a; Bouzenia et al., 2024). Furthermore, emergent mechanisms for procedural memory enable agents to distill execution trajectories into retrievable scripts, facilitating efficient generalization to novel scenarios (Liu et al., 2025b; Han et al., 2025a).

APIs APIs serve as the universal interface for encapsulated skills. While earlier work focused on fine-tuning models to correctly invoke tools (Schick et al., 2023; Patil et al., 2024), the exponential growth of API libraries has shifted the primary bottleneck to retrieval. Standard information retrieval methods often fail to capture the functional semantics of tools (Shi et al., 2025c). Consequently, recent approaches have moved towards learningbased retrieval and reranking strategies that account for tool documentation quality, hierarchical relationships, and collaborative usage patterns to bridge the gap between user intent and executable functions (Zheng et al., 2024b; Gao and Zhang, 2024c; Qu et al., 2024, 2025a).

MCPs To reduce protocol fragmentation in API-based ecosystems, the Model Context Protocol provides an open standard that unifies how agents discover and use tools and data, including code-execution patterns that load tools on demand and cut context overhead (Qiu et al., 2025c,b). Broad platform support indicates a convergence toward a common interface layer.

Beyond standard executables, research explores learnable memories of tool capabilities to handle uncertain neural tools, parametric integration that embeds tool symbols to unify retrieval and calling, and architectureas-skill perspectives where specialized agents are callable modules within a modular design space (Xiao et al., 2025b; Wang et al., 2025i; Zhao et al., 2025a). Collectively, these strands reframe skill memory as a learnable, evolving, and orchestrable capability layer.

Summary In conclusion, skill-based memory constitutes the active execution substrate of the agent, evolving from static code snippets and modular scripts to standardized APIs and learnable architectures. It bridges the gap between abstract planning and environmental interaction by operationalizing insights from case-based and strategy-based memories into verifiable procedures. As mechanisms for tool creation, retrieval, and interoperability (e.g., MCP) mature, skill memory moves beyond simple storage, enabling a continuous loop of capability synthesis, refinement, and execution that drives open-ended agent evolution.

Code Snippets

8

Conclusion

76

Figure 1 Overview of agent memory organized by the unified taxonomy of forms (Section 3), functions (Section 4), and dynamics (Section 5). The diagram positions memory artifacts by their dominant form and primary function. It further maps representative systems into this taxonomy to provide a consolidated landscape.

Figure 1 Overview of agent memory organized by the unified taxonomy of forms (Section 3), functions (Section 4), and dynamics (Section 5). The diagram positions memory artifacts by their dominant form and primary function. It further maps representative systems into this taxonomy to provide a consolidated landscape.

Functions and Scripts

Yuyang Hu † , Shichun Liu † , Yanwei Yue † , Guibin Zhang † /cube , Boyang Liu , Fangyi Zhu , Jiahang Lin , Honglin Guo , Shihan Dou , Zhiheng Xi , Senjie Jin , Jiejun Tan , Yanbin Yin , Jiongnan Liu , Zeyu Zhang , Zhongxiang Sun , Yutao Zhu , HaoSun , Boci Peng , Zhenrong Cheng , Xuanbo Fan , Jiaxin Guo , Xinlei Yu , Zhenhong Zhou , Zewen Hu , Jiahao Huo , Junhao Wang , Yuwei Niu , Yu Wang , Zhenfei Yin , Xiaobin Hu , Yue Liao , Qiankun Li , Kun Wang , Wangchunshu Zhou , Yixin Liu , Dawei Cheng , Qi Zhang , Tao Gui ‡ , Shirui Pan , Yan Zhang ‡ , Philip Torr , Zhicheng Dou ‡ , Ji-Rong Wen , Xuanjing Huang ‡ , Yu-Gang Jiang , Shuicheng Yan ‡

† Core Contributors with Names Listed Alphabetically. /cube Project Organizer. ‡ Core Supervisors.

Affiliations : National University of Singapore, Renmin University of China, Fudan University, Peking University, Nanyang Technological University, Tongji University, University of California San Diego, Hong Kong University of Science and Technology (Guangzhou), Griffith University, Georgia Institute of Technology, OPPO, Oxford University

Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. It underpins long-horizon reasoning, continual adaptation, and effective interaction with complex environments. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, assumptions, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity and dynamics of contemporary agent memory systems. This survey aims to provide an up-to-date and comprehensive landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms , functions , and dynamics . From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level , parametric , and latent memory . From the perspective of functions, we move beyond coarse temporal categorizations and propose a finer-grained taxonomy that distinguishes factual , experiential , and working memory . From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time as agents interact with their environments. To support empirical research and practical development, we compile a comprehensive summary of representative benchmarks and open source memory frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including automation-oriented memory design, the deep integration of reinforcement learning with memory systems, multimodal memory, shared memory for multi-agent systems, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.

/envelope Main Contact: guibinz@u.nus.edu, yuyang.hu@ruc.edu.cn, liusc24@m.fudan.edu.cn, ywyue25@stu.pku.edu.cn Github: https://github.com/Shichun-Liu/Agent-Memory-Paper-List

Note: If you identify your own or other papers relevant to this survey that have not been discussed (we apologize for any such omissions due to the rapidly expanding literature), please feel free to contact us via email or raise an issue on GitHub.

APIs
MCPs
Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Hybrid memory

Advanced agent architectures increasingly adopt a hybrid design that integrates multiple forms of experiential memory to balance grounded evidence with generalizable logic. By maintaining a spectrum of knowledge spanning raw episodes, distilled rules, and executable skills, these systems dynamically select the most appropriate memory format, ensuring both retrieval precision and broad generalization across contexts.

A prominent direction involves coupling case-based and strategy-based memories to facilitate complementary reasoning. For example, ExpeL (Zhao et al., 2024) synergizes concrete trajectories with abstract textual insights, allowing agents to recall specific solutions while applying general heuristics. Agent KB (Tang et al., 2025d) employs a hierarchical structure where high-level workflows guide planning and specific solution paths provide execution details. Similarly, R2D2 (Huang et al., 2025c) integrates a replay buffer of historical traces with a reflective mechanism that refines decision strategies from past errors, effectively bridging case retrieval and strategic abstraction. Complementing these, Dynamic Cheatsheet (Suzgun et al., 2025) prevents redundant computation by storing accumulated strategies and problem-solving insights for immediate reuse at inference time.

Furthermore, recent frameworks strive to unify the lifecycle of memory, incorporating Skill-based components or establishing comprehensive cognitive architectures (Sun et al., 2025a; Cai et al., 2025a). In scientific reasoning, ChemAgent (Tang et al., 2025c) constructs a self-updating library that pairs execution cases with decomposable skill modules, enabling the model to refine its chemical reasoning through accumulated experience. Taking a holistic approach, LARP (Yan et al., 2023) establishes a cognitive architecture for open-world games that harmonizes semantic memory for world knowledge, episodic memory for interaction cases, and procedural memory for learnable skills, ensuring consistent role-playing and robust decision-making. Finally, evolutionary systems like G-Memory (Zhang et al., 2025c) and Memp (Fang et al., 2025d) implement dynamic transitions, where repeated successful cases are gradually compiled into efficient skills, automating the shift from heavy retrieval to rapid execution. A recent effort, MemVerse (Liu et al., 2025e) combines both parametric memory and token-level prcedural memory.

Working Memory

In cognitive science, working memory is defined as a capacity-limited, dynamically controlled mechanism that supports higher-order cognition by selecting, maintaining, and transforming task-relevant information in the moment (Baddeley, 2012). Beyond mere temporary storage, it implies active control under resource constraints. This perspective is grounded in frameworks such as the multicomponent model and the embedded-processes account, both of which emphasize attentional focus, interference control, and bounded capacity (Cowan, 2014).

When transposed to LLMs, the standard context window functions primarily as a passive, read-only buffer. Although the model can consume the window's contents during inference, it lacks explicit mechanisms to select, sustain, or transform the current workspace dynamically. Recent behavioral evidence suggests that current models do not exhibit human-like working memory characteristics, underscoring the necessity for explicitly engineered, operable working memory mechanisms (Huang et al., 2025a).

Throughout this section, we define working memory as the set of mechanisms for the active management and manipulation of context within a single episode (Zhang et al., 2025r). The objective is to transform the context window from a passive buffer into a controllable, updatable, and interference-resistant workspace. This transition offers immediate benefits: it increases the density of task-relevant information under fixed attention budgets, suppresses redundancy and noise, and enables the rewriting or compression of representations to preserve coherent chains of thought. We categorize these mechanisms based on the interaction dynamics.

Representative working memory approaches under this interaction-based taxonomy, together with their storage carriers, task domains, and optimization strategies, are systematically summarized in Table 6.

Single-turn Working Memory

Single-turn working memory addresses the challenge of processing massive immediate inputs, including long documents (Chevalier et al., 2023) and high-dimensional multimodal streams (Wang et al., 2024g), within a single forward pass. Rather than passively consuming the entire context, the objective is to actively construct a writable workspace. This involves filtering and transforming raw information to increase density and operability under fixed attention and memory budgets (Jiang et al., 2023, 2024). We categorize these mechanisms into input condensation , which reduces physical token count, and observation abstraction , which transforms data into structured semantic representations.

Input Condensation Input condensation techniques aim to preprocess the context to minimize token usage while preserving essential information (Jiang et al., 2023). These methods generally fall into three paradigms: hard, soft, and hybrid condensation (Liao et al., 2025a).

Hard condensation discretely selects tokens based on importance metrics. Approaches like LLMLingua (Jiang et al., 2023) and LongLLMLingua (Jiang et al., 2024) estimate token perplexity to discard predictable or task-irrelevant content, while CompAct (Yoon et al., 2024) adopts an iterative strategy to retain segments that maximize information gain. Although efficient, hard selection risks severing syntactic or semantic dependencies. Soft condensation encodes variable-length contexts into dense latent vectors (memory slots). Methods such as Gist (Mu et al., 2023), In-Context Autoencoder (ICAE) (Ge et al., 2024), and AutoCompressors (Chevalier et al., 2023) train models to compress prompts into valid summary tokens or distinct memory embeddings. This achieves high compression ratios but requires additional training and may obscure fine-grained details. Hybrid approaches like HyCo2 (Liao et al., 2025a) attempt to reconcile these trade-offs by combining global semantic adapters (soft) with token-level retention probabilities (hard).

Observation Abstraction While condensation focuses on reduction, observation abstraction aims to transform raw observations into structured formats that facilitate reasoning. This mechanism maps dynamic, high-dimensional observation spaces into fixed-size memory states, preventing agents from being overwhelmed by raw data.

In complex interactive environments, abstraction converts verbose inputs into concise state descriptions. Synapse (Zheng et al., 2024a) rewrites unstructured HTML DOM trees into task-relevant state summaries to guide GUI automation. Similarly, in multimodal settings, processing every frame of a video stream is computationally prohibitive. Working memory mechanisms address this by extracting semantic structures: Context as Memory (Yu et al., 2025b) filters frames based on field-of-view overlap, VideoAgent (Wang et al.,

Table 6 Taxonomy of working memory methods. We categorize approaches into Single-turn and Multi-turn settings based on interaction dynamics. Methods are compared across three technical dimensions: (1) Carrier (Section 3) identifies the storage medium, (2) Task specifies the evaluation domain or application scenario, and (3) Optimization denotes the integration strategy, where PE encompasses prompt engineering and inference-time techniques without parameter updates, distinct from gradient-based methods like SFT and RL.

2024g) converts streams into temporal event descriptions, and MA-LMM (He et al., 2024) maintains a bank of visual features. These methods effectively rewrite high-dimensional, redundant streams into low-dimensional, semantically rich representations operable within a limited context window for efficient processing.

Summary Single-turn working memory functions as an active compression layer that maximizes the utility of the context window for immediate reasoning. By employing input condensation and observation abstraction, these mechanisms effectively increase the information density of the operational workspace, ensuring that critical evidence is retained despite capacity constraints. However, this optimization is strictly intra-turn ; it addresses the breadth and complexity of static inputs rather than the temporal continuity of dynamic interactions.

Input Condensation

Memory consolidation aims to transform newly acquired short-term traces into structured and generalizable long-term knowledge. Its core mechanism is to identify semantic relationships between new and existing memories and to integrate them into higher-level abstractions or insights. This process serves two main purposes. First, it reorganizes fragmented pieces of information into coherent structures, preventing the loss of critical details during short-term retention and enabling the formation of stable knowledge schemas. Second, by abstracting, compressing, and generalizing experiential data, consolidation extracts reusable patterns from specific events, yielding insights that support cross-task generalization.

A central challenge is determining the granularity at which new memories should be matched and merged with existing ones. Prior work spans a spectrum of consolidation strategies, from local content merging to cluster-level fusion and global integration.

Local Consolidation This operation focuses on fine-grained updates involving highly similar memory fragments. In RMM (Tan et al., 2025c), each new topic memory retrieves its top-K most similar candidates, and an LLM decides whether merging is appropriate, thereby reducing the risk of incorrect generalization. In

Figure 9 The landscape of Memory Evolution mechanisms. We categorize the evolution process into three distinct branches that maintain the central MemoryDatabase : (a) Consolidation synthesizes insights by processing raw materials through local consolidation, cluster fusion, and global integration; (b) Updating ensures accuracy and consistency by performing conflict resolution on external databases and applying parameter updates to the internal model; and (c) Forgetting optimizes efficiency by pruning data based on specific criteria: time expiration, low access frequency, and low informational value. The outer ring displays representative frameworks and agents associated with each evolutionary mechanism.

Figure 9 The landscape of Memory Evolution mechanisms. We categorize the evolution process into three distinct branches that maintain the central MemoryDatabase : (a) Consolidation synthesizes insights by processing raw materials through local consolidation, cluster fusion, and global integration; (b) Updating ensures accuracy and consistency by performing conflict resolution on external databases and applying parameter updates to the internal model; and (c) Forgetting optimizes efficiency by pruning data based on specific criteria: time expiration, low access frequency, and low informational value. The outer ring displays representative frameworks and agents associated with each evolutionary mechanism.

multimodal settings, VLN (Song et al., 2025b) triggers a pooling mechanism when capacity is saturated. It identifies the most similar or redundant memory pairs and compresses them into higher-level abstractions. These approaches refine detailed knowledge while preserving the global structure of the memory store, improving precision and storage efficiency. However, they cannot fully capture cluster-level relations or the higher-order dependencies that emerge across semantically related memories.

Cluster-level Fusion Adopting cluster-level fusion is essential for capturing cross-instance regularities as memory grows. Across clusters, PREMem (Kim et al., 2025b) aligns new memory clusters with similar existing ones and applies fusion modes such as generalization and refinement to form higher-order reasoning units, substantially improving interpretability and reasoning depth. EverMemOS (Hu et al., 2026a) computes the similarity between a newly generated MemCell and the centroids of all MemScenes, and merges it into the MemScene that is sufficiently similar. Within a cluster, TiM (Liu et al., 2023a) periodically invokes an LLM to examine memories that share the same hashing bucket and merges semantically redundant entries. CAM (Li et al., 2025g) merges all nodes within the target cluster into a representative summary, yielding

higher-level and consistent cross-sample representations. These methods reorganize the memory structure at a broader scale and mark an important step toward structured knowledge.

Global Integration This operation performs holistic consolidation to maintain global coherence and to distill system-level insights from accumulated experience. Compared with Section 5.1.1, semantic summarization focuses on deriving a global summary from the existing context and can be viewed as the initial construction of the summary. In contrast, this paragraph emphasizes how new information is integrated into an existing summary as additional data arrives. For user factual memory, MOOM (Chen et al., 2025e) constructs stable role profiles by integrating temporary role snapshots with historical traces using rule-based processing, embedding methods, and LLM-driven abstraction. For experiential memory, Matrix (Liu et al., 2024) performs iterative optimization to combine execution trajectories and reflective insights with global memory, distilling task-agnostic principles that support reuse across scenarios. As single-step reasoning contexts and environmental feedback lengthen, methods like AgentFold (Ye et al., 2025a) and Context Folding (Zhang et al., 2025r) internalize the ability to compress working memory. In multi-step interactions, including web navigation, these methods automatically summarize and condense the global context after each step, supporting efficient and effective reasoning. Global integration consolidates high-level, structured knowledge from the complete history of experience, providing a reliable contextual foundation while improving generalization, reasoning accuracy, and personalized decision-making.

Summary Consolidation is the cognitive process of reorganizing fragmented short-term traces into coherent long-term schemas. It moves beyond simple storage to synthesize connections between isolated entries, forming a structured worldview. It enhances generalization and reduces storage redundancy. However, it risks information smoothing, where outlier events or unique exceptions are lost during the abstraction process, potentially reducing the agent's sensitivity to anomalies and specific events.

Observation Abstraction

After initiating the retrieval process, the next challenge lies in transforming the raw query into an effective retrieval signal aligned with the memory index. Query construction acts as the translation layer between the user's surface utterance and the memory's latent storage. Traditional approaches typically perform retrieval directly based on the user query, which is simple but fails to align the query semantics with those of the memory index. To bridge this gap, agentic memory systems proactively perform query decomposition or query rewriting, generating intermediate retrieval signals that better match the latent structure of the memory.

Query Decomposition This approach breaks down a complex query into simpler sub-queries, allowing the system to retrieve more fine-grained and relevant information. Such decomposition alleviates the one-shot retrieval bottleneck by enabling modular retrieval and reasoning over intermediate results. For instance, Visconde (Pereira et al., 2023) and ChemAgent (Tang et al., 2025c) employ LLMs to decompose the original question into sub-problems, retrieve candidate results for each from the memory, and finally aggregate them into a coherent answer. However, these methods lack global planning. To address this issue, PRIME (Tran et al., 2025) and MA-RAG (Nguyen et al., 2025) introduce a Planner Agent, inspired by the ReAct (Yao et al., 2023b) paradigm, that first formulates a global retrieval plan before decomposing it into sub-queries. Yet, these approaches mainly rely on problem-driven decomposition and thus cannot explicitly identify what specific knowledge the model is missing. To make sub-queries more targeted, Agent KB (Tang et al., 2025d) adopts a two-stage retrieval process in which a teacher model observes the student model's failures and generates fine-grained sub-queries accordingly. This targeted decomposition improves retrieval precision and reduces irrelevant results, particularly in knowledge-intensive tasks.

Query Rewriting Instead of decomposing, this strategy rewrites the original query or generates a hypothetical document to refine its semantics before retrieval. Such rewriting mitigates the mismatch between user intent and the memory index. HyDE (Gao et al., 2023b), for example, instructs the LLM to generate a hypothetical document in a zero-shot manner and performs retrieval using its semantic embedding. The generated document encapsulates the desired semantics, effectively bridging the gap between the user query and the target memory. MemoRAG (Qian et al., 2025) extends this idea by incorporating global memory into hypothetical document generation. It first compresses the global memory and then generates a draft answer conditioned on both the query and the compressed memory; this draft is then used as a rewritten query. Since the draft has access to the global memory context, it captures user intent more faithfully and uncovers implicit information needs. Similarly, MemGuide (Du et al., 2025b) leverages the dialogue context to prompt an LLM to produce a concise, command-like phrase that serves as a high-level intent description for retrieval. Beyond directly prompting an LLM to rewrite the query, Rewrite-Retrieve-Read (Ma et al., 2023b) trains a small language model as a dedicated rewriter through reinforcement learning, while ToC (Kim et al., 2023a) employs a Tree of Clarifications to progressively refine and specify the user's retrieval objective.

Summary These two paradigms, decomposition and rewriting, are not mutually exclusive. Auto-RAG (Kim et al., 2024a) integrates both by evaluating HyDE and Visconde under identical retrieval conditions and then selecting the strategy that performs best for the given task. The findings of this work demonstrate that the quality of the memory-retrieval query has a substantial impact on reasoning performance. In contrast to earlier research, which primarily focused on designing sophisticated memory architectures, recent studies (Yan et al., 2025b) place increasing emphasis on the retrieval construction process, shifting the role of memory toward serving retrieval. The choice of what to retrieve with is, unsurprisingly, a critical component of this process.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Multi-turn Working Memory

Multi-turn working memory addresses a fundamentally different problem space than the single-turn setting. In long-horizon interactions, the primary bottleneck shifts from instantaneous context capacity to the continuous maintenance of task state and historical relevance . Even with extended context windows, the accumulation of history inevitably saturates attention budgets, increases latency, and induces goal drift (Lu et al., 2025b). To mitigate this, working memory in multi-turn settings functions as an externalized state carrier, organizing a continuous loop of reading, evaluation, and writing. The objective is to preserve critical state information accessible and consistent within a bounded resource budget. We categorize these mechanisms by their state management strategies: state consolidation , hierarchical folding , and cognitive planning .

State Consolidation In continuous interaction streams, state consolidation maps an ever-growing trajectory into a fixed-size state space through dynamic updates. Treating interaction as a streaming environment, MemAgent (Yu et al., 2025a), and MemSearcher (Yuan et al., 2025a) employ recurrent mechanisms to update fixed-budget memory and discard redundancy, answering queries from a compact, evolving state. ReSum (Wu et al., 2025f) extends this by periodically distilling history into reasoning states, utilizing reinforcement learning to optimize summary-conditioned behavior for indefinite exploration.

Beyond heuristic summarization, ACON (Kang et al., 2025c) frames state consolidation as an optimization problem, jointly compressing environment observations and interaction histories into a bounded condensation and iteratively refining compression guidelines from failure cases. IterResearch (Chen et al., 2025b) further adopts an MDP-inspired formulation with iterative workspace reconstruction, where an evolving report serves as persistent memory and periodic synthesis mitigates context suffocation and noise contamination in long-horizon research.

Regarding state representation, approaches vary to ensure constant-size footprints. MEM1 (Zhou et al., 2025b) maintains a shared internal state that merges new observations with prior memory. Distinct from explicit text, MemGen (Zhang et al., 2025d) injects latent memory tokens directly into the reasoning stream.

Hierarchical Folding For complex, long-horizon tasks, state maintenance requires structure beyond linear summarization. Hierarchical folding decomposes the task trajectory based on subgoals , maintaining fine-grained traces only while a subtask is active, and folding the completed sub-trajectory into a concise summary upon completion.

This decompose-then-consolidate strategy allows the working memory to expand and contract dynamically. HiAgent (Hu et al., 2025a) instantiates this by using subgoals as memory units, retaining only active action-observation pairs and writing back a summary after subgoal completion. Context-Folding (Sun et al., 2025b) and AgentFold (Ye et al., 2025a) extend this by making the folding operation a learnable policy, training agents to autonomously determine when to branch into sub-trajectories and how to abstract them into high-level states. DeepAgent (Li et al., 2025i) further applies this to tool-use reasoning, compressing interactions into structured episodic and working memories to support fine-grained credit assignment. By replacing finished sub-trajectories with stable high-level abstractions, these methods preserve essential context while keeping the active window small.

Cognitive Planning At the highest level of abstraction, working memory creates and maintains an externalized plan or world model . The state functions not merely as a summary of the past, but as a forward-looking structure that guides future actions.

PRIME (Tran et al., 2025) integrates retrieval directly into the planning loop, ensuring that memory updates actively support complex reasoning steps. In embodied and agentic environments, treating the language model

as a high-level planner elevates the plan to the core of working memory. Approaches like SayPlan employ 3D scene graphs as queryable environmental memory to scale planning across large spaces (Rana et al., 2023). In GUI and household tasks, systems like Agent-S (Agashe et al., 2025) and KARMA (Wang et al., 2025r) stabilize long-horizon performance by anchoring reasoning to a hierarchical plan, using memory-augmented retrieval to bridge long-term knowledge with short-term execution.

By making plans and structured environment representations the readable and writable core of working memory, agents can maintain goal consistency and revise strategies robustly against perception failures (Song et al., 2023).

Summary Multi-turn working memory pivots on the construction of an operable state carrier rather than the retention of raw history. By integrating state consolidation to compress continuous streams, hierarchical folding to structure sub-trajectories, and cognitive planning to anchor future actions, these mechanisms effectively decouple reasoning performance from interaction length. This paradigm enables agents to maintain temporal coherence and goal alignment over indefinite horizons while adhering to strict computational and memory constraints.

State Consolidation

Memory consolidation aims to transform newly acquired short-term traces into structured and generalizable long-term knowledge. Its core mechanism is to identify semantic relationships between new and existing memories and to integrate them into higher-level abstractions or insights. This process serves two main purposes. First, it reorganizes fragmented pieces of information into coherent structures, preventing the loss of critical details during short-term retention and enabling the formation of stable knowledge schemas. Second, by abstracting, compressing, and generalizing experiential data, consolidation extracts reusable patterns from specific events, yielding insights that support cross-task generalization.

A central challenge is determining the granularity at which new memories should be matched and merged with existing ones. Prior work spans a spectrum of consolidation strategies, from local content merging to cluster-level fusion and global integration.

Local Consolidation This operation focuses on fine-grained updates involving highly similar memory fragments. In RMM (Tan et al., 2025c), each new topic memory retrieves its top-K most similar candidates, and an LLM decides whether merging is appropriate, thereby reducing the risk of incorrect generalization. In

Figure 9 The landscape of Memory Evolution mechanisms. We categorize the evolution process into three distinct branches that maintain the central MemoryDatabase : (a) Consolidation synthesizes insights by processing raw materials through local consolidation, cluster fusion, and global integration; (b) Updating ensures accuracy and consistency by performing conflict resolution on external databases and applying parameter updates to the internal model; and (c) Forgetting optimizes efficiency by pruning data based on specific criteria: time expiration, low access frequency, and low informational value. The outer ring displays representative frameworks and agents associated with each evolutionary mechanism.

Figure 9 The landscape of Memory Evolution mechanisms. We categorize the evolution process into three distinct branches that maintain the central MemoryDatabase : (a) Consolidation synthesizes insights by processing raw materials through local consolidation, cluster fusion, and global integration; (b) Updating ensures accuracy and consistency by performing conflict resolution on external databases and applying parameter updates to the internal model; and (c) Forgetting optimizes efficiency by pruning data based on specific criteria: time expiration, low access frequency, and low informational value. The outer ring displays representative frameworks and agents associated with each evolutionary mechanism.

multimodal settings, VLN (Song et al., 2025b) triggers a pooling mechanism when capacity is saturated. It identifies the most similar or redundant memory pairs and compresses them into higher-level abstractions. These approaches refine detailed knowledge while preserving the global structure of the memory store, improving precision and storage efficiency. However, they cannot fully capture cluster-level relations or the higher-order dependencies that emerge across semantically related memories.

Cluster-level Fusion Adopting cluster-level fusion is essential for capturing cross-instance regularities as memory grows. Across clusters, PREMem (Kim et al., 2025b) aligns new memory clusters with similar existing ones and applies fusion modes such as generalization and refinement to form higher-order reasoning units, substantially improving interpretability and reasoning depth. EverMemOS (Hu et al., 2026a) computes the similarity between a newly generated MemCell and the centroids of all MemScenes, and merges it into the MemScene that is sufficiently similar. Within a cluster, TiM (Liu et al., 2023a) periodically invokes an LLM to examine memories that share the same hashing bucket and merges semantically redundant entries. CAM (Li et al., 2025g) merges all nodes within the target cluster into a representative summary, yielding

higher-level and consistent cross-sample representations. These methods reorganize the memory structure at a broader scale and mark an important step toward structured knowledge.

Global Integration This operation performs holistic consolidation to maintain global coherence and to distill system-level insights from accumulated experience. Compared with Section 5.1.1, semantic summarization focuses on deriving a global summary from the existing context and can be viewed as the initial construction of the summary. In contrast, this paragraph emphasizes how new information is integrated into an existing summary as additional data arrives. For user factual memory, MOOM (Chen et al., 2025e) constructs stable role profiles by integrating temporary role snapshots with historical traces using rule-based processing, embedding methods, and LLM-driven abstraction. For experiential memory, Matrix (Liu et al., 2024) performs iterative optimization to combine execution trajectories and reflective insights with global memory, distilling task-agnostic principles that support reuse across scenarios. As single-step reasoning contexts and environmental feedback lengthen, methods like AgentFold (Ye et al., 2025a) and Context Folding (Zhang et al., 2025r) internalize the ability to compress working memory. In multi-step interactions, including web navigation, these methods automatically summarize and condense the global context after each step, supporting efficient and effective reasoning. Global integration consolidates high-level, structured knowledge from the complete history of experience, providing a reliable contextual foundation while improving generalization, reasoning accuracy, and personalized decision-making.

Summary Consolidation is the cognitive process of reorganizing fragmented short-term traces into coherent long-term schemas. It moves beyond simple storage to synthesize connections between isolated entries, forming a structured worldview. It enhances generalization and reduces storage redundancy. However, it risks information smoothing, where outlier events or unique exceptions are lost during the abstraction process, potentially reducing the agent's sensitivity to anomalies and specific events.

Hierarchical Folding
Cognitive Planning
Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Dynamics: How Memory Operates and Evolves?

The preceding sections introduced the architectural forms (Section 3) and functional roles (Section 4) of memory, outlining a relatively static conceptual framework for agent memory. However, such a static view overlooks the inherent dynamism that fundamentally characterizes agentic memory. Unlike knowledge that is statically encoded in model parameters or fixed databases, an agentic memory system can dynamically construct and update its memory store and perform customized retrieval conditioned on different queries. This adaptive capability is crucial for enabling agents to self-evolve and engage in lifelong learning.

Accordingly, this section investigates the paradigm shift from static storage to dynamic memory management and utilization. This paradigm shift reflects the foundational operational advantages of agentic memory over static database approaches. In practice, an agentic memory system can autonomously extract refined, generalizable knowledge based on reasoning traces and environmental feedback. By dynamically fusing and updating this newly extracted knowledge with the existing memory base, the system ensures continuous adaptation to evolving environments and mitigates cognitive conflicts. Based on the constructed memory base, the system executes targeted retrieval from designated memory modules at precise moments, thereby enhancing reasoning effectively. To systematically analyze 'how' the memory system operates and evolves, we examine the complete memory lifecycle by decomposing it into three fundamental processes. Figure 8 provides a holistic illustration of this dynamic memory lifecycle, highlighting how memory formation, evolution, and retrieval interact to support adaptive and self-evolving agent behavior.

Figure8 The operational dynamics of agent memory. We decouple the complete memory lifecycle into three fundamental processes that drive the system's adaptability and self-evolution: (1) MemoryFormation transforms raw interactive experiences into information-dense knowledge units by selectively identifying patterns with long-term utility; (2) MemoryEvolution dynamically integrates new memories into the existing repository through consolidation , updating , and forgetting mechanisms to ensure the knowledge base remains coherent and efficient; and (3) Memory Retrieval executes context-aware queries to access specific memory modules, thereby optimizing reasoning performance with precise information support. The alphabetical order denotes the sequence of operations within the memory systems.

Figure8 The operational dynamics of agent memory. We decouple the complete memory lifecycle into three fundamental processes that drive the system's adaptability and self-evolution: (1) MemoryFormation transforms raw interactive experiences into information-dense knowledge units by selectively identifying patterns with long-term utility; (2) MemoryEvolution dynamically integrates new memories into the existing repository through consolidation , updating , and forgetting mechanisms to ensure the knowledge base remains coherent and efficient; and (3) Memory Retrieval executes context-aware queries to access specific memory modules, thereby optimizing reasoning performance with precise information support. The alphabetical order denotes the sequence of operations within the memory systems.

Memory Formation

We define memory formation as the process of encoding raw contexts (e.g., dialogues or images) into compact knowledge. The necessity for memory formation emerges from the scaling limitations inherent in processing lengthy, noisy, and highly redundant raw contexts. Full-context prompting often encounters computational overhead, prohibitive memory footprints, and degraded reasoning performance on out-of-distribution input lengths. To mitigate these issues, recent memory systems distill essential information into efficiently storable and precisely retrievable representations, enabling more efficient and effective inference.

Memory formation is not independent of the preceding sections. Depending on the task type, the memory formation process selectively extracts different architectural memories described in Section 3 to fulfill the corresponding functions outlined in Section 4. Based on the granularity of information compression and the logic of encoding, we categorize the memory formation process into five distinct types. Table 7 summarizes representative methods under each category, comparing their sub-types, representation forms, and key mechanisms.

Semantic Summarization

Semantic summarization transforms raw observational data into compact and semantically rich summaries. The resulting summaries capture the global, high-level information of the original data, rather than specific factual or experiential details (Zhao et al., 2024; Anokhin et al., 2024). Typical examples of such summaries include the overarching narrative of a document (Kim and Kim, 2025; Yu et al., 2025a), the procedural flow of a task (Ye et al., 2025a; Zhang et al., 2025r), or a user's historical profile (Zhang, 2024; Westhäußer et al., 2025). By filtering out redundant content while preserving task-relevant global semantics, semantic summarization provides a high-level guiding blueprint for subsequent reasoning without introducing excessive

Table 7 Taxonomy of memory formation methods. We classify approaches based on the memory formation operations. Methods are analyzed across three technical dimensions: (1) Sub-Type identifies the specific variation or scope, (2) Representation Form specifies the output format, and (3) Key Mechanism denotes the core algorithmic strategy.

contextual overhead. To achieve these effects, the compression process can be implemented in two primary ways: incremental and partitioned semantic summarization.

Incremental Semantic Summarization This paradigm employs a temporal integration mechanism that continuously fuses newly observed information with the existing summary, producing an evolving representation of global semantics. This chunk-by-chunk paradigm supports incremental learning (McCloskey and Cohen, 1989), circumvents the O ( n 2 ) computational burden of full-sequence processing (Yu et al., 2025a), and promotes progressive convergence toward global semantics (Chen et al., 2024b). Early implementations such as MemGPT (Packer et al., 2023a) and Mem0 (Chhikara et al., 2025) directly merged new chunks with existing summaries at appropriate moments, relying solely on the LLM's inherent summarization ability. However, this approach was constrained by the model's limited capacity, often resulting in inconsistency or semantic drift. To alleviate these issues, Chen et al. (2024b) and Wu et al. (2025i) incorporated external evaluators to filter redundant or incoherent content, using a convolutional-based discriminator for consistency verification and DeBERTa (He et al., 2020) for filtering trivial content, respectively. Instead of relying on auxiliary networks, subsequent methods, such as Mem1 (Zhou et al., 2025b) and MemAgent (Yu et al., 2025a), enhanced the LLM's own summarization capability through reinforcement learning with PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024).

As incremental summarization advanced from heuristic fusion to filtered integration and ultimately to learning-based optimization, the summarization competence became increasingly internalized within the model, thereby reducing cumulative errors across iterations. Nevertheless, the serial update nature still poses computational bottlenecks (Fang et al., 2025b) and potential information forgetting, motivating the development of Partitioned semantic summarization approaches.

Partitioned Semantic Summarization This paradigm adopts a spatial decomposition mechanism, dividing information into distinct semantic partitions and generating separate summaries for each. Early studies typically adopted heuristic partitioning strategies for handling long contexts. MemoryBank (Zhong et al., 2024) and COMEDY (Chen et al., 2025d) summarize and aggregate long-term dialogues by treating each day or session as a basic unit. Along the structural dimension, Wu et al. (2021) and Bailly et al. (2025) generate summaries of summaries by segmenting long documents into chapters or paragraphs. While intuitive, such approaches often suffer from semantic discontinuity across partitions. To address this issue, methods such as ReadAgent (Lee et al., 2024a) and LightMem (Fang et al., 2025b) introduce semantic or topical clustering before summarization, thereby enhancing inter-chunk coherence. Extending beyond textual compression, DeepSeek-OCR (Wei et al., 2025a) pioneers the idea of compressing long contexts via optical 2D mapping, achieving higher compression ratios in multimodal scenarios. In the video memory domain, FDVS (You et al., 2024) and LangRepo (Kahatapitiya et al., 2025) segment long videos into clips and generate textual summaries by integrating multi-source signals such as subtitles, object detection, and scene descriptions, which are then hierarchically aggregated into a global long video story.

Compared with incremental summarization, the partitioned approach offers superior efficiency and captures finer-grained semantics. However, its independent processing of each sub-chunk can lead to the loss of cross-partition semantic dependencies.

Summary Semantic summarization operates as a lossy compression mechanism, aiming to distill the gist from lengthy interaction logs. Unlike verbatim storage, it prioritizes global semantic coherence over local factual precision, transforming linear streams into compact narrative blocks. The primary strength of semantic summarization is efficiency: it drastically reduces context length, making it ideal for long-term dialogue. However, the trade-off is resolution loss: specific details or subtle cues may be smoothed out, limiting their utility in evidence-critical tasks.

Incremental Semantic Summarization

Semantic summarization transforms raw observational data into compact and semantically rich summaries. The resulting summaries capture the global, high-level information of the original data, rather than specific factual or experiential details (Zhao et al., 2024; Anokhin et al., 2024). Typical examples of such summaries include the overarching narrative of a document (Kim and Kim, 2025; Yu et al., 2025a), the procedural flow of a task (Ye et al., 2025a; Zhang et al., 2025r), or a user's historical profile (Zhang, 2024; Westhäußer et al., 2025). By filtering out redundant content while preserving task-relevant global semantics, semantic summarization provides a high-level guiding blueprint for subsequent reasoning without introducing excessive

Table 7 Taxonomy of memory formation methods. We classify approaches based on the memory formation operations. Methods are analyzed across three technical dimensions: (1) Sub-Type identifies the specific variation or scope, (2) Representation Form specifies the output format, and (3) Key Mechanism denotes the core algorithmic strategy.

contextual overhead. To achieve these effects, the compression process can be implemented in two primary ways: incremental and partitioned semantic summarization.

Incremental Semantic Summarization This paradigm employs a temporal integration mechanism that continuously fuses newly observed information with the existing summary, producing an evolving representation of global semantics. This chunk-by-chunk paradigm supports incremental learning (McCloskey and Cohen, 1989), circumvents the O ( n 2 ) computational burden of full-sequence processing (Yu et al., 2025a), and promotes progressive convergence toward global semantics (Chen et al., 2024b). Early implementations such as MemGPT (Packer et al., 2023a) and Mem0 (Chhikara et al., 2025) directly merged new chunks with existing summaries at appropriate moments, relying solely on the LLM's inherent summarization ability. However, this approach was constrained by the model's limited capacity, often resulting in inconsistency or semantic drift. To alleviate these issues, Chen et al. (2024b) and Wu et al. (2025i) incorporated external evaluators to filter redundant or incoherent content, using a convolutional-based discriminator for consistency verification and DeBERTa (He et al., 2020) for filtering trivial content, respectively. Instead of relying on auxiliary networks, subsequent methods, such as Mem1 (Zhou et al., 2025b) and MemAgent (Yu et al., 2025a), enhanced the LLM's own summarization capability through reinforcement learning with PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024).

As incremental summarization advanced from heuristic fusion to filtered integration and ultimately to learning-based optimization, the summarization competence became increasingly internalized within the model, thereby reducing cumulative errors across iterations. Nevertheless, the serial update nature still poses computational bottlenecks (Fang et al., 2025b) and potential information forgetting, motivating the development of Partitioned semantic summarization approaches.

Partitioned Semantic Summarization This paradigm adopts a spatial decomposition mechanism, dividing information into distinct semantic partitions and generating separate summaries for each. Early studies typically adopted heuristic partitioning strategies for handling long contexts. MemoryBank (Zhong et al., 2024) and COMEDY (Chen et al., 2025d) summarize and aggregate long-term dialogues by treating each day or session as a basic unit. Along the structural dimension, Wu et al. (2021) and Bailly et al. (2025) generate summaries of summaries by segmenting long documents into chapters or paragraphs. While intuitive, such approaches often suffer from semantic discontinuity across partitions. To address this issue, methods such as ReadAgent (Lee et al., 2024a) and LightMem (Fang et al., 2025b) introduce semantic or topical clustering before summarization, thereby enhancing inter-chunk coherence. Extending beyond textual compression, DeepSeek-OCR (Wei et al., 2025a) pioneers the idea of compressing long contexts via optical 2D mapping, achieving higher compression ratios in multimodal scenarios. In the video memory domain, FDVS (You et al., 2024) and LangRepo (Kahatapitiya et al., 2025) segment long videos into clips and generate textual summaries by integrating multi-source signals such as subtitles, object detection, and scene descriptions, which are then hierarchically aggregated into a global long video story.

Compared with incremental summarization, the partitioned approach offers superior efficiency and captures finer-grained semantics. However, its independent processing of each sub-chunk can lead to the loss of cross-partition semantic dependencies.

Summary Semantic summarization operates as a lossy compression mechanism, aiming to distill the gist from lengthy interaction logs. Unlike verbatim storage, it prioritizes global semantic coherence over local factual precision, transforming linear streams into compact narrative blocks. The primary strength of semantic summarization is efficiency: it drastically reduces context length, making it ideal for long-term dialogue. However, the trade-off is resolution loss: specific details or subtle cues may be smoothed out, limiting their utility in evidence-critical tasks.

Partitioned Semantic Summarization

Semantic summarization transforms raw observational data into compact and semantically rich summaries. The resulting summaries capture the global, high-level information of the original data, rather than specific factual or experiential details (Zhao et al., 2024; Anokhin et al., 2024). Typical examples of such summaries include the overarching narrative of a document (Kim and Kim, 2025; Yu et al., 2025a), the procedural flow of a task (Ye et al., 2025a; Zhang et al., 2025r), or a user's historical profile (Zhang, 2024; Westhäußer et al., 2025). By filtering out redundant content while preserving task-relevant global semantics, semantic summarization provides a high-level guiding blueprint for subsequent reasoning without introducing excessive

Table 7 Taxonomy of memory formation methods. We classify approaches based on the memory formation operations. Methods are analyzed across three technical dimensions: (1) Sub-Type identifies the specific variation or scope, (2) Representation Form specifies the output format, and (3) Key Mechanism denotes the core algorithmic strategy.

contextual overhead. To achieve these effects, the compression process can be implemented in two primary ways: incremental and partitioned semantic summarization.

Incremental Semantic Summarization This paradigm employs a temporal integration mechanism that continuously fuses newly observed information with the existing summary, producing an evolving representation of global semantics. This chunk-by-chunk paradigm supports incremental learning (McCloskey and Cohen, 1989), circumvents the O ( n 2 ) computational burden of full-sequence processing (Yu et al., 2025a), and promotes progressive convergence toward global semantics (Chen et al., 2024b). Early implementations such as MemGPT (Packer et al., 2023a) and Mem0 (Chhikara et al., 2025) directly merged new chunks with existing summaries at appropriate moments, relying solely on the LLM's inherent summarization ability. However, this approach was constrained by the model's limited capacity, often resulting in inconsistency or semantic drift. To alleviate these issues, Chen et al. (2024b) and Wu et al. (2025i) incorporated external evaluators to filter redundant or incoherent content, using a convolutional-based discriminator for consistency verification and DeBERTa (He et al., 2020) for filtering trivial content, respectively. Instead of relying on auxiliary networks, subsequent methods, such as Mem1 (Zhou et al., 2025b) and MemAgent (Yu et al., 2025a), enhanced the LLM's own summarization capability through reinforcement learning with PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024).

As incremental summarization advanced from heuristic fusion to filtered integration and ultimately to learning-based optimization, the summarization competence became increasingly internalized within the model, thereby reducing cumulative errors across iterations. Nevertheless, the serial update nature still poses computational bottlenecks (Fang et al., 2025b) and potential information forgetting, motivating the development of Partitioned semantic summarization approaches.

Partitioned Semantic Summarization This paradigm adopts a spatial decomposition mechanism, dividing information into distinct semantic partitions and generating separate summaries for each. Early studies typically adopted heuristic partitioning strategies for handling long contexts. MemoryBank (Zhong et al., 2024) and COMEDY (Chen et al., 2025d) summarize and aggregate long-term dialogues by treating each day or session as a basic unit. Along the structural dimension, Wu et al. (2021) and Bailly et al. (2025) generate summaries of summaries by segmenting long documents into chapters or paragraphs. While intuitive, such approaches often suffer from semantic discontinuity across partitions. To address this issue, methods such as ReadAgent (Lee et al., 2024a) and LightMem (Fang et al., 2025b) introduce semantic or topical clustering before summarization, thereby enhancing inter-chunk coherence. Extending beyond textual compression, DeepSeek-OCR (Wei et al., 2025a) pioneers the idea of compressing long contexts via optical 2D mapping, achieving higher compression ratios in multimodal scenarios. In the video memory domain, FDVS (You et al., 2024) and LangRepo (Kahatapitiya et al., 2025) segment long videos into clips and generate textual summaries by integrating multi-source signals such as subtitles, object detection, and scene descriptions, which are then hierarchically aggregated into a global long video story.

Compared with incremental summarization, the partitioned approach offers superior efficiency and captures finer-grained semantics. However, its independent processing of each sub-chunk can lead to the loss of cross-partition semantic dependencies.

Summary Semantic summarization operates as a lossy compression mechanism, aiming to distill the gist from lengthy interaction logs. Unlike verbatim storage, it prioritizes global semantic coherence over local factual precision, transforming linear streams into compact narrative blocks. The primary strength of semantic summarization is efficiency: it drastically reduces context length, making it ideal for long-term dialogue. However, the trade-off is resolution loss: specific details or subtle cues may be smoothed out, limiting their utility in evidence-critical tasks.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Knowledge Distillation

While semantic summarization captures the global semantics of raw data at a macro level, knowledge distillation operates at a finer granularity, extracting reusable knowledge from interaction trajectories or documents. In a broad sense, knowledge refers to the various forms of factual and experiential memory described in Section 4, depending on the task's underlying functions.

Distilling Factual Memory This process focuses on transforming raw interactions and documents into explicit, declarative knowledge regarding users and environmental states. This process ensures that the agent maintains consistency and adaptability by retaining verifiable facts rather than transient context. In the domain of user modeling, systems such as TiM (Liu et al., 2023a), RMM (Tan et al., 2025c), and EMem (Zhou and Han, 2025b) employ abstraction mechanisms to convert dialogue turns into high-level thoughts or enriched elementary discourse units, thereby preserving long-term persona coherence. For users' objective modeling, approaches like MemGuide (Du et al., 2025b) extract user intent descriptions from dialogues. During reasoning, it captures and updates goal states, separating confirmed constraints from unresolved intents to mitigate goal drift. Furthermore, this distillation extends to multimodal environments, where agents like ESR (Shen et al., 2024) and M3-Agent (Long et al., 2025) compress egocentric visual observations into text-addressable facts about object locations and user routines. Furthermore, by equipping agents with multimodal understanding capabilities, Video-RAG (Luo et al., 2025b) converts audio, subtitles, and objects in videos into textual notes, i.e., factual memories, to enhance long-video understanding.

Distilling Experiential Memory This process focuses on extracting the strategies underlying task execution from historical trajectories. By deriving planning principles from successful rollouts and corrective signals from failures, this paradigm enhances the agent's problem-solving ability on specific tasks. Through abstraction and generalization, it further supports cross-task knowledge transfer. As a result, experiential generalization enables the agent to continually refine its competence and move toward lifelong learning.

This line of research aims to derive high-level planning strategies and key insights from both successful and failed trajectories. Some approaches focus on success-based distillation, where systems such as AgentRR (Feng et al., 2025) and AWM (Wang et al., 2024m) summarize overall task plans from successful cases. Mem p (Fang et al., 2025d) analyzes and summarizes the gold trajectories from the training set, distilling them into abstract procedural knowledge. Others adopt failure-driven reflection, exemplified by Matrix (Liu et al., 2024), SAGE (Liang et al., 2025), and R2D2 (Huang et al., 2025c), which compare reasoning traces against ground-truth answers to identify error sources and extract reflective insights. Combining both, ExpeL (Zhao et al., 2024), From Experience to Strategy (Xia et al., 2025), and ReMe (Cao et al., 2025b) contrast successful and failed experiences to uncover holistic planning insights.

However, prior work primarily focuses on summarizing task-level planning knowledge, lacking fine-grained, step-level insights. To address this gap, H 2 R (Ye et al., 2025b) introduces a two-tier reflection mechanism: it follows ExpeL to construct a pool of high-level planning insights, while further segmenting trajectories by subgoal sequences to derive step-wise execution insights.

Earlier methods relied on fixed prompts for insight extraction, making their performance sensitive to prompt design and the underlying LLM's capacity. Recently, trainable distillation methods have become prevalent. Learn-to-Memorize (Zhang et al., 2025u) optimizes task-specific prompts for different agents. On the other hand, Memory-R1 (Yan et al., 2025c) uses an LLMExtract module to obtain experiential and factual knowledge, while only the subsequent fusion component is trained to integrate these outputs into the memory bank. Although these approaches adopt an end-to-end framework, they still fall short in enhancing the LLM's intrinsic ability to distill insights. To overcome this limitation, Memα (Wang et al., 2025p) explicitly trains the LLM on what insights to extract and how to preserve them.

Summary This part focuses on extracting function-specific knowledge from the raw context, without addressing the structure of memory storage. Each piece of knowledge can be viewed as a flat memory unit. Simply storing multiple units in an unstructured table ignores the semantic and hierarchical relations among them. To address this, the memory formation process can apply structured rules to derive insights and store them within a hierarchical architecture. Simple but essential, the single knowledge distillation method introduced here serves as a foundational component for more complex and structured memory formation mechanisms.

Distilling Factual Memory

User factual memory persists verifiable facts about a specific user across sessions and tasks, including identity, preferences, routines, historical commitments, and salient events.

Its primary function is to prevent characteristic failure modes of stateless interaction, such as coreference drift, repeated elicitation, and contradictory responses, thereby reducing interruptions to long-horizon goals (Tan et al., 2025c; Zhong et al., 2024). Engineering practice typically comprises selection and compression, structured organization, retrieval and reuse, and consistency governance, aiming to sustain long-range dialogic and behavioral coherence under bounded access cost.

Dialogue Coherence Dialogue coherence requires an agent to preserve conversational context, user-specific facts, and a stable persona over extended periods. This ensures that later turns remain sensitive to earlier disclosures and affective cues, rather than degrading into repeated clarifications or inconsistent replies. To achieve this, modern systems implement user factual memory through two complementary strategies: heuristic selection and semantic abstraction .

To navigate finite context windows efficiently, a primary strategy is to selectively retain and rank interaction histories. Rather than retaining all raw logs, systems (Xi and Wang, 2025; Zhong et al., 2024; Park et al., 2023; Lei et al., 2025) maintain structured stores of past interactions, ranking entries by metrics such as relevance , recency , importance , or distinctiveness . By filtering retrieval based on these scores, high-value items are preserved and periodically condensed into higher-level summaries, conditioning subsequent responses to maintain continuity without overwhelming the agent's working memory.

Beyond mere selection, advanced frameworks emphasize the transformation and abstraction of raw dialogue fragments into higher-level semantic representations. Approaches such as Think in Memory (Liu et al., 2023a) and Reflective Memory Management (Tan et al., 2025c) convert raw interaction traces into thought representations or reflections via iterative update operations. This allows the agent to query a stable semantic memory, keeping later replies topically consistent and less repetitive. Similarly, COMEDY (Chen et al., 2025d) employs a single language model to generate, compress, and reuse memory while updating compact user profiles. These methods effectively stabilize persona and preference expression over long conversational histories by decoupling memory storage from the raw token surface form.

Goal Consistency Goal consistency requires an agent to maintain and refine an explicit task representation over time. This ensures that clarifying questions, information requests, and actions remain strictly aligned with the primary objective, minimizing intent drift.

To mitigate such drift, systems utilize factual memory to dynamically track and update the task state. Approaches like RecurrentGPT (Zhou et al., 2023b), Memolet (Yen and Zhao, 2024), and MemGuide (Du et al., 2025b) retain confirmed information while highlighting unresolved elements. By guiding retrieval based on task intent, these methods help agents satisfy missing constraints and maintain focus across sessions.

For complex, long-horizon tasks, memory forms are often structured to facilitate localized retrieval centered on the active goal (Wu et al., 2025h). For instance, A-Mem (Xu et al., 2025c) organizes memories as an interconnected graph of linked notes, while H-Mem (Limbacher and Legenstein, 2020) employs associative mechanisms to recall prerequisite facts when subsequent steps depend on prior observations.

In embodied scenarios, factual memory grounds agent behavior in user-specific habits and environmental context. Systems such as M3-Agent (Long et al., 2025) and MEMENTO (Kwon et al., 2025) persist data on household members, object locations, and routines, reusing this information to minimize redundant exploration and repeated instructions. Similarly, Encode-Store-Retrieve (Shen et al., 2024) processes egocentric visual streams into text-addressable entries, allowing agents to answer questions based on past visual experiences without requiring user repetition.

Summary Collectively, these mechanisms transform ephemeral interaction traces into a persistent cognitive substrate. By integrating retrieval-based ranking with generative abstraction, user factual memory upgrades the system from simple similarity matching to the active maintenance of explicit goals and constraints. This foundation yields a dual benefit: it fosters a sense of familiarity and trust through long-term behavioral

coherence, while simultaneously enhancing operational efficiency by increasing task success rates, reducing redundancy, and lowering error recovery overhead.

Distilling Experiential Memory

Experiential memory encapsulates the mechanism by which agents encode historical trajectories, distilled strategies, and interaction outcomes into durable, retrievable representations. Unlike working memory, which manages transient context, experiential memory focuses on the long-term accumulation and transfer of knowledge across distinct episodes.

Theoretically grounded in cognitive science, this paradigm parallels human nondeclarative memory , specifically the procedural and habit systems (Squire, 2004; Seger and Spiering, 2011). Biological systems rely on distributed neural circuits for implicit skill acquisition (Reber, 2013). In contrast, agentic experiential memory typically employs explicit data structures, such as vector databases or symbolic logs. This implementation difference grants agents a unique capability absent in biological counterparts: the ability to introspect, edit, and reason over their own procedural knowledge.

Crucially, experiential memory serves as a foundation for continual learning and self-evolution in the era of experience (Sutton, 2025; Gao et al., 2025). By maintaining a repository of structured experiences, agents achieve a non-parametric path to adaptation and avoid the prohibitive costs of frequent parametric updates. This mechanism effectively closes the learning loop by converting interaction feedback into reusable knowledge. Through this process, agents rectify past errors, abstract generalizable heuristics, and compile routine behaviors. Consequently, such adaptation minimizes redundant computations and refines decision-making over time (Zhao et al., 2024; Shinn et al., 2023b).

To systematically analyze existing literature, we classify experiential memory based on the abstraction level of the stored information. An overview of this abstraction-based taxonomy and representative paradigms is illustrated in Figure 7. Representative methods under this abstraction-based taxonomy, together with their storage carriers, representation forms, and optimization strategies, are summarized in Table 5.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Structured Construction

While Semantic Summarization (Section 5.1.1) and Knowledge Distillation (Section 5.1.2) effectively compress summaries and knowledge at different levels of granularity, they often treat memory as isolated units. In

contrast, Structured Construction transforms amorphous data into organized topological representations. This process is not merely a change in storage format, but an active structural operation that determines how information is linked and layered. Unlike unstructured plaintext summarization, structured extraction significantly enhances both interpretability and retrieval efficiency. Crucially, such structural prior excels at capturing complex logic and dependencies in multi-hop reasoning tasks, offering substantial advantages over traditional retrieval-augmented methods.

Based on the operational granularity of how the underlying structure is derived, we categorize existing methods into two paradigms: Entity-Level Construction, which builds underlying topology by atomizing text into entities and relations, and Chunk-Level Construction, which builds structure by organizing intact text segments or memory items.

Entity-Level Construction The foundational structure of this paradigm is derived from relational triples extraction, which decomposes the raw context into its finest-grained semantic atomic entities and relations. Traditional approaches model memory as a planar knowledge graph. For instance, KGT (Sun et al., 2024) introduces a real-time personalization mechanism where user preferences and feedback are directly encoded as nodes and edges within a user-specific knowledge graph. Similarly, Mem0 g (Chhikara et al., 2025) utilizes LLMs to convert conversation messages directly into entities and relation triplets in the extraction phase.

However, these direct extraction methods are often limited by the inherent capabilities of the LLM, leading to potential noise or structural errors. To improve the quality of the constructed graph, D-SMART (Lei et al., 2025) adopts a refined approach: it first employs an LLM to distill core semantic content into concise, assertionlike natural language statements, and subsequently extracts an OWL-compliant knowledge graph fragment through a neuro-symbolic pipeline. Additionally, Ret-LLM (Modarressi et al., 2023) applies supervised fine-tuning to the LLM, enabling more robust read-write interactions with the relational graph.

While the aforementioned methods focus on planar structures, recent advancements have progressed towards constructing hierarchical memory to capture high-level abstractions. For example, GraphRAG (Edge et al., 2025) derives an entity knowledge graph from source documents and applies community detection algorithms to extract graph communities and generate community summaries iteratively. This hierarchical approach identifies higher-level cluster associations between entities, enabling the extraction of generalized insights and facilitating flexible retrieval at varying granularities.

To better reflect the internal coherence and temporal information of the original data, some works extend the semantic knowledge graph by incorporating an episodic graph. AriGraph (Anokhin et al., 2024) and HippoRAG (Gutierrez et al., 2024) establish a dual-layer structure comprising the semantic and episodic graph. They extract semantic triplets from dialogues while connecting nodes that occur simultaneously or establishing node-paragraph indices. Zep (Rasmussen et al., 2025) further formalizes this into a three-layer temporal graph architecture: an episodic subgraph ( G e ) that logs the occurrence and processing times of raw messages via a bi-temporal model, a semantic subgraph ( G s ) for entities and time-bounded facts, and a community subgraph ( G c ) for high-level clustering and summarization of entities.

Chunk-Level Construction This paradigm treats continuous text spans or discrete memory items as nodes, preserving local semantic integrity while organizing them into topological structures. The evolution of this field progresses from static, planar (2D) extraction from fixed corpora to dynamic adaptation with incoming trajectories, and ultimately to hierarchical (3D) architectures.

Early approaches focused on organizing fixed text libraries into static planar structures. HAT (A et al., 2024) processes long texts by segmenting them and progressively aggregating summaries to construct a hierarchical tree. Similarly, RAPTOR (Sarthi et al., 2024) recursively clusters text chunks using UMAP for dimensionality reduction and Gaussian Mixture Models for soft clustering, iteratively summarizing these clusters to form a tree. However, these static methods lack the flexibility to handle streaming data without costly reconstruction.

To address this, dynamic planar approaches incrementally build memory structures as new trajectories arrive, differing based on their foundational elements. Methods based on raw text include MemTree (Rezazadeh et al., 2025c) and H-MEM (Sun and Zeng, 2025). MemTree adopts a bottom-up approach where new text fragments retrieve the most similar nodes and are inserted as children or iteratively into a subtree, triggering bottom-up

summary updates for all parent nodes. Conversely, H-MEM utilizes a top-down strategy, prompting the LLM to organize data into a four-level JSON hierarchy comprising the domain, category, memory trace, and episode layer. Alternatively, A-MEM (Xu et al., 2025c) and PREMem (Kim et al., 2025b) focus on reorganizing the extracted memory items. A-MEM summarizes knowledge into discrete notes and links relevant ones to construct a networked memory. PREMem clusters extracted factual, experiential, and subjective memories to identify and store higher-dimensional cross-session reasoning patterns.

Recent advancements move beyond planar layouts to construct hierarchical structures, offering richer semantic depth. SGMem (Wu et al., 2025h) constructs a hierarchy by using NLTK to split text into sentences, forming a KNN graph across all sentence nodes, and subsequently calling an LLM to extract summaries, facts, and insights corresponding to each dialogue. To support the incremental construction of hierarchical structures as streaming data arrives, CAM (Li et al., 2025g) establishes edges between text blocks based on semantic relevance and narrative coherence. It iteratively summarizes the ego graph and handles new memory insertions by explicitly disentangling overlapping clusters through node replication. In multi-agent scenarios, G-memory (Zhang et al., 2025c) extends this dynamic 3D approach by maintaining three distinct graphs: an interaction graph for raw chat history, a query graph for specific tasks, and an insight graph. This structure enables each agent to receive customized memory at varying levels of granularity.

Summary The main advantage of structured construction is explainability and the ability to handle complex relational queries. Such methods capture intricate semantic and hierarchical relationships between memory elements, support reasoning over multi-step dependencies, and facilitate integration with symbolic or graph-based reasoning frameworks. However, the downside is schema rigidity: pre-defined structures may fail to represent nuanced or ambiguous information, and the extraction and maintenance costs are typically high.

Entity-Level Construction

After initiating the retrieval process, the next challenge lies in transforming the raw query into an effective retrieval signal aligned with the memory index. Query construction acts as the translation layer between the user's surface utterance and the memory's latent storage. Traditional approaches typically perform retrieval directly based on the user query, which is simple but fails to align the query semantics with those of the memory index. To bridge this gap, agentic memory systems proactively perform query decomposition or query rewriting, generating intermediate retrieval signals that better match the latent structure of the memory.

Query Decomposition This approach breaks down a complex query into simpler sub-queries, allowing the system to retrieve more fine-grained and relevant information. Such decomposition alleviates the one-shot retrieval bottleneck by enabling modular retrieval and reasoning over intermediate results. For instance, Visconde (Pereira et al., 2023) and ChemAgent (Tang et al., 2025c) employ LLMs to decompose the original question into sub-problems, retrieve candidate results for each from the memory, and finally aggregate them into a coherent answer. However, these methods lack global planning. To address this issue, PRIME (Tran et al., 2025) and MA-RAG (Nguyen et al., 2025) introduce a Planner Agent, inspired by the ReAct (Yao et al., 2023b) paradigm, that first formulates a global retrieval plan before decomposing it into sub-queries. Yet, these approaches mainly rely on problem-driven decomposition and thus cannot explicitly identify what specific knowledge the model is missing. To make sub-queries more targeted, Agent KB (Tang et al., 2025d) adopts a two-stage retrieval process in which a teacher model observes the student model's failures and generates fine-grained sub-queries accordingly. This targeted decomposition improves retrieval precision and reduces irrelevant results, particularly in knowledge-intensive tasks.

Query Rewriting Instead of decomposing, this strategy rewrites the original query or generates a hypothetical document to refine its semantics before retrieval. Such rewriting mitigates the mismatch between user intent and the memory index. HyDE (Gao et al., 2023b), for example, instructs the LLM to generate a hypothetical document in a zero-shot manner and performs retrieval using its semantic embedding. The generated document encapsulates the desired semantics, effectively bridging the gap between the user query and the target memory. MemoRAG (Qian et al., 2025) extends this idea by incorporating global memory into hypothetical document generation. It first compresses the global memory and then generates a draft answer conditioned on both the query and the compressed memory; this draft is then used as a rewritten query. Since the draft has access to the global memory context, it captures user intent more faithfully and uncovers implicit information needs. Similarly, MemGuide (Du et al., 2025b) leverages the dialogue context to prompt an LLM to produce a concise, command-like phrase that serves as a high-level intent description for retrieval. Beyond directly prompting an LLM to rewrite the query, Rewrite-Retrieve-Read (Ma et al., 2023b) trains a small language model as a dedicated rewriter through reinforcement learning, while ToC (Kim et al., 2023a) employs a Tree of Clarifications to progressively refine and specify the user's retrieval objective.

Summary These two paradigms, decomposition and rewriting, are not mutually exclusive. Auto-RAG (Kim et al., 2024a) integrates both by evaluating HyDE and Visconde under identical retrieval conditions and then selecting the strategy that performs best for the given task. The findings of this work demonstrate that the quality of the memory-retrieval query has a substantial impact on reasoning performance. In contrast to earlier research, which primarily focused on designing sophisticated memory architectures, recent studies (Yan et al., 2025b) place increasing emphasis on the retrieval construction process, shifting the role of memory toward serving retrieval. The choice of what to retrieve with is, unsurprisingly, a critical component of this process.

Chunk-Level Construction

After initiating the retrieval process, the next challenge lies in transforming the raw query into an effective retrieval signal aligned with the memory index. Query construction acts as the translation layer between the user's surface utterance and the memory's latent storage. Traditional approaches typically perform retrieval directly based on the user query, which is simple but fails to align the query semantics with those of the memory index. To bridge this gap, agentic memory systems proactively perform query decomposition or query rewriting, generating intermediate retrieval signals that better match the latent structure of the memory.

Query Decomposition This approach breaks down a complex query into simpler sub-queries, allowing the system to retrieve more fine-grained and relevant information. Such decomposition alleviates the one-shot retrieval bottleneck by enabling modular retrieval and reasoning over intermediate results. For instance, Visconde (Pereira et al., 2023) and ChemAgent (Tang et al., 2025c) employ LLMs to decompose the original question into sub-problems, retrieve candidate results for each from the memory, and finally aggregate them into a coherent answer. However, these methods lack global planning. To address this issue, PRIME (Tran et al., 2025) and MA-RAG (Nguyen et al., 2025) introduce a Planner Agent, inspired by the ReAct (Yao et al., 2023b) paradigm, that first formulates a global retrieval plan before decomposing it into sub-queries. Yet, these approaches mainly rely on problem-driven decomposition and thus cannot explicitly identify what specific knowledge the model is missing. To make sub-queries more targeted, Agent KB (Tang et al., 2025d) adopts a two-stage retrieval process in which a teacher model observes the student model's failures and generates fine-grained sub-queries accordingly. This targeted decomposition improves retrieval precision and reduces irrelevant results, particularly in knowledge-intensive tasks.

Query Rewriting Instead of decomposing, this strategy rewrites the original query or generates a hypothetical document to refine its semantics before retrieval. Such rewriting mitigates the mismatch between user intent and the memory index. HyDE (Gao et al., 2023b), for example, instructs the LLM to generate a hypothetical document in a zero-shot manner and performs retrieval using its semantic embedding. The generated document encapsulates the desired semantics, effectively bridging the gap between the user query and the target memory. MemoRAG (Qian et al., 2025) extends this idea by incorporating global memory into hypothetical document generation. It first compresses the global memory and then generates a draft answer conditioned on both the query and the compressed memory; this draft is then used as a rewritten query. Since the draft has access to the global memory context, it captures user intent more faithfully and uncovers implicit information needs. Similarly, MemGuide (Du et al., 2025b) leverages the dialogue context to prompt an LLM to produce a concise, command-like phrase that serves as a high-level intent description for retrieval. Beyond directly prompting an LLM to rewrite the query, Rewrite-Retrieve-Read (Ma et al., 2023b) trains a small language model as a dedicated rewriter through reinforcement learning, while ToC (Kim et al., 2023a) employs a Tree of Clarifications to progressively refine and specify the user's retrieval objective.

Summary These two paradigms, decomposition and rewriting, are not mutually exclusive. Auto-RAG (Kim et al., 2024a) integrates both by evaluating HyDE and Visconde under identical retrieval conditions and then selecting the strategy that performs best for the given task. The findings of this work demonstrate that the quality of the memory-retrieval query has a substantial impact on reasoning performance. In contrast to earlier research, which primarily focused on designing sophisticated memory architectures, recent studies (Yan et al., 2025b) place increasing emphasis on the retrieval construction process, shifting the role of memory toward serving retrieval. The choice of what to retrieve with is, unsurprisingly, a critical component of this process.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Latent Representation

The previous chapters focused on how to build token-level memory; this part focuses on encoding memory into the machine's native latent representation. Latent representation encodes raw experiences into embeddings that reside in a latent space. Unlike semantic compression and structured extraction, which summarize experiences before embedding them into vectors, latent encoding inherently stores experiences in latent space, thereby reducing information loss during summarization and text embedding. Furthermore, latent encoding is more conducive to machine cognition, enabling a unified representation across different modalities and ensuring both density and semantic richness in memory representation.

Textual Latent Representation Although originally designed to accelerate inference, the KV cache can also be viewed as a form of latent representation within the context of memory (Li et al., 2025c; Jiang et al., 2025b). It utilizes additional memory to store past information, thereby avoiding redundant computation. MEMORYLLM (Wang et al., 2024j) and M+ (Wang et al., 2025n) represent memory as self-updatable latent embeddings, which are injected into transformer layers during inference. Moreover, MemGen (Zhang et al., 2025d) introduces a memory trigger to monitor the agent's reasoning state and determine when to explicitly invoke memory, as well as a memory waiver that leverages the agent's current state to construct a latent token sequence. This sequence acts as machine-native memory, enriching the agent's reasoning capabilities.

Multimodal Latent Representation In multimodal memory research, CoMEM (Wu et al., 2025d) compresses vision-language inputs into fixed-length tokens via a Q-Former, enabling dense, continuous memory and supporting plug-and-play usage for infinite context lengths. Encode-Store-Retrieve (Shen et al., 2024) converts egocentric video frames into language encodings using Ego-LLaVA, which are subsequently transformed into vector representations through an embedding model. Although embedding models are employed to ensure semantic alignment, these methods often face a trade-off between compression loss and computational overhead, particularly in handling gradient flow in long-context sequences.

When integrated with Embodied AI, multimodal latent memory can fuse data from multiple sensors. For example, Mem2Ego (Zhang et al., 2025m) dynamically aligns global contextual information with local perception, embedding landmark semantics as latent memory to enhance spatial reasoning and decisionmaking in long-horizon tasks. KARMA (Wang et al., 2025r) adopts a hybrid long- and short-term memory form that encodes object information into multimodal embeddings, achieving a balance between immediate

responsiveness and consistent representation. These explorations underscore the advantages of latent encoding in providing unified and semantically rich representations across modalities.

Summary Latent representation bypasses human-readable formats, encoding experiences directly into machine-native vectors or KV-caches. This high-density format preserves rich semantic signals that might be lost in text decoding, enabling smoother integration with the model's internal computations. And it supports multimodal alignment seamlessly. However, it suffers from opaqueness. The latent memory is a black box, making it difficult for humans to debug, edit, or verify the knowledge it stores.

Textual Latent Representation

The previous chapters focused on how to build token-level memory; this part focuses on encoding memory into the machine's native latent representation. Latent representation encodes raw experiences into embeddings that reside in a latent space. Unlike semantic compression and structured extraction, which summarize experiences before embedding them into vectors, latent encoding inherently stores experiences in latent space, thereby reducing information loss during summarization and text embedding. Furthermore, latent encoding is more conducive to machine cognition, enabling a unified representation across different modalities and ensuring both density and semantic richness in memory representation.

Textual Latent Representation Although originally designed to accelerate inference, the KV cache can also be viewed as a form of latent representation within the context of memory (Li et al., 2025c; Jiang et al., 2025b). It utilizes additional memory to store past information, thereby avoiding redundant computation. MEMORYLLM (Wang et al., 2024j) and M+ (Wang et al., 2025n) represent memory as self-updatable latent embeddings, which are injected into transformer layers during inference. Moreover, MemGen (Zhang et al., 2025d) introduces a memory trigger to monitor the agent's reasoning state and determine when to explicitly invoke memory, as well as a memory waiver that leverages the agent's current state to construct a latent token sequence. This sequence acts as machine-native memory, enriching the agent's reasoning capabilities.

Multimodal Latent Representation In multimodal memory research, CoMEM (Wu et al., 2025d) compresses vision-language inputs into fixed-length tokens via a Q-Former, enabling dense, continuous memory and supporting plug-and-play usage for infinite context lengths. Encode-Store-Retrieve (Shen et al., 2024) converts egocentric video frames into language encodings using Ego-LLaVA, which are subsequently transformed into vector representations through an embedding model. Although embedding models are employed to ensure semantic alignment, these methods often face a trade-off between compression loss and computational overhead, particularly in handling gradient flow in long-context sequences.

When integrated with Embodied AI, multimodal latent memory can fuse data from multiple sensors. For example, Mem2Ego (Zhang et al., 2025m) dynamically aligns global contextual information with local perception, embedding landmark semantics as latent memory to enhance spatial reasoning and decisionmaking in long-horizon tasks. KARMA (Wang et al., 2025r) adopts a hybrid long- and short-term memory form that encodes object information into multimodal embeddings, achieving a balance between immediate

responsiveness and consistent representation. These explorations underscore the advantages of latent encoding in providing unified and semantically rich representations across modalities.

Summary Latent representation bypasses human-readable formats, encoding experiences directly into machine-native vectors or KV-caches. This high-density format preserves rich semantic signals that might be lost in text decoding, enabling smoother integration with the model's internal computations. And it supports multimodal alignment seamlessly. However, it suffers from opaqueness. The latent memory is a black box, making it difficult for humans to debug, edit, or verify the knowledge it stores.

Multimodal Latent Representation

The previous chapters focused on how to build token-level memory; this part focuses on encoding memory into the machine's native latent representation. Latent representation encodes raw experiences into embeddings that reside in a latent space. Unlike semantic compression and structured extraction, which summarize experiences before embedding them into vectors, latent encoding inherently stores experiences in latent space, thereby reducing information loss during summarization and text embedding. Furthermore, latent encoding is more conducive to machine cognition, enabling a unified representation across different modalities and ensuring both density and semantic richness in memory representation.

Textual Latent Representation Although originally designed to accelerate inference, the KV cache can also be viewed as a form of latent representation within the context of memory (Li et al., 2025c; Jiang et al., 2025b). It utilizes additional memory to store past information, thereby avoiding redundant computation. MEMORYLLM (Wang et al., 2024j) and M+ (Wang et al., 2025n) represent memory as self-updatable latent embeddings, which are injected into transformer layers during inference. Moreover, MemGen (Zhang et al., 2025d) introduces a memory trigger to monitor the agent's reasoning state and determine when to explicitly invoke memory, as well as a memory waiver that leverages the agent's current state to construct a latent token sequence. This sequence acts as machine-native memory, enriching the agent's reasoning capabilities.

Multimodal Latent Representation In multimodal memory research, CoMEM (Wu et al., 2025d) compresses vision-language inputs into fixed-length tokens via a Q-Former, enabling dense, continuous memory and supporting plug-and-play usage for infinite context lengths. Encode-Store-Retrieve (Shen et al., 2024) converts egocentric video frames into language encodings using Ego-LLaVA, which are subsequently transformed into vector representations through an embedding model. Although embedding models are employed to ensure semantic alignment, these methods often face a trade-off between compression loss and computational overhead, particularly in handling gradient flow in long-context sequences.

When integrated with Embodied AI, multimodal latent memory can fuse data from multiple sensors. For example, Mem2Ego (Zhang et al., 2025m) dynamically aligns global contextual information with local perception, embedding landmark semantics as latent memory to enhance spatial reasoning and decisionmaking in long-horizon tasks. KARMA (Wang et al., 2025r) adopts a hybrid long- and short-term memory form that encodes object information into multimodal embeddings, achieving a balance between immediate

responsiveness and consistent representation. These explorations underscore the advantages of latent encoding in providing unified and semantically rich representations across modalities.

Summary Latent representation bypasses human-readable formats, encoding experiences directly into machine-native vectors or KV-caches. This high-density format preserves rich semantic signals that might be lost in text decoding, enabling smoother integration with the model's internal computations. And it supports multimodal alignment seamlessly. However, it suffers from opaqueness. The latent memory is a black box, making it difficult for humans to debug, edit, or verify the knowledge it stores.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Parametric Internalization

As LLMs increasingly incorporate memory systems to support long-term adaptation, a central research question is how these external memories should be consolidated into parametric form. While the latent representation methods discussed above parameterize memory externally to the model, parametric internalization directly adjusts the model's internal parameters. It leverages the model's capacity to encode and generalize information through its learned parameter space. This paradigm fundamentally enhances the model's intrinsic competence, eliminating the overhead of external storage and retrieval while seamlessly supporting continual updates. As we discussed in Section 4, not all memory content serves the same function: some entries provide declarative knowledge, while others encode procedural strategies that shape an agent's reasoning and behavior. This distinction motivates a finer-grained view of memory internalization, separating it into knowledge internalization and capability internalization.

Knowledge Internalization This strategy entails converting externally stored factual memories, such as conceptual definitions or domain knowledge, into the model's parameter space. Through this process, the model can directly recall and utilize these facts without relying on explicit retrieval or external memory modules. In practice, knowledge internalization is typically achieved through model editing (Sinitsin et al., 2020; De Cao et al., 2021). Early work, such as MEND (Mitchell et al., 2022), introduced an auxiliary network that enables rapid, single-step edits by decomposing fine-tuning gradients, thereby minimizing interference with unrelated knowledge. Building on this line of work, ROME (Meng et al., 2022) refined the editing process by using causal tracing to precisely locate the MLP layers that store specific facts and applying rank-one updates to inject new information with higher precision and better generalization. MEMIT (Meng et al., 2023) further advanced this line by supporting batch edits, enabling thousands of facts to be updated simultaneously through multi-layer residual distributions and batch formulas, which substantially improves scalability. With the rise of parameter-efficient paradigms like LoRA (Hu et al., 2022), knowledge internalization can be performed through lightweight adapters rather than direct parameter modification. For instance, CoLoR (Wistuba et al., 2023) freezes the pretrained Transformer parameters and trains only small LoRA adapters to internalize new knowledge, avoiding the high cost of full-parameter fine-tuning. Despite these advances, these approaches can still incur off-target effects (De Cao et al., 2021) and remain vulnerable to catastrophic forgetting in continual learning scenarios.

Capability Internalization This strategy seeks to embed experiential knowledge, such as procedural expertise or strategic heutistics, into the model's parameter space. The paradigm represents a memory formation operation in a broad sense, shifting from the acquisition of factual knowledge to the internalization of experiential capabilities. Specifically, these capabilities include domain-specific solution schemas, strategic planning, and the effective deployment of agentic skills, among others. Technically, capability internalization is achieved by learning from reasoning traces, through supervised fine-tuning (Wei et al., 2022; Zelikman et al., 2022; Schick et al., 2023; Mukherjee et al., 2023) or preference-guided optimization methods such as DPO (Rafailov et al., 2023; Tunstall et al., 2023; Yuan et al., 2024c; Grattafiori et al., 2024) and GRPO (Shao et al., 2024; DeepSeek-AI et al., 2025). As an attempt to integrate external RAG with parameterized training, Memory Decoder (Cao et al., 2025a) is a plug-and-play method that does not modify the base model, like external RAG, while achieving parameter-internalized inference speed by eliminating external retrieval overhead. Such plug-and-play parameterized memory may have broad potential.

Summary Parametric internalization represents the ultimate consolidation of memory, where external knowledge is fused into the model's weights via gradients. This shifts the paradigm from retrieving information

to possessing capability, mimicking biological long-term potentiation. As knowledge becomes effectively instinctive, access is zero-latency, enabling the model to respond immediately without querying external memory. However, this approach faces several challenges, including catastrophic forgetting and high update costs. Unlike external memory, parameterized internalization is difficult to modify or remove precisely without unintended side effects, limiting flexibility and adaptability.

Knowledge Internalization

While semantic summarization captures the global semantics of raw data at a macro level, knowledge distillation operates at a finer granularity, extracting reusable knowledge from interaction trajectories or documents. In a broad sense, knowledge refers to the various forms of factual and experiential memory described in Section 4, depending on the task's underlying functions.

Distilling Factual Memory This process focuses on transforming raw interactions and documents into explicit, declarative knowledge regarding users and environmental states. This process ensures that the agent maintains consistency and adaptability by retaining verifiable facts rather than transient context. In the domain of user modeling, systems such as TiM (Liu et al., 2023a), RMM (Tan et al., 2025c), and EMem (Zhou and Han, 2025b) employ abstraction mechanisms to convert dialogue turns into high-level thoughts or enriched elementary discourse units, thereby preserving long-term persona coherence. For users' objective modeling, approaches like MemGuide (Du et al., 2025b) extract user intent descriptions from dialogues. During reasoning, it captures and updates goal states, separating confirmed constraints from unresolved intents to mitigate goal drift. Furthermore, this distillation extends to multimodal environments, where agents like ESR (Shen et al., 2024) and M3-Agent (Long et al., 2025) compress egocentric visual observations into text-addressable facts about object locations and user routines. Furthermore, by equipping agents with multimodal understanding capabilities, Video-RAG (Luo et al., 2025b) converts audio, subtitles, and objects in videos into textual notes, i.e., factual memories, to enhance long-video understanding.

Distilling Experiential Memory This process focuses on extracting the strategies underlying task execution from historical trajectories. By deriving planning principles from successful rollouts and corrective signals from failures, this paradigm enhances the agent's problem-solving ability on specific tasks. Through abstraction and generalization, it further supports cross-task knowledge transfer. As a result, experiential generalization enables the agent to continually refine its competence and move toward lifelong learning.

This line of research aims to derive high-level planning strategies and key insights from both successful and failed trajectories. Some approaches focus on success-based distillation, where systems such as AgentRR (Feng et al., 2025) and AWM (Wang et al., 2024m) summarize overall task plans from successful cases. Mem p (Fang et al., 2025d) analyzes and summarizes the gold trajectories from the training set, distilling them into abstract procedural knowledge. Others adopt failure-driven reflection, exemplified by Matrix (Liu et al., 2024), SAGE (Liang et al., 2025), and R2D2 (Huang et al., 2025c), which compare reasoning traces against ground-truth answers to identify error sources and extract reflective insights. Combining both, ExpeL (Zhao et al., 2024), From Experience to Strategy (Xia et al., 2025), and ReMe (Cao et al., 2025b) contrast successful and failed experiences to uncover holistic planning insights.

However, prior work primarily focuses on summarizing task-level planning knowledge, lacking fine-grained, step-level insights. To address this gap, H 2 R (Ye et al., 2025b) introduces a two-tier reflection mechanism: it follows ExpeL to construct a pool of high-level planning insights, while further segmenting trajectories by subgoal sequences to derive step-wise execution insights.

Earlier methods relied on fixed prompts for insight extraction, making their performance sensitive to prompt design and the underlying LLM's capacity. Recently, trainable distillation methods have become prevalent. Learn-to-Memorize (Zhang et al., 2025u) optimizes task-specific prompts for different agents. On the other hand, Memory-R1 (Yan et al., 2025c) uses an LLMExtract module to obtain experiential and factual knowledge, while only the subsequent fusion component is trained to integrate these outputs into the memory bank. Although these approaches adopt an end-to-end framework, they still fall short in enhancing the LLM's intrinsic ability to distill insights. To overcome this limitation, Memα (Wang et al., 2025p) explicitly trains the LLM on what insights to extract and how to preserve them.

Summary This part focuses on extracting function-specific knowledge from the raw context, without addressing the structure of memory storage. Each piece of knowledge can be viewed as a flat memory unit. Simply storing multiple units in an unstructured table ignores the semantic and hierarchical relations among them. To address this, the memory formation process can apply structured rules to derive insights and store them within a hierarchical architecture. Simple but essential, the single knowledge distillation method introduced here serves as a foundational component for more complex and structured memory formation mechanisms.

Capability Internalization

As LLMs increasingly incorporate memory systems to support long-term adaptation, a central research question is how these external memories should be consolidated into parametric form. While the latent representation methods discussed above parameterize memory externally to the model, parametric internalization directly adjusts the model's internal parameters. It leverages the model's capacity to encode and generalize information through its learned parameter space. This paradigm fundamentally enhances the model's intrinsic competence, eliminating the overhead of external storage and retrieval while seamlessly supporting continual updates. As we discussed in Section 4, not all memory content serves the same function: some entries provide declarative knowledge, while others encode procedural strategies that shape an agent's reasoning and behavior. This distinction motivates a finer-grained view of memory internalization, separating it into knowledge internalization and capability internalization.

Knowledge Internalization This strategy entails converting externally stored factual memories, such as conceptual definitions or domain knowledge, into the model's parameter space. Through this process, the model can directly recall and utilize these facts without relying on explicit retrieval or external memory modules. In practice, knowledge internalization is typically achieved through model editing (Sinitsin et al., 2020; De Cao et al., 2021). Early work, such as MEND (Mitchell et al., 2022), introduced an auxiliary network that enables rapid, single-step edits by decomposing fine-tuning gradients, thereby minimizing interference with unrelated knowledge. Building on this line of work, ROME (Meng et al., 2022) refined the editing process by using causal tracing to precisely locate the MLP layers that store specific facts and applying rank-one updates to inject new information with higher precision and better generalization. MEMIT (Meng et al., 2023) further advanced this line by supporting batch edits, enabling thousands of facts to be updated simultaneously through multi-layer residual distributions and batch formulas, which substantially improves scalability. With the rise of parameter-efficient paradigms like LoRA (Hu et al., 2022), knowledge internalization can be performed through lightweight adapters rather than direct parameter modification. For instance, CoLoR (Wistuba et al., 2023) freezes the pretrained Transformer parameters and trains only small LoRA adapters to internalize new knowledge, avoiding the high cost of full-parameter fine-tuning. Despite these advances, these approaches can still incur off-target effects (De Cao et al., 2021) and remain vulnerable to catastrophic forgetting in continual learning scenarios.

Capability Internalization This strategy seeks to embed experiential knowledge, such as procedural expertise or strategic heutistics, into the model's parameter space. The paradigm represents a memory formation operation in a broad sense, shifting from the acquisition of factual knowledge to the internalization of experiential capabilities. Specifically, these capabilities include domain-specific solution schemas, strategic planning, and the effective deployment of agentic skills, among others. Technically, capability internalization is achieved by learning from reasoning traces, through supervised fine-tuning (Wei et al., 2022; Zelikman et al., 2022; Schick et al., 2023; Mukherjee et al., 2023) or preference-guided optimization methods such as DPO (Rafailov et al., 2023; Tunstall et al., 2023; Yuan et al., 2024c; Grattafiori et al., 2024) and GRPO (Shao et al., 2024; DeepSeek-AI et al., 2025). As an attempt to integrate external RAG with parameterized training, Memory Decoder (Cao et al., 2025a) is a plug-and-play method that does not modify the base model, like external RAG, while achieving parameter-internalized inference speed by eliminating external retrieval overhead. Such plug-and-play parameterized memory may have broad potential.

Summary Parametric internalization represents the ultimate consolidation of memory, where external knowledge is fused into the model's weights via gradients. This shifts the paradigm from retrieving information

to possessing capability, mimicking biological long-term potentiation. As knowledge becomes effectively instinctive, access is zero-latency, enabling the model to respond immediately without querying external memory. However, this approach faces several challenges, including catastrophic forgetting and high update costs. Unlike external memory, parameterized internalization is difficult to modify or remove precisely without unintended side effects, limiting flexibility and adaptability.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Memory Evolution

Memory Formation introduced in Section 5.1 extracts memory from raw data. The next important step is to integrate the newly extracted memories with the existing memory repository, enabling the dynamic evolution of the memory system. A naive strategy is simply appending new entries to the existing memory bank. However, it overlooks the semantic dependencies and potential contradictions between memory entries and neglects the temporal validity of information. To address these limitations, we introduce Memory Evolution. This mechanism consolidates new and existing memories to synthesize high-level insights, resolve logical conflicts, and prune obsolete data. By ensuring the compactness, consistency, and relevance of long-term knowledge, this approach enables the memory system to adapt its cognitive processes and contextual understanding as environments and tasks evolve.

Based on the objectives of memory evolution, we categorize it into the following mechanisms:

Consolidation

Memory consolidation aims to transform newly acquired short-term traces into structured and generalizable long-term knowledge. Its core mechanism is to identify semantic relationships between new and existing memories and to integrate them into higher-level abstractions or insights. This process serves two main purposes. First, it reorganizes fragmented pieces of information into coherent structures, preventing the loss of critical details during short-term retention and enabling the formation of stable knowledge schemas. Second, by abstracting, compressing, and generalizing experiential data, consolidation extracts reusable patterns from specific events, yielding insights that support cross-task generalization.

A central challenge is determining the granularity at which new memories should be matched and merged with existing ones. Prior work spans a spectrum of consolidation strategies, from local content merging to cluster-level fusion and global integration.

Local Consolidation This operation focuses on fine-grained updates involving highly similar memory fragments. In RMM (Tan et al., 2025c), each new topic memory retrieves its top-K most similar candidates, and an LLM decides whether merging is appropriate, thereby reducing the risk of incorrect generalization. In

Figure 9 The landscape of Memory Evolution mechanisms. We categorize the evolution process into three distinct branches that maintain the central MemoryDatabase : (a) Consolidation synthesizes insights by processing raw materials through local consolidation, cluster fusion, and global integration; (b) Updating ensures accuracy and consistency by performing conflict resolution on external databases and applying parameter updates to the internal model; and (c) Forgetting optimizes efficiency by pruning data based on specific criteria: time expiration, low access frequency, and low informational value. The outer ring displays representative frameworks and agents associated with each evolutionary mechanism.

Figure 9 The landscape of Memory Evolution mechanisms. We categorize the evolution process into three distinct branches that maintain the central MemoryDatabase : (a) Consolidation synthesizes insights by processing raw materials through local consolidation, cluster fusion, and global integration; (b) Updating ensures accuracy and consistency by performing conflict resolution on external databases and applying parameter updates to the internal model; and (c) Forgetting optimizes efficiency by pruning data based on specific criteria: time expiration, low access frequency, and low informational value. The outer ring displays representative frameworks and agents associated with each evolutionary mechanism.

multimodal settings, VLN (Song et al., 2025b) triggers a pooling mechanism when capacity is saturated. It identifies the most similar or redundant memory pairs and compresses them into higher-level abstractions. These approaches refine detailed knowledge while preserving the global structure of the memory store, improving precision and storage efficiency. However, they cannot fully capture cluster-level relations or the higher-order dependencies that emerge across semantically related memories.

Cluster-level Fusion Adopting cluster-level fusion is essential for capturing cross-instance regularities as memory grows. Across clusters, PREMem (Kim et al., 2025b) aligns new memory clusters with similar existing ones and applies fusion modes such as generalization and refinement to form higher-order reasoning units, substantially improving interpretability and reasoning depth. EverMemOS (Hu et al., 2026a) computes the similarity between a newly generated MemCell and the centroids of all MemScenes, and merges it into the MemScene that is sufficiently similar. Within a cluster, TiM (Liu et al., 2023a) periodically invokes an LLM to examine memories that share the same hashing bucket and merges semantically redundant entries. CAM (Li et al., 2025g) merges all nodes within the target cluster into a representative summary, yielding

higher-level and consistent cross-sample representations. These methods reorganize the memory structure at a broader scale and mark an important step toward structured knowledge.

Global Integration This operation performs holistic consolidation to maintain global coherence and to distill system-level insights from accumulated experience. Compared with Section 5.1.1, semantic summarization focuses on deriving a global summary from the existing context and can be viewed as the initial construction of the summary. In contrast, this paragraph emphasizes how new information is integrated into an existing summary as additional data arrives. For user factual memory, MOOM (Chen et al., 2025e) constructs stable role profiles by integrating temporary role snapshots with historical traces using rule-based processing, embedding methods, and LLM-driven abstraction. For experiential memory, Matrix (Liu et al., 2024) performs iterative optimization to combine execution trajectories and reflective insights with global memory, distilling task-agnostic principles that support reuse across scenarios. As single-step reasoning contexts and environmental feedback lengthen, methods like AgentFold (Ye et al., 2025a) and Context Folding (Zhang et al., 2025r) internalize the ability to compress working memory. In multi-step interactions, including web navigation, these methods automatically summarize and condense the global context after each step, supporting efficient and effective reasoning. Global integration consolidates high-level, structured knowledge from the complete history of experience, providing a reliable contextual foundation while improving generalization, reasoning accuracy, and personalized decision-making.

Summary Consolidation is the cognitive process of reorganizing fragmented short-term traces into coherent long-term schemas. It moves beyond simple storage to synthesize connections between isolated entries, forming a structured worldview. It enhances generalization and reduces storage redundancy. However, it risks information smoothing, where outlier events or unique exceptions are lost during the abstraction process, potentially reducing the agent's sensitivity to anomalies and specific events.

Local Consolidation

Memory consolidation aims to transform newly acquired short-term traces into structured and generalizable long-term knowledge. Its core mechanism is to identify semantic relationships between new and existing memories and to integrate them into higher-level abstractions or insights. This process serves two main purposes. First, it reorganizes fragmented pieces of information into coherent structures, preventing the loss of critical details during short-term retention and enabling the formation of stable knowledge schemas. Second, by abstracting, compressing, and generalizing experiential data, consolidation extracts reusable patterns from specific events, yielding insights that support cross-task generalization.

A central challenge is determining the granularity at which new memories should be matched and merged with existing ones. Prior work spans a spectrum of consolidation strategies, from local content merging to cluster-level fusion and global integration.

Local Consolidation This operation focuses on fine-grained updates involving highly similar memory fragments. In RMM (Tan et al., 2025c), each new topic memory retrieves its top-K most similar candidates, and an LLM decides whether merging is appropriate, thereby reducing the risk of incorrect generalization. In

Figure 9 The landscape of Memory Evolution mechanisms. We categorize the evolution process into three distinct branches that maintain the central MemoryDatabase : (a) Consolidation synthesizes insights by processing raw materials through local consolidation, cluster fusion, and global integration; (b) Updating ensures accuracy and consistency by performing conflict resolution on external databases and applying parameter updates to the internal model; and (c) Forgetting optimizes efficiency by pruning data based on specific criteria: time expiration, low access frequency, and low informational value. The outer ring displays representative frameworks and agents associated with each evolutionary mechanism.

Figure 9 The landscape of Memory Evolution mechanisms. We categorize the evolution process into three distinct branches that maintain the central MemoryDatabase : (a) Consolidation synthesizes insights by processing raw materials through local consolidation, cluster fusion, and global integration; (b) Updating ensures accuracy and consistency by performing conflict resolution on external databases and applying parameter updates to the internal model; and (c) Forgetting optimizes efficiency by pruning data based on specific criteria: time expiration, low access frequency, and low informational value. The outer ring displays representative frameworks and agents associated with each evolutionary mechanism.

multimodal settings, VLN (Song et al., 2025b) triggers a pooling mechanism when capacity is saturated. It identifies the most similar or redundant memory pairs and compresses them into higher-level abstractions. These approaches refine detailed knowledge while preserving the global structure of the memory store, improving precision and storage efficiency. However, they cannot fully capture cluster-level relations or the higher-order dependencies that emerge across semantically related memories.

Cluster-level Fusion Adopting cluster-level fusion is essential for capturing cross-instance regularities as memory grows. Across clusters, PREMem (Kim et al., 2025b) aligns new memory clusters with similar existing ones and applies fusion modes such as generalization and refinement to form higher-order reasoning units, substantially improving interpretability and reasoning depth. EverMemOS (Hu et al., 2026a) computes the similarity between a newly generated MemCell and the centroids of all MemScenes, and merges it into the MemScene that is sufficiently similar. Within a cluster, TiM (Liu et al., 2023a) periodically invokes an LLM to examine memories that share the same hashing bucket and merges semantically redundant entries. CAM (Li et al., 2025g) merges all nodes within the target cluster into a representative summary, yielding

higher-level and consistent cross-sample representations. These methods reorganize the memory structure at a broader scale and mark an important step toward structured knowledge.

Global Integration This operation performs holistic consolidation to maintain global coherence and to distill system-level insights from accumulated experience. Compared with Section 5.1.1, semantic summarization focuses on deriving a global summary from the existing context and can be viewed as the initial construction of the summary. In contrast, this paragraph emphasizes how new information is integrated into an existing summary as additional data arrives. For user factual memory, MOOM (Chen et al., 2025e) constructs stable role profiles by integrating temporary role snapshots with historical traces using rule-based processing, embedding methods, and LLM-driven abstraction. For experiential memory, Matrix (Liu et al., 2024) performs iterative optimization to combine execution trajectories and reflective insights with global memory, distilling task-agnostic principles that support reuse across scenarios. As single-step reasoning contexts and environmental feedback lengthen, methods like AgentFold (Ye et al., 2025a) and Context Folding (Zhang et al., 2025r) internalize the ability to compress working memory. In multi-step interactions, including web navigation, these methods automatically summarize and condense the global context after each step, supporting efficient and effective reasoning. Global integration consolidates high-level, structured knowledge from the complete history of experience, providing a reliable contextual foundation while improving generalization, reasoning accuracy, and personalized decision-making.

Summary Consolidation is the cognitive process of reorganizing fragmented short-term traces into coherent long-term schemas. It moves beyond simple storage to synthesize connections between isolated entries, forming a structured worldview. It enhances generalization and reduces storage redundancy. However, it risks information smoothing, where outlier events or unique exceptions are lost during the abstraction process, potentially reducing the agent's sensitivity to anomalies and specific events.

Cluster-level Fusion
Global Integration

The previous chapters focused on how to build token-level memory; this part focuses on encoding memory into the machine's native latent representation. Latent representation encodes raw experiences into embeddings that reside in a latent space. Unlike semantic compression and structured extraction, which summarize experiences before embedding them into vectors, latent encoding inherently stores experiences in latent space, thereby reducing information loss during summarization and text embedding. Furthermore, latent encoding is more conducive to machine cognition, enabling a unified representation across different modalities and ensuring both density and semantic richness in memory representation.

Textual Latent Representation Although originally designed to accelerate inference, the KV cache can also be viewed as a form of latent representation within the context of memory (Li et al., 2025c; Jiang et al., 2025b). It utilizes additional memory to store past information, thereby avoiding redundant computation. MEMORYLLM (Wang et al., 2024j) and M+ (Wang et al., 2025n) represent memory as self-updatable latent embeddings, which are injected into transformer layers during inference. Moreover, MemGen (Zhang et al., 2025d) introduces a memory trigger to monitor the agent's reasoning state and determine when to explicitly invoke memory, as well as a memory waiver that leverages the agent's current state to construct a latent token sequence. This sequence acts as machine-native memory, enriching the agent's reasoning capabilities.

Multimodal Latent Representation In multimodal memory research, CoMEM (Wu et al., 2025d) compresses vision-language inputs into fixed-length tokens via a Q-Former, enabling dense, continuous memory and supporting plug-and-play usage for infinite context lengths. Encode-Store-Retrieve (Shen et al., 2024) converts egocentric video frames into language encodings using Ego-LLaVA, which are subsequently transformed into vector representations through an embedding model. Although embedding models are employed to ensure semantic alignment, these methods often face a trade-off between compression loss and computational overhead, particularly in handling gradient flow in long-context sequences.

When integrated with Embodied AI, multimodal latent memory can fuse data from multiple sensors. For example, Mem2Ego (Zhang et al., 2025m) dynamically aligns global contextual information with local perception, embedding landmark semantics as latent memory to enhance spatial reasoning and decisionmaking in long-horizon tasks. KARMA (Wang et al., 2025r) adopts a hybrid long- and short-term memory form that encodes object information into multimodal embeddings, achieving a balance between immediate

responsiveness and consistent representation. These explorations underscore the advantages of latent encoding in providing unified and semantically rich representations across modalities.

Summary Latent representation bypasses human-readable formats, encoding experiences directly into machine-native vectors or KV-caches. This high-density format preserves rich semantic signals that might be lost in text decoding, enabling smoother integration with the model's internal computations. And it supports multimodal alignment seamlessly. However, it suffers from opaqueness. The latent memory is a black box, making it difficult for humans to debug, edit, or verify the knowledge it stores.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Updating

Memory Update refers to the process by which an agent revises or replaces its existing memory when conflicts arise or new information is acquired. The goal is to maintain factual consistency and continual adaptation without full model retraining. Unlike memory consolidation described in Section 5.2.1, which focuses on abstraction and generalization, memory update emphasizes localized correction and synchronization, enabling the agent to remain aligned with an evolving environment.

Through continuous updating, agentic memory systems preserve the accuracy and timeliness of knowledge, preventing outdated information from biasing reasoning. It is thus a core mechanism for achieving lifelong learning and self-evolution. Depending on where the memory resides, updates fall into two categories: (1) External Memory Update: updates to external memory stores and (2) Model Editing: model-internal editing within the parameter space.

External Memory Update Entries in vector databases or knowledge graphs are revised whenever contradictions or new facts emerge. Instead of altering model weights, this approach maintains factual alignment through dynamic modifications of external storage. Static memories inevitably accumulate outdated or conflicting entries, leading to logical inconsistencies and reasoning errors. Updating external memories enables lightweight corrections while avoiding the cost of full retraining or re-indexing.

The development of external memory update mechanisms has progressed along a trajectory, moving from rule-based corrections to temporally aware soft deletion, then to delayed-consistency strategies, and ultimately to fully learned update policies. Early systems such as MemGPT (Packer et al., 2023a), D-SMART (Lei et al., 2025), and Mem0 g (Chhikara et al., 2025) followed a straightforward pipeline in which the LLM detects conflicts between new information and then invokes replace or delete operations to update the memory. Although effective for basic factual repair, these systems relied on destructive replacement, erasing valuable historical context and breaking temporal continuity. To address this issue, Zep (Rasmussen et al., 2025) introduced temporal annotations, marking conflicting facts with invalid timestamps rather than deleting them, thereby preserving both semantic consistency and temporal integrity. This marked a shift from hard replacement to soft, time-aware updating. However, real-time updates impose significant computational and I/O burdens under high-frequency interaction. MOOM (Chen et al., 2025e) and LightMem (Fang et al.,

2025b), therefore, introduced dual-phase updating: a soft online update for real-time responsiveness, followed by an offline reflective consolidation phase where similar entries are merged and conflicts resolved via LLM reasoning. This eventual consistency paradigm balances latency and coherence. As agentic reinforcement learning matured, it became possible to enhance the LLM's intrinsic memory update decision-making through reinforcement learning. Memα (Wang et al., 2025p) formulated memory updating as a policy-learning problem, enabling the LLM to learn when, how, and whether to update, thereby achieving dynamic trade-offs between stability and freshness.

Overall, external memory updates have transitioned from manually triggered corrections to self-regulated, temporally aware learning processes, maintaining factual consistency and structural stability through LLMdriven retrieval, conflict detection, and revision.

Model Editing Model editing performs direct modifications within the model's parameter space to correct or inject knowledge without full retraining, representing implicit knowledge updates. Retraining is costly and prone to catastrophic forgetting. Model editing enables precise, low-cost corrections that enhance adaptability and internal knowledge retention.

Approaches of model editing fall into two main categories. (1) Explicit localization and modification: ROME (Tan et al., 2025b) identifies the parameter region encoding specific knowledge via gradient tracing and performs targeted weight updates; Model Editor Networks (Tang et al., 2025c) trains an auxiliary meta-editor network to predict optimal parameter adjustments. (2) Latent-space self-updating: MEMORYLLM (Xu et al., 2025c) embeds a memory pool within Transformer layers, periodically replacing memory tokens to integrate new knowledge; M+ (Wang et al., 2025n) maintains dual-layer memories, discarding obsolete short-term entries and compressing key information into long-term storage.

Hybrid approaches such as ChemAgent (Tang et al., 2025c) further combine external memory updates with internal model editing, synchronizing factual and representational changes for rapid cross-domain adaptation.

Summary From an implementation standpoint, memory updating focuses on resolving conflicts and revising knowledge triggered by the arrival of new memories, whereas memory consolidation emphasizes the integration and abstraction of new and existing knowledge. The two memory updating strategies discussed above establish a dual-pathway mechanism involving conflict resolution in external databases and parameter editing within the model, enabling agents to perform continuous self-correction and support long-term evolution. The key challenge is the stability-plasticity dilemma: determining when to overwrite existing knowledge versus when to treat new information as noise. Incorrect updates can overwrite critical information, leading to knowledge degradation and faulty reasoning.

External Memory Update

Experiential memory encapsulates the mechanism by which agents encode historical trajectories, distilled strategies, and interaction outcomes into durable, retrievable representations. Unlike working memory, which manages transient context, experiential memory focuses on the long-term accumulation and transfer of knowledge across distinct episodes.

Theoretically grounded in cognitive science, this paradigm parallels human nondeclarative memory , specifically the procedural and habit systems (Squire, 2004; Seger and Spiering, 2011). Biological systems rely on distributed neural circuits for implicit skill acquisition (Reber, 2013). In contrast, agentic experiential memory typically employs explicit data structures, such as vector databases or symbolic logs. This implementation difference grants agents a unique capability absent in biological counterparts: the ability to introspect, edit, and reason over their own procedural knowledge.

Crucially, experiential memory serves as a foundation for continual learning and self-evolution in the era of experience (Sutton, 2025; Gao et al., 2025). By maintaining a repository of structured experiences, agents achieve a non-parametric path to adaptation and avoid the prohibitive costs of frequent parametric updates. This mechanism effectively closes the learning loop by converting interaction feedback into reusable knowledge. Through this process, agents rectify past errors, abstract generalizable heuristics, and compile routine behaviors. Consequently, such adaptation minimizes redundant computations and refines decision-making over time (Zhao et al., 2024; Shinn et al., 2023b).

To systematically analyze existing literature, we classify experiential memory based on the abstraction level of the stored information. An overview of this abstraction-based taxonomy and representative paradigms is illustrated in Figure 7. Representative methods under this abstraction-based taxonomy, together with their storage carriers, representation forms, and optimization strategies, are summarized in Table 5.

Model Editing

Memory forgetting refers to the deliberate removal of outdated, redundant, or low-value information to free capacity and maintain focus on salient knowledge. Unlike update mechanisms, which resolve conflicts between memories, forgetting prioritizes eliminating outdated information to ensure efficiency and relevance. Over time, unbounded memory accumulation leads to increased noise, retrieval delays, and interference from outdated knowledge. Controlled forgetting helps mitigate overload and maintain cognitive focus. Yet, overly aggressive pruning risks erasing rare but essential knowledge, harming reasoning continuity in long-term contexts.

Forgetting mechanisms can be categorized into Time-based Forgetting, Frequency-based Forgetting, and Importance-driven Forgetting, corresponding respectively to creation time, retrieval activity, and integrated semantic valuation.

Time-based Forgetting Time-driven forgetting considers only the creation time of memories, gradually decaying their strength over time to emulate human memory fading. MemGPT (Packer et al., 2023a) evicts the earliest messages upon context overflow. Xu et al. (2025c) and Wang et al. (2025n) employ stochastic token replacement, with a replacement ratio of K/N, to simulate exponential forgetting in human cognition, discarding the oldest entries once the pool exceeds capacity. Unlike explicit deletion of old memories, MAICC (Jiang et al., 2025c) implements soft forgetting by gradually decaying the weights of memories over time. This process mirrors natural forgetting, ensuring continuous adaptation without historical overload.

Frequency-based Forgetting Frequency-driven forgetting prioritizes memory based on retrieval behavior, retaining frequently accessed entries while discarding inactive ones. XMem (Cheng and Schwing, 2022) employs an LFU policy to remove low-frequency entries; KARMA (Wang et al., 2025r) uses counting Bloom filters to track access frequency; MemOS (Li et al., 2025l) applies an LRU strategy, removing long-unused items while archiving highly active ones. This ensures efficient retrieval and storage equilibrium. By distinguishing between creation time and retrieval frequency, these two axes form a more orthogonal taxonomy: time-based decay captures natural temporal aging, while frequency-based forgetting reflects usage dynamics, together maintaining system efficiency and recency.

Importance-driven Forgetting Importance-driven forgetting integrates temporal, frequency, and semantic signals to retain high-value knowledge while pruning redundancy. Early works such as Zhong et al. (2024) and Chen et al. (2025e) quantified importance via composite scores combining temporal decay and access frequency, achieving numeric-based selective forgetting. Later methods evolved toward semantic-level evaluation: VLN (Song et al., 2025b) pools semantically redundant memories via similarity clustering, while Livia (Xi and Wang, 2025) incorporates emotional salience and contextual relevance to model emotion-driven selective forgetting. As LLMs develop increasingly powerful judgment capabilities, TiM (Liu et al., 2023a) and MemTool (Lumer et al., 2025) leverage LLMs to assess memory importance and explicitly prune or forget less important memories. This shift reflects a transition from static numeric scoring to semantic intelligence. Agents can now perform conscious forgetting and selectively retain memories most pertinent to the task context, semantics, and affective cues.

Summary Time-based decay reflects the natural temporal fading of memory, frequency-based forgetting ensures efficient access to frequently used memories, and importance-driven forgetting introduces semantic discernment. These three forgetting mechanisms jointly govern how agentic memory remains timely, efficiently accessible, and semantically relevant. However, heuristic forgetting mechanisms like LRU may eliminate long-tail knowledge, which is seldom accessed but essential for correct decision-making. Therefore, when storage cost is not a critical constraint, many memory systems avoid directly deleting certain memories.

The above discussion describes how prior work manually designs memory architectures at different stages, enabling agents' memory contents to evolve online. More recently, MemEvolve (Zhang et al., 2025h) proposes a meta-evolutionary framework that jointly evolves both agents' experiential knowledge and their underlying memory architecture, allowing the memory framework itself to continuously learn and adapt.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Forgetting

Memory forgetting refers to the deliberate removal of outdated, redundant, or low-value information to free capacity and maintain focus on salient knowledge. Unlike update mechanisms, which resolve conflicts between memories, forgetting prioritizes eliminating outdated information to ensure efficiency and relevance. Over time, unbounded memory accumulation leads to increased noise, retrieval delays, and interference from outdated knowledge. Controlled forgetting helps mitigate overload and maintain cognitive focus. Yet, overly aggressive pruning risks erasing rare but essential knowledge, harming reasoning continuity in long-term contexts.

Forgetting mechanisms can be categorized into Time-based Forgetting, Frequency-based Forgetting, and Importance-driven Forgetting, corresponding respectively to creation time, retrieval activity, and integrated semantic valuation.

Time-based Forgetting Time-driven forgetting considers only the creation time of memories, gradually decaying their strength over time to emulate human memory fading. MemGPT (Packer et al., 2023a) evicts the earliest messages upon context overflow. Xu et al. (2025c) and Wang et al. (2025n) employ stochastic token replacement, with a replacement ratio of K/N, to simulate exponential forgetting in human cognition, discarding the oldest entries once the pool exceeds capacity. Unlike explicit deletion of old memories, MAICC (Jiang et al., 2025c) implements soft forgetting by gradually decaying the weights of memories over time. This process mirrors natural forgetting, ensuring continuous adaptation without historical overload.

Frequency-based Forgetting Frequency-driven forgetting prioritizes memory based on retrieval behavior, retaining frequently accessed entries while discarding inactive ones. XMem (Cheng and Schwing, 2022) employs an LFU policy to remove low-frequency entries; KARMA (Wang et al., 2025r) uses counting Bloom filters to track access frequency; MemOS (Li et al., 2025l) applies an LRU strategy, removing long-unused items while archiving highly active ones. This ensures efficient retrieval and storage equilibrium. By distinguishing between creation time and retrieval frequency, these two axes form a more orthogonal taxonomy: time-based decay captures natural temporal aging, while frequency-based forgetting reflects usage dynamics, together maintaining system efficiency and recency.

Importance-driven Forgetting Importance-driven forgetting integrates temporal, frequency, and semantic signals to retain high-value knowledge while pruning redundancy. Early works such as Zhong et al. (2024) and Chen et al. (2025e) quantified importance via composite scores combining temporal decay and access frequency, achieving numeric-based selective forgetting. Later methods evolved toward semantic-level evaluation: VLN (Song et al., 2025b) pools semantically redundant memories via similarity clustering, while Livia (Xi and Wang, 2025) incorporates emotional salience and contextual relevance to model emotion-driven selective forgetting. As LLMs develop increasingly powerful judgment capabilities, TiM (Liu et al., 2023a) and MemTool (Lumer et al., 2025) leverage LLMs to assess memory importance and explicitly prune or forget less important memories. This shift reflects a transition from static numeric scoring to semantic intelligence. Agents can now perform conscious forgetting and selectively retain memories most pertinent to the task context, semantics, and affective cues.

Summary Time-based decay reflects the natural temporal fading of memory, frequency-based forgetting ensures efficient access to frequently used memories, and importance-driven forgetting introduces semantic discernment. These three forgetting mechanisms jointly govern how agentic memory remains timely, efficiently accessible, and semantically relevant. However, heuristic forgetting mechanisms like LRU may eliminate long-tail knowledge, which is seldom accessed but essential for correct decision-making. Therefore, when storage cost is not a critical constraint, many memory systems avoid directly deleting certain memories.

The above discussion describes how prior work manually designs memory architectures at different stages, enabling agents' memory contents to evolve online. More recently, MemEvolve (Zhang et al., 2025h) proposes a meta-evolutionary framework that jointly evolves both agents' experiential knowledge and their underlying memory architecture, allowing the memory framework itself to continuously learn and adapt.

Time-based Forgetting

Memory forgetting refers to the deliberate removal of outdated, redundant, or low-value information to free capacity and maintain focus on salient knowledge. Unlike update mechanisms, which resolve conflicts between memories, forgetting prioritizes eliminating outdated information to ensure efficiency and relevance. Over time, unbounded memory accumulation leads to increased noise, retrieval delays, and interference from outdated knowledge. Controlled forgetting helps mitigate overload and maintain cognitive focus. Yet, overly aggressive pruning risks erasing rare but essential knowledge, harming reasoning continuity in long-term contexts.

Forgetting mechanisms can be categorized into Time-based Forgetting, Frequency-based Forgetting, and Importance-driven Forgetting, corresponding respectively to creation time, retrieval activity, and integrated semantic valuation.

Time-based Forgetting Time-driven forgetting considers only the creation time of memories, gradually decaying their strength over time to emulate human memory fading. MemGPT (Packer et al., 2023a) evicts the earliest messages upon context overflow. Xu et al. (2025c) and Wang et al. (2025n) employ stochastic token replacement, with a replacement ratio of K/N, to simulate exponential forgetting in human cognition, discarding the oldest entries once the pool exceeds capacity. Unlike explicit deletion of old memories, MAICC (Jiang et al., 2025c) implements soft forgetting by gradually decaying the weights of memories over time. This process mirrors natural forgetting, ensuring continuous adaptation without historical overload.

Frequency-based Forgetting Frequency-driven forgetting prioritizes memory based on retrieval behavior, retaining frequently accessed entries while discarding inactive ones. XMem (Cheng and Schwing, 2022) employs an LFU policy to remove low-frequency entries; KARMA (Wang et al., 2025r) uses counting Bloom filters to track access frequency; MemOS (Li et al., 2025l) applies an LRU strategy, removing long-unused items while archiving highly active ones. This ensures efficient retrieval and storage equilibrium. By distinguishing between creation time and retrieval frequency, these two axes form a more orthogonal taxonomy: time-based decay captures natural temporal aging, while frequency-based forgetting reflects usage dynamics, together maintaining system efficiency and recency.

Importance-driven Forgetting Importance-driven forgetting integrates temporal, frequency, and semantic signals to retain high-value knowledge while pruning redundancy. Early works such as Zhong et al. (2024) and Chen et al. (2025e) quantified importance via composite scores combining temporal decay and access frequency, achieving numeric-based selective forgetting. Later methods evolved toward semantic-level evaluation: VLN (Song et al., 2025b) pools semantically redundant memories via similarity clustering, while Livia (Xi and Wang, 2025) incorporates emotional salience and contextual relevance to model emotion-driven selective forgetting. As LLMs develop increasingly powerful judgment capabilities, TiM (Liu et al., 2023a) and MemTool (Lumer et al., 2025) leverage LLMs to assess memory importance and explicitly prune or forget less important memories. This shift reflects a transition from static numeric scoring to semantic intelligence. Agents can now perform conscious forgetting and selectively retain memories most pertinent to the task context, semantics, and affective cues.

Summary Time-based decay reflects the natural temporal fading of memory, frequency-based forgetting ensures efficient access to frequently used memories, and importance-driven forgetting introduces semantic discernment. These three forgetting mechanisms jointly govern how agentic memory remains timely, efficiently accessible, and semantically relevant. However, heuristic forgetting mechanisms like LRU may eliminate long-tail knowledge, which is seldom accessed but essential for correct decision-making. Therefore, when storage cost is not a critical constraint, many memory systems avoid directly deleting certain memories.

The above discussion describes how prior work manually designs memory architectures at different stages, enabling agents' memory contents to evolve online. More recently, MemEvolve (Zhang et al., 2025h) proposes a meta-evolutionary framework that jointly evolves both agents' experiential knowledge and their underlying memory architecture, allowing the memory framework itself to continuously learn and adapt.

Frequency-based Forgetting

Memory forgetting refers to the deliberate removal of outdated, redundant, or low-value information to free capacity and maintain focus on salient knowledge. Unlike update mechanisms, which resolve conflicts between memories, forgetting prioritizes eliminating outdated information to ensure efficiency and relevance. Over time, unbounded memory accumulation leads to increased noise, retrieval delays, and interference from outdated knowledge. Controlled forgetting helps mitigate overload and maintain cognitive focus. Yet, overly aggressive pruning risks erasing rare but essential knowledge, harming reasoning continuity in long-term contexts.

Forgetting mechanisms can be categorized into Time-based Forgetting, Frequency-based Forgetting, and Importance-driven Forgetting, corresponding respectively to creation time, retrieval activity, and integrated semantic valuation.

Time-based Forgetting Time-driven forgetting considers only the creation time of memories, gradually decaying their strength over time to emulate human memory fading. MemGPT (Packer et al., 2023a) evicts the earliest messages upon context overflow. Xu et al. (2025c) and Wang et al. (2025n) employ stochastic token replacement, with a replacement ratio of K/N, to simulate exponential forgetting in human cognition, discarding the oldest entries once the pool exceeds capacity. Unlike explicit deletion of old memories, MAICC (Jiang et al., 2025c) implements soft forgetting by gradually decaying the weights of memories over time. This process mirrors natural forgetting, ensuring continuous adaptation without historical overload.

Frequency-based Forgetting Frequency-driven forgetting prioritizes memory based on retrieval behavior, retaining frequently accessed entries while discarding inactive ones. XMem (Cheng and Schwing, 2022) employs an LFU policy to remove low-frequency entries; KARMA (Wang et al., 2025r) uses counting Bloom filters to track access frequency; MemOS (Li et al., 2025l) applies an LRU strategy, removing long-unused items while archiving highly active ones. This ensures efficient retrieval and storage equilibrium. By distinguishing between creation time and retrieval frequency, these two axes form a more orthogonal taxonomy: time-based decay captures natural temporal aging, while frequency-based forgetting reflects usage dynamics, together maintaining system efficiency and recency.

Importance-driven Forgetting Importance-driven forgetting integrates temporal, frequency, and semantic signals to retain high-value knowledge while pruning redundancy. Early works such as Zhong et al. (2024) and Chen et al. (2025e) quantified importance via composite scores combining temporal decay and access frequency, achieving numeric-based selective forgetting. Later methods evolved toward semantic-level evaluation: VLN (Song et al., 2025b) pools semantically redundant memories via similarity clustering, while Livia (Xi and Wang, 2025) incorporates emotional salience and contextual relevance to model emotion-driven selective forgetting. As LLMs develop increasingly powerful judgment capabilities, TiM (Liu et al., 2023a) and MemTool (Lumer et al., 2025) leverage LLMs to assess memory importance and explicitly prune or forget less important memories. This shift reflects a transition from static numeric scoring to semantic intelligence. Agents can now perform conscious forgetting and selectively retain memories most pertinent to the task context, semantics, and affective cues.

Summary Time-based decay reflects the natural temporal fading of memory, frequency-based forgetting ensures efficient access to frequently used memories, and importance-driven forgetting introduces semantic discernment. These three forgetting mechanisms jointly govern how agentic memory remains timely, efficiently accessible, and semantically relevant. However, heuristic forgetting mechanisms like LRU may eliminate long-tail knowledge, which is seldom accessed but essential for correct decision-making. Therefore, when storage cost is not a critical constraint, many memory systems avoid directly deleting certain memories.

The above discussion describes how prior work manually designs memory architectures at different stages, enabling agents' memory contents to evolve online. More recently, MemEvolve (Zhang et al., 2025h) proposes a meta-evolutionary framework that jointly evolves both agents' experiential knowledge and their underlying memory architecture, allowing the memory framework itself to continuously learn and adapt.

Importance-driven Forgetting

Memory forgetting refers to the deliberate removal of outdated, redundant, or low-value information to free capacity and maintain focus on salient knowledge. Unlike update mechanisms, which resolve conflicts between memories, forgetting prioritizes eliminating outdated information to ensure efficiency and relevance. Over time, unbounded memory accumulation leads to increased noise, retrieval delays, and interference from outdated knowledge. Controlled forgetting helps mitigate overload and maintain cognitive focus. Yet, overly aggressive pruning risks erasing rare but essential knowledge, harming reasoning continuity in long-term contexts.

Forgetting mechanisms can be categorized into Time-based Forgetting, Frequency-based Forgetting, and Importance-driven Forgetting, corresponding respectively to creation time, retrieval activity, and integrated semantic valuation.

Time-based Forgetting Time-driven forgetting considers only the creation time of memories, gradually decaying their strength over time to emulate human memory fading. MemGPT (Packer et al., 2023a) evicts the earliest messages upon context overflow. Xu et al. (2025c) and Wang et al. (2025n) employ stochastic token replacement, with a replacement ratio of K/N, to simulate exponential forgetting in human cognition, discarding the oldest entries once the pool exceeds capacity. Unlike explicit deletion of old memories, MAICC (Jiang et al., 2025c) implements soft forgetting by gradually decaying the weights of memories over time. This process mirrors natural forgetting, ensuring continuous adaptation without historical overload.

Frequency-based Forgetting Frequency-driven forgetting prioritizes memory based on retrieval behavior, retaining frequently accessed entries while discarding inactive ones. XMem (Cheng and Schwing, 2022) employs an LFU policy to remove low-frequency entries; KARMA (Wang et al., 2025r) uses counting Bloom filters to track access frequency; MemOS (Li et al., 2025l) applies an LRU strategy, removing long-unused items while archiving highly active ones. This ensures efficient retrieval and storage equilibrium. By distinguishing between creation time and retrieval frequency, these two axes form a more orthogonal taxonomy: time-based decay captures natural temporal aging, while frequency-based forgetting reflects usage dynamics, together maintaining system efficiency and recency.

Importance-driven Forgetting Importance-driven forgetting integrates temporal, frequency, and semantic signals to retain high-value knowledge while pruning redundancy. Early works such as Zhong et al. (2024) and Chen et al. (2025e) quantified importance via composite scores combining temporal decay and access frequency, achieving numeric-based selective forgetting. Later methods evolved toward semantic-level evaluation: VLN (Song et al., 2025b) pools semantically redundant memories via similarity clustering, while Livia (Xi and Wang, 2025) incorporates emotional salience and contextual relevance to model emotion-driven selective forgetting. As LLMs develop increasingly powerful judgment capabilities, TiM (Liu et al., 2023a) and MemTool (Lumer et al., 2025) leverage LLMs to assess memory importance and explicitly prune or forget less important memories. This shift reflects a transition from static numeric scoring to semantic intelligence. Agents can now perform conscious forgetting and selectively retain memories most pertinent to the task context, semantics, and affective cues.

Summary Time-based decay reflects the natural temporal fading of memory, frequency-based forgetting ensures efficient access to frequently used memories, and importance-driven forgetting introduces semantic discernment. These three forgetting mechanisms jointly govern how agentic memory remains timely, efficiently accessible, and semantically relevant. However, heuristic forgetting mechanisms like LRU may eliminate long-tail knowledge, which is seldom accessed but essential for correct decision-making. Therefore, when storage cost is not a critical constraint, many memory systems avoid directly deleting certain memories.

The above discussion describes how prior work manually designs memory architectures at different stages, enabling agents' memory contents to evolve online. More recently, MemEvolve (Zhang et al., 2025h) proposes a meta-evolutionary framework that jointly evolves both agents' experiential knowledge and their underlying memory architecture, allowing the memory framework itself to continuously learn and adapt.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Memory Retrieval

Building on the memory bank established in Section 5.1 and Section 5.2, the next critical step is how to retrieve and utilize memories during reasoning. We define memory retrieval as the process of retrieving relevant and concise knowledge fragments from a certain memory repository to support current reasoning tasks at the right moment. The key challenge lies in efficiently and accurately locating the required knowledge fragments within a large-scale memory store. To address this, many algorithms employ heuristic strategies or learnable models to optimize various stages of the retrieval process. Based on the execution order of retrieval, this process can be decomposed into four aspects. Figure 10 provides a structured overview of this retrieval pipeline, organizing existing methods according to their roles across retrieval stages.

Retrieval Timing and Intent

The retrieval intent and timing determine when to trigger the retrieval mechanism and which memory store to query. Existing memory systems adopt different design choices in this regard, ranging from always-on retrieval to retrieval triggered by explicit instructions or internal signals (Zhao et al., 2024; Wang et al., 2025p; Fang et al., 2025b). For example, MIRIX (Wang and Chen, 2025) performs retrieval from all six memory databases for each query and concatenates the retrieved contents, reflecting a design that prioritizes comprehensive memory access. Other approaches instead aim to trigger retrieval more selectively, allowing the model to decide both the timing and scope of memory access, which can lead to more targeted and efficient use of memory resources. In this subsection, we review the literature from two complementary perspectives: automated retrieval timing and automated retrieval intent.

Automated Retrieval Timing This term refers to the model's ability to autonomously determine when to trigger a memory retrieval operation during reasoning. The simplest strategy is to delegate the decision to either the LLM or an external controller, allowing it to determine solely from the query whether retrieval is necessary. For example, MemGPT (Packer et al., 2023a) and MemTool (Lumer et al., 2025) allow the LLM itself to invoke retrieval functions, enabling efficient access to external memory within an operating-system-like framework. However, these methods rely on static judgments from the query alone, neglecting the model's dynamically evolving cognitive state during reasoning.

To address this limitation, recent work integrates fast-slow thinking mechanisms into retrieval timing. ComoRAG (Wang et al., 2025f) and PRIME (Tran et al., 2025), for instance, first produce a fast response and then let the agent evaluate its adequacy. If the initial reasoning is deemed insufficient, the system triggers deeper retrieval and reasoning based on failure feedback. MemGen (Zhang et al., 2025d) further refines the triggering mechanism by converting the explicit agent-level decision into a latent, trainable process. It introduces memory triggers that detect critical retrieval moments from latent rollout states, thereby improving the precision of retrieval timing while preserving end-to-end differentiability.

Automated Retrieval Intent This aspect concerns the model's ability to autonomously decide which memory source to access within a hierarchical storage form. AgentRR (Feng et al., 2025), for example, dynamically switches between low-level procedural templates and high-level experiential abstractions based on environmental feedback. However, its reliance on explicit feedback limits applicability in open-ended reasoning settings.

Figure 10 Taxonomy of memory retrieval methodologies in agentic systems. The mindmap organizes existing literature into four distinct phases of the retrieval pipeline: Timing and Intent , which governs the initiation of the process; Query Construction , covering techniques for query decomposition and rewriting; Retrieval Strategies , categorizing search paradigms into lexical, semantic, graph-based, and hybrid approaches; and Post-Retrieval Processing , which focuses on refining outputs through re-ranking, filtering, and aggregation.

Figure 10 Taxonomy of memory retrieval methodologies in agentic systems. The mindmap organizes existing literature into four distinct phases of the retrieval pipeline: Timing and Intent , which governs the initiation of the process; Query Construction , covering techniques for query decomposition and rewriting; Retrieval Strategies , categorizing search paradigms into lexical, semantic, graph-based, and hybrid approaches; and Post-Retrieval Processing , which focuses on refining outputs through re-ranking, filtering, and aggregation.

To overcome this constraint, MemOS (Li et al., 2025l) employs a MemScheduler that dynamically selects among parametric, activation, and plaintext memory based on user-, task-, or organization-level context. Yet, this flat selection scheme overlooks the hierarchical structure of the memory system. H-MEM (Sun and Zeng, 2025) addresses this by introducing an index-based routing mechanism, which performs coarse-to-fine retrieval, moving from the domain layer to the episode layer and gradually narrowing the search space to the most relevant sub-memories. This hierarchical routing not only improves retrieval precision but also mitigates information overload.

Summary Autonomous timing and intent help reduce computational overhead and suppress unnecessary noise, but they also create a potential vulnerability. When an agent overestimates its internal knowledge and fails to initiate retrieval when needed, the system can fall into a silent failure mode in which knowledge gaps may lead to hallucinated outputs. Therefore, a balance needs to be achieved: providing the agent with essential information at the right moments while avoiding excessive retrieval that introduces noise.

Automated Retrieval Timing

The retrieval intent and timing determine when to trigger the retrieval mechanism and which memory store to query. Existing memory systems adopt different design choices in this regard, ranging from always-on retrieval to retrieval triggered by explicit instructions or internal signals (Zhao et al., 2024; Wang et al., 2025p; Fang et al., 2025b). For example, MIRIX (Wang and Chen, 2025) performs retrieval from all six memory databases for each query and concatenates the retrieved contents, reflecting a design that prioritizes comprehensive memory access. Other approaches instead aim to trigger retrieval more selectively, allowing the model to decide both the timing and scope of memory access, which can lead to more targeted and efficient use of memory resources. In this subsection, we review the literature from two complementary perspectives: automated retrieval timing and automated retrieval intent.

Automated Retrieval Timing This term refers to the model's ability to autonomously determine when to trigger a memory retrieval operation during reasoning. The simplest strategy is to delegate the decision to either the LLM or an external controller, allowing it to determine solely from the query whether retrieval is necessary. For example, MemGPT (Packer et al., 2023a) and MemTool (Lumer et al., 2025) allow the LLM itself to invoke retrieval functions, enabling efficient access to external memory within an operating-system-like framework. However, these methods rely on static judgments from the query alone, neglecting the model's dynamically evolving cognitive state during reasoning.

To address this limitation, recent work integrates fast-slow thinking mechanisms into retrieval timing. ComoRAG (Wang et al., 2025f) and PRIME (Tran et al., 2025), for instance, first produce a fast response and then let the agent evaluate its adequacy. If the initial reasoning is deemed insufficient, the system triggers deeper retrieval and reasoning based on failure feedback. MemGen (Zhang et al., 2025d) further refines the triggering mechanism by converting the explicit agent-level decision into a latent, trainable process. It introduces memory triggers that detect critical retrieval moments from latent rollout states, thereby improving the precision of retrieval timing while preserving end-to-end differentiability.

Automated Retrieval Intent This aspect concerns the model's ability to autonomously decide which memory source to access within a hierarchical storage form. AgentRR (Feng et al., 2025), for example, dynamically switches between low-level procedural templates and high-level experiential abstractions based on environmental feedback. However, its reliance on explicit feedback limits applicability in open-ended reasoning settings.

Figure 10 Taxonomy of memory retrieval methodologies in agentic systems. The mindmap organizes existing literature into four distinct phases of the retrieval pipeline: Timing and Intent , which governs the initiation of the process; Query Construction , covering techniques for query decomposition and rewriting; Retrieval Strategies , categorizing search paradigms into lexical, semantic, graph-based, and hybrid approaches; and Post-Retrieval Processing , which focuses on refining outputs through re-ranking, filtering, and aggregation.

Figure 10 Taxonomy of memory retrieval methodologies in agentic systems. The mindmap organizes existing literature into four distinct phases of the retrieval pipeline: Timing and Intent , which governs the initiation of the process; Query Construction , covering techniques for query decomposition and rewriting; Retrieval Strategies , categorizing search paradigms into lexical, semantic, graph-based, and hybrid approaches; and Post-Retrieval Processing , which focuses on refining outputs through re-ranking, filtering, and aggregation.

To overcome this constraint, MemOS (Li et al., 2025l) employs a MemScheduler that dynamically selects among parametric, activation, and plaintext memory based on user-, task-, or organization-level context. Yet, this flat selection scheme overlooks the hierarchical structure of the memory system. H-MEM (Sun and Zeng, 2025) addresses this by introducing an index-based routing mechanism, which performs coarse-to-fine retrieval, moving from the domain layer to the episode layer and gradually narrowing the search space to the most relevant sub-memories. This hierarchical routing not only improves retrieval precision but also mitigates information overload.

Summary Autonomous timing and intent help reduce computational overhead and suppress unnecessary noise, but they also create a potential vulnerability. When an agent overestimates its internal knowledge and fails to initiate retrieval when needed, the system can fall into a silent failure mode in which knowledge gaps may lead to hallucinated outputs. Therefore, a balance needs to be achieved: providing the agent with essential information at the right moments while avoiding excessive retrieval that introduces noise.

Automated Retrieval Intent

The retrieval intent and timing determine when to trigger the retrieval mechanism and which memory store to query. Existing memory systems adopt different design choices in this regard, ranging from always-on retrieval to retrieval triggered by explicit instructions or internal signals (Zhao et al., 2024; Wang et al., 2025p; Fang et al., 2025b). For example, MIRIX (Wang and Chen, 2025) performs retrieval from all six memory databases for each query and concatenates the retrieved contents, reflecting a design that prioritizes comprehensive memory access. Other approaches instead aim to trigger retrieval more selectively, allowing the model to decide both the timing and scope of memory access, which can lead to more targeted and efficient use of memory resources. In this subsection, we review the literature from two complementary perspectives: automated retrieval timing and automated retrieval intent.

Automated Retrieval Timing This term refers to the model's ability to autonomously determine when to trigger a memory retrieval operation during reasoning. The simplest strategy is to delegate the decision to either the LLM or an external controller, allowing it to determine solely from the query whether retrieval is necessary. For example, MemGPT (Packer et al., 2023a) and MemTool (Lumer et al., 2025) allow the LLM itself to invoke retrieval functions, enabling efficient access to external memory within an operating-system-like framework. However, these methods rely on static judgments from the query alone, neglecting the model's dynamically evolving cognitive state during reasoning.

To address this limitation, recent work integrates fast-slow thinking mechanisms into retrieval timing. ComoRAG (Wang et al., 2025f) and PRIME (Tran et al., 2025), for instance, first produce a fast response and then let the agent evaluate its adequacy. If the initial reasoning is deemed insufficient, the system triggers deeper retrieval and reasoning based on failure feedback. MemGen (Zhang et al., 2025d) further refines the triggering mechanism by converting the explicit agent-level decision into a latent, trainable process. It introduces memory triggers that detect critical retrieval moments from latent rollout states, thereby improving the precision of retrieval timing while preserving end-to-end differentiability.

Automated Retrieval Intent This aspect concerns the model's ability to autonomously decide which memory source to access within a hierarchical storage form. AgentRR (Feng et al., 2025), for example, dynamically switches between low-level procedural templates and high-level experiential abstractions based on environmental feedback. However, its reliance on explicit feedback limits applicability in open-ended reasoning settings.

Figure 10 Taxonomy of memory retrieval methodologies in agentic systems. The mindmap organizes existing literature into four distinct phases of the retrieval pipeline: Timing and Intent , which governs the initiation of the process; Query Construction , covering techniques for query decomposition and rewriting; Retrieval Strategies , categorizing search paradigms into lexical, semantic, graph-based, and hybrid approaches; and Post-Retrieval Processing , which focuses on refining outputs through re-ranking, filtering, and aggregation.

Figure 10 Taxonomy of memory retrieval methodologies in agentic systems. The mindmap organizes existing literature into four distinct phases of the retrieval pipeline: Timing and Intent , which governs the initiation of the process; Query Construction , covering techniques for query decomposition and rewriting; Retrieval Strategies , categorizing search paradigms into lexical, semantic, graph-based, and hybrid approaches; and Post-Retrieval Processing , which focuses on refining outputs through re-ranking, filtering, and aggregation.

To overcome this constraint, MemOS (Li et al., 2025l) employs a MemScheduler that dynamically selects among parametric, activation, and plaintext memory based on user-, task-, or organization-level context. Yet, this flat selection scheme overlooks the hierarchical structure of the memory system. H-MEM (Sun and Zeng, 2025) addresses this by introducing an index-based routing mechanism, which performs coarse-to-fine retrieval, moving from the domain layer to the episode layer and gradually narrowing the search space to the most relevant sub-memories. This hierarchical routing not only improves retrieval precision but also mitigates information overload.

Summary Autonomous timing and intent help reduce computational overhead and suppress unnecessary noise, but they also create a potential vulnerability. When an agent overestimates its internal knowledge and fails to initiate retrieval when needed, the system can fall into a silent failure mode in which knowledge gaps may lead to hallucinated outputs. Therefore, a balance needs to be achieved: providing the agent with essential information at the right moments while avoiding excessive retrieval that introduces noise.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Query Construction

After initiating the retrieval process, the next challenge lies in transforming the raw query into an effective retrieval signal aligned with the memory index. Query construction acts as the translation layer between the user's surface utterance and the memory's latent storage. Traditional approaches typically perform retrieval directly based on the user query, which is simple but fails to align the query semantics with those of the memory index. To bridge this gap, agentic memory systems proactively perform query decomposition or query rewriting, generating intermediate retrieval signals that better match the latent structure of the memory.

Query Decomposition This approach breaks down a complex query into simpler sub-queries, allowing the system to retrieve more fine-grained and relevant information. Such decomposition alleviates the one-shot retrieval bottleneck by enabling modular retrieval and reasoning over intermediate results. For instance, Visconde (Pereira et al., 2023) and ChemAgent (Tang et al., 2025c) employ LLMs to decompose the original question into sub-problems, retrieve candidate results for each from the memory, and finally aggregate them into a coherent answer. However, these methods lack global planning. To address this issue, PRIME (Tran et al., 2025) and MA-RAG (Nguyen et al., 2025) introduce a Planner Agent, inspired by the ReAct (Yao et al., 2023b) paradigm, that first formulates a global retrieval plan before decomposing it into sub-queries. Yet, these approaches mainly rely on problem-driven decomposition and thus cannot explicitly identify what specific knowledge the model is missing. To make sub-queries more targeted, Agent KB (Tang et al., 2025d) adopts a two-stage retrieval process in which a teacher model observes the student model's failures and generates fine-grained sub-queries accordingly. This targeted decomposition improves retrieval precision and reduces irrelevant results, particularly in knowledge-intensive tasks.

Query Rewriting Instead of decomposing, this strategy rewrites the original query or generates a hypothetical document to refine its semantics before retrieval. Such rewriting mitigates the mismatch between user intent and the memory index. HyDE (Gao et al., 2023b), for example, instructs the LLM to generate a hypothetical document in a zero-shot manner and performs retrieval using its semantic embedding. The generated document encapsulates the desired semantics, effectively bridging the gap between the user query and the target memory. MemoRAG (Qian et al., 2025) extends this idea by incorporating global memory into hypothetical document generation. It first compresses the global memory and then generates a draft answer conditioned on both the query and the compressed memory; this draft is then used as a rewritten query. Since the draft has access to the global memory context, it captures user intent more faithfully and uncovers implicit information needs. Similarly, MemGuide (Du et al., 2025b) leverages the dialogue context to prompt an LLM to produce a concise, command-like phrase that serves as a high-level intent description for retrieval. Beyond directly prompting an LLM to rewrite the query, Rewrite-Retrieve-Read (Ma et al., 2023b) trains a small language model as a dedicated rewriter through reinforcement learning, while ToC (Kim et al., 2023a) employs a Tree of Clarifications to progressively refine and specify the user's retrieval objective.

Summary These two paradigms, decomposition and rewriting, are not mutually exclusive. Auto-RAG (Kim et al., 2024a) integrates both by evaluating HyDE and Visconde under identical retrieval conditions and then selecting the strategy that performs best for the given task. The findings of this work demonstrate that the quality of the memory-retrieval query has a substantial impact on reasoning performance. In contrast to earlier research, which primarily focused on designing sophisticated memory architectures, recent studies (Yan et al., 2025b) place increasing emphasis on the retrieval construction process, shifting the role of memory toward serving retrieval. The choice of what to retrieve with is, unsurprisingly, a critical component of this process.

Query Decomposition

After initiating the retrieval process, the next challenge lies in transforming the raw query into an effective retrieval signal aligned with the memory index. Query construction acts as the translation layer between the user's surface utterance and the memory's latent storage. Traditional approaches typically perform retrieval directly based on the user query, which is simple but fails to align the query semantics with those of the memory index. To bridge this gap, agentic memory systems proactively perform query decomposition or query rewriting, generating intermediate retrieval signals that better match the latent structure of the memory.

Query Decomposition This approach breaks down a complex query into simpler sub-queries, allowing the system to retrieve more fine-grained and relevant information. Such decomposition alleviates the one-shot retrieval bottleneck by enabling modular retrieval and reasoning over intermediate results. For instance, Visconde (Pereira et al., 2023) and ChemAgent (Tang et al., 2025c) employ LLMs to decompose the original question into sub-problems, retrieve candidate results for each from the memory, and finally aggregate them into a coherent answer. However, these methods lack global planning. To address this issue, PRIME (Tran et al., 2025) and MA-RAG (Nguyen et al., 2025) introduce a Planner Agent, inspired by the ReAct (Yao et al., 2023b) paradigm, that first formulates a global retrieval plan before decomposing it into sub-queries. Yet, these approaches mainly rely on problem-driven decomposition and thus cannot explicitly identify what specific knowledge the model is missing. To make sub-queries more targeted, Agent KB (Tang et al., 2025d) adopts a two-stage retrieval process in which a teacher model observes the student model's failures and generates fine-grained sub-queries accordingly. This targeted decomposition improves retrieval precision and reduces irrelevant results, particularly in knowledge-intensive tasks.

Query Rewriting Instead of decomposing, this strategy rewrites the original query or generates a hypothetical document to refine its semantics before retrieval. Such rewriting mitigates the mismatch between user intent and the memory index. HyDE (Gao et al., 2023b), for example, instructs the LLM to generate a hypothetical document in a zero-shot manner and performs retrieval using its semantic embedding. The generated document encapsulates the desired semantics, effectively bridging the gap between the user query and the target memory. MemoRAG (Qian et al., 2025) extends this idea by incorporating global memory into hypothetical document generation. It first compresses the global memory and then generates a draft answer conditioned on both the query and the compressed memory; this draft is then used as a rewritten query. Since the draft has access to the global memory context, it captures user intent more faithfully and uncovers implicit information needs. Similarly, MemGuide (Du et al., 2025b) leverages the dialogue context to prompt an LLM to produce a concise, command-like phrase that serves as a high-level intent description for retrieval. Beyond directly prompting an LLM to rewrite the query, Rewrite-Retrieve-Read (Ma et al., 2023b) trains a small language model as a dedicated rewriter through reinforcement learning, while ToC (Kim et al., 2023a) employs a Tree of Clarifications to progressively refine and specify the user's retrieval objective.

Summary These two paradigms, decomposition and rewriting, are not mutually exclusive. Auto-RAG (Kim et al., 2024a) integrates both by evaluating HyDE and Visconde under identical retrieval conditions and then selecting the strategy that performs best for the given task. The findings of this work demonstrate that the quality of the memory-retrieval query has a substantial impact on reasoning performance. In contrast to earlier research, which primarily focused on designing sophisticated memory architectures, recent studies (Yan et al., 2025b) place increasing emphasis on the retrieval construction process, shifting the role of memory toward serving retrieval. The choice of what to retrieve with is, unsurprisingly, a critical component of this process.

Query Rewriting

After initiating the retrieval process, the next challenge lies in transforming the raw query into an effective retrieval signal aligned with the memory index. Query construction acts as the translation layer between the user's surface utterance and the memory's latent storage. Traditional approaches typically perform retrieval directly based on the user query, which is simple but fails to align the query semantics with those of the memory index. To bridge this gap, agentic memory systems proactively perform query decomposition or query rewriting, generating intermediate retrieval signals that better match the latent structure of the memory.

Query Decomposition This approach breaks down a complex query into simpler sub-queries, allowing the system to retrieve more fine-grained and relevant information. Such decomposition alleviates the one-shot retrieval bottleneck by enabling modular retrieval and reasoning over intermediate results. For instance, Visconde (Pereira et al., 2023) and ChemAgent (Tang et al., 2025c) employ LLMs to decompose the original question into sub-problems, retrieve candidate results for each from the memory, and finally aggregate them into a coherent answer. However, these methods lack global planning. To address this issue, PRIME (Tran et al., 2025) and MA-RAG (Nguyen et al., 2025) introduce a Planner Agent, inspired by the ReAct (Yao et al., 2023b) paradigm, that first formulates a global retrieval plan before decomposing it into sub-queries. Yet, these approaches mainly rely on problem-driven decomposition and thus cannot explicitly identify what specific knowledge the model is missing. To make sub-queries more targeted, Agent KB (Tang et al., 2025d) adopts a two-stage retrieval process in which a teacher model observes the student model's failures and generates fine-grained sub-queries accordingly. This targeted decomposition improves retrieval precision and reduces irrelevant results, particularly in knowledge-intensive tasks.

Query Rewriting Instead of decomposing, this strategy rewrites the original query or generates a hypothetical document to refine its semantics before retrieval. Such rewriting mitigates the mismatch between user intent and the memory index. HyDE (Gao et al., 2023b), for example, instructs the LLM to generate a hypothetical document in a zero-shot manner and performs retrieval using its semantic embedding. The generated document encapsulates the desired semantics, effectively bridging the gap between the user query and the target memory. MemoRAG (Qian et al., 2025) extends this idea by incorporating global memory into hypothetical document generation. It first compresses the global memory and then generates a draft answer conditioned on both the query and the compressed memory; this draft is then used as a rewritten query. Since the draft has access to the global memory context, it captures user intent more faithfully and uncovers implicit information needs. Similarly, MemGuide (Du et al., 2025b) leverages the dialogue context to prompt an LLM to produce a concise, command-like phrase that serves as a high-level intent description for retrieval. Beyond directly prompting an LLM to rewrite the query, Rewrite-Retrieve-Read (Ma et al., 2023b) trains a small language model as a dedicated rewriter through reinforcement learning, while ToC (Kim et al., 2023a) employs a Tree of Clarifications to progressively refine and specify the user's retrieval objective.

Summary These two paradigms, decomposition and rewriting, are not mutually exclusive. Auto-RAG (Kim et al., 2024a) integrates both by evaluating HyDE and Visconde under identical retrieval conditions and then selecting the strategy that performs best for the given task. The findings of this work demonstrate that the quality of the memory-retrieval query has a substantial impact on reasoning performance. In contrast to earlier research, which primarily focused on designing sophisticated memory architectures, recent studies (Yan et al., 2025b) place increasing emphasis on the retrieval construction process, shifting the role of memory toward serving retrieval. The choice of what to retrieve with is, unsurprisingly, a critical component of this process.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Retrieval Strategies

After clarifying the retrieval objective, we obtain a query with a well-defined intent. The next core challenge lies in leveraging this query to efficiently and accurately retrieve truly relevant knowledge from a large and complex memory repository. Retrieval strategies serve as the bridge between queries and the memory base, and their design directly determines both retrieval efficiency and result quality. In this section, we systematically review various retrieval paradigms and analyze their strengths, limitations, and application scenarios-from traditional sparse retrieval based on keyword matching, to modern dense retrieval using semantic embeddings,

to graph-based retrieval for structured knowledge, to the emerging class of generative retrieval methods, and finally to hybrid retrieval techniques that integrate multiple paradigms.

Lexical Retrieval This strategy relies on keyword matching to locate relevant documents, with representative methods including TF-IDF (SPARCK JONES, 1972) and BM25 (Robertson and Zaragoza, 2009). TF-IDF measures the importance of keywords based on term frequency and inverse document frequency, enabling fast and interpretable retrieval. BM25 further refines this approach by incorporating term frequency saturation and document length normalization. Such methods are often employed in precision-oriented retrieval scenarios, where accuracy and relevance of results take precedence over recall (Tang et al., 2025d; Wang et al., 2025p; Pan et al., 2025). However, purely lexical matching struggles to capture semantic variations and contextual relationships, making it highly sensitive to linguistic expression differences and thus less effective in open-domain knowledge or multimodal memory settings.

Semantic Retrieval This strategy encodes queries and memory entries into a shared embedding space and matches them based on semantic similarity rather than lexical overlap. Representative approaches utilize semantic encoders, including Sentence-BERT (Reimers and Gurevych, 2019) and CLIP (Radford et al., 2021). Within memory systems, this approach better captures task context and supports semantic generalization and fuzzy matching, making it the default choice in most agentic memory frameworks (Lewis et al., 2020; Wang et al., 2024b; Yang et al., 2024a; Xu et al., 2025c; Tan et al., 2025c; Nguyen et al., 2025; Qian et al., 2025; Hassell et al., 2025; Huang et al., 2025c). However, semantic drift and forced top-K retrieval often introduce retrieval noise and spurious recall. To address these issues, recent systems incorporate dynamic retrieval policies, reranking modules, and hybrid retrieval schemes.

Graph Retrieval This strategy leverages not only semantic signals but also the explicit topological structure of graphs, enabling inherently more precise and structure-aware retrieval. By directly accessing structural paths, these methods exhibit stronger multi-hop reasoning capabilities and can more effectively explore long-range dependencies. Moreover, treating relational structure as a constraint on inference paths naturally supports retrieval governed by exact rules and symbolic constraints. Representative approaches such as AriGraph (Anokhin et al., 2024), EMG-RAG (Wang et al., 2024l), Mem0 g (Chhikara et al., 2025), and SGMem (Wu et al., 2025h) first identify the most relevant nodes or triples and then expand to their semantically related K-hop neighbors to construct an ego-graph. HippoRAG (Gutierrez et al., 2024) performs personalized PageRank (Page et al., 1999) seeded on the retrieved nodes and ranks the rest of the graph by their proximity to these seeds, enabling effective multi-hop retrieval. Going beyond fixed expansion rules, CAM (Li et al., 2025g) and D-SMART (Lei et al., 2025) employ LLMs to steer subgraph exploration: CAM uses an LLM to select informative neighbors and children of a central node for associative exploration, while D-SMART treats the LLM as a planner that performs beam search over a KG memory to retrieve one-hop neighbors of target entities and the relations connecting a given entity pair. For temporal graphs, Zep (Rasmussen et al., 2025) and MemoTime (Tan et al., 2025b) further enable entity-subgraph construction and relation retrieval under explicit temporal constraints, ensuring that the returned results satisfy the required time rules.

Generative Retrieval This strategy replaces lexical or semantic retrieval with a model that directly generates the identifiers of relevant documents (Tay et al., 2022; Wang et al., 2022b). By framing retrieval as a conditional generation task, the model implicitly stores candidate documents in its parameters and performs deep query-document interaction during decoding(Li et al., 2025k). Leveraging the semantic capabilities of pretrained language models, this paradigm often outperforms traditional retrieval methods, particularly in small-scale settings(Zeng et al., 2024). However, generative retrieval requires additional training to internalize the semantics of all candidate documents, resulting in limited scalability when the corpus evolves (Yuan et al., 2024b). For these reasons, agentic memory systems have paid relatively little attention to this paradigm, although its tight integration of generation and retrieval suggests untapped potential.

Hybrid Retrieval This strategy integrates the strengths of multiple retrieval paradigms. Systems such as Agent KB (Tang et al., 2025d) and MIRIX (Wang and Chen, 2025) combine lexical and semantic retrieval to balance precise term or tool matching with broader semantic alignment. Similarly, Semantic

Anchoring (Chatterjee and Agarwal, 2025) performs parallel searches over semantic embeddings and symbolic inverted indices to achieve complementary coverage. Some other methods combine multiple evaluation signals to guide retrieval. Generative Agents (Kaiya et al., 2023), for example, illustrate this multi-factor approach through a scoring scheme that accumulates recency, importance, and relevance. MAICC (Jiang et al., 2025c) adopts a mixed-utility scoring function that integrates similarity with both global and predicted individual returns. In graph-based settings, retrieval typically proceeds in two stages: semantic retrieval first identifies relevant nodes or triples, and graph topology is subsequently leveraged to expand the search space (Anokhin et al., 2024; Wang et al., 2024l; Gutierrez et al., 2024; Li et al., 2025g).

At the database infrastructure level, MemoriesDB (Ward, 2025) introduces a temporal-semantic-relational database designed for long-term agent memory, providing a hybrid retrieval architecture that integrates these dimensions into a unified storage and access framework.

By fusing heterogeneous retrieval signals, hybrid approaches preserve the precision of keyword matching while incorporating the contextual understanding of semantic methods, ultimately yielding more comprehensive and relevant results.

Lexical Retrieval

Building on the memory bank established in Section 5.1 and Section 5.2, the next critical step is how to retrieve and utilize memories during reasoning. We define memory retrieval as the process of retrieving relevant and concise knowledge fragments from a certain memory repository to support current reasoning tasks at the right moment. The key challenge lies in efficiently and accurately locating the required knowledge fragments within a large-scale memory store. To address this, many algorithms employ heuristic strategies or learnable models to optimize various stages of the retrieval process. Based on the execution order of retrieval, this process can be decomposed into four aspects. Figure 10 provides a structured overview of this retrieval pipeline, organizing existing methods according to their roles across retrieval stages.

Semantic Retrieval

Building on the memory bank established in Section 5.1 and Section 5.2, the next critical step is how to retrieve and utilize memories during reasoning. We define memory retrieval as the process of retrieving relevant and concise knowledge fragments from a certain memory repository to support current reasoning tasks at the right moment. The key challenge lies in efficiently and accurately locating the required knowledge fragments within a large-scale memory store. To address this, many algorithms employ heuristic strategies or learnable models to optimize various stages of the retrieval process. Based on the execution order of retrieval, this process can be decomposed into four aspects. Figure 10 provides a structured overview of this retrieval pipeline, organizing existing methods according to their roles across retrieval stages.

Graph Retrieval

Building on the memory bank established in Section 5.1 and Section 5.2, the next critical step is how to retrieve and utilize memories during reasoning. We define memory retrieval as the process of retrieving relevant and concise knowledge fragments from a certain memory repository to support current reasoning tasks at the right moment. The key challenge lies in efficiently and accurately locating the required knowledge fragments within a large-scale memory store. To address this, many algorithms employ heuristic strategies or learnable models to optimize various stages of the retrieval process. Based on the execution order of retrieval, this process can be decomposed into four aspects. Figure 10 provides a structured overview of this retrieval pipeline, organizing existing methods according to their roles across retrieval stages.

Generative Retrieval

Building on the memory bank established in Section 5.1 and Section 5.2, the next critical step is how to retrieve and utilize memories during reasoning. We define memory retrieval as the process of retrieving relevant and concise knowledge fragments from a certain memory repository to support current reasoning tasks at the right moment. The key challenge lies in efficiently and accurately locating the required knowledge fragments within a large-scale memory store. To address this, many algorithms employ heuristic strategies or learnable models to optimize various stages of the retrieval process. Based on the execution order of retrieval, this process can be decomposed into four aspects. Figure 10 provides a structured overview of this retrieval pipeline, organizing existing methods according to their roles across retrieval stages.

Hybrid Retrieval

Building on the memory bank established in Section 5.1 and Section 5.2, the next critical step is how to retrieve and utilize memories during reasoning. We define memory retrieval as the process of retrieving relevant and concise knowledge fragments from a certain memory repository to support current reasoning tasks at the right moment. The key challenge lies in efficiently and accurately locating the required knowledge fragments within a large-scale memory store. To address this, many algorithms employ heuristic strategies or learnable models to optimize various stages of the retrieval process. Based on the execution order of retrieval, this process can be decomposed into four aspects. Figure 10 provides a structured overview of this retrieval pipeline, organizing existing methods according to their roles across retrieval stages.

Post-Retrieval Processing

Initial retrieval often returns fragments that are redundant, noisy, or semantically inconsistent. Directly injecting these results into the prompt can lead to excessively long contexts, conflicting information, and reasoning distracted by irrelevant content. Post-retrieval processing, therefore, becomes essential for ensuring prompt quality. Its goal is to distill the retrieved results into a concise, accurate, and semantically coherent context. In practice, two components are central: (1) Re-rankingandFiltering: performing fine-grained relevance estimation to remove irrelevant or outdated memories and reorder the remaining fragments, thereby reducing noise and redundancy. (2) Aggregation and Compression: integrating the retrieved memories with the original query, eliminating duplication, merging semantically similar information, and reconstructing a compact and coherent final context.

Re-ranking and Filtering To maintain a concise and coherent context, initial retrieval results are re-ranked and filtered to ensure a concise and coherent context by removing low-relevance items. Early approaches rely on heuristic criteria for evaluating semantic consistency. For example, Semantic Anchoring (Chatterjee and Agarwal, 2025) integrates vector similarity with entity- and discourse-level alignment, whereas RCR-Router (Liu et al., 2025d) combines multiple handcrafted signals, including role relevance, task-stage priority, and recency. These methods, however, often require extensive hyperparameter tuning to balance heterogeneous importance scores. To alleviate this burden, learn-to-memorize (Zhang et al., 2025u) formulates score aggregation as a reinforcement-learning problem, enabling the model to learn optimal weights over retrieval signals. While these techniques primarily optimize semantic coherence, scenarios demanding strict temporal reasoning require additional constraints: Rasmussen et al. (2025) and Tan et al. (2025b) filter memories based on their timestamps and validity windows to satisfy complex temporal dependencies.

With the increasing capability of LLMs, recent methods leverage their intrinsic language understanding to assess memory quality directly. Memory-R1 (Yan et al., 2025c) and Westhäußer et al. (2025) both introduce LLM-based evaluators (Answer Agents or Self-Validator Agents) that filter retrieved content before producing the final response. However, prompt-based filtering remains limited by the LLM's inherent capacity and by mismatches between prompt semantics and downstream usage. Consequently, many systems train auxiliary models to estimate memory importance more robustly (Tan et al., 2025c). Memento (Zhou et al., 2025a) uses Q-learning (Watkins and Dayan, 1992) to predict the probability that a retrieved item contributes to a correct answer, and MemGuide (Du et al., 2025b) fine-tunes LLaMA-8B (Grattafiori et al., 2024) to re-rank candidates using marginal slot-completion gain. Together, these re-ranking and filtering strategies refine retrieval results without modifying the underlying retriever, enabling compatibility with any pre-trained retrieval model while supporting task-specific optimization.

Aggregation and Compression Another approach to improving both the quality and efficiency of downstream reasoning through post-retrieval processing is the aggregation and compression. This process integrates the retrieved evidence with the query to form a coherent and compact context. Unlike filtering and

re-ranking, which mainly address noise and prioritization, this stage focuses on merging multiple fragmented memory items into higher-level and distilled knowledge representations, and on refining these representations when task-specific adaptations are required. ComoRAG (Wang et al., 2025f) illustrates this idea through its Integration Agent, which identifies historical signals that are semantically aligned with the query and combines them into an abstract global summary that provides broad contextual grounding. The Extractor Agent in MA-RAG (Nguyen et al., 2025) performs fine-grained content selection over the retrieved documents, retaining only the key information that is strongly relevant to the current subquery and producing concise snippets tailored to local reasoning needs.

Furthermore, G-Memory (Zhang et al., 2025c) extends aggregation and compression into the personalization for multi-agent systems. It consolidates retrieved high-level insights and sparsified trajectories, and then uses an LLM to customize these condensed experiences according to the agent's role. This process refines general knowledge into role-specific prompts that populate the agent's personalized memory.

Summary In conclusion, post-retrieval processing acts as a crucial intermediate step that transforms noisy, fragmented retrieval results into a precise and coherent context for reasoning. Through the above mechanisms, the post-retrieval processing not only enhances the density and fidelity of the memories supplied to the model but also aligns the information with task requirements and agent characteristics.

Re-ranking and Filtering

The retrieval intent and timing determine when to trigger the retrieval mechanism and which memory store to query. Existing memory systems adopt different design choices in this regard, ranging from always-on retrieval to retrieval triggered by explicit instructions or internal signals (Zhao et al., 2024; Wang et al., 2025p; Fang et al., 2025b). For example, MIRIX (Wang and Chen, 2025) performs retrieval from all six memory databases for each query and concatenates the retrieved contents, reflecting a design that prioritizes comprehensive memory access. Other approaches instead aim to trigger retrieval more selectively, allowing the model to decide both the timing and scope of memory access, which can lead to more targeted and efficient use of memory resources. In this subsection, we review the literature from two complementary perspectives: automated retrieval timing and automated retrieval intent.

Automated Retrieval Timing This term refers to the model's ability to autonomously determine when to trigger a memory retrieval operation during reasoning. The simplest strategy is to delegate the decision to either the LLM or an external controller, allowing it to determine solely from the query whether retrieval is necessary. For example, MemGPT (Packer et al., 2023a) and MemTool (Lumer et al., 2025) allow the LLM itself to invoke retrieval functions, enabling efficient access to external memory within an operating-system-like framework. However, these methods rely on static judgments from the query alone, neglecting the model's dynamically evolving cognitive state during reasoning.

To address this limitation, recent work integrates fast-slow thinking mechanisms into retrieval timing. ComoRAG (Wang et al., 2025f) and PRIME (Tran et al., 2025), for instance, first produce a fast response and then let the agent evaluate its adequacy. If the initial reasoning is deemed insufficient, the system triggers deeper retrieval and reasoning based on failure feedback. MemGen (Zhang et al., 2025d) further refines the triggering mechanism by converting the explicit agent-level decision into a latent, trainable process. It introduces memory triggers that detect critical retrieval moments from latent rollout states, thereby improving the precision of retrieval timing while preserving end-to-end differentiability.

Automated Retrieval Intent This aspect concerns the model's ability to autonomously decide which memory source to access within a hierarchical storage form. AgentRR (Feng et al., 2025), for example, dynamically switches between low-level procedural templates and high-level experiential abstractions based on environmental feedback. However, its reliance on explicit feedback limits applicability in open-ended reasoning settings.

Figure 10 Taxonomy of memory retrieval methodologies in agentic systems. The mindmap organizes existing literature into four distinct phases of the retrieval pipeline: Timing and Intent , which governs the initiation of the process; Query Construction , covering techniques for query decomposition and rewriting; Retrieval Strategies , categorizing search paradigms into lexical, semantic, graph-based, and hybrid approaches; and Post-Retrieval Processing , which focuses on refining outputs through re-ranking, filtering, and aggregation.

Figure 10 Taxonomy of memory retrieval methodologies in agentic systems. The mindmap organizes existing literature into four distinct phases of the retrieval pipeline: Timing and Intent , which governs the initiation of the process; Query Construction , covering techniques for query decomposition and rewriting; Retrieval Strategies , categorizing search paradigms into lexical, semantic, graph-based, and hybrid approaches; and Post-Retrieval Processing , which focuses on refining outputs through re-ranking, filtering, and aggregation.

To overcome this constraint, MemOS (Li et al., 2025l) employs a MemScheduler that dynamically selects among parametric, activation, and plaintext memory based on user-, task-, or organization-level context. Yet, this flat selection scheme overlooks the hierarchical structure of the memory system. H-MEM (Sun and Zeng, 2025) addresses this by introducing an index-based routing mechanism, which performs coarse-to-fine retrieval, moving from the domain layer to the episode layer and gradually narrowing the search space to the most relevant sub-memories. This hierarchical routing not only improves retrieval precision but also mitigates information overload.

Summary Autonomous timing and intent help reduce computational overhead and suppress unnecessary noise, but they also create a potential vulnerability. When an agent overestimates its internal knowledge and fails to initiate retrieval when needed, the system can fall into a silent failure mode in which knowledge gaps may lead to hallucinated outputs. Therefore, a balance needs to be achieved: providing the agent with essential information at the right moments while avoiding excessive retrieval that introduces noise.

Aggregation and Compression

The retrieval intent and timing determine when to trigger the retrieval mechanism and which memory store to query. Existing memory systems adopt different design choices in this regard, ranging from always-on retrieval to retrieval triggered by explicit instructions or internal signals (Zhao et al., 2024; Wang et al., 2025p; Fang et al., 2025b). For example, MIRIX (Wang and Chen, 2025) performs retrieval from all six memory databases for each query and concatenates the retrieved contents, reflecting a design that prioritizes comprehensive memory access. Other approaches instead aim to trigger retrieval more selectively, allowing the model to decide both the timing and scope of memory access, which can lead to more targeted and efficient use of memory resources. In this subsection, we review the literature from two complementary perspectives: automated retrieval timing and automated retrieval intent.

Automated Retrieval Timing This term refers to the model's ability to autonomously determine when to trigger a memory retrieval operation during reasoning. The simplest strategy is to delegate the decision to either the LLM or an external controller, allowing it to determine solely from the query whether retrieval is necessary. For example, MemGPT (Packer et al., 2023a) and MemTool (Lumer et al., 2025) allow the LLM itself to invoke retrieval functions, enabling efficient access to external memory within an operating-system-like framework. However, these methods rely on static judgments from the query alone, neglecting the model's dynamically evolving cognitive state during reasoning.

To address this limitation, recent work integrates fast-slow thinking mechanisms into retrieval timing. ComoRAG (Wang et al., 2025f) and PRIME (Tran et al., 2025), for instance, first produce a fast response and then let the agent evaluate its adequacy. If the initial reasoning is deemed insufficient, the system triggers deeper retrieval and reasoning based on failure feedback. MemGen (Zhang et al., 2025d) further refines the triggering mechanism by converting the explicit agent-level decision into a latent, trainable process. It introduces memory triggers that detect critical retrieval moments from latent rollout states, thereby improving the precision of retrieval timing while preserving end-to-end differentiability.

Automated Retrieval Intent This aspect concerns the model's ability to autonomously decide which memory source to access within a hierarchical storage form. AgentRR (Feng et al., 2025), for example, dynamically switches between low-level procedural templates and high-level experiential abstractions based on environmental feedback. However, its reliance on explicit feedback limits applicability in open-ended reasoning settings.

Figure 10 Taxonomy of memory retrieval methodologies in agentic systems. The mindmap organizes existing literature into four distinct phases of the retrieval pipeline: Timing and Intent , which governs the initiation of the process; Query Construction , covering techniques for query decomposition and rewriting; Retrieval Strategies , categorizing search paradigms into lexical, semantic, graph-based, and hybrid approaches; and Post-Retrieval Processing , which focuses on refining outputs through re-ranking, filtering, and aggregation.

Figure 10 Taxonomy of memory retrieval methodologies in agentic systems. The mindmap organizes existing literature into four distinct phases of the retrieval pipeline: Timing and Intent , which governs the initiation of the process; Query Construction , covering techniques for query decomposition and rewriting; Retrieval Strategies , categorizing search paradigms into lexical, semantic, graph-based, and hybrid approaches; and Post-Retrieval Processing , which focuses on refining outputs through re-ranking, filtering, and aggregation.

To overcome this constraint, MemOS (Li et al., 2025l) employs a MemScheduler that dynamically selects among parametric, activation, and plaintext memory based on user-, task-, or organization-level context. Yet, this flat selection scheme overlooks the hierarchical structure of the memory system. H-MEM (Sun and Zeng, 2025) addresses this by introducing an index-based routing mechanism, which performs coarse-to-fine retrieval, moving from the domain layer to the episode layer and gradually narrowing the search space to the most relevant sub-memories. This hierarchical routing not only improves retrieval precision but also mitigates information overload.

Summary Autonomous timing and intent help reduce computational overhead and suppress unnecessary noise, but they also create a potential vulnerability. When an agent overestimates its internal knowledge and fails to initiate retrieval when needed, the system can fall into a silent failure mode in which knowledge gaps may lead to hallucinated outputs. Therefore, a balance needs to be achieved: providing the agent with essential information at the right moments while avoiding excessive retrieval that introduces noise.

Summary

Factual memory refers to the capacity of an agent to store and retrieve explicit, declarative facts about past events, user-specific information, and the state of the external environment. This information encompasses a wide range of content, including dialogue history, user preferences, and relevant properties of the external world. By allowing the agent to exploit historical information when interpreting current inputs, factual memory serves as the cornerstone for context awareness, personalized responses, and extended task planning.

To understand the structural composition of agent memory, we draw upon the cognitive science framework of declarative memory (Riedel and Blokland, 2015). In neuroscience, declarative memory denotes long-term storage for information that can be consciously accessed and is commonly analyzed in terms of two major components: episodic and semantic memory (Squire, 2004). Episodic memory stores personally experienced events associated with specific temporal and spatial contexts-the what , where , and when of an episode (Tulving, 1972, 2002). Its central characteristic is the capacity to mentally re-experience past events. Semantic memory retains general factual knowledge, concepts, and word meanings independent of the specific occasion on which they were acquired (Squire, 2004). While supported by a unitary declarative system in the human brain,

these components represent distinct levels of abstraction.

In agent systems, this biological distinction is operationalized not as a rigid dichotomy but as a processing continuum . Systems typically initiate this process by logging concrete interaction histories as episodic traces, such as dialogue turns, user actions, and environment states (Zhong et al., 2024; Wang et al., 2024h; Chhikara et al., 2025). Subsequent processing stages apply summarization (Wang et al., 2025h; Chen et al., 2025d), reflection (Tan et al., 2025c; Park et al., 2023; Wang et al., 2025h), entity extraction (Gutierrez et al., 2024), and fact induction (Rasmussen et al., 2025). The resulting abstractions are stored in structures such as vector databases (Zhong et al., 2024), key-value stores, or knowledge graphs (Rasmussen et al., 2025; Sun et al., 2024), governed by procedures for deduplication and consistency checking. Through this sequence, raw event streams are gradually transformed into reusable semantic fact bases.

Functionally, this architecture ensures that the agent exhibits three fundamental properties during interaction: consistency , coherence , and adaptability .

· Consistency implies stable behavior and self-presentation over time. By maintaining a persistent internal state regarding user-specific facts and its own commitments, the agent avoids contradictions and arbitrary changes of stance. · Coherence is reflected in robust context awareness. The agent can recall and integrate relevant interaction history, refer to past user inputs, and preserve topical continuity, ensuring responses form a logically connected dialogue rather than isolated utterances. · Adaptability demonstrates the ability to personalize behavior based on stored user profiles and historical feedback. Consequently, response style and decision-making progressively align with the user's specific needs and characteristics.

For exposition, we further organize factual memory according to the primary entity it refers to. This entitycentric taxonomy, together with representative methods and their technical design choices, is systematically summarized in subsection 4.1. This perspective highlights two central application domains:

Resources and Frameworks

Benchmarks and Datasets

In this section, we survey representative benchmarks and datasets that have been used (or could be used) to evaluate the memory, long-term, continual-learning, or long-context capabilities of LLM-based agents. We classify these benchmarks into two broad categories: (1) those explicitly designed for memory / lifelong learning / self-evolving agents, and (2) those originally developed for other purposes (e.g., tool-use capacity, web search, embodied action) but nevertheless relevant for memory evaluation due to their long-horizon, multi-task, or sequential nature.

Benchmarks for Memory / Lifelong / Self-Evolving Agents

Memory-oriented benchmarks focus primarily on how well an agent can construct, maintain, and exploit an explicit memory of past interactions or world facts. These tasks typically probe the retention and retrieval of information across multi-turn dialogues, user-specific sessions, or long synthetic narratives, sometimes including multimodal signals.

A consolidated overview of these benchmarks, including their memory focus, environment type, modality, and evaluation scale, is provided in Table 8, which serves as a structured reference for comparing their design objectives and evaluation settings. Representative examples such as MemBench (Tan et al., 2025a), LoCoMo (Maharana et al., 2024), WebChoreArena (Miyai et al., 2025), MT-Mind2Web (Deng et al., 2024), PersonaMem (Jiang et al., 2025a), PerLTQA (Du et al., 2024), MPR (Zhang et al., 2025v), PrefEval (Zhao et al., 2025d), LOCCO (Jia et al., 2025), StoryBench (Wan and Ma, 2025), Madial-Bench (He et al., 2025), DialSim (Zheng et al., 2025b), LongBench (Bai et al., 2024), LongBench v2 (Bai et al., 2025), RULER (Hsieh et al., 2024), BALILong (Kuratov et al., 2024) MM-Needle (Wang et al., 2025e), and HaluMem (Chen et al., 2025a) stress user modeling, preference tracking, and conversation-level consistency, often under simulated settings where ground-truth memories can be precisely controlled.

Lifelong-learning benchmarks extend beyond isolated memory retrieval to examine how agents continually acquire, consolidate, and update knowledge over long horizons and evolving task distributions. Benchmarks such as LongMemEval (Wu et al., 2025a), MemoryBank (Zhong et al., 2024), MemoryBench (Ai et al., 2025), LifelongAgentBench (Zheng et al., 2025b), and StreamBench (Wu et al., 2024a), are designed around sequences of tasks or episodes in which new information gradually arrives and earlier information may become obsolete or conflicting. These setups emphasize phenomena like catastrophic forgetting, forward and backward transfer, and test-time adaptation, making them suitable for studying how memory mechanisms interact with continual-learning objectives. In many cases, performance is tracked not only on the current task but also on

Table 8 Overview of benchmarks relevant to LLM agent memory, long-term, lifelong learning, and self-evolving evaluation. The table covers two categories of benchmarks: (i) benchmarks explicitly designed for memory-, lifelong learning-, or self-evolving agent evaluation, and (ii) other agent-oriented benchmarks that implicitly stress long-horizon memory through sequential, multi-step, or multi-task interactions. Fac. and Exp. indicate whether a benchmark evaluates factual memory or experiential (interaction-derived) memory, respectively. MM. denotes the presence of multimodal inputs, while Env. indicates whether the benchmark is conducted in a simulated or real environment. Feature summarizes the primary capability under evaluation, and Scale reports the approximate benchmark size in terms of samples (s.) or tasks (t.). PDDL denotes commonly used PDDL-based planning subsets.

previously seen tasks or conversations, thereby quantifying how well the agent preserves useful knowledge

while adapting to new users, domains, or interaction patterns.

Self-evolving-agent benchmarks go a step further by treating the agent as an open-ended system that can iteratively refine its own memory, skills, and strategies through interaction. Here, the focus is not only on storing and recalling information, but also on meta-level behaviors such as self-reflection, memory editing, tool-augmented storage, and policy improvement over multiple episodes or games. Benchmarks like MemoryAgentBench (Hu et al., 2025c), Evo-Memory (Wei et al., 2025e), and other multi-episode or mission-style environments can be instantiated in a self-evolving setting by allowing the agent to accumulate trajectories, synthesize higher-level abstractions, and adjust its behavior in future runs based on its own past performance. When viewed through this lens, these benchmarks provide a testbed for evaluating whether an agent can autonomously bootstrap more capable behaviors over time-turning static tasks into arenas for long-term adaptation, strategy refinement, and genuinely self-improving memory use.

Beyond benchmarks explicitly designed for memory or lifelong learning, a wide range of agent-oriented and long-horizon evaluation suites are also relevant for studying memory-related capabilities in LLM-based agents. Although these benchmarks were originally introduced to assess other aspects such as tool use, embodied interaction, or knowledge-intensive reasoning, their sequential, multi-step, and multi-task nature implicitly places strong demands on long-term information retention, context management, and state tracking.

Embodied and interactive environments constitute a major class of such benchmarks. Frameworks like ALFWorld (Shridhar et al., 2021) and ScienceWorld (Wang et al., 2022a) evaluate agents in simulated textbased or partially grounded environments where success requires remembering past observations, intermediate goals, and environment dynamics across extended action sequences. Similarly, BabyAI (Chevalier-Boisvert et al., 2019) focuses on language-conditioned instruction following over temporally extended episodes, implicitly testing an agent's ability to maintain task-relevant state throughout interaction. While these benchmarks do not explicitly model external memory modules, effective performance often depends on the agent's capacity to preserve and reuse information over long horizons.

Another prominent category includes web-based and tool-augmented interaction benchmarks. WebShop (Yao et al., 2023a), WebArena (Zhou et al., 2024b), and MMInA (Tian et al., 2025b) assess agents operating in realistic or semi-realistic web environments involving multi-step navigation, information gathering, and decision making. These settings naturally induce long-context trajectories in which earlier actions, retrieved information, or user constraints must be recalled and integrated at later stages. ToolBench (Qin et al., 2024a) further extends this paradigm by evaluating an agent's ability to select and invoke APIs across complex workflows, where memory of prior tool outputs and tool-use experience is critical for coherent execution.

Multi-task and general agent evaluation platforms also provide indirect but valuable signals about memory usage. AgentGym (Xi et al., 2024b) and AgentBoard (Xi et al., 2024b) aggregate diverse environments or tasks into unified evaluation suites, requiring agents to adapt across tasks while retaining task-specific knowledge and strategies. PDDL-based planning environments, commonly used in agent benchmarks, evaluate strategic reasoning over structured action spaces, where agents benefit from accumulating and reusing experience across episodes to improve long-horizon planning performance.

Finally, several recent benchmarks target demanding real-world or near-real-world reasoning scenarios that inherently stress long-context and cross-step consistency. SWE-Bench Verified (Jimenez et al., 2024) evaluates code repair over realistic software repositories, where agents must reason over long files and evolving code states. GAIA (Mialon et al., 2023) and xBench (Chen et al., 2025c) assess deep research and search-intensive tasks that require synthesizing information gathered across multiple steps and sources. GenAI-Bench (Li et al., 2024a), while focusing on multimodal generation quality, similarly involves complex workflows in which memory of prior prompts, intermediate outputs, or visual constraints plays a nontrivial role.

Taken together, these benchmarks complement memory-oriented evaluations explicitly by situating LLM-based agents in rich, interactive, and long-horizon settings. Although memory is not always an explicit target of measurement, sustained performance in these environments implicitly depends on an agent's ability to manage long contexts, preserve relevant information, and integrate past experience into ongoing decision making, making them valuable testbeds for studying memory-related behaviors in practice.

Table 9 Overview of representative open-source memory frameworks for LLM-based agents. The table compares widely used frameworks in terms of the types of memory they support (factual vs. experiential), multimodality, internal memory structure, and reported evaluation benchmarks. Fac. and Exp. denote factual and experiential memory, respectively, MM. indicates multimodal memory support, and Structure summarizes the core memory abstraction or organization mechanism adopted by each framework. Evaluation lists publicly reported benchmarks used to assess memory-related capabilities, when available.

Open-Source Frameworks

A rapidly growing ecosystem of open-source memory frameworks aims to provide reusable infrastructure for building memory-augmented LLM agents. A structured comparison of representative open-source memory frameworks, including their supported memory types, architectural abstractions, and evaluation coverage, is summarized in Table 9. Most of these frameworks support factual memory via vector or structured stores, and an increasing subset also models experiential traces, such as dialogue histories, user actions, and episodic summaries, with multimodal memory emerging more recently. Open-source memory frameworks for LLM agents span a spectrum from agent-centric systems with rich, hierarchical memory abstractions to more generalpurpose retrieval or memory-as-a-service backends, e.g., MemGPT (Packer et al., 2023b), Mem0 (Chhikara et al., 2025), Memobase, MemoryOS (Kang et al., 2025a), MemOS (Li et al., 2025l), Zep (Rasmussen et al., 2025), LangMem (LangChain, 2025), SuperMemory (Supermemory, 2025), Cognee (Cognee, 2025), Memary (Memary, 2025), Pinecone, Chroma, Weaviate, Second Me, MemU, MemEngine (Zhang et al., 2025t), Memori, ReMe (AgentScope, 2025), AgentMemory, and MineContext (MineContext, 2025). Many of them explicitly separate short- and long-term stores and offer graph-based, profile-based, or modular memory spaces, and some have begun to report results on memory-based benchmarks. The others typically provide scalable

vector or graph databases, APIs, and semantic or streaming entity layers that help organize context but often leave agent behavior and evaluation protocols to the application. Overall, these frameworks are rapidly maturing in their representational flexibility and system design.

Positions and Frontiers

This section articulates key positions and emerging frontiers in the design of memory systems for LLM-based agents. Moving beyond descriptive surveys of existing methods, we focus on paradigm-level shifts that redefine how memory is constructed, managed, and optimized in long-horizon agentic settings. Specifically, we examine the transition from retrieval-centric to generative memory, from manually engineered to autonomously managed memory systems, and from heuristic pipelines to reinforcement learning-driven memory control. We further discuss how these shifts intersect with multimodal reasoning, multi-agent collaboration, and trustworthiness, outlining open challenges and research directions that are likely to shape the next generation of agent memory architectures.

Memory Retrieval vs. Memory Generation

Look Back: From Memory Retrieval to Memory Generation

Historically, the dominant paradigm in agent memory research has centered on memoryretrieval . Under this paradigm, the primary objective is to identify, filter, and select the most relevant memory entries from an existing memory store given the current context. A large body of prior work focuses on improving retrieval accuracy through better indexing strategies, similarity metrics, reranking models, or structured representations such as knowledge graphs (Tan et al., 2025c; Memobase, 2025). In practice, this includes techniques such as vector similarity search with dense embeddings, hybrid retrieval combining lexical and semantic signals, hierarchical filtering, and graph-based traversal. These methods emphasize precision and recall in accessing stored information, implicitly assuming that the memory base itself is already well formed.

Recently, however, increasing attention has shifted toward memorygeneration . Rather than treating memory as a static repository to be queried, memory generation emphasizes the agent's ability to actively synthesize new memory representations on demand. The goal is not merely to retrieve and concatenate existing fragments, but to integrate, compress, and reorganize information in a manner that is tailored to the current context and future utility. This shift reflects a growing recognition that effective memory usage often requires abstraction and recomposition, especially when raw stored information is noisy, redundant, or misaligned with the immediate task.

Existing approaches to memory generation can be broadly grouped into two directions. One line of work adopts a retrieve then generate strategy, where retrieved memory items serve as raw material for reconstruction. In this setting, the agent first accesses a subset of relevant memories and then generates a refined memory representation that is more concise, coherent, and context specific, as implemented in ComoRAG (Wang et al., 2025f), G-Memory (Zhang et al., 2025c) and CoMEM (Wu et al., 2025d). This approach preserves grounding in historical information while enabling adaptive summarization and restructuring. A second line of work explores direct memory generation , in which memory is produced without any explicit retrieval step. Instead, the agent generates memory representations directly from the current context, interaction history, or latent internal states. Systems such as MemGen (Zhang et al., 2025d) and VisMem (Yu et al., 2025e) exemplify this direction by constructing latent memory tokens that are customized to the task at hand, bypassing explicit memory lookup altogether.

Future Perspective

Looking ahead, we anticipate that generative approaches will play an increasingly central role in agent memory systems. We highlight three properties that future generative memory mechanisms should ideally exhibit.

First, generative memory should be context adaptive . Rather than storing generic summaries, the memory system should generate representations that are explicitly optimized for the agent's anticipated future needs.

This includes adapting the granularity, abstraction level, and semantic focus of memory to different tasks, stages of problem solving, or interaction regimes.

Second, generative memory should support integration across heterogeneous signals . Agents increasingly operate over diverse modalities and information sources, including text, code, tool outputs, and environmental feedback. Memory generation provides a natural mechanism for fusing these fragmented signals into unified representations that are more useful for downstream reasoning than raw concatenation or retrieval alone. We hypothesize that latent memory (as discussed in Section 3.3) might be a promising technical path for this gaol.

Third, generative memory should be learned and self optimizing . Rather than relying on manually specified generation rules, future systems should learn when and how to generate memory through optimization signals, such as reinforcement learning or long horizon task performance. In this view, memory generation becomes an integral component of the agent's policy, co evolving with reasoning and decision making.

Automated Memory Management

Look-Back: From Hand-crafted to Automatically Constructed Memory Systems.

Existing agent memory systems (Xu et al., 2025c; Packer et al., 2023a) typically rely on manually designed strategies to determine what information to store, when to use it, and how to update or retrieve it. By guiding fixed LLMs with detailed instructions (Chhikara et al., 2025), predefined thresholds (Kang et al., 2025a), or explicit human-crafted rules drafted by human experts (Xu et al., 2025c), system designers can integrate memory modules into current agent frameworks with relatively low computational and engineering cost, enabling rapid prototyping and deployment. Besides, they also offer interpretability, reproducibility, and controlled , allowing the developers to precisely specify the state and behavior of memory. However, similar to expert systems in other areas, such manually curated approaches suffer from significant limitations: they are inherently inflexible and often fail to generalize across diverse, dynamic environments. Consequently, these systems tend to underperform in long-term or open-ended interactions.

Recent developments in agent memory research begin to address these limitations by enabling the agents themselves to autonomously manage the memory evolution and retrieval. For example, CAM (Li et al., 2025g) empowers LLM agents to automatically cluster fine-grained memory entries into high-level abstract units. Memory-R1 (Yan et al., 2025c) introduces an auxiliary agent equipped with a dedicated 'memory mcanager' tool to handle memory updates. Despite these advances, current solutions remain constrained: many are still driven by manually engineered rules or are optimized for narrow, task-specific learning objectives, making them difficult to generalize to open-ended settings.

Future Perspective

Looking ahead, we anticipate that generative approaches will play an increasingly central role in agent memory systems. We highlight three properties that future generative memory mechanisms should ideally exhibit.

First, generative memory should be context adaptive . Rather than storing generic summaries, the memory system should generate representations that are explicitly optimized for the agent's anticipated future needs.

This includes adapting the granularity, abstraction level, and semantic focus of memory to different tasks, stages of problem solving, or interaction regimes.

Second, generative memory should support integration across heterogeneous signals . Agents increasingly operate over diverse modalities and information sources, including text, code, tool outputs, and environmental feedback. Memory generation provides a natural mechanism for fusing these fragmented signals into unified representations that are more useful for downstream reasoning than raw concatenation or retrieval alone. We hypothesize that latent memory (as discussed in Section 3.3) might be a promising technical path for this gaol.

Third, generative memory should be learned and self optimizing . Rather than relying on manually specified generation rules, future systems should learn when and how to generate memory through optimization signals, such as reinforcement learning or long horizon task performance. In this view, memory generation becomes an integral component of the agent's policy, co evolving with reasoning and decision making.

Reinforcement Learning Meets Agent Memory

Look-Back: RL is Internalizing Memory Management Abilities for Agents.

Reinforcement learning is rapidly reshaping the development paradigm of modern LLM-based agents. Across a wide spectrum of agentic capabilities, including planning, reasoning, tool use, as well as across diverse task domains such as mathematical reasoning, deep research, and software engineering, RL has begun to play a central role in driving agent performance (Zhang et al., 2025f,l; Feng et al., 2024). Memory, as one of the foundational components of agentic capability, follows a similar trend from pipeline-based to model-native paradigm (Sang et al., 2025). The agent memory research community is collectively transitioning from early heuristic and manually engineered designs to approaches in which RLincreasingly governs key decisions . Looking ahead, it is reasonable to expect that fully RL-based memory systems may eventually become the dominant direction. Before discussing this trajectory in detail, we briefly outline the first stage of development. This transition, in which memory management is progressively internalized and optimized through reinforcement learning, is schematically illustrated in Figure 11.

RL-free Memory Systems A substantial portion of the agent memory literature surveyed earlier can be categorized as RL-free memory systems. These approaches typically rely on heuristic or manually specified mechanisms, such as fixed thresholding rules inspired by curves of forgetting, rigid semantic search pipelines found in frameworks such as MemOS (Li et al., 2025l), Mem0 (Chhikara et al., 2025), and MemoBase (Memobase, 2025), or simple concatenation-based strategies for storing memory chunks. In some systems, an LLM participates in memory management in a way that appears agentic , yet the underlying behavior is entirely prompt-driven. The LLM is asked to generate memory entries but has not received any dedicated training for effective memory control, as seen in systems such as Dynamic Cheatsheet (Suzgun et al., 2025), ExpeL (Zhao et al., 2024), EvolveR (Wu et al., 2025c), and G-Memory (Zhang et al., 2025c). This class of methods has dominated early work in the field and is likely to remain influential for some time due to its simplicity and practical accessibility.

RL-assisted Memory Systems As the field progressed, many works began to incorporate RL-based methods into selected components of the memory pipeline. An early attempt in this direction is RMM (Tan et al., 2025c), which employed a lightweight policy gradient learner to rank memory chunks after an initial retrieval stage based on BM25 or other semantic similarity metrics. Later systems explored substantially more ambitious designs. For example, Memα (Wang et al., 2025p) delegates the entire process of memory construction to an agent trained with RL, and Memory-R1 (Yan et al., 2025c) employs a similar philosophy. A rapidly expanding line of research investigates how an agent can autonomously fold, compress, and manage context in ultra-long multi-turn tasks. This setting corresponds to the management of working memory (Kang

et al., 2025c; Ye et al., 2025a). Many of the leading systems in this area are trained with RL, including but not limited to Context Folding (Sun et al., 2025b), Memory-as-Action (Zhang et al., 2025r), MemSearcher (Yuan et al., 2025a), and IterResearch (Chen et al., 2025b). These RL-assisted approaches have already demonstrated strong capabilities and point toward the increasing role of RL in future memory system design.

RL-free Memory Systems

While an LLM-based agent interacts with an environment, its instantaneous observation o i t is often insufficient for effective decision-making. Agents therefore rely on additional information derived from prior interactions, both within the current task and across previously completed tasks. We formalize this capability through a unified agent memory system , represented as an evolving memory state

$$

$$

where M denotes the space of admissible memory configurations. No specific internal structure is imposed on M t ; it may take the form of a text buffer, key-value store, vector database, graph structure, or any hybrid representation. At the beginning of a task, M t may already contain information distilled from prior trajectories (cross-trial memory). During task execution, new information accumulates and functions as short-term, task-specific memory. Both roles are supported within a single memory container, with temporal distinctions emerging from usage patterns rather than architectural separation.

Memory Lifecycle: Formation, Evolution, and Retrieval The dynamics of the memory system are characterized by three conceptual operators.

Memory Formation At time step t , the agent produces informational artifacts ϕ t , which may include tool outputs, reasoning traces, partial plans, self-evaluations, or environmental feedback. A formation operator

$$

$$

selectively transforms these artifacts into memory candidates, extracting information with potential future utility rather than storing the entire interaction history verbatim.

Memory Evolution Formed memory candidates are integrated into the existing memory base through an evolution operator

$$

$$

which may consolidate redundant entries (Zhao et al., 2024), resolve conflicts (Rasmussen et al., 2025; Li et al., 2025l), discard low-utility information (Wang et al., 2025r), or restructure memory for efficient retrieval. The resulting memory state persists across subsequent decision steps and tasks.

MemoryRetrieval When selecting an action, agent i retrieves a context-dependent memory signal

$$

$$

where R denotes a retrieval operator that constructs a task-aware query and returns relevant memory content. The retrieved signal m i t is formatted for direct consumption by the LLM policy, for example as a sequence of textual snippets or a structured summary.

Temporal Roles Within the Agent Loop Although memory is represented as a unified state M t , the three lifecycle operators (formation F , evolution E , and retrieval R ) need not be invoked at every time step. Instead, different memory effects arise from distinct temporal invocation patterns. For instance, some systems perform retrieval only once at task initialization,

$$

$$

where ⊥ denotes null retrieval strategy. Others may retrieve memory intermittently or continuously based on contextual triggers. Similarly, memory formation may range from minimal accumulation of raw observations,

$$

$$

to sophisticated extraction and refinement of reusable patterns or abstractions. Thus, inside a task , short-term memory effects may arise from lightweight logging just as in Yao et al. (2023b); Chen et al. (2023a) or from more elaborate iterative refinement (Hu et al., 2025a); across tasks , long-term memory may be updated episodically at task boundaries or continuously throughout operation. Short-term and long-term memory phenomena therefore emerge not from discrete architectural modules but from the temporal patterns with which formation, evolution, and retrieval are engaged.

Memory-Agent Coupling The interaction between memory and the agent's decision process is similarly flexible. In general, the agent policy is written as

$$

$$

where the retrieved memory signal m i t may be present or absent depending on the retrieval schedule. When retrieval is disabled at a given step, m i t can be treated as a distinguished null input.

Consequently, the overall agent loop consists of observing the environment, optionally retrieving memory, computing an action, receiving feedback, and optionally updating memory through formation and evolution. Different agent implementations instantiate different subsets of these operations at different temporal frequencies, giving rise to memory systems that range from passive buffers to actively evolving knowledge bases.

RL-assisted Memory Systems

While an LLM-based agent interacts with an environment, its instantaneous observation o i t is often insufficient for effective decision-making. Agents therefore rely on additional information derived from prior interactions, both within the current task and across previously completed tasks. We formalize this capability through a unified agent memory system , represented as an evolving memory state

$$

$$

where M denotes the space of admissible memory configurations. No specific internal structure is imposed on M t ; it may take the form of a text buffer, key-value store, vector database, graph structure, or any hybrid representation. At the beginning of a task, M t may already contain information distilled from prior trajectories (cross-trial memory). During task execution, new information accumulates and functions as short-term, task-specific memory. Both roles are supported within a single memory container, with temporal distinctions emerging from usage patterns rather than architectural separation.

Memory Lifecycle: Formation, Evolution, and Retrieval The dynamics of the memory system are characterized by three conceptual operators.

Memory Formation At time step t , the agent produces informational artifacts ϕ t , which may include tool outputs, reasoning traces, partial plans, self-evaluations, or environmental feedback. A formation operator

$$

$$

selectively transforms these artifacts into memory candidates, extracting information with potential future utility rather than storing the entire interaction history verbatim.

Memory Evolution Formed memory candidates are integrated into the existing memory base through an evolution operator

$$

$$

which may consolidate redundant entries (Zhao et al., 2024), resolve conflicts (Rasmussen et al., 2025; Li et al., 2025l), discard low-utility information (Wang et al., 2025r), or restructure memory for efficient retrieval. The resulting memory state persists across subsequent decision steps and tasks.

MemoryRetrieval When selecting an action, agent i retrieves a context-dependent memory signal

$$

$$

where R denotes a retrieval operator that constructs a task-aware query and returns relevant memory content. The retrieved signal m i t is formatted for direct consumption by the LLM policy, for example as a sequence of textual snippets or a structured summary.

Temporal Roles Within the Agent Loop Although memory is represented as a unified state M t , the three lifecycle operators (formation F , evolution E , and retrieval R ) need not be invoked at every time step. Instead, different memory effects arise from distinct temporal invocation patterns. For instance, some systems perform retrieval only once at task initialization,

$$

$$

where ⊥ denotes null retrieval strategy. Others may retrieve memory intermittently or continuously based on contextual triggers. Similarly, memory formation may range from minimal accumulation of raw observations,

$$

$$

to sophisticated extraction and refinement of reusable patterns or abstractions. Thus, inside a task , short-term memory effects may arise from lightweight logging just as in Yao et al. (2023b); Chen et al. (2023a) or from more elaborate iterative refinement (Hu et al., 2025a); across tasks , long-term memory may be updated episodically at task boundaries or continuously throughout operation. Short-term and long-term memory phenomena therefore emerge not from discrete architectural modules but from the temporal patterns with which formation, evolution, and retrieval are engaged.

Memory-Agent Coupling The interaction between memory and the agent's decision process is similarly flexible. In general, the agent policy is written as

$$

$$

where the retrieved memory signal m i t may be present or absent depending on the retrieval schedule. When retrieval is disabled at a given step, m i t can be treated as a distinguished null input.

Consequently, the overall agent loop consists of observing the environment, optionally retrieving memory, computing an action, receiving feedback, and optionally updating memory through formation and evolution. Different agent implementations instantiate different subsets of these operations at different temporal frequencies, giving rise to memory systems that range from passive buffers to actively evolving knowledge bases.

Future Perspective

Looking ahead, we anticipate that generative approaches will play an increasingly central role in agent memory systems. We highlight three properties that future generative memory mechanisms should ideally exhibit.

First, generative memory should be context adaptive . Rather than storing generic summaries, the memory system should generate representations that are explicitly optimized for the agent's anticipated future needs.

This includes adapting the granularity, abstraction level, and semantic focus of memory to different tasks, stages of problem solving, or interaction regimes.

Second, generative memory should support integration across heterogeneous signals . Agents increasingly operate over diverse modalities and information sources, including text, code, tool outputs, and environmental feedback. Memory generation provides a natural mechanism for fusing these fragmented signals into unified representations that are more useful for downstream reasoning than raw concatenation or retrieval alone. We hypothesize that latent memory (as discussed in Section 3.3) might be a promising technical path for this gaol.

Third, generative memory should be learned and self optimizing . Rather than relying on manually specified generation rules, future systems should learn when and how to generate memory through optimization signals, such as reinforcement learning or long horizon task performance. In this view, memory generation becomes an integral component of the agent's policy, co evolving with reasoning and decision making.

Multimodal Memory

Look-Back

As research on text-based memory becomes increasingly mature and extensively explored, and as multimodal large language models and unified models that jointly support multimodal understanding and generation continue to advance, attention has naturally expanded toward multimodalmemory . This shift reflects a broader recognition that many real-world agentic settings are inherently multimodal, and that memory systems limited to text alone are insufficient to support long-horizon reasoning and interaction in complex environments.

Existing efforts on multimodal memory can be broadly grouped into two complementary directions. The first focuses on enabling multimodal agents to store, retrieve, and utilize memories derived from diverse sensory inputs (Long et al., 2025; Zuo et al., 2025). This direction is a natural extension of agent memory, since agents operating in realistic environments inevitably encounter heterogeneous data sources, including images, audio, video, and other non-textual signals (Xie et al., 2024). The degree of progress in multimodal memory closely follows the maturity of corresponding modalities. Visual modalities such as images and videos have received the most attention, leading to a growing body of work on visual and video memory mechanisms that support tasks such as visual grounding, temporal tracking, and long-term scene consistency (Long et al., 2025; Wang et al., 2024g; Gurukar and Kadav, 2025; Yu et al., 2025e; Bo et al., 2025; Wang et al., 2025q; Li et al., 2024d). In contrast, memory systems for audio and other modalities remain relatively underexplored (Li et al., 2025a).

The second direction treats memory as an enabling component for unified models . In this setting, memory

is leveraged not primarily to support agent decision making, but to enhance multimodal generation and consistency. For example, in image and video generation systems, memory mechanisms are often used to preserve entity consistency, maintain world state across frames, or ensure coherence across long generation horizons (Yu et al., 2025b). Here, memory serves as a stabilizing structure that anchors generation to previously produced content, rather than as a record of agent experience per se.

Future Perspective

Looking ahead, we anticipate that generative approaches will play an increasingly central role in agent memory systems. We highlight three properties that future generative memory mechanisms should ideally exhibit.

First, generative memory should be context adaptive . Rather than storing generic summaries, the memory system should generate representations that are explicitly optimized for the agent's anticipated future needs.

This includes adapting the granularity, abstraction level, and semantic focus of memory to different tasks, stages of problem solving, or interaction regimes.

Second, generative memory should support integration across heterogeneous signals . Agents increasingly operate over diverse modalities and information sources, including text, code, tool outputs, and environmental feedback. Memory generation provides a natural mechanism for fusing these fragmented signals into unified representations that are more useful for downstream reasoning than raw concatenation or retrieval alone. We hypothesize that latent memory (as discussed in Section 3.3) might be a promising technical path for this gaol.

Third, generative memory should be learned and self optimizing . Rather than relying on manually specified generation rules, future systems should learn when and how to generate memory through optimization signals, such as reinforcement learning or long horizon task performance. In this view, memory generation becomes an integral component of the agent's policy, co evolving with reasoning and decision making.

Shared Memory in Multi-Agent Systems

Look-Back: From Isolated Memories to Shared Cognitive Substrates

As LLM-based multi-agent systems (MAS) have gained prominence, shared memory has emerged as a key mechanism for enabling coordination, consistency, and collective intelligence. Early multi-agent frameworks primarily relied on isolated local memories coupled with explicit message passing, where agents exchanged information through dialogue histories or task-specific communication protocols (Qian et al., 2024; Wu et al., 2024b; Hu et al., 2025b; Zhang et al., 2025j). While this design avoided direct interference between agents, it often suffered from redundancy, fragmented context, and high communication overhead, especially as team size and task horizon increased.

Subsequent work introduced centralized shared memory structures , such as global vector stores, blackboard systems, or shared documents (Hong et al., 2024), accessible to all agents. These designs enabled a form of team-level memory that supported joint attention, reduced duplication, and facilitated long-horizon coordination. Representative systems demonstrated that shared memory could serve as a persistent common ground for planning, role handoff, and consensus building (Rezazadeh et al., 2025b; Xu et al., 2025a). However, naive global sharing also exposed new challenges, including memory clutter, write contention, and the lack of role- or permission-aware access control.

Future Perspective

Looking ahead, we anticipate that generative approaches will play an increasingly central role in agent memory systems. We highlight three properties that future generative memory mechanisms should ideally exhibit.

First, generative memory should be context adaptive . Rather than storing generic summaries, the memory system should generate representations that are explicitly optimized for the agent's anticipated future needs.

This includes adapting the granularity, abstraction level, and semantic focus of memory to different tasks, stages of problem solving, or interaction regimes.

Second, generative memory should support integration across heterogeneous signals . Agents increasingly operate over diverse modalities and information sources, including text, code, tool outputs, and environmental feedback. Memory generation provides a natural mechanism for fusing these fragmented signals into unified representations that are more useful for downstream reasoning than raw concatenation or retrieval alone. We hypothesize that latent memory (as discussed in Section 3.3) might be a promising technical path for this gaol.

Third, generative memory should be learned and self optimizing . Rather than relying on manually specified generation rules, future systems should learn when and how to generate memory through optimization signals, such as reinforcement learning or long horizon task performance. In this view, memory generation becomes an integral component of the agent's policy, co evolving with reasoning and decision making.

Memory for World Model

Look-Back

As research on text-based memory becomes increasingly mature and extensively explored, and as multimodal large language models and unified models that jointly support multimodal understanding and generation continue to advance, attention has naturally expanded toward multimodalmemory . This shift reflects a broader recognition that many real-world agentic settings are inherently multimodal, and that memory systems limited to text alone are insufficient to support long-horizon reasoning and interaction in complex environments.

Existing efforts on multimodal memory can be broadly grouped into two complementary directions. The first focuses on enabling multimodal agents to store, retrieve, and utilize memories derived from diverse sensory inputs (Long et al., 2025; Zuo et al., 2025). This direction is a natural extension of agent memory, since agents operating in realistic environments inevitably encounter heterogeneous data sources, including images, audio, video, and other non-textual signals (Xie et al., 2024). The degree of progress in multimodal memory closely follows the maturity of corresponding modalities. Visual modalities such as images and videos have received the most attention, leading to a growing body of work on visual and video memory mechanisms that support tasks such as visual grounding, temporal tracking, and long-term scene consistency (Long et al., 2025; Wang et al., 2024g; Gurukar and Kadav, 2025; Yu et al., 2025e; Bo et al., 2025; Wang et al., 2025q; Li et al., 2024d). In contrast, memory systems for audio and other modalities remain relatively underexplored (Li et al., 2025a).

The second direction treats memory as an enabling component for unified models . In this setting, memory

is leveraged not primarily to support agent decision making, but to enhance multimodal generation and consistency. For example, in image and video generation systems, memory mechanisms are often used to preserve entity consistency, maintain world state across frames, or ensure coherence across long generation horizons (Yu et al., 2025b). Here, memory serves as a stabilizing structure that anchors generation to previously produced content, rather than as a record of agent experience per se.

Future Perspective

Looking ahead, we anticipate that generative approaches will play an increasingly central role in agent memory systems. We highlight three properties that future generative memory mechanisms should ideally exhibit.

First, generative memory should be context adaptive . Rather than storing generic summaries, the memory system should generate representations that are explicitly optimized for the agent's anticipated future needs.

This includes adapting the granularity, abstraction level, and semantic focus of memory to different tasks, stages of problem solving, or interaction regimes.

Second, generative memory should support integration across heterogeneous signals . Agents increasingly operate over diverse modalities and information sources, including text, code, tool outputs, and environmental feedback. Memory generation provides a natural mechanism for fusing these fragmented signals into unified representations that are more useful for downstream reasoning than raw concatenation or retrieval alone. We hypothesize that latent memory (as discussed in Section 3.3) might be a promising technical path for this gaol.

Third, generative memory should be learned and self optimizing . Rather than relying on manually specified generation rules, future systems should learn when and how to generate memory through optimization signals, such as reinforcement learning or long horizon task performance. In this view, memory generation becomes an integral component of the agent's policy, co evolving with reasoning and decision making.

Trustworthy Memory

Look-Back: From Trustworthy RAG to Trustworthy Memory

As shown throughout this survey, memory plays a foundational role in enabling agentic behavior, which supports persistence, personalization, and continual learning. However, as memory systems become more deeply embedded into LLM-based agents, the question of trustworthiness has become paramount.

Earlier concerns around hallucination and factuality in retrieval-augmented generation (RAG) systems (Niu et al., 2024; Sun et al., 2025f; Lu et al., 2025c) have now evolved into a broader trust discourse for memoryaugmented agents. Similar to RAG, one major motivation for using external or long-term memory is to reduce hallucinations by grounding model outputs in retrievable, factual content (Ru et al., 2024; Wang et al., 2025c). However, unlike RAG, agent memory often stores user-specific, persistent, and potentially sensitive content, ranging from factual knowledge to past interactions, preferences, or behavioral traces. This introduces additional challenges in privacy, interpretability, and safety.

Recent work by Wang et al. (2025b) demonstrates that memory modules can leak private data through indirect prompt-based attacks, highlighting the risk of memorization and over-retention. Concurrently, Wu et al. (2025g) argues that agent memory systems must support explicit mechanisms for access control , verifiable forgetting , and auditable updates to remain trustworthy. Notably, such threats are magnified in agent scenarios where memory persists across long time horizons.

Explainability also remains a critical bottleneck. While explicit memory, such as text logs or key-value stores, offers some transparency, users and developers still lack tools to trace which memory items were retrieved, how they influenced generation, or whether they were misused. In this regard, diagnostic tools like RAGChecker (Ru et al., 2024) and conflict-resolution frameworks such as RAMDocs with MADAM-RAG (Wang et al., 2025d) provide inspiration for tracing memory usage and reasoning under uncertainty.

Moreover, beyond individual memory, Shi et al. (2025d) and Rezazadeh et al. (2025a) highlight the emerging importance of collective privacy in shared or federated memory systems, which may operate across multi-agent deployments or organizations. All these developments collectively signal a need to elevate trust as a first-class principle in memory design.

Future Perspective

Looking ahead, we anticipate that generative approaches will play an increasingly central role in agent memory systems. We highlight three properties that future generative memory mechanisms should ideally exhibit.

First, generative memory should be context adaptive . Rather than storing generic summaries, the memory system should generate representations that are explicitly optimized for the agent's anticipated future needs.

This includes adapting the granularity, abstraction level, and semantic focus of memory to different tasks, stages of problem solving, or interaction regimes.

Second, generative memory should support integration across heterogeneous signals . Agents increasingly operate over diverse modalities and information sources, including text, code, tool outputs, and environmental feedback. Memory generation provides a natural mechanism for fusing these fragmented signals into unified representations that are more useful for downstream reasoning than raw concatenation or retrieval alone. We hypothesize that latent memory (as discussed in Section 3.3) might be a promising technical path for this gaol.

Third, generative memory should be learned and self optimizing . Rather than relying on manually specified generation rules, future systems should learn when and how to generate memory through optimization signals, such as reinforcement learning or long horizon task performance. In this view, memory generation becomes an integral component of the agent's policy, co evolving with reasoning and decision making.

Human-Cognitive Connections

Look Back

The architecture of contemporary agent memory systems has converged with foundational models of human cognition established over the last century. The prevailing design, which couples a capacity-limited context window with massive external vector databases, mirrors the Atkinson-Shiffrin multi-store model (Atkinson and Shiffrin, 1968), effectively instantiating an artificial counterpart to the distinction between working memory and long-term memory (Baddeley, 2012). Furthermore, the partitioning of agent memory into interaction logs, world knowledge, and code-based skills exhibits a striking structural alignment with Tulving's classification of episodic , semantic , and procedural memory (Tulving, 1972; Squire, 2004). Current frameworks (Zhong et al., 2024; Park et al., 2023; Gutierrez et al., 2024; Li et al., 2025l) operationalize these biological categories into engineering artifacts, where episodic memory provides autobiographical continuity and semantic memory offers generalized world knowledge.

Despite these structural parallels, a fundamental divergence remains in the dynamics of retrieval and maintenance. Human memory operates as a constructive process , where the brain actively reconstructs past events based on current cognitive states rather than replaying exact recordings (Schacter and Addis, 2007). In contrast, the majority of existing agent memory systems rely on verbatim retrieval mechanisms like RAG, treating memory as a repository of immutable tokens to be queried via semantic similarity (Packer et al., 2023b; Chhikara et al., 2025). Consequently, while agents possess a veridical record of the past, they lack the biological capacity for memory distortion, abstraction, and the dynamic remodeling of history that characterizes human intelligence.

Future Perspective

Looking ahead, we anticipate that generative approaches will play an increasingly central role in agent memory systems. We highlight three properties that future generative memory mechanisms should ideally exhibit.

First, generative memory should be context adaptive . Rather than storing generic summaries, the memory system should generate representations that are explicitly optimized for the agent's anticipated future needs.

This includes adapting the granularity, abstraction level, and semantic focus of memory to different tasks, stages of problem solving, or interaction regimes.

Second, generative memory should support integration across heterogeneous signals . Agents increasingly operate over diverse modalities and information sources, including text, code, tool outputs, and environmental feedback. Memory generation provides a natural mechanism for fusing these fragmented signals into unified representations that are more useful for downstream reasoning than raw concatenation or retrieval alone. We hypothesize that latent memory (as discussed in Section 3.3) might be a promising technical path for this gaol.

Third, generative memory should be learned and self optimizing . Rather than relying on manually specified generation rules, future systems should learn when and how to generate memory through optimization signals, such as reinforcement learning or long horizon task performance. In this view, memory generation becomes an integral component of the agent's policy, co evolving with reasoning and decision making.

Conclusion

$$ s_{t+1} \sim \Psi(s_{t+1} \mid s_t, a_t), $$

$$ a_t = \pi_i(o_t^i, m_t^i, \mathcal{Q}), $$

$$ \tau = (s_0, o_0, a_0, s_1, o_1, a_1, \dots, s_T), $$

$$ \mathcal{M}_t \in \mathbb{M}, $$

$$ m_t^i = \begin{cases} R(\mathcal{M}_{0}, o_0^i, \mathcal{Q}), & t = 0, \[4pt] \bot, & t > 0, \end{cases} $$

$$ \mathcal{M}_{t+1}^{\mathrm{form}} = \mathcal{M}_t \cup {o_t^i}, $$

This survey has examined agent memory as a foundational component of modern LLM-based agentic systems. By framing existing research through the unified lenses of forms, functions, and dynamics , we have clarified the conceptual landscape of agent memory and situated it within the broader evolution of agentic intelligence. On the level of forms , we identify three principal realizations: token-level, parametric, and latent memory, each of which has undergone distinct and rapid advances in recent years, reflecting fundamentally different trade-offs in representation, adaptability, and integration with agent policies. On the level of functions , we move beyond the coarse long-term versus short-term dichotomy prevalent in prior surveys, and instead propose a more fine-grained and encompassing taxonomy that distinguishes factual, experiential, and working memory

according to their roles in knowledge retention, capability accumulation, and task-level reasoning. Together, these perspectives reveal that memory is not merely an auxiliary storage mechanism, but an essential substrate through which agents achieve temporal coherence, continual adaptation, and long-horizon competence.

Beyond organizing prior work, we have identified key challenges and emerging directions that point toward the next stage of agent memory research. In particular, the increasing integration of reinforcement learning, the rise of multimodal and multi-agent settings, and the shift from retrieval-centric to generative memory paradigms suggest a future in which memory systems become fully learnable, adaptive, and self-organizing. Such systems hold the potential to transform large language models from powerful but static generators into agents capable of sustained interaction, self-improvement, and principled reasoning over time.

We hope this survey provides a coherent foundation for future research and serves as a reference for both researchers and practitioners. As agentic systems continue to mature, the design of memory will remain a central and open problem, one that is likely to play a decisive role in the development of robust, general, and enduring artificial intelligence.

1 Introduction1 Introduction1 Introduction4
2 Preliminaries: Formalizing Agents and Memory2 Preliminaries: Formalizing Agents and Memory2 Preliminaries: Formalizing Agents and Memory6
2.1LLM-based Agent Systems. . . . . . . . . .6
2.2Agent Memory Systems .Agent Memory Systems .7
2.3. . . . . . . . . . . . . . . . . . . Comparing Agent Memory with Other Key Concepts . . . .. . . . . . . . . . . . . . . . . . . Comparing Agent Memory with Other Key Concepts . . . .8
2.3.1Agent Memory vs. LLM Memory . . .9
2.3.2Agent Memory vs. RAG10
2.3.3. . . . . . . . Agent Memory vs. Context Engineering11
3 Form: What Carries Memory?3 Form: What Carries Memory?3 Form: What Carries Memory?12
3.1Token-level Memory .. . . . . . . . . . . . .13
3.1.1Flat Memory (1D) . . . . . . . . . .15
3.1.2. Planar Memory (2D) . . . . . . .20
3.1.3. . . Hierarchical Memory (3D) . . . . . . .21
3.2Parametric Memory . . .. . . . . . .22
3.2.1. . . . Internal Parametric Memory . . . . .22
3.2.2External Parametric Memory . . . . .24
3.3LatentMemory . . . . . . . . . . . . . . . . .26
3.3.1Generate . . . . . . . . . . . . . . . .26
3.3.2Reuse . . . . . . . . . . . . . . . . . .28
3.3.3Transform . . . . . . . . . . . . . . . .28
3.4Adaptation. . . . . . . . . . . . . . . . . . .30
4 Functions:Memory?31
4.1Why Agents Need Factual Memory . . . . . . .. . . . . . . . .32
4.1.1User factual memory . . . . . . .35
4.1.2. . . Environment factual memory . . . . .36
4.2ExperientialMemory . . . . . . . . . . . . . .37
4.2.1Case-based Memory . . . . . . . . . . Memory39
4.2.2Strategy-based . . . . . . . . Memory40
4.2.3Skill-based . . . . . . . . . . . .41
4.34.2.4Hybrid memory . . . . . . . . . . . Memory . .42 42
Working 4.3.1. . . . . . . . . . . . . . Single-turn Working Memory . . . . .43
4.3.2Multi-turn Working Memory . . . . .45
5Dynamics: How Memory Operates and Evolves? . . . . . . . . . . . . . . . . . . . . . .Dynamics: How Memory Operates and Evolves? . . . . . . . . . . . . . . . . . . . . . .46
5.1Memory Formation . .Memory Formation . .48
5.1.1Semantic Summarization . . . . . . .48
5.1.2Knowledge Distillation . . . . . . . . .50
5.1.3Structured Construction . . . . . . . .51
5.1.4Latent Representation . . . . . . .53
5.1.5. . Parametric Internalization . . . . . . .54
5.2Memory Evolution .. . . . . . . . . . . . . .55
5.2.1Consolidation . . . . . . . . . . . . . .55
5.2.2Updating . . . . . . . . . . . . . . .57
5.2.3. Forgetting . . . . . . . . . . . . . . . .58
5.3Memory Retrieval .. . . . . .59
5.3.1. . . . . . . . Retrieval Timing and Intent . . . . . .60
5.3.2Query Construction . . . . . . . . . .62
5.3.3 5.3.4Retrieval Strategies . . . . . . . . . . . Post-Retrieval Processing . . . . . . .62 64
6 Resources and Frameworks6 Resources and Frameworks6 Resources and Frameworks65
6.1 Benchmarks and Datasets . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .65
6.1.1Benchmarks for Memory / Lifelong / Self-Evolving Agents . . . . . . . . . . . . . . .65
6.1.2Other Related Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . .67
6.2Open-Source Frameworks . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .68
7Positions and FrontiersPositions and Frontiers69
7.1Memory Retrieval vs. Memory Generation .. . . . . . . . . . . . . . . . . . . . . . . .69
7.1.1Look Back: From Memory Retrieval to Memory Generation . . . . . . . . . . .69
7.1.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . .69
. . . . . . . . . 7.2 Automated Memory Management . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .70
7.2.1Look-Back: From Hand-crafted to Automatically Constructed Memory Systems.70
7.2.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70
7.3Reinforcement Learning Meets Agent Memory .71
7.3.1. . . . . . . . . . . . . . . . . . . . . . Look-Back: RL is Internalizing Memory Management Abilities for Agents. . . .71
7.3.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
7.4Multimodal Memory . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .72
7.4.1Look-Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
7.4.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
7.5SharedMemory in Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . .73
7.5.1Look-Back: From Isolated Memories to Shared Cognitive Substrates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
7.5.2Future Perspective . . . . . .73
7.6Memory for World Model . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .74
7.6.1 Look-Back . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .74
7.6.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74
7.7Trustworthy Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75
. . .. . . . . .75
7.7.1Look-Back: From Trustworthy RAG to Trustworthy Memory . . . . . . . . . . . . . . . . . . . . . . . . . .75
7.7.2Future Perspective . . . . . . . . .
7.8Human-Cognitive Connections . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
7.8.1 Look Back . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
7.8.2Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
MethodMultiTypeMemory FormTask
Flat Memory ModelsFlat Memory ModelsFlat Memory ModelsFlat Memory ModelsFlat Memory Models
Reflexion (Shinn et al., 2023b)E&WTrajectory as short-term and feedback as long-termQA, Reasoning, Coding
Memento (Zhou et al., 2025a)✗ ✔ExpTrajectory case (success/failure).Reasoning Game
JARVIS-1 (Wang et al., 2025q)ExpPlan-environment pairs.
Expel (Zhao et al., 2024)ExpInsights and few-shot examples.Reasoning
Buffer of Thoughts (Yang et al., 2024b)ExpHigh-level thought-templates.Game, Reasoning, Coding
SAGE (Liang et al., 2025)ExpDual-store with forgetting mechanism.Game, Reasoning, Coding
ChemAgent (Tang et al., 2025c)ExpStructured sub-tasks and principles.Chemistry
AgentKB (Tang et al., 2025d)Exp5-tuple experience nodes.Coding, Reasoning
H 2 R (Ye et al., 2025b)ExpPlanning and Execution layers.Game, Embodied Simula- tion
AWM (Wang et al., 2024m)ExpAbstracted universal workflows.Web
PRINCIPLES (Kim et al., 2025a)ExpRule templates from self-play.Emotional Companion
ReasoningBank (Ouyang et al., 2025)ExpTransferable reasoning strategy items.Web
Voyager (Wang et al., 2024b)ExpExecutable skill code library.Game
DGM (Zhang et al., 2025i)ExpRecursive self-modifiable codebase.Coding
Memp (Fang et al., 2025d)ExpInstructions and abstract scripts.Embodied Simulation, Travel Planning
UFO2 (Zhang et al., 2025a)ExpSystem docs and interaction records.Windows OS
LEGOMem (Han et al., 2025a)ExpVectorized task trajectories.Office
ToolMem (Xiao et al., 2025b)ExpTool capability.Tool Calling
SCM (Wang et al., 2025a)FactMemory stream and vector database.Long-context
MemoryBank (Zhong et al., 2024)FactHistory and user profile.Emotional Companion
MPC (Lee et al., 2023)FactPersona and summary vector pool.QA
RecMind (Wang et al., 2024h)FactUser metadata and external knowledge.Recommendation
InteRecAgent (Huang et al., 2025d)FactUser profiles and candidate item.Recommendation
Ego-LLaVA (Shen et al., 2024)FactLanguage-encoded chunk embeddings.Multimodal QA
ChatHaruhi (Li et al., 2023a)FactDialogue database from media.Role-Playing
Memochat (Lu et al., 2023)FactMemos and categorized dialogue history.Long-conv QA
RecursiveSum (Wang et al., 2025h)FactRecursive summaries of short dialogues.Long-conv QA
MemGPT (Packer et al., 2023a)FactVirtual memory (Main/External contexts).Long-conv QA, Doc QA
MethodMultiTypeMemory StructureTask
RoleLLM (Wang et al., 2024d)FactRole-specific QA pairs.Role-Playing
Think-in-memory (Liu et al., 2023a)FactHash table of inductive thoughts.Long-conv QA
PLA (Yuan et al., 2025b)FactEvolving records of history and summaries.QA, Human Feedback
COMEDY (Chen et al., 2025d)FactSingle-model compressed memory format.Summary, Compression, QA
Memoro (Zulfikar et al., 2024)FactSpeech-to-text vector embeddings.User Study
Memory Sharing (Gao and Zhang, 2024a)FactQuery-Response pair retrieval.Literary Creation, Logic, Plan Generation
Conv Agent(Alonso et al., 2024)FactChain-of-tables and vector entries.QA
EM-LLM (Fountas et al., 2025)FactEpisodic events with Bayesian boundaries.Long-context
Memocrs (Xi et al., 2024a)FactUser metadata and knowledge.Recommendation
SECOM (Pan et al., 2025)FactParagraph-level segmented blocks.Long-conv QA
Mem0 (Chhikara et al., 2025)FactSummary and original dialogue.Long-conv QA
RMM (Tan et al., 2025c)FactReflection-organized flat entries.Personalization
MEMENTO (Kwon et al., 2025)FactInteraction history entries.Personalization
MemGuide (Du et al., 2025b)FactDialogue-derived QA pairs.Long-conv QA
MIRIX (Wang and Chen, 2025)FactSix optimized flat memory types.Long-conv QA
SemanticAnchor (Chatterjee and Agar- wal, 2025)FactSyntactic 5-tuple structure.Long-conv QA
MMS (Zhang et al., 2025b)FactDual Retrieval and Context units.Long-conv QA
Memory-R1 (Yan et al., 2025c)FactRL-managed mem0 architecture.Long-conv QA
ComoRAG (Wang et al., 2025f)FactFact/Semantic/Plot units with probes.Narrative QA
Nemori (Nan et al., 2025)FactPredictive calibration store.Long-conv QA
Livia (Xi and Wang, 2025)FactPruned interaction history.Emotional Companion
MOOM (Chen et al., 2025e)✗ ✗FactDecoupled plot and character stores.Role-Playing
Mem- α (Wang et al., 2025p)FactCore, Semantic, and Episodic Mem.Memory Management
Personalized Long term Interac- tion (Westhäußer et al., 2025)FactHierarchical history and summaries.Personalization
LightMem (Fang et al., 2025b)FactOptimized Long/Short-term store.Long-conv QA
MEXTRA (Wang et al., 2025b)FactExtracted raw dialogue data.Privacy Attack
MovieChat (Song et al., 2024)FactShort-term features and long-term persis- tence.Video Understanding
MA-LMM (He et al., 2024)FactVisual and Query memory banks.Video Understanding
VideoAgent (Wang et al., 2024g)FactTemporal text descriptions and object tracking.Video Understanding
Video-RAG (Luo et al., 2025b)FactVisually-aligned information .Video Understanding Embodied Task
KARMA (Wang et al., 2025r) Embodied VideoAgent (Fan et al.,✔ ✔Fact Fact3D scene graph and dynamic object states.MultiModal
2025)FactPersistent object and sensor store.Embodied Navigation
Mem2Ego (Zhang et al., 2025m)Map, landmark, and visited location stores.Generation
Context-as-Memory (Yu et al., 2025b)FactGenerated context frames.Video
RCR-Router (Liu et al., 2025d)FactBudget-aware semantic subsets.QA
ELL (Cai et al., 2025a)FactLiflong memory and skills.Lifelong Learning
MemRL (Zhang et al., 2026)ExpRL for memory management.Web
ReMe (Cao et al., 2025b)ExpStep level experience and insight.Web
MMAG (Zeppieri, 2025) Hindsight (Latimer et al., 2025)Fact FactFive interacting memory layers.User Study Long-conv
Retains, recalls, and reflects.QA
GAM (Yan et al., 2025a)FactSimple memory but search is guided.Long-conv QA
Planar Memory ModelsPlanar Memory ModelsPlanar Memory ModelsPlanar Memory ModelsPlanar Memory Models
D-SMART (Lei et al., 2025)FactStructured memory with reasoning trees.Long-conv QA
Reflexion (Shinn et al., 2023b)WorkReflective text buffer from experiences.QA, Reasoning, Coding
MethodMultiTypeMemory StructureTask
PREMem (Kim et al., 2025b)FactDynamic cross-session linked triples.Long-conv QA
Query Reconstruct (Xu et al., 2025b)ExpLogic graphs built from knowledge bases.KnowledgeGraph QA
KGT (Sun et al., 2024)FactKG node from query and feedback.QA
Optimus-1 (Li et al., 2024d)F&EKnowledge graph and experience pool.Game
SALI (Pan et al., 2024)ExpTopological graph with spatial nodesNavigation
HAT (A et al., 2024)FactHierarchical aggregate tree.Long-conv QA
MemTree (Rezazadeh et al., 2025c)FactDynamic hierarchical conversation tree.Long-conv QA
TeaFarm (iunn Ong et al., 2025)FactCausal edges connecting memories.Long-conv QA
COMET (Kim et al., 2024b)FactContext-aware memory through graph.Long-conv QA
Intrinsic Memory (Yuen et al., 2025)FactPrivate internal and shared external mem.Planning
A-MEM (Xu et al., 2025c)FactCard-based connected mem.Long-conv QA
Ret-LLM (Modarressi et al., 2023)FactTriplet table and LSH vectors.QA
HuaTuo (Wang et al., 2023a)FactMedical Knowledge Graph.Medical QA
M3-Agent (Long et al., 2025)FactMultimodal nodes in graph structure.Embodied QA
EMem (Zhou and Han, 2025a)FactEvent-centric alternative with pagerank.Long-conv QA
WorldMM (Yeo et al., 2025)FactMultiple complementary memories.Video Understanding
Memoria (Sarin et al., 2025)FactKnowledge-graph profile and summary.Long-conv QA
LingoEDU (Zhou et al., 2026)FactRelation tree of Elementary Discourse Units.Long-conv QA
Hierarchical Memory ModelsHierarchical Memory ModelsHierarchical Memory ModelsHierarchical Memory ModelsHierarchical Memory Models
GraphRAG (Edge et al., 2025)FactMulti-level community graph indices.QA, Summarization
H-Mem (Sun and Zeng, 2025)FactDecoupled index layers and content layers.Long-conv QA
EMG-RAG (Wang et al., 2024l)FactThree-tiered memory graph.QA
G-Memory (Zhang et al., 2025c)ExpQuery-centric three-layer graph structure.QA, Game, Embodied Task
Zep (Rasmussen et al., 2025)FactTemporal Knowledge Graphs.Long-conv QA
SGMem (Wu et al., 2025h)FactChunk Graph and Sentence Graph.Long-conv QA
HippoRAG (Gutierrez et al., 2024)FactKnowledge with query nodes.QA
HippoRAG 2 (Gutiérrez et al., 2025)FactKG with phrase and passage.QA
AriGraph (Anokhin et al., 2024)FactSemantic and Episodic memory graph.Game
Lyfe Agents (Kaiya et al., 2023)FactWorking, Short & Long-term layers.Social Simulation
CAM (Li et al., 2025g)FactMultilayer graph with topic.Doc QA
HiAgent (Hu et al., 2025a)E&WGoal graphs with recursive cluster.Agentic Tasks
ILM-TR (Tang et al., 2024)FactHierarchical Memory tree.Long-context
CompassMem (Hu et al., 2026b)FactHierarchical event-centric Memory.QA
MAGMA (Jiang et al., 2026)FactSemantic, temporal, causal, entity graphs.Long-conv QA
EverMemOS (Hu et al., 2026a)FactReusable memories covering multi types.Long-conv QA
RGMem (Tian et al., 2025a)FactRenormalization Group-based memory.Long-conv QA
MemVerse (Liu et al., 2025e)FactMultimodal hierarchical knowledge graphs.Reasoning, QA
MethodTypeTaskOptimization
I. Internal Parametric MemoryI. Internal Parametric MemoryI. Internal Parametric MemoryI. Internal Parametric Memory
(a) Pre-Train Phase
TNL (Qin et al., 2024b)WorkingQA, ReasoningSFT
StreamingLLM (Xiao et al., 2024)WorkingQA, ReasoningSFT
LMLM (Zhao et al., 2025b)FactualQA, Factual GenSFT
HierMemLM (Pouransari et al., 2025)FactualQA, Language ModelingSFT
Function Token (Zhang et al., 2025o)FactualLanguage ModelingPretrain
(b) Mid-Train Phase
Agent-Founder (Su et al., 2025)ExperientialTool Calling, Deep ResearchSFT
Early Experience (Zhang et al., 2025k)ExperientialTool Calling, Embodied Simulation, Reasoning, WebSFT
(c) Post-Train Phase
Character-LM (Shao et al., 2023)FactualRole PlayingSFT
CharacterGLM (Zhou et al., 2024a)FactualRole PlayingSFT
SELF-PARAM (Wang et al., 2025o)FactualQA, RecommendationKL Tuning
Room (Kim et al., 2023b)ExperientialEmbodied TaskRL
KnowledgeEditor (Cao et al., 2021)FactualQA, Fact CheckingFT
Mend (Mitchell et al., 2022)FactualQA, Fact Checking, Model EditingFT
PersonalityEdit Mao et al. (2024)FactualQA, Model EditingFT, PE
APP (Ma et al., 2024)FactualQAFT
DINM (Wang et al., 2024c)ExperientialQA, DetoxificationFT
AlphaEdit (Fang et al., 2025c)FactualQAFT
II. External Parametric MemoryII. External Parametric MemoryII. External Parametric MemoryII. External Parametric Memory
(a) Adapter-based Modules
MLP-Memory (Wei et al., 2025d)FactualQA, Classification, Textual EntailmentSFT
K-Adapter (Wang et al., 2021)FactualQA, Entity Typing, ClassificationSFT
WISE (Wang et al., 2024e)FactualQA, Hallucination DetectionSFT
ELDER (Li et al., 2025d)FactualModel EditingSFT
T-Patcher (Huang et al., 2023)FactualQAFT
Sparse Memory FT (Lin et al., 2025a)FactualQASFT
Memory Decoder (Cao et al., 2025a)FactualQA, Language ModelingSFT
MemLoRA (Bini et al., 2025)FactualQASFT
(b) Auxiliary LM-based Modules
MAC (Tack et al., 2024)FactualQASFT
Retroformer (Yao et al., 2024a)ExperientialQA, Web NavigationRL
MethodFormTypeTask
I. GenerateI. GenerateI. GenerateI. Generate
(a) Single Modal
Gist (Mu et al., 2023)Gist TokensWorking WorkingLong-context Compression
Taking a Deep Breath (Luo et al., 2024)Sentinel TokensLong-context QA
SoftCoT (Xu et al., 2025d)Soft TokensWorkingReasoning
CARE (Choi et al., 2025)Memory TokensWorkingQA, Fact Checking
AutoCompressor (Chevalier et al., 2023)Summary VectorsWorkingQA, Compression
MemoRAG (Qian et al., 2025)Global Semantic StatesWorkingQA, Summary
MemoryLLM (Wang et al., 2024j)Persistent TokensFactualLong-conv QA, Model Editing
M+ (Wang et al., 2025n)Cross-layer Token PoolsFactualQA
LM2 (Kang et al., 2025b)Matrix SlotsWorkingQA, Reasoning
Titans (Behrouz et al., 2025b)Neural Weights (MLP)WorkingQA, Language Modeling
MemGen (Zhang et al., 2025d)LoRA FragmentsWorking, Exp.QA, Math, Code, Embodied Task, Reasoning
EMU (Na et al., 2024)Embeddings w/ ReturnsFactualGame
TokMem (Wu et al., 2025j)Memory TokensExp.Funcation calling
Nested Learning (Behrouz et al., 2025a)Nested OptimizationFactualLanguage Modeling
Memoria (Park and Bak, 2024)Three memory layers with engramsFactualLanguage Modeling
(b) Multi-Modal
CoMem (Wu et al., 2025d)Multimodal EmbeddingsFactualMultimodal QA
ACM (Wu et al., 2025e)Trajectory EmbeddingsWorkingWeb
Time-VLM (Zhong et al., 2025)Patch EmbeddingsWorkingVideo Understanding
Mem Augmented RL (Mezghani et al., 2022)Novelty State EncoderWorkingVisual Navigation
MemoryVLA (Shi et al., 2025a)Perceptual StatesFactual, WorkingEmbodied Task
XMem (Cheng and Schwing, 2022)Key-Value EmbeddingsWorkingVideo Segmentation
II. ReuseII. ReuseII. ReuseII. Reuse
Memorizing Transformers (Wu et al., 2022)External KV CacheWorkingLanguage Modeling
SirLLM (Yao et al., 2024b)Entropy-selected KVFactualLong-conv QA
Memory 3 (Yang et al., 2024a)Critical KV PairsFactualQA
FOT (Tworkowski et al., 2023)Memory-Attention KVWorkingQA, Few-shot
LONGMEM (Wang et al., 2023b)Residual SideNet KVWorkinglearning, Language Modeling Language Modeling and Understanding
III. TransformIII. TransformIII. TransformIII. Transform
Scissorhands (Liu et al., 2023b)Pruned KVWorkingImage classification & generation
SnapKV (Li et al., 2024b)Aggregated Prefix KVWorkingLanguage Modeling
PyramidKV (Cai et al., 2024)Layer-wise BudgetWorkingLanguage Modeling
RazorAttention (Tang et al., 2025a)Compensated WindowWorkingLanguage Modeling
H2O (Zhang et al., 2023) R 3 Mem (Wang et al., 2025k)Heavy Hitter Tokens Virtual memory tokens with reversible compressionWorking WorkingQA, Language Modeling QA, Language Modeling
MethodCarrierStructureTaskOptimization
I. User factual MemoryI. User factual MemoryI. User factual MemoryI. User factual MemoryI. User factual Memory
(a) Dialogue Coherence
MemGPT (Packer et al., 2023b)Token-level1DLong-term dialoguePE
TiM (Liu et al., 2023a)Token-level2DQAPE
MemoryBank (Zhong et al., 2024)Token-level1DEmotional CompanionPE
AI Persona (Wang et al., 2024f)Token-level1DEmotional CompanionPE
Encode-Store-Retrieve (Shen et al., 2024)Token-level1DMultimodal QAPE
MethodCarrierFormTaskOptimization
Livia (Xi and Wang, 2025)Token-level1DEmotional CompanionPE
mem0 (Chhikara et al., 2025)Token-level1DLong-term dialogue, QAPE
RMM (Tan et al., 2025c)Token-level2DPersonalizationPE, RL
D-SMART (Lei et al., 2025)Token-level2DReasoningPE
Comedy (Chen et al., 2025d)Token-level1DSummary, Compression, QAPE
MEMENTO (Kwon et al., 2025)Token-level1DEmbodied, PersonalizationPE
O-Mem (Wang et al., 2025g)Token-level3DPersonalized DialoguePE
DAM-LLM (Lu and Li, 2025)Token-level1DEmotional CompanionPE
MemInsight (Salama et al., 2025)Token-level1DPersonalized DialoguePE
EMem (Zhou and Han, 2025a)Token-level1DPersonalized DialoguePE
RGMem (Tian et al., 2025a)Token-level1DLong-conv QAPE
Memoria (Sarin et al., 2025)Token-level1DLong-conv QAPE
MemVerse (Liu et al., 2025e)FactMultimodal hierarchical knowledge graphs.Reasoning, QA
(b) Goal Consistency
RecurrentGPT (Zhou et al., 2023b)Token-level1DLong-Context Generation, Personalized Interactive FictionPE
Memolet (Yen and Zhao, 2024)Token-level2DQA, Document ReasoningPE
MemGuide (Du et al., 2025b)Token-level1DLong-conv QAPE, SFT
SGMem (Wu et al., 2025h)Token-level2DLong-contextPE
A-Mem (Xu et al., 2025c)Token-level2DQA, ReasoningPE
M3-agent (Long et al., 2025)Token-level2DMultimodal QAPE, SFT
WorldMM (Yeo et al., 2025)Token-level1DMultimodal QAPE
EverMemOS (Hu et al., 2026a)Token-level1DLong-conv QAPE
Environment factual MemoryEnvironment factual MemoryEnvironment factual MemoryEnvironment factual Memory
(a) Knowledge Persistence
MemGPT (Packer et al., 2023b)Token-level1DDocument QAPE
CALYPSO (Zhu et al., 2023)Token-level1DTabletop GamingPE
AriGraph (Anokhin et al., 2024)Token-level3DGame, Multi-op QAPE
HippoRAG (Gutierrez et al., 2024)Token-level3DQAPE
WISE (Wang et al., 2024e)Parametric/Document Reasoning, QASFT
MemoryLLM (Wang et al., 2024j)Parametric/Document ReasoningSFT
Memoria (Park and Bak, 2024)latent/Language ModelingPE
Zep (Rasmussen et al., 2025)Token-level3DDocument analysisPE
MemTree (Rezazadeh et al., 2025c)Token-level2DDocument Reasoning, Dia- loguePE
LMLM (Zhao et al., 2025b)Token-level1DQASFT
M+ (Wang et al., 2025n)Latent/Document Reasoning, QASFT
CAM (Li et al., 2025g)Token-level3DMulti-hop QASFT, RFT
MemAct (Zhang et al., 2025r)Token-level1DMulti-obj QARL
Mem- α (Wang et al., 2025p)Token-Level1DDocument ReasoningRL
WebWeaver (Li et al., 2025m)Token-level1DDeep ResearchSFT
MemLoRA (Bini et al., 2025)Parametric/QASFT
Memory Decoder (Cao et al., 2025a)Parametric/QA, Language ModelingSFT
(b) Shared Access
GameGPT (Chen et al., 2023b)Token-level1DGame DevelopmentPE
Generative Agent (Park et al., 2023)Token-level2DSocial SimulationPE
S³ (Gao et al., 2023a)Token-level1DSocial SimulationPE
Memory Sharing (Gao and Zhang, 2024a)Token-level1DDocument ReasoningPE
MetaGPT (HongToken-level1DSoftware DevelopmentPE
et al., 2024) G-Memory (Zhang et al., 2025e)Token-level3DQAPE
OASIS (Yang et al., 2025)Token-level, Parametric1DSocial SimulationPE
MethodCarrierFormTaskOptimization
I. Case-based MemoryI. Case-based MemoryI. Case-based MemoryI. Case-based MemoryI. Case-based Memory
Expel (Zhao et al., 2024)Token-levelSolutionReasoningPE
Synapse (Zheng et al., 2024a)Token-levelSolutionWeb Interaction, Instruction-guided Web TaskPE
Fincon (Yu et al., 2024)Token-levelSolutionFinancialPE
MapCoder (Islam et al., 2024)Token-levelSolutionCodingPE
Memento (Zhou et al., 2025a)Token-levelTrajectoryReasoningRL
COLA (Zhao et al., 2025a)Token-levelTrajectoryGUI, Web Navigation, ReasoningPE
Continuous Memory (Wu et al., 2025e)LatentTrajectoryGUISFT
JARVIS-1 (Wang et al., 2025q)Token-levelTrajectoryGame, GUI InteractionPE
MemGen (Zhang et al., 2025d)LatentTrajectoryWeb Search, Embodied Simulation, Reasoning, Math, CodeRL, SFT
Early Experience (Zhang et al., 2025k)ParametricTrajectoryEmbodied Simulation, Reasoning, Web NavigationSFT
DreamGym (Chen et al., 2025f)Token-levelTrajectoryWeb Interaction, Embodied Simula- tion, ShoppingRL
MemRL (Zhang et al., 2026)Token-levelTrajectoryCoding, Embodied Simulation, Rea- soningRL
II. Strategy-based MemoryII. Strategy-based MemoryII. Strategy-based MemoryII. Strategy-based MemoryII. Strategy-based Memory
Reflexion (Shinn et al., 2023a)Token-levelInsightEmbodied Simulation, Reasoning, CodingPE
Buffer of Thoughts (Yang et al., 2024b)Token-levelPatternGame, Reasoning, CodingPE
AWM (Wang et al., 2024m)Token-levelWorkflowWeb Interaction, Instruction-guided Web TaskPE
RecMind (Wang et al., 2024h)Token-levelPatternRecommendationPE
H 2 R (Ye et al., 2025b)Token-levelInsightGame, Embodied SimulationPE
ReasoningBank (Ouyang et al., 2025)Token-levelInsightWeb Interaction, Instruction-guided Web TaskPE
R2D2 (Huang et al., 2025c)Token-levelInsightWeb InteractionPE
MethodCarrierFormTaskOptimization
BrowserAgent (Yu et al., 2025d)Token-levelInsightGeneral QA, Web searchRL, SFT
Agent KB (Tang et al., 2025d)Token-levelWorkflowCode, ReasoningPE
ToolMem (Xiao et al., 2025b)Token-levelInsightReasoning, Image GenerationPE
PRINCIPLES (Kim et al., 2025a)Token-levelPatternEmotional CompanionPE
SE-Agent (Sun et al., 2025c)Token-levelInsightCodingPE
ACE (Zhang et al., 2025n)Token-levelInsightCoding, Tool calling, FinancialPE
Flex (Cai et al., 2025c)Token-levelInsightMath, Chemistry, BiologyPE
AgentEvolver (Zhai et al., 2025)ParametricPatternTool-augmented TaskRL
Dynamic Cheatsheet (Suzgun et al., 2025)Token-levelInsightMath, Reasoning, GamePE
Training-Free GRPO (Cai et al., 2025b)Token-levelInsightMath, Reasoning, Web SearchPE
MemEvolve (Zhang et al., 2025h)Token-levelSolution,InsightWeb Search, ReasoningPE
III. Skill-based MemoryIII. Skill-based MemoryIII. Skill-based MemoryIII. Skill-based MemoryIII. Skill-based Memory
CREATOR (Qian et al., 2023)Token-levelFunction and ScriptReasoning, MathPE
Gorilla (Patil et al., 2024)Token-levelAPITool callingSFT
ToolRerank (Zheng et al., 2024b)Token-levelAPITool callingPE
Voyager (Wang et al., 2024b)Token-levelCode SnippetGamePE
RepairAgent (Bouzenia et al., 2024)Token-levelFunction and ScriptCodingPE
COLT (Qu et al., 2024)Token-levelAPITool callingSFT
ToolLLM (Qin et al., 2024a)Token-levelAPITool CallingSFT
LEGOMem (Han et al., 2025a)Token-levelFunction and ScriptOfficePE
Darwin Gödel Machine (Zhang et al., 2025i)Token-levelCode SnippetCodePE
Huxley-Gödel Machine (Wang et al., 2025j)Token-levelCode SnippetCodePE
Memp p (Fang et al., 2025d)Token-levelFunction and ScriptEmbodied Simulation, Travel Plan- ningPE
SkillWeaver (Zheng et al., 2025a)Token-levelFunction and ScriptWeb Interaction, Instruction-guided Web TaskPE
Alita (Qiu et al., 2025c)Token-levelMCPMath, Reasoning, VQAPE
Alita-G (Qiu et al., 2025b)Token-levelMCPMath, Reasoning, VQAPE
LearnAct (Liu et al., 2025b)Token-levelFunction and ScriptMobile GUIPE
ToolGen (Wang et al., 2025i)ParametricAPITool callingSFT
MemTool (Lumer et al., 2025)Token-levelMCPTool callingSFT
ToolRet (Shi et al., 2025c)Token-levelAPIWeb, Code, Tool RetrievalSFT
DRAFT (Qu et al., 2025a)Token-levelAPITool callingPE
ASI (Wang et al., 2025s)Token-levelFunctions and ScriptsWeb InteractionPE
MethodCarrierTaskOptimization
I. Single-turn Working MemoryI. Single-turn Working MemoryI. Single-turn Working MemoryI. Single-turn Working Memory
(a) Input Condensation
Gist (Mu et al., 2023)LatentInstruction Fine-tuningSFT
ICAE (Ge et al., 2024)LatentLanguage Modeling, Instruction Fine-tuningPretrain, LoRA
AutoCompressors (Chevalier et al., 2023)LatentLangague ModelingSFT
LLMLingua (Jiang et al., 2023)Token-levelReasoning, Conversation, SummarizationPE
LongLLMLingua (Jiang et al., 2024)Token-levelMulti-doc QA, Long-context, Multi-hop QAPE
CompAct (Yoon et al., 2024)Token-levelDocument QASFT
HyCo2 (Liao et al., 2025a)HybridSummarization, Open-domain QA, Multi-hop QASFT
Sentence-Anchor (Tarasov et al., 2025)LatentDocument QASFT
MELODI (Chen et al., 2024c)HybridPretrainingPretrain
R 3 Mem (Wang et al., 2025k)LatentDocument QA, Language ModelingPEFT
(b) Observation Abstraction
Synapse (Zheng et al., 2024a)Token-levelComputer Control, Web NavigationPE
VideoAgent (Wang et al., 2024g)Token-levelLong-term Video UnderstandingPE
MA-LMM (He et al., 2024)LatentLong-term Video UnderstandingSFT
Context as Memory (Yu et al., 2025b)Token-levelLong-term Video GenerationPE
II. Multi-turn Working MemoryII. Multi-turn Working MemoryII. Multi-turn Working MemoryII. Multi-turn Working Memory
(c) State Consolidation
MEM1 (Zhou et al., 2025b)LatentRetrieval, Open-domain QA, ShoppingRL
MemGen (Zhang et al., 2025d)LatentReasoning, Embodied Action, Web Search, CodingRL
MemAgent (Yu et al., 2025a)Token-levelLong-term Doc. QARL
ReMemAgent (Shi et al., 2025b)Token-levelLong-term Doc. QARL
ReSum (Wu et al., 2025f)Token-levelLong-horizon Web SearchRL
MemSearcher (Yuan et al., 2025a)Token-levelMulti-hop QASFT, RL
ACON (Kang et al., 2025c)Token-levelApp use, Multi-objective QAPE
IterResearch (Chen et al., 2025b)Token-levelReasoning, Web Navigation, Long-Horizon QARL
SUPO (Lu et al., 2025a)Token-levelLong-horizon taskRL
AgentDiet (Xiao et al., 2025a)Token-levelLong-horizon taskPE
SUMER (Zheng et al., 2025c)Token-levelQARL
Sculptor (Li et al., 2025f)Token-levelMulti-Needle QAPE,RL
AgeMem (Yu et al., 2026)Token-levelQA, Embodied ActionPE,RL
(d) Hierarchical Folding
HiAgent (Hu et al., 2025a)Token-levelLong-horizon Agent TaskPE
Context-Folding (Sun et al., 2025b)Token-levelDeep Research, SWERL
AgentFold (Ye et al., 2025a)Token-levelWeb SearchSFT
DeepAgent (Li et al., 2025i)Token-levelTool Use, Shopping, ReasoningRL
(e) Cognitive Planning
SayPlan (Rana et al., 2023)Token-level3D Scene Graph, RoboticsPE
KARMA (Wang et al., 2025r)Token-levelHouseholdPE
Agent-S (Agashe et al., 2025)Token-levelComputer UsePE
PRIME (Tran et al., 2025)Token-levelMulti-hop QA, Knowledge-intensive ReasoningPE
MethodSub-TypeRepresentation FormKey Mechanism
I. Semantic SummarizationI. Semantic SummarizationI. Semantic SummarizationI. Semantic Summarization
MemGPT (Packer et al., 2023a) Mem0 (Chhikara et al., 2025) Mem1 (Zhou et al., 2025b) MemAgent (Yu et al., 2025a) MemoryBank (Zhong et al., 2024) ReadAgent (Lee et al., 2024a) LightMem (Fang et al., 2025b) DeepSeek-OCR (Wei et al., 2025a) FDVS (You et al., 2024) LangRepo (Kahatapitiya et al., 2025) TiM (Liu et al., 2023a) RMM (Tan et al., 2025b) MemGuide (Du et al., 2025b) M3-Agent (Long et al., 2025) AWM (Wang et al., 2024m)Incremental Incremental Incremental Incremental Partitioned Partitioned Partitioned Partitioned Partitioned Partitioned Factual Factual Factual Factual ExperientialTextual Summary Textual Summary Textual Summary Textual Summary Textual Summary Textual Summary Textual Summary Visual Token Mapping Multimodal Summary Multimodal Summary II. Knowledge Distillation Textual Insight Topic Insight User Intent Text-addressable Facts Workflow PatternsMerging new chunks into the working context LLM-driven summarization RL-optimized summarization (PPO) RL-optimized summarization (GRPO) Daily/Session-based segmentation Semantic clustering before summarization Topic-clustered summarization Optical 2D mapping compression Multi-source signal integration (Subtitle/Object) Hierarchical video clip aggregation Abstraction of dialogue into thoughts Abstraction of dialogue into topic-based memory Capturing high-level user intent Compressing egocentric visual observations Workflow extraction from success trajectories
KGT (Sun et al., 2024) Mem0 g (Chhikara et al., 2025) D-SMART (Lei et al., 2025) GraphRAG (Edge et al., 2025) AriGraph (Anokhin et al., 2024) Zep (Rasmussen et al., 2025) RAPTOR (Sarthi et al., 2024) MemTree (Rezazadeh et al., 2025c) H-MEM (Sun and Zeng, 2025) A-MEM (Xu et al., 2025c) PREMem (Kim et al., 2025b) CAM (Li et al., 2025g) G-Memory (Zhang et al., 2025c)Entity-Level Entity-Level Entity-Level Entity-Level Entity-Level Entity-Level Chunk-Level Chunk-Level Chunk-Level Chunk-Level Chunk-Level Chunk-Level Chunk-LevelUser Graph Knowledge Graph Dynamic Memory Graph Hierarchical KG Semantic+Episodic Graph Temporal KG Tree Structure Tree Structure Hierarchical JSON Networked Notes Reasoning Patterns Hierarchical Graph Hierarchical GraphEncoding user preferences as nodes/edges LLM-based entity and triplet extraction Constructing an OWL-compliant graph Community detection and iterative summarization Dual-layer (Semantic nodes + Episodic links) 3-layer graph (Episodic, Semantic, Community) Recursive GMM clustering and summarization Bottom-up insertion and summary updates Top-down 4-level hierarchy organization Discrete notes with semantic links Cross-session reasoning pattern clustering Disentangling overlapping clusters via replication 3-tier graph (interaction, query, insight)
IV. Latent RepresentationIV. Latent RepresentationIV. Latent RepresentationIV. Latent Representation
MemoryLLM (Wang et al., 2024j) M+ (Wang et al., 2025n) MemGen (Zhang et al., 2025d) ESR (Shen et al., 2024) CoMEM (Wu et al., 2025d) Mem2Ego (Zhang et al., 2025m)Textual Textual Textual MultimodalLatent Vector Latent Vector Latent Token Latent Vector
KARMA (Wang et al., 2025r)Multimodal Multimodal MultimodalContinuous Embedding Multimodal Embedding Multimodal Embedding ParametricSelf-updatable latent embeddings Cross-layer long-term memory tokens Latent memory trigger and weaver Video-to-Language-to-Vector encoding Vision-language compression via Q-Former
Embedding landmark semantics as latent Hybrid long/short-term memory encoding
memory Internalizationmemory Internalizationmemory Internalizationmemory Internalization
MEND (Mitchell et al., 2022) ROME (Meng et al., 2022) MEMIT (Meng et al., 2023)Knowledge Knowledge KnowledgeGradient Decomposition Model Parameters Model Parameters
LoRA ParametersAuxiliary network for fast edits Causal tracing and rank-one update Mass-editing via residual distribution
V.V.V.V.
CoLoR (Wistuba et al., 2023) ToolFormer (Schick et al., 2023)Knowledge CapabilityModel ParametersLow-rank adapter training Supervised fine-tuning on API calls
NameLinkFac.Exp.MM.Env.FeatureScale
Memory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented BenchmarksMemory/Lifelong-learning/Self-evolving-oriented Benchmarks
MemBench/github GitHubsimulatedinteractive scenarios53,000 s.
MemoryAgentBench/github GitHubsimulatedmulti-turn interactions4 t.
LoCoMo/globe Websiterealconversational memory300 s.
WebChoreArena/github GitHubrealtedious web browsing4 t./532 s.
MT-Mind2Web/github GitHubrealconversational web navigation720 s.
PersonaMem/globe Websitesimulateddynamic user profiling15 t./180 s.
LongMemEval/github GitHubsimulatedinteractive memory5 t./500 s.
PerLTQA/globe Websitesimulatedsocial personalized interactions8,593 s.
MemoryBank/globe Websitesimulateduser memory updating194 s.
MPR/github GitHubsimulateduser personalization108,000 s.
PrefEval/globe Websitesimulatedpersonal preferences3,000 s.
LOCCO/globe Websitesimulatedchronological conversations3,080 s.
StoryBench/globe Websitemixedinteractive fiction games3 t.
MemoryBench/globe Websitesimulatedcontinual learning4 t./ ∼ 20,000 s.
Madial-Bench/github GitHubsimulatedmemory recalling331 s.
Evo-Memory/globe Websitesimulatedtest-time learning10 t./ ∼ 3,700 s.
LifelongAgentBench/globe Websitesimulatedlifelong learning1,396 s.
StreamBench/globe Websitesimulatedcontinuous online learning9,702 s.
DialSim/globe Websiterealmulti-dialogue understanding∼ 1,300 s.
LongBench/globe Websitemixedlong-context understanding21 t./4,750 s.
LongBench v2/globe Websitemixedlong-context multitasks20 t./503 s.
RULER/github GitHubsimulatedlong-context retrieval13 t.
BABILong/github GitHubsimulatedlong-context reasoning20 t.
MM-Needle/globe Websitesimulatedmultimodal long-context retrieval∼ 280,000 s.
HaluMem/github GitHubsimulatedmemory hallucinations3,467 s.
HotpotQA/globe Websitesimulatedlong-context QA113k s.
Other Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related BenchmarksOther Related Benchmarks
ALFWorld/globe Websitesimulatedtext-based embodied environment3,353 t.
ScienceWorld/github GitHubsimulatedinteractive embodied environment10 t./30 t.
AgentGym/globe Websitemixedmultiple environments89 t./20,509 s.
AgentBoard/github GitHubmixedmulti-round interaction9 t./1013 s.
PDDL ∗/globe Websitesimulatedstrategy game-
BabyAI/globe Websitesimulatedlanguage learning19 t.
WebShop/globe Websitesimulatede-commerce web interaction12,087 s.
WebArena/globe Websiterealweb interaction812 s.
MMInA/globe Websiterealmultihop web interaction1,050 s.
SWE-Bench Verified/globe Websiterealcode repair500 s.
GAIA/globe Websiterealhuman-level deep research466 s.
xBench-DS/globe Websiterealdeep-search evaluation100 s.
ToolBench/github GitHubrealAPI tool use126,486 s.
GenAI-Bench/globe Websiterealvisual generation evaluation∼ 40,000 s.
FrameworkLinksFac.Exp.MM.StructureEvaluation
MemGPT/github GitHub /globe Websitehierachical (S/LTM)LoCoMo
Mem0/github GitHub /globe Websitegraph + vectorLoCoMo
Memobase/github GitHub /globe Websitestructured profilesLoCoMo
MIRIX/github GitHub /globe Websitestructured memoryLoCoMo, MemoryA- gentBench
MemoryOS/github GitHub /globe Websitehierarchical (S/M/LTM)LoCoMo, Memory- Bank
MemOS/github GitHub /globe Websitetree memory + memcubeLoCoMo, PreFEval, LongMemEval, Per- sonaMem
Zep/github GitHub /globe Websitetemporal knowledge graphLongMemEval
LangMem/github GitHub /globe Websitecore API + manager-
SuperMemory/github GitHub /globe Websitevector + semantic-
Cognee/github GitHub /globe Websiteknowledge graph-
Memary/github GitHub /globe Websitestream + entity store-
Pinecone/github GitHub /globe Websitevector database-
Chroma/github GitHub /globe Websitevector database-
Weaviate/github GitHub /globe Websitevector + graph-
Second Me/github GitHub /globe Websiteagent ego-
MemU/github GitHub /globe Websitehierachical layers-
MemEngine/github GitHubmodular space-
Memori/github GitHub /globe Websitememory database-
ReMe/github GitHub /globe Websitememory management-
AgentMemory/github GitHub /globe Websitememory management-
MineContext/github GitHub /globe Websitecontext engineering-
Acontext/github GitHubcontext engineering + skill learning-
PowerMem/github GitHuboceanbase-
ReMe/github GitHubagentscopeBFCL, AppWorld
HindSight/github GitHubparallel retrieval + reflection-

Figure 3 Taxonomy of token-level memory organized by topological complexity and dimensionality: (a) Flat Memory (1D) stores information as linear sequences or independent clusters without explicit inter-unit topology, commonly used for Chunk sets, Dialogue logs, and Experience pools. (b) Planar Memory (2D) introduces a single-layer structured layout where units are linked via Tree or Graph structures to capture relational dependencies, supporting diverse node types such as images and chat records. (c) Hierarchical Memory (3D) employs multi-level forms, such as Pyramids or Multi-layer graphs, to facilitate vertical abstraction and cross-layer reasoning between different data granularities, such as raw docs and synthesized QAs.

Figure6 The functional taxonomy of agent memory. We organize memory capabilities based on their functions (purpose) into three primary pillars spanning two temporal domains: (1) Factual Memory serves as a persistent declarative knowledge base to ensure interaction consistency , coherence , and adaptability ; (2) Experiential Memory encapsulates procedural knowledge to enable continual learning and self-evolution across episodes; and (3) Working Memory provides mechanisms for the active management of transient context.

Figure 11 The evolution of RL-enabled agent memory systems. A conceptual progression from RL-free memory systems based on heuristic or prompt-driven pipelines, to partially RL-involved designs where reinforcement learning governs selected memory operations, and finally to fully RL-driven memory systems in which memory architectures and control policies are learned end-to-end. This evolution reflects a broader paradigm shift from manually engineered memory pipelines toward model-native , self-optimizing memory management in LLM-based agents.

References

[dinm] Mengru Wang, Ningyu Zhang, Ziwen Xu, Zekun Xi, Shumin Deng, Yunzhi Yao, Qishen Zhang, Linyi Yang, Jindong Wang, Huajun Chen. (2024). Detoxifying Large Language Models via Knowledge Editing. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), {ACL. doi:10.18653/V1/2024.ACL-LONG.171.

[app] Jun{-. (2024). Neighboring Perturbations of Knowledge Editing on Large Language Models. Forty-first International Conference on Machine Learning, {ICML.

[personality-edit] Shengyu Mao, Xiaohan Wang, Mengru Wang, Yong Jiang, Pengjun Xie, Fei Huang, Ningyu Zhang. (2024). Editing Personality For Large Language Models. Natural Language Processing and Chinese Computing - 13th National {CCF. doi:10.1007/978-981-97-9434-8_19.

[mend] Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, Christopher D. Manning. (2022). Fast Model Editing at Scale. The Tenth International Conference on Learning Representations, {ICLR.

[attention-sink] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis. (2024). Efficient Streaming Language Models with Attention Sinks. The Twelfth International Conference on Learning Representations, {ICLR.

[Rasmussen2025Zep] Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef. (2025). Zep: {A. CoRR. doi:10.48550/ARXIV.2501.13956.

[lightening-attention-2] Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong. (2024). Lightning Attention-2: {A. CoRR. doi:10.48550/ARXIV.2401.04658.

[lightening-attention] Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong. (2024). Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention. Forty-first International Conference on Machine Learning, {ICML.

[flash-attention-3] Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao. (2024). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024.

[flash-attention-2] Tri Dao. (2024). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. The Twelfth International Conference on Learning Representations, {ICLR.

[DBLP:conf/iclr/ChoromanskiLDSG21] Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tam{'{a. (2021). Rethinking Attention with Performers. 9th International Conference on Learning Representations, {ICLR.

[big-bird] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Onta{~{n. (2020). Big Bird: Transformers for Longer Sequences. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.

[longformer] Qiuhui Chen, Qiang Fu, Hao Bai, Yi Hong. (2024). LongFormer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs. {IEEE/CVF. doi:10.1109/WACV57701.2024.00354.

[machine-memory-system] Taewoon Kim, Michael Cochez, Vincent Fran{\c{c. (2023). A Machine with Short-Term, Episodic, and Semantic Memory Systems. Thirty-Seventh {AAAI. doi:10.1609/AAAI.V37I1.25075.

[self-param] Yu Wang, Xinshuang Liu, Xiusi Chen, Sean O'Brien, Junda Wu, Julian J. McAuley. (2025). Self-Updatable Large Language Models by Integrating Context into Model Parameters. The Thirteenth International Conference on Learning Representations, {ICLR.

[zhou2023agents] Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu Chen, Wentao Zhang, Ningyu Zhang, Huajun Chen, Peng Cui, Mrinmaya Sachan. (2023). Agents: An Open-source Framework for Autonomous Language Agents. CoRR. doi:10.48550/ARXIV.2309.07870.

[zhu-etal-2025-oagents] Zhu, He, Qin, Tianrui, Zhu, King, Huang, Heyuan, Guan, Yeyi, Xia, Jinxiang, Li, Hanhao, Yao, Yi, Wang, Ningning, Liu, Pai, Peng, Tianhao, Gui, Xin, Xiaowan, Li, Liu, Yuhui, Tang, Xiangru, Yang, Jian, Zhang, Ge, Gao, Xitong, Jiang, Yuchen Eleanor, Zhang, Changwang, Wang, Jun, Liu, Jiaheng, Zhou, Wangchunshu. (2025). {OA. Findings of the Association for Computational Linguistics: EMNLP 2025.

[qin2025flashsearcherfasteffectiveweb] Tianrui Qin, Qianben Chen, Sinuo Wang, He Xing, King Zhu, He Zhu, Dingfeng Shi, Xinxin Liu, Ge Zhang, Jiaheng Liu, Yuchen Eleanor Jiang, Xitong Gao, Wangchunshu Zhou. (2025). Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution.

[liang2025personalizeddeepresearchbenchmarks] Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Yuchen Eleanor Jiang, Ningyu Zhang, Wangchunshu Zhou. (2025). Towards Personalized Deep Research: Benchmarks and Evaluations.

[li2025chainofagentsendtoendagentfoundation] Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou. (2025). Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL.

[zhou2023recurrentgpt] Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, Mrinmaya Sachan. (2023). RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text.

[wang2024aipersona] Tiannan Wang, Meiling Tao, Ruoyu Fang, Huilin Wang, Shuai Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou. (2024). AI PERSONA: Towards Life-long Personalization of LLMs.

[zhou2024agents2] Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, Yuchen Eleanor Jiang. (2024). Symbolic Learning Enables Self-Evolving Agents. CoRR. doi:10.48550/ARXIV.2406.18532.

[liu2025evovlaselfevolvingvisionlanguageactionmodel] Zeting Liu, Zida Yang, Zeyu Zhang, Hao Tang. (2025). EvoVLA: Self-Evolving Vision-Language-Action Model.

[wang2025omem] Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, Wangchunshu Zhou. (2025). O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents.

[mlp-memory] Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, Zhouhan Lin. (2025). {MLP. CoRR. doi:10.48550/ARXIV.2508.01832.

[pt-hier-memory] Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel. (2025). Pretraining with hierarchical memories: separating long-tail and common knowledge.

[character-lm] Yunfan Shao, Linyang Li, Junqi Dai, Xipeng Qiu. (2023). Character-LLM: {A. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, {EMNLP. doi:10.18653/V1/2023.EMNLP-MAIN.814.

[characterGLM] Jinfeng Zhou, Zhuang Chen, Dazhen Wan, Bosi Wen, Yi Song, Jifan Yu, Yongkang Huang, Pei Ke, Guanqun Bi, Libiao Peng, Jiaming Yang, Xiyao Xiao, Sahand Sabour, Xiaohan Zhang, Wenjing Hou, Yijia Zhang, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang. (2024). CharacterGLM: Customizing Social Characters with Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: {EMNLP. doi:10.18653/V1/2024.EMNLP-INDUSTRY.107.

[agent-early-experience] Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu. (2025). Agent Learning via Early Experience.

[agent-founder] Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou. (2025). Scaling Agents via Continual Pre-training. CoRR. doi:10.48550/ARXIV.2509.13310.

[amortize-context] Jihoon Tack, Jaehyung Kim, Eric Mitchell, Jinwoo Shin, Yee Whye Teh, Jonathan Richard Schwarz. (2024). Online Adaptation of Language Models with a Memory of Amortized Contexts. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024.

[liu2025advances] Liu, Bang, Li, Xinfeng, Zhang, Jiayi, Wang, Jinlin, He, Tanjin, Hong, Sirui, Liu, Hongzhang, Zhang, Shaokun, Song, Kaitao, Zhu, Kunlun, others. (2025). Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems. arXiv preprint arXiv:2504.01990.

[wei2025aiscienceagenticscience] Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guangshuai Wang, Zhiqiang Gao, Juntai Cao, Zijie Qiu, Ming Hu, Chenglong Ma, Shixiang Tang, Junjun He, Chunfeng Song, Xuming He, Qiang Zhang, Chenyu You, Shuangjia Zheng, Ning Ding, Wanli Ouyang, Nanqing Dong, Yu Cheng, Siqi Sun, Lei Bai, Bowen Zhou. (2025). From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery.

[toollearningsurvey] Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen. (2025). Reinforcement Learning for Reasoning in Large Language Models with One Training Example. Frontiers of Computer Science. doi:10.1007/s11704-024-40678-2.

[fang2025comprehensivesurveyselfevolvingai] Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, Zaiqiao Meng. (2025). A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems.

[durante2024agentaisurveyinghorizons] Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, Katsushi Ikeuchi, Hoi Vo, Li Fei-Fei, Jianfeng Gao. (2024). Agent AI: Surveying the Horizons of Multimodal Interaction.

[wang2024agentssoftwareengineeringsurvey] Yanlin Wang, Wanjun Zhong, Yanxian Huang, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, Zibin Zheng. (2024). Agents in Software Engineering: Survey, Landscape, and Vision.

[zhang2025deepresearchsurveyautonomous] Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, Xiangyu Zhao. (2025). Deep Research: A Survey of Autonomous Research Agents.

[xu2025comprehensivesurveydeepresearch] Renjun Xu, Jingwen Peng. (2025). A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications.

[luo2025largelanguagemodelagent] Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, Ming Zhang. (2025). Large Language Model Agent: A Survey on Methodology, Applications and Challenges.

[zhao2025absolute] Zhao, Andrew, Wu, Yiran, Yue, Yang, Wu, Tong, Xu, Quentin, Lin, Matthieu, Wang, Shenzhi, Wu, Qingyun, Zheng, Zilong, Huang, Gao. (2025). $\text{Absolute Zero. arXiv preprint arXiv:2505.03335.

[minaee2025largelanguagemodelssurvey] Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao. (2025). Large Language Models: A Survey.

[matarazzo2025surveylargelanguagemodels] Andrea Matarazzo, Riccardo Torlone. (2025). A Survey on Large Language Models with some Insights on their Capabilities and Limitations.

[huang2025rzero] Huang, Chengsong, Yu, Wenhao, Wang, Xiaoyang, Zhang, Hongming, Li, Zongxia, Li, Ruosen, Huang, Jiaxin, Mi, Haitao, Yu, Dong. (2025). $\text{R-Zero. arXiv preprint arXiv:2508.05004.

[shao2024deepseekmath] Shao, Zhihong, Wang, Peiyi, Zhu, Qihao, Xu, Runxin, Song, Junxiao, Bi, Xiao, Zhang, Haowei, Zhang, Mingchuan, Li, YK, Wu, Yang, others. (2024). Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.

[wu2024autogen] Wu, Qingyun, Bansal, Gagan, Zhang, Jieyu, Wu, Yiran, Li, Beibin, Zhu, Erkang, Jiang, Li, Zhang, Xiaoyun, Zhang, Shaokun, Liu, Jiale, others. (2024). Autogen: Enabling next-gen LLM applications via multi-agent conversations. First Conference on Language Modeling.

[wang2025evoagentx] Wang, Yingxu, Liu, Siwei, Fang, Jinyuan, Meng, Zaiqiao. (2025). $\text{EvoAgentX. arXiv preprint arXiv:2507.03616.

[qiao2024agent] Shuofei Qiao, Runnan Fang, Ningyu Zhang, Yuqi Zhu, Xiang Chen, Shumin Deng, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen. (2024). Agent Planning with World Knowledge Model. Advances in Neural Information Processing Systems.

[liu2025graphaugmentedlargelanguagemodel] Liu, Yixin, Zhang, Guibin, Wang, Kun, Li, Shiyuan, Pan, Shirui. (2025). Graph-Augmented Large Language Model Agents: Current Progress and Future Prospects. arXiv preprint arXiv:2507.21407.

[yu2024netsafeexploringtopologicalsafety] Yu, Miao, Wang, Shilong, Zhang, Guibin, Mao, Junyuan, Yin, Chenlong, Liu, Qijiong, Wen, Qingsong, Wang, Kun, Wang, Yang. (2024). $\text{NetSafe. arXiv preprint arXiv:2410.15686.

[yue2025masrouterlearningroutellms] Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, Yiyan Qi. (2025). $\text{MasRouter. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[zhang2025g] Zhang, Guibin, Fu, Muxin, Wan, Guancheng, Yu, Miao, Wang, Kun, Yan, Shuicheng. (2025). $\text{G-Memory. arXiv preprint arXiv:2506.07398.

[liu2025sew] Liu, Siwei, Fang, Jinyuan, Zhou, Han, Wang, Yingxu, Meng, Zaiqiao. (2025). $\text{SEW. arXiv preprint arXiv:2505.18646.

[liang2025reasoning] Liang, Jintao, Su, Gang, Lin, Huifeng, Wu, You, Zhao, Rui, Li, Ziyue. (2025). Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges. arXiv preprint arXiv:2506.10408.

[liao2025marft] Liao, Junwei, Wen, Muning, Wang, Jun, Zhang, Weinan. (2025). $\text{MARFT. arXiv preprint arXiv:2504.16129.

[park2025maporl] Chanwoo Park, Seungju Han, Xingzhi Guo, Asuman E. Ozdaglar, Kaiqing Zhang, Joo{-. (2025). $\text{MAPoRL. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[zhao2025sirius] Zhao, Wanjia, Yuksekgonul, Mert, Wu, Shirley, Zou, James. (2025). $\text{SiriuS. arXiv preprint arXiv:2502.04780.

[sarkar2025surveyllmagentcommunication] Sarkar, Anjana, Sarkar, Soumyendu. (2025). Survey of $\text{LLM. arXiv preprint arXiv:2506.05364.

[yang2024enhancing] Yang, Weiqing, Wang, Hanbin, Liu, Zhenghao, Li, Xinze, Yan, Yukun, Wang, Shuo, Gu, Yu, Yu, Minghe, Liu, Zhiyuan, Yu, Ge. (2024). Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement. CoRR.

[yang2025agentnetdecentralizedevolutionarycoordination] Yang, Yingxuan, Chai, Huacan, Shao, Shuai, Song, Yuanyi, Qi, Siyuan, Rui, Renting, Zhang, Weinan. (2025). $\text{AgentNet. arXiv preprint arXiv:2504.00587.

[yang2025surveyaiagentprotocols] Yang, Yingxuan, Chai, Huacan, Song, Yuanyi, Qi, Siyuan, Wen, Muning, Li, Ning, Liao, Junwei, Hu, Haoyi, Lin, Jianghao, Chang, Gaowei, others. (2025). A survey of $\text{AI. arXiv preprint arXiv:2504.16736.

[yao2023react] Yao, Shunyu, Zhao, Jeffrey, Yu, Dian, Du, Nan, Shafran, Izhak, Narasimhan, Karthik R, Cao, Yuan. (2023). $\text{ReAct. The Eleventh International Conference on Learning Representations.

[zhang2025multiagentarchitecturesearchagentic] Zhang, Guibin, Niu, Luyang, Fang, Junfeng, Wang, Kun, BAI, LEI, Wang, Xiang. (2025). Multi-agent Architecture Search via Agentic Supernet. Forty-second International Conference on Machine Learning.

[zhang2025cut] Zhang, Guibin, Yue, Yanwei, Li, Zhixun, Yun, Sukwon, Wan, Guancheng, Wang, Kun, Cheng, Dawei, Yu, Jeffrey Xu, Chen, Tianlong. (2025). Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems. The Thirteenth International Conference on Learning Representations.

[marti2025] Liao, Junwei, Wen, Muning, Wang, Jun, Zhang, Weinan. (2025). Marft: Multi-agent reinforcement fine-tuning. arXiv preprint arXiv:2504.16129.

[chen2024optima] Weize Chen, Jiarui Yuan, Chen Qian, Cheng Yang, Zhiyuan Liu, Maosong Sun. (2025). Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System. Findings of the Association for Computational Linguistics.

[agarwal2024promptwizard] Agarwal, Eshaan, Singh, Joykirat, Dani, Vivek, Magazine, Raghav, Ganu, Tanuja, Nambi, Akshay. (2024). PromptWizard: Task-aware prompt optimization framework. arXiv preprint arXiv:2405.18369.

[wei2025vflow] Wei, Yangbo, Huang, Zhen, Li, Huang, Xing, Wei W, Lin, Ting-Jung, He, Lei. (2025). Vflow: Discovering optimal agentic workflows for verilog generation. arXiv preprint arXiv:2504.03723.

[motwani2024malt] Motwani, Sumeet Ramesh, Smith, Chandler, Das, Rocktim Jyoti, Rafailov, Rafael, Laptev, Ivan, Torr, Philip HS, Pizzati, Fabio, Clark, Ronald, de Witt, Christian Schroeder. (2024). $\text{MALT. arXiv preprint arXiv:2412.01928.

[subramaniam2025multiagent-ft] Subramaniam, Vighnesh, Du, Yilun, Tenenbaum, Joshua B, Torralba, Antonio, Li, Shuang, Mordatch, Igor. (2025). Multiagent finetuning: Self improvement with diverse reasoning chains. arXiv preprint arXiv:2501.05707.

[zhang2025darwin] Zhang, Jenny, Hu, Shengran, Lu, Cong, Lange, Robert, Clune, Jeff. (2025). Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents. arXiv preprint arXiv:2505.22954.

[wang2025huxleygodelmachinehumanlevelcoding] Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, Jürgen Schmidhuber. (2025). Huxley-Godel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine.

[xu2022gps] Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yanggang Wang, Haiyu Li, Zhilin Yang. (2022). $\text{GPS:. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.

[prasad2023grips] Archiki Prasad, Peter Hase, Xiang Zhou, Mohit Bansal. (2023). $\text{GRIPS. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.

[pan2024plum] Rui Pan, Shuo Xing, Shizhe Diao, Wenhe Sun, Xiang Liu, Kashun Shum, Jipeng Zhang, Renjie Pi, Tong Zhang. (2024). Plum: Prompt Learning using Metaheuristics. Findings of the Association for Computational Linguistics.

[hsieh2024automatic] Cho{-. (2024). Automatic Engineering of Long Prompts. Findings of the Association for Computational Linguistics.

[lu2024strings] Yao Lu, Jiayi Wang, Raphael Tang, Sebastian Riedel, Pontus Stenetorp. (2024). Strings from the Library of Babel: Random Sampling as a Strong Baseline for Prompt Optimisation. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

[chen2024prompt] Yongchao Chen, Jacob Arkin, Yilun Hao, Yang Zhang, Nicholas Roy, Chuchu Fan. (2024). PRompt Optimization in Multi-Step Tasks {(PROMST):. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

[lin2024use] Xiaoqiang Lin, Zhaoxuan Wu, Zhongxiang Dai, Wenyang Hu, Yao Shu, See{-. (2024). Use Your $\text{INSTINCT:. Forty-first International Conference on Machine Learning.

[zhang2025revolve] Peiyan Zhang, Haibo Jin, Leyang Hu, Xinnuo Li, Liying Kang, Man Luo, Yangqiu Song, Haohan Wang. (2025). Revolve: Optimizing $\text{AI. Forty-second International Conference on Machine Learning.

[wang2024correctly] Wang, Wenyi, Alyahya, Hisham A, Ashley, Dylan R, Serikov, Oleg, Khizbullin, Dmitrii, Faccio, Francesco, Schmidhuber, J{. (2024). How to correctly do semantic backpropagation on language-based agentic systems. arXiv preprint arXiv:2412.03624.

[austin2024grad] Austin, Derek, Chartock, Elliott. (2024). $\text{GRAD-SUM. arXiv preprint arXiv:2407.12865.

[hu2024localized] Hu, Wenyang, Shu, Yao, Yu, Zongmin, Wu, Zhaoxuan, Lin, Xiaoqiang, Dai, Zhongxiang, Ng, See-Kiong, Low, Bryan Kian Hsiang. (2024). Localized zeroth-order prompt optimization. Advances in Neural Information Processing Systems.

[shi2024best] Shi, Chengshuai, Yang, Kun, Yang, Jing, Shen, Cong. (2024). Best arm identification for prompt learning under a limited budget. arXiv preprint arXiv:2402.09723.

[schneider2025hyperband] Lennart Schneider, Martin Wistuba, Aaron Klein, Jacek Golebiowski, Giovanni Zappella, Felice Antonio Merra. (2025). Hyperband-based Bayesian Optimization for Black-box Prompt Selection. Forty-second International Conference on Machine Learning.

[wan2025few] Xingchen Wan, Han Zhou, Ruoxi Sun, Sercan {. (2025). From Few to Many: Self-Improving Many-Shot Reasoners Through Iterative Optimization and Generation. The Thirteenth International Conference on Learning Representations.

[zhou2023large] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba. (2023). Large Language Models are Human-Level Prompt Engineers. The Eleventh International Conference on Learning Representations.

[zhou2023survival] Han Zhou, Xingchen Wan, Ivan Vulic, Anna Korhonen. (2023). Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning. Findings of the Association for Computational Linguistics: {EMNLP.

[yang2024large] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, Xinyun Chen. (2024). Large Language Models as Optimizers. The Twelfth International Conference on Learning Representations.

[guo2024connecting] Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, Yujiu Yang. (2024). Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers. The Twelfth International Conference on Learning Representations.

[deng2022rlprompt] Mingkai Deng, Jianyu Wang, Cheng{-. (2022). $\text{RLPrompt. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.

[wang2025ragen] Wang, Zihan, Wang, Kangrui, Wang, Qineng, Zhang, Pingyue, Li, Linjie, Yang, Zhengyuan, Jin, Xing, Yu, Kefan, Nguyen, Minh Nhat, Liu, Licheng, others. (2025). $\text{RAGEN. arXiv preprint arXiv:2504.20073.

[wang2024promptagent] Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P. Xing, Zhiting Hu. (2024). $\text{PromptAgent. The Twelfth International Conference on Learning Representations.

[fernando2024promptbreeder] Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, Tim Rockt{. (2024). Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution. Forty-first International Conference on Machine Learning.

[xu2024reprompting] Weijia Xu, Andrzej Banburski, Nebojsa Jojic. (2024). Reprompting: Automated Chain-of-Thought Prompt Inference Through Gibbs Sampling. Forty-first International Conference on Machine Learning.

[ruan2024liveideabench] Ruan, Kai, Wang, Xuan, Hong, Jixiang, Wang, Peng, Liu, Yang, Sun, Hao. (2024). LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context. arXiv preprint arXiv:2412.17596.

[wan2024teach] Xingchen Wan, Ruoxi Sun, Hootan Nakhost, Sercan {. (2024). Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization. Advances in Neural Information Processing Systems.

[srivastava2022beyond] Srivastava, Aarohi, Rastogi, Abhinav, Rao, Abhishek, Shoeb, Abu Awal Md, Abid, Abubakar, Fisch, Adam, Brown, Adam R, Santoro, Adam, Gupta, Aditya, Garriga-Alonso, Adri{`a. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.

[binz2023using] Binz, Marcel, Schulz, Eric. (2023). Using cognitive psychology to understand $\text{GPT. Proceedings of the National Academy of Sciences.

[loya2023exploring] Manikanta Loya, Divya Sinha, Richard Futrell. (2023). Exploring the Sensitivity of LLMs' Decision-Making Capabilities: Insights from Prompt Variations and Hyperparameters. Findings of the Association for Computational Linguistics: {EMNLP.

[zhang2023tempera] Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, Joseph E. Gonzalez. (2023). $\text{TEMPERA:. The Eleventh International Conference on Learning Representations.

[opsahl2024optimizing] Krista Opsahl{-. (2024). Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

[guo2024evoprompt] Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, Yujiu Yang. (2024). $\text{EvoPrompt. The Twelfth International Conference on Learning Representations.

[ye2023prompt] Ye, Qinyuan, Axmed, Maxamed, Pryzant, Reid, Khani, Fereshte. (2023). Prompt engineering a prompt engineer. arXiv preprint arXiv:2311.05661.

[wu2024strago] Yurong Wu, Yan Gao, Bin Zhu, Zineng Zhou, Xiaodi Sun, Sheng Yang, Jian{-. (2024). $\text{StraGo. Findings of the Association for Computational Linguistics: {EMNLP.

[yao2024retroformer] Weiran Yao, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Yihao Feng, Le Xue, Rithesh R. N., Zeyuan Chen, Jianguo Zhang, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese. (2024). Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization. The Twelfth International Conference on Learning Representations.

[pryzant2023automatic] Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, Michael Zeng. (2023). Automatic Prompt Optimization with. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.

[mert2025optimizing] Mert Y{. (2025). Optimizing generative $\text{AI. Nature.

[yuksekgonul2024textgrad] Yuksekgonul, Mert, Bianchi, Federico, Boen, Joseph, Liu, Sheng, Huang, Zhi, Guestrin, Carlos, Zou, James. (2024). Textgrad: Automatic ``differentiation'' via text. arXiv preprint arXiv:2406.07496.

[lin2024prompt] Lin, Xiaoqiang, Dai, Zhongxiang, Verma, Arun, Ng, See-Kiong, Jaillet, Patrick, Low, Bryan Kian Hsiang. (2024). Prompt optimization with human feedback. arXiv preprint arXiv:2405.17346.

[ma-etal-2024-sciagent] Yubo Ma, Zhibin Gou, Junheng Hao, Ruochen Xu, Shuohang Wang, Liangming Pan, Yujiu Yang, Yixin Cao, Aixin Sun. (2024). $\text{SciAgent. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 2024.

[tora] Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen. (2024). $\text{ToRA. The Twelfth International Conference on Learning Representations.

[xin2025deepseekproverv] Huajian Xin, Z.Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Haowei Zhang, Qihao Zhu, Dejian Yang, Zhibin Gou, Z.F. Wu, Fuli Luo, Chong Ruan. (2025). DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search. The Thirteenth International Conference on Learning Representations.

[xin2024deepseek] Xin, Huajian, Guo, Daya, Shao, Zhihong, Ren, Zhizhou, Zhu, Qihao, Liu, Bo, Ruan, Chong, Li, Wenda, Liang, Xiaodan. (2024). Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data. arXiv preprint arXiv:2405.14333.

[next] Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, Pengcheng Yin. (2024). $\text{NExT. Forty-first International Conference on Machine Learning.

[bi2025forestofthought] Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, Yunhe Wang. (2025). Forest-of-Thought: Scaling Test-Time Compute for Enhancing {LLM. Forty-second International Conference on Machine Learning.

[zhu2024deductive] Tinghui Zhu, Kai Zhang, Jian Xie, Yu Su. (2024). Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning. First Conference on Language Modeling.

[zhu-etal-2023-solving] Zhu, Xinyu, Wang, Junjie, Zhang, Lin, Zhang, Yuxiang, Huang, Yongfeng, Gan, Ruyi, Zhang, Jiaxing, Yang, Yujiu. (2023). Solving Math Word Problems via Cooperative Reasoning induced Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[setlur2025rewarding] Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar. (2025). Rewarding Progress: Scaling Automated Process Verifiers for {LLM. The Thirteenth International Conference on Learning Representations.

[wang-etal-2024-math] Wang, Peiyi, Li, Lei, Shao, Zhihong, Xu, Runxin, Dai, Damai, Li, Yifei, Chen, Deli, Wu, Yu, Sui, Zhifang. (2024). Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[jiao-etal-2024-learning] Fangkai Jiao, Chengwei Qin, Zhengyuan Liu, Nancy F. Chen, Shafiq Joty. (2024). Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

[liu2024skyworkrewardbagtricksreward] Liu, Chris Yuhao, Zeng, Liang, Liu, Jiacai, Yan, Rui, He, Jujie, Wang, Chaojie, Yan, Shuicheng, Liu, Yang, Zhou, Yahui. (2024). Skywork-reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451.

[liu2025skyworkrewardv2scalingpreferencedata] Liu, Chris Yuhao, Zeng, Liang, Xiao, Yuzhen, He, Jujie, Liu, Jiacai, Wang, Chaojie, Yan, Rui, Shen, Wei, Zhang, Fuxiang, Xu, Jiacheng, others. (2025). Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy. arXiv preprint arXiv:2507.01352.

[first2023baldur] First, Emily, Rabe, Markus N, Ringer, Talia, Brun, Yuriy. (2023). Baldur: Whole-proof generation and repair with large language models. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.

[ni2023lever] Ni, Ansong, Iyer, Srini, Radev, Dragomir, Stoyanov, Veselin, Yih, Wen-tau, Wang, Sida, Lin, Xi Victoria. (2023). $\text{LEVER. International Conference on Machine Learning.

[ChenZNZLLC23] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian{-. (2023). $\text{CodeT. The Eleventh International Conference on Learning Representations.

[52936] Pan, Ruwei, Zhang, Hongyu, Liu, Chao. (2025). $\text{CodeCoR. arXiv preprint arXiv:2501.07811.

[gao2024retrievalaugmentedgenerationlargelanguage] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey.

[gu2024mambalineartimesequencemodeling] Albert Gu, Tri Dao. (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.

[nie2025largelanguagediffusionmodels] Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li. (2025). Large Language Diffusion Models.

[prabhumoye2022fewshotinstructionpromptspretrained] Shrimai Prabhumoye, Rafal Kocielnik, Mohammad Shoeybi, Anima Anandkumar, Bryan Catanzaro. (2022). Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases.

[mousavi2023ncriticsselfrefinementlargelanguage] Sajad Mousavi, Ricardo Luna Gutiérrez, Desik Rengarajan, Vineet Gundecha, Ashwin Ramesh Babu, Avisek Naug, Antonio Guillen, Soumyendu Sarkar. (2023). N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics.

[han2025stitchtimesavesnine] Jinyi Han, Xinyi Wang, Haiquan Zhao, Tingyun li, Zishang Jiang, Sihang Jiang, Jiaqing Liang, Xin Lin, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao. (2025). A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models.

[ma2023fairnessguidedfewshotpromptinglarge] Huan Ma, Changqing Zhang, Yatao Bian, Lemao Liu, Zhirui Zhang, Peilin Zhao, Shu Zhang, Huazhu Fu, Qinghua Hu, Bingzhe Wu. (2023). Fairness-guided Few-shot Prompting for Large Language Models.

[han2025retrievalaugmentedgenerationgraphsgraphrag] Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A. Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, Jiliang Tang. (2025). Retrieval-Augmented Generation with Graphs (GraphRAG).

[zhang2025leanragknowledgegraphbasedgenerationsemantic] Yaoze Zhang, Rong Wu, Pinlong Cai, Xiaoman Wang, Guohang Yan, Song Mao, Ding Wang, Botian Shi. (2025). LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval.

[lieber2024jambahybridtransformermambalanguage] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, Yoav Shoham. (2024). Jamba: A Hybrid Transformer-Mamba Language Model.

[peng2023rwkvreinventingrnnstransformer] Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Bolun Wang, Johan S. Wind, Stanislaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Qinghua Zhou, Jian Zhu, Rui-Jie Zhu. (2023). RWKV: Reinventing RNNs for the Transformer Era.

[ruan2023tptulargelanguagemodelbased] Jingqing Ruan, Yihong Chen, Bin Zhang, Zhiwei Xu, Tianpeng Bao, Guoqing Du, Shiwei Shi, Hangyu Mao, Ziyue Li, Xingyu Zeng, Rui Zhao. (2023). TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage.

[wu2024mathchatconversetacklechallenging] Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, Chi Wang. (2024). MathChat: Converse to Tackle Challenging Math Problems with LLM Agents.

[chen2023fireactlanguageagentfinetuning] Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, Shunyu Yao. (2023). FireAct: Toward Language Agent Fine-tuning.

[weng2025deepscientistadvancingfrontierpushingscientific] Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, Yue Zhang. (2025). DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively.

[chang2025agentnetworkprotocoltechnical] Gaowei Chang, Eidan Lin, Chengxuan Yuan, Rizhao Cai, Binbin Chen, Xuan Xie, Yin Zhang. (2025). Agent Network Protocol Technical White Paper.

[marro2024scalablecommunicationprotocolnetworks] Samuele Marro, Emanuele La Malfa, Jesse Wright, Guohao Li, Nigel Shadbolt, Michael Wooldridge, Philip Torr. (2024). A Scalable Communication Protocol for Networks of Large Language Models.

[aleithan2024swebenchenhancedcodingbenchmark] Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, Song Wang. (2024). SWE-Bench+: Enhanced Coding Benchmark for LLMs.

[wang2022scienceworldagentsmarter5th] Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, Prithviraj Ammanabrolu. (2022). ScienceWorld: Is your Agent Smarter than a 5th Grader?.

[wan2025remalearningmetathinkllms] Ziyu Wan, Yunxiang Li, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, Ying Wen. (2025). ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning.

[DBLP:conf/emnlp/LiDZHGLI24] Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, Eugene Ie. (2024). Improving Multi-Agent Debate with Sparse Communication Topology. Findings of the Association for Computational Linguistics: {EMNLP. doi:10.18653/V1/2024.FINDINGS-EMNLP.427.

[islam2025codesim] Md. Ashraful Islam, Mohammed Eunus Ali, Md. Rizwan Parvez. (2025). $\text{CodeSim. Findings of the Association for Computational Linguistics: {NAACL.

[luo2025o1] Luo, Haotian, Shen, Li, He, Haiying, Wang, Yibo, Liu, Shiwei, Li, Wei, Tan, Naiqiang, Cao, Xiaochun, Tao, Dacheng. (2025). O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570.

[wang2025dynamically] Wang, Yingxu, Fan, Shiqi, Wang, Mengzhu, Liu, Siwei. (2025). Dynamically Adaptive Reasoning via $\text{LLM. arXiv preprint arXiv:2508.00719.

[du2024anytool] Yu Du, Fangyun Wei, Hongyang Zhang. (2024). $\text{AnyTool. Forty-first International Conference on Machine Learning.

[liu2025toolace] Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, Enhong Chen. (2025). $\text{ToolACE. The Thirteenth International Conference on Learning Representations.

[wang2025toolgen] Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, Haonan Li. (2025). $\text{ToolGen. The Thirteenth International Conference on Learning Representations.

[deepseekai2025deepseekr1incentivizingreasoningcapability] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

[hui2024qwen2] Hui, Binyuan, Yang, Jian, Cui, Zeyu, Yang, Jiaxi, Liu, Dayiheng, Zhang, Lei, Liu, Tianyu, Zhang, Jiajun, Yu, Bowen, Lu, Keming, others. (2024). Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186.

[chen2025do] Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu. (2025). Do {NOT. Forty-second International Conference on Machine Learning.

[lambert2024tulu] Lambert, Nathan, Morrison, Jacob, Pyatkin, Valentina, Huang, Shengyi, Ivison, Hamish, Brahman, Faeze, Miranda, Lester James V, Liu, Alisa, Dziri, Nouha, Lyu, Shane, others. (2024). Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124.

[hendrycks2025agidefinition] Hendrycks, Dan, Song, Dawn, Szegedy, Christian, Lee, Honglak, Gal, Yarin, Brynjolfsson, Erik, Li, Sharon, Zou, Andy, Levine, Lionel, Han, Bo, others. (2025). A Definition of AGI. arXiv preprint arXiv:2510.18212.

[zhang2024multimodalfoundationagentfinancial] Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, Longtao Zheng, Xinrun Wang, Bo An. (2024). A Multimodal Foundation Agent for Financial Trading: Tool-Augmented, Diversified, and Generalist.

[yang2025oasisopenagentsocial] Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, Prateek Gupta, Shuyue Hu, Zhenfei Yin, Guohao Li, Xu Jia, Lijun Wang, Bernard Ghanem, Huchuan Lu, Chaochao Lu, Wanli Ouyang, Yu Qiao, Philip Torr, Jing Shao. (2025). OASIS: Open Agent Social Interaction Simulations with One Million Agents.

[liu2025agentcfmemoryenhancedllmbasedagents] Jiahao Liu, Shengkang Gu, Dongsheng Li, Guangping Zhang, Mingzhe Han, Hansu Gu, Peng Zhang, Tun Lu, Li Shang, Ning Gu. (2025). AgentCF++: Memory-enhanced LLM-based Agents for Popularity-aware Cross-domain Recommendations.

[qiu2025alitageneralistagentenabling] Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, Mengdi Wang. (2025). Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution.

[zhao2025pyvisionagenticvisiondynamic] Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei. (2025). PyVision: Agentic Vision with Dynamic Tooling.

[qiu2025agentdistilltrainingfreeagentdistillation] Jiahao Qiu, Xinzhe Juan, Yimin Wang, Ling Yang, Xuan Qi, Tongcheng Zhang, Jiacheng Guo, Yifu Lu, Zixin Yao, Hongru Wang, Shilong Liu, Xun Jiang, Liu Leqi, Mengdi Wang. (2025). AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes.

[suzgun2025dynamiccheatsheettesttimelearning] Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, James Zou. (2025). Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory.

[zhang2025latentevolveselfevolvingtesttimescaling] Guibin Zhang, Fanci Meng, Guancheng Wan, Zherui Li, Kun Wang, Zhenfei Yin, Lei Bai, Shuicheng Yan. (2025). LatentEvolve: Self-Evolving Test-Time Scaling in Latent Space.

[li2025helloagainllmpoweredpersonalized] Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, Tat-Seng Chua. (2025). Hello Again! LLM-powered Personalized Agent for Long-term Dialogue.

[zhang2025ruc_survey] Zhang, Zeyu, Dai, Quanyu, Bo, Xiaohe, Ma, Chen, Li, Rui, Chen, Xu, Zhu, Jieming, Dong, Zhenhua, Wen, Ji-Rong. (2025). A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems.

[putta2024agent] Putta, Pranav, Mills, Edmund, Garg, Naman, Motwani, Sumeet, Finn, Chelsea, Garg, Divyansh, Rafailov, Rafael. (2024). Agent $\text{Q. arXiv preprint arXiv:2408.07199.

[zhang2025landscapeagenticreinforcementlearning] Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Michael Littman, Jun Wang, Shuicheng Yan, Philip Torr, Lei Bai. (2025). The Landscape of Agentic Reinforcement Learning for LLMs: A Survey.

[rafailov2023direct] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, Chelsea Finn. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Thirty-seventh Conference on Neural Information Processing Systems.

[lai2024stepdpostepwisepreferenceoptimization] Lai, Xin, Tian, Zhuotao, Chen, Yukang, Yang, Senqiao, Peng, Xiangru, Jia, Jiaya. (2024). $\text{Step-DPO. arXiv preprint arXiv:2406.18629.

[hu2023language] Hu, Zhiting, Shu, Tianmin. (2023). Language models, agent models, and world models: The law for machine reasoning and planning. arXiv preprint arXiv:2312.05230.

[jiang2025agentsbench] Jiang, Cong, Yang, Xiaolei. (2025). AgentsBench: A Multi-Agent LLM Simulation Framework for Legal Judgment Prediction. Systems.

[xu2023multi] Xu, Tianwen, Ju, Fengkui. (2023). Multi-agent logic for reasoning about duties and powers in private law. Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law.

[yuan2024can] Yuan, Weikang, Cao, Junjie, Jiang, Zhuoren, Kang, Yangyang, Lin, Jun, Song, Kaisong, Yan, Pengwei, Sun, Changlong, Liu, Xiaozhong, others. (2024). Can large language models grasp legal theories? enhance legal reasoning with insights from multi-agent collaboration. arXiv preprint arXiv:2410.02507.

[fatemi2024finvision] Fatemi, Sorouralsadat, Hu, Yuheng. (2024). FinVision: A multi-agent framework for stock market prediction. Proceedings of the 5th ACM International Conference on AI in Finance.

[luo2025llm] Luo, Yichen, Feng, Yebo, Xu, Jiahua, Tasca, Paolo, Liu, Yang. (2025). $\text{LLM. arXiv preprint arXiv:2501.00826.

[li2025hedgeagents] Li, Xiangyu, Zeng, Yawen, Xing, Xiaofen, Xu, Jin, Xu, Xiangmin. (2025). $\text{HedgeAgents. Companion Proceedings of the ACM on Web Conference 2025.

[raza2025trism] Raza, Shaina, Sapkota, Ranjan, Karkee, Manoj, Emmanouilidis, Christos. (2025). $\text{TRiSM. arXiv preprint arXiv:2506.04133.

[sarin2024unleashing] Sarin, Saket, Singh, Sunil K, Kumar, Sudhakar, Goyal, Shivam, Gupta, Brij Bhooshan, Alhalabi, Wadee, Arya, Varsha. (2024). Unleashing the Power of Multi-Agent Reinforcement Learning for Algorithmic Trading in the Digital Financial Frontier and Enterprise Information Systems.. Computers, Materials & Continua.

[wu-etal-2024-reasoning] Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Aky{. (2024). Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers).

[tong2024dartmath] Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, Junxian He. (2024). {DART. The Thirty-eighth Annual Conference on Neural Information Processing Systems.

[huang-etal-2025-opencoder] Huang, Siming, Cheng, Tianhao, Liu, Jason Klein, Xu, Weidi, Hao, Jiaran, Song, Liuyihan, Xu, Yang, Yang, Jian, Liu, Jiaheng, Zhang, Chenchen, Chai, Linzheng, Yuan, Ruifeng, Luo, Xianzhen, Wang, Qiufeng, Fan, YuanTao, Zhu, Qingfu, Zhang, Zhaoxiang, Gao, Yang, Fu, Jie, Liu, Qian, Li, Houyi, Zhang, Ge, Qi, Yuan, Yinghui, Xu, Chu, Wei, Wang, Zili. (2025). {O. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2025.acl-long.1591.

[zuo2025kg4diagnosis] Zuo, Kaiwen, Jiang, Yirui, Mo, Fan, Lio, Pietro. (2025). $\text{KG4Diagnosis. AAAI Bridge Program on AI for Medicine and Healthcare.

[feng2025m] Feng, Jinghao, Zheng, Qiaoyu, Wu, Chaoyi, Zhao, Ziheng, Zhang, Ya, Wang, Yanfeng, Xie, Weidi. (2025). M({. arXiv preprint arXiv:2502.20301.

[fallahpour2025medrax] Fallahpour, Adibvafa, Ma, Jun, Munim, Alif, Lyu, Hongwei, Wang, Bo. (2025). Medrax: Medical reasoning agent for chest x-ray. arXiv preprint arXiv:2502.02673.

[chen2025enhancing] Chen, Xi, Yi, Huahui, You, Mingke, Liu, WeiZhi, Wang, Li, Li, Hairui, Zhang, Xue, Guo, Yingman, Fan, Lei, Chen, Gang, others. (2025). Enhancing diagnostic capability with multi-agents conversational large language models. NPJ digital medicine.

[huang2024o1replicationjourney] Huang, Zhen, Zou, Haoyang, Li, Xuefeng, Liu, Yixiu, Zheng, Yuxiang, Chern, Ethan, Xia, Shijie, Qin, Yiwei, Yuan, Weizhe, Liu, Pengfei. (2024). O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?. arXiv preprint arXiv:2411.16489.

[yin2024mumath] Shuo Yin, Weihao You, Zhilong Ji, Guoqiang Zhong, Jinfeng Bai. (2024). $\text{MuMath-Code. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

[bespoke_stratos] Bespoke Labs. (2025). Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation.

[cobbe2021training] Cobbe, Karl, Kosaraju, Vineet, Bavarian, Mohammad, Chen, Mark, Jun, Heewoo, Kaiser, Lukasz, Plappert, Matthias, Tworek, Jerry, Hilton, Jacob, Nakano, Reiichiro, others. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.

[nye2021show] Nye, Maxwell, Andreassen, Anders Johan, Gur-Ari, Guy, Michalewski, Henryk, Austin, Jacob, Bieber, David, Dohan, David, Lewkowycz, Aitor, Bosma, Maarten, Luan, David, others. (2021). Show your work: Scratchpads for intermediate computation with language models.

[ho-etal-2023-large] Ho, Namgyu, Schmid, Laura, Yun, Se-Young. (2023). Large Language Models Are Reasoning Teachers. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2023.acl-long.830.

[zheng2025goaldirectedsearchoutperformsgoalagnostic] Yicong Zheng, Kevin L. McKee, Thomas Miconi, Zacharie Bugaud, Mick van Gelderen, Jed McCaleb. (2025). Goal-Directed Search Outperforms Goal-Agnostic Memory Compression in Long-Context Memory Tasks.

[behrouz2025nested] Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni. (2025). Nested Learning: The Illusion of Deep Learning Architectures. The Thirty-ninth Annual Conference on Neural Information Processing Systems.

[zelikman2022] Zelikman, Eric, Wu, Yuhuai, Mu, Jesse, Goodman, Noah. (2022). $\text{STaR. Advances in Neural Information Processing Systems.

[turpin2023language] Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman. (2023). Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. Thirty-seventh Conference on Neural Information Processing Systems.

[donner2018solving] Donner-Banzhoff, Norbert. (2018). Solving the diagnostic challenge: a patient-centered approach. The Annals of Family Medicine.

[KONONENKO200189] Igor Kononenko. (2001). Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in Medicine. doi:https://doi.org/10.1016/S0933-3657(01)00077-X.

[sun2024lawluo] Sun, Jingyun, Dai, Chengxiao, Luo, Zhongze, Chang, Yangbo, Li, Yang. (2024). $\text{LawLuo. arXiv preprint arXiv:2407.16252.

[mcnaughton2024cactus] McNaughton, Andrew D, Sankar Ramalaxmi, Gautham Krishna, Kruel, Agustin, Knutson, Carter R, Varikoti, Rohith A, Kumar, Neeraj. (2024). CACTUS: Chemistry Agent Connecting Tool Usage to Science. ACS omega.

[xing2024designing] Frank Xing. (2025). Designing Heterogeneous $\text{LLM. ACM Transactions on Management Information Systems.

[wang2024peer] Wang, Yiying, Li, Xiaojing, Wang, Binzhu, Zhou, Yueyang, Lin, Yingru, Ji, Han, Chen, Hong, Zhang, Jinshi, Yu, Fei, Zhao, Zewei, others. (2024). $\text{PEER. arXiv preprint arXiv:2407.06985.

[yu2024fincon] Yu, Yangyang, Yao, Zhiyuan, Li, Haohang, Deng, Zhiyang, Jiang, Yuechen, Cao, Yupeng, Chen, Zhi, Suchow, Jordan, Cui, Zhenyu, Liu, Rong, others. (2024). Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making. Advances in Neural Information Processing Systems.

[m2024augmenting] M. Bran, Andres, Cox, Sam, Schilter, Oliver, Baldassari, Carlo, White, Andrew D, Schwaller, Philippe. (2024). Augmenting large language models with chemistry tools. Nature Machine Intelligence.

[sun2024pathgen] Sun, Yuxuan, Zhang, Yunlong, Si, Yixuan, Zhu, Chenglu, Shui, Zhongyi, Zhang, Kai, Li, Jingxiong, Lyu, Xingheng, Lin, Tao, Yang, Lin. (2024). Pathgen-1.6 m: 1.6 million pathology image-text pairs generation through multi-agent collaboration. arXiv preprint arXiv:2407.00203.

[su2025kgarevion] Su, Xiaorui, Wang, Yibo, Gao, Shanghua, Liu, Xiaolong, Giunchiglia, Valentina, Clevert, Djork-Arn{'e. (2025). KGARevion: an AI agent for knowledge-intensive biomedical QA. ICLR.

[almansoori2025self] Almansoori, Mohammad, Kumar, Komal, Cholakkal, Hisham. (2025). Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions. arXiv preprint arXiv:2503.22678.

[ghezloo2025pathfinder] Ghezloo, Fatemeh, Seyfioglu, Mehmet Saygin, Soraki, Rustin, Ikezogwo, Wisdom O, Li, Beibin, Vivekanandan, Tejoram, Elmore, Joann G, Krishna, Ranjay, Shapiro, Linda. (2025). PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology. arXiv preprint arXiv:2502.08916.

[li2024mmedagent] Li, Binxu, Yan, Tiankai, Pan, Yuanting, Luo, Jie, Ji, Ruiyang, Ding, Jiayuan, Xu, Zhe, Liu, Shilong, Dong, Haoyu, Lin, Zihao, others. (2024). $\text{MMedAgent. arXiv preprint arXiv:2407.02483.

[wang2025medagent] Wang, Ziyue, Wu, Junde, Low, Chang Han, Jin, Yueming. (2025). $\text{MedAgent-Pro. arXiv preprint arXiv:2503.18968.

[chen2025mdteamgpt] Chen, Kai, Li, Xinfeng, Yang, Tianpei, Wang, Hewei, Dong, Wei, Gao, Yang. (2025). $\text{MDTeamGPT. arXiv preprint arXiv:2503.13856.

[zhuang2025learning] Zhuang, Yangyang, Jiang, Wenjia, Zhang, Jiayu, Yang, Ze, Zhou, Joey Tianyi, Zhang, Chi. (2025). Learning to Be A Doctor: Searching for Effective Medical Agent Architectures. arXiv preprint arXiv:2504.11301.

[xing2025designing] Xing, Frank. (2025). ‘Designing heterogeneous LLM agents for financial sentiment analysis. arXiv preprint arXiv:2401.05799.

[madaan2023self] Madaan, Aman, Tandon, Niket, Gupta, Prakhar, Hallinan, Skyler, Gao, Luyu, Wiegreffe, Sarah, Alon, Uri, Dziri, Nouha, Prabhumoye, Shrimai, Yang, Yiming, others. (2023). $\text{Self-Refine. Advances in Neural Information Processing Systems.

[di2023multi] Di Martino, Beniamino, Esposito, Antonio, Colucci Cante, Luigi. (2023). Multi agents simulation of justice trials to support control management and reduction of civil trials duration. Journal of Ambient Intelligence and Humanized Computing.

[rasheed2024codepori] Rasheed, Zeeshan, Sami, Malik Abdul, Kemell, Kai-Kristian, Waseem, Muhammad, Saari, Mika, Syst{. (2024). $\text{CodePori. arXiv preprint arXiv:2402.01411.

[van2023llm] van der Ouderaa, Tycho FA, Nagel, Markus, Van Baalen, Mart, Asano, Yuki M, Blankevoort, Tijmen. (2023). The $\text{LLM. arXiv preprint arXiv:2312.17244.

[zhang2025codecriticbench] Zhang, Alexander, Dong, Marcus, Liu, Jiaheng, Zhang, Wei, Wang, Yejie, Yang, Jian, Zhang, Ge, Liu, Tianyu, Peng, Zhongyuan, Tan, Yingshui, others. (2025). $\text{CodeCriticBench. arXiv preprint arXiv:2502.16614.

[rahman2025marco] Rahman, Asif, Cvetkovic, Veljko, Reece, Kathleen, Walters, Aidan, Hassan, Yasir, Tummeti, Aneesh, Torres, Bryan, Cooney, Denise, Ellis, Margaret, Nikolopoulos, Dimitrios S. (2025). $\text{MARCO. arXiv preprint arXiv:2505.03906.

[wang2024openhands] Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, et al.. (2025). $\text{OpenHands. The Thirteenth International Conference on Learning Representations.

[wu2025optimas] Wu, Shirley, Sarthi, Parth, Zhao, Shiyu, Lee, Aaron, Shandilya, Herumb, Grobelnik, Adrian Mladenic, Choudhary, Nurendra, Huang, Eddie, Subbian, Karthik, Zhang, Linjun, others. (2025). Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards. arXiv preprint arXiv:2507.03041.

[tang2024codeagent] Tang, Xunzhu, Kim, Kisub, Song, Yewei, Lothritz, Cedric, Li, Bei, Ezzini, Saad, Tian, Haoye, Klein, Jacques, Bissyand{'e. (2024). $\text{CodeAgent. arXiv preprint arXiv:2402.02172.

[zhang2023self] Zhang, Kechi, Li, Zhuo, Li, Jia, Li, Ge, Jin, Zhi. (2023). $\text{Self-Edit. arXiv preprint arXiv:2305.04087.

[mannadiar2010debugging] Mannadiar, Raphael, Vangheluwe, Hans. (2010). Debugging in domain-specific modelling. International Conference on Software Language Engineering.

[puvvadi2025coding] Puvvadi, Meghana, Arava, Sai Kumar, Santoria, Adarsh, Chennupati, Sesha Sai Prasanna, Puvvadi, Harsha Vardhan. (2025). Coding agents: A comprehensive survey of automated bug fixing systems and benchmarks. 2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT).

[he2025llm] He, Junda, Treude, Christoph, Lo, David. (2025). $\text{LLM. ACM Transactions on Software Engineering and Methodology.

[lee2024unifieddebuggingapproachllmbased] Lee, Cheryl, Xia, Chunqiu Steven, Yang, Longji, Huang, Jen-tse, Zhu, Zhouruixin, Zhang, Lingming, Lyu, Michael R. (2024). A unified debugging approach via llm-based multi-agent synergy. arXiv preprint arXiv:2404.17153.

[jin2024rgd] Jin, Haolin, Sun, Zechao, Chen, Huaming. (2024). $\text{RGD. 2024 IEEE International Conference on Agents (ICA).

[dong2024self] Dong, Yihong, Jiang, Xue, Jin, Zhi, Li, Ge. (2024). Self-collaboration code generation via chatgpt. ACM Transactions on Software Engineering and Methodology.

[adnan2025large] Adnan, Muntasir, Xu, Zhiwei, Kuhn, Carlos CN. (2025). Large Language Model Guided Self-Debugging Code Generation. arXiv preprint arXiv:2502.02928.

[huang2023agentcoder] Huang, Dong, Zhang, Jie M, Luck, Michael, Bu, Qingwen, Qing, Yuhao, Cui, Heming. (2023). $\text{AgentCoder. arXiv preprint arXiv:2312.13010.

[yang2024finrobot] Yang, Hongyang, Zhang, Boyu, Wang, Neng, Guo, Cheng, Zhang, Xiaoli, Lin, Likun, Wang, Junlin, Zhou, Tianyu, Guan, Mao, Zhang, Runjia, others. (2024). $\text{FinRobot. arXiv preprint arXiv:2405.14767.

[xiao2024cellagent] Xiao, Yihang, Liu, Jinyi, Zheng, Yan, Xie, Xiaohan, Hao, Jianye, Li, Mingzhi, Wang, Ruitao, Ni, Fei, Li, Yuxiao, Luo, Jintian, others. (2024). Cellagent: An llm-driven multi-agent framework for automated single-cell data analysis. bioRxiv.

[chen2023teaching] Xinyun Chen, Maxwell Lin, Nathanael Sch{. (2024). Teaching Large Language Models to $\text{Self-Debug. The Twelfth International Conference on Learning Representations.

[he2024agentscourt] He, Zhitao, Cao, Pengfei, Wang, Chenhao, Jin, Zhuoran, Chen, Yubo, Xu, Jiexin, Li, Huaijun, Jiang, Xiaojian, Liu, Kang, Zhao, Jun. (2024). $\text{AgentsCourt. arXiv preprint arXiv:2403.02959.

[shi2024legalgpt] Shi, Juanming, Guo, Qinglang, Liao, Yong, Liang, Shenglin. (2024). LegalGPT: Legal Chain of Thought for the Legal Large Language Model Multi-agent Framework. International Conference on Intelligent Computing.

[tang2402codeagent] Tang, Xunzhu, Kim, Kisub, Song, Yewei, Lothritz, Cedric, Li, Bei, Ezzini, Saad, Tian, Haoye, Klein, Jacques, Bissyande, Tegawende F. Codeagent: Autonomous communicative agents for code review, 2024. URL https://arxiv. org/abs/2402.02172.

[chen2024agentcourt] Guhong Chen, Liyang Fan, Zihan Gong, Nan Xie, Zixuan Li, Ziqiang Liu, Chengming Li, Qiang Qu, Hamid Alinejad{-. (2025). $\text{AgentCourt. Findings of the Association for Computational Linguistics.

[qian2025enhancing] Qian, Yiyue, Zhang, Shinan, Zhou, Yun, Ding, Haibo, Socolinsky, Diego, Zhang, Yi. (2025). Enhancing $\text{LLM-as-a-Judge. amazon.science.

[zhao2024auto] Ruochen Zhao, Wenxuan Zhang, Yew Ken Chia, Weiwen Xu, Deli Zhao, Lidong Bing. (2025). $\text{Auto-Arena. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[li2024generation] Li, Dawei, Jiang, Bohan, Huang, Liangjie, Beigi, Alimohammad, Zhao, Chengshuai, Tan, Zhen, Bhattacharjee, Amrita, Jiang, Yuxuan, Chen, Canyu, Wu, Tianhao, others. (2024). From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv preprint arXiv:2411.16594.

[li2024llms] Li, Haitao, Dong, Qian, Chen, Junjie, Su, Huixue, Zhou, Yujia, Ai, Qingyao, Ye, Ziyi, Liu, Yiqun. (2024). $\text{LLMs-as-Judges. arXiv preprint arXiv:2412.05579.

[arabzadeh2024towards] Arabzadeh, Negar, Kiseleva, Julia, Wu, Qingyun, Wang, Chi, Awadallah, Ahmed, Dibia, Victor, Fourney, Adam, Clarke, Charles. (2024). Towards better human-agent alignment: Assessing task utility in llm-powered applications. arXiv preprint arXiv:2402.09015.

[mialon2023gaia] Mialon, Gr{'e. (2023). $\text{GAIA. The Twelfth International Conference on Learning Representations.

[hu2025osda] Hu, Zhaolin, Zhou, Yixiao, Wang, Zhongan, Li, Xin, Yang, Weimin, Fan, Hehe, Yang, Yi. (2025). OSDA Agent: Leveraging Large Language Models for De Novo Design of Organic Structure Directing Agents. The Thirteenth International Conference on Learning Representations.

[huang2023codecot] Huang, Dong, Bu, Qingwen, Cui, Heming. (2023). Codecot and beyond: Learning to program and test like a developer. arXiv preprint arXiv:2308.08784.

[ruan2024accelerated] Ruan, Yixiang, Lu, Chenyin, Xu, Ning, Zhang, Jian, Xuan, Jun, Pan, Jianzhang, Fang, Qun, Gao, Hanyu, Shen, Xiaodong, Ye, Ning, others. (2024). Accelerated end-to-end chemical synthesis development with large language models.

[ghafarollahi2024sciagents] Ghafarollahi, Alireza, Buehler, Markus J. (2024). SciAgents: Automating Scientific Discovery Through Bioinspired Multi-Agent Intelligent Graph Reasoning. Advanced Materials.

[qian2024iterative] Qian, Chen, Li, Jiahao, Dang, Yufan, Liu, Wei, Wang, YiFei, Xie, Zihao, Chen, Weize, Yang, Cheng, Zhang, Yingli, Liu, Zhiyuan, others. (2024). Iterative experience refinement of software-developing agents. arXiv preprint arXiv:2405.04219.

[inoue2025drugagent] Inoue, Yoshitaka, Song, Tianci, Wang, Xinling, Luna, Augustin, Fu, Tianfan. (2025). Drugagent: Multi-agent large language model-based reasoning for drug-target interaction prediction. ICLR 2025 Workshop on Machine Learning for Genomics Explorations.

[averly2025liddia] Averly, Reza, Baker, Frazier N, Ning, Xia. (2025). $\text{LIDDIA. arXiv preprint arXiv:2502.13959.

[landrum2013rdkit] Landrum, Greg. (2013). Rdkit documentation. Release.

[schick2023toolformer] Schick, Timo, Dwivedi-Yu, Jane, Dess{`\i. (2023). $\text{ToolFormer. Advances in Neural Information Processing Systems.

[li2023api] Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li. (2023). $\text{API-Bank. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.

[xu2023tool] Xu, Qiantong, Hong, Fenglu, Li, Bo, Hu, Changran, Chen, Zhengyu, Zhang, Jian. (2023). On the tool manipulation capability of open-source large language models. arXiv preprint arXiv:2305.16504.

[huang2023metatool] Huang, Yue, Shi, Jiawen, Li, Yuan, Fan, Chenrui, Wu, Siyuan, Zhang, Qihui, Liu, Yixin, Zhou, Pan, Wan, Yao, Gong, Neil Zhenqiang, others. (2023). Metatool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128.

[zhuang2023toolqa] Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, Chao Zhang. (2023). $\text{ToolQA. Advances in Neural Information Processing Systems.

[novikov2025alphaevolve] Novikov, Alexander, V{~u. (2025). $\text{AlphaEvolve. arXiv preprint arXiv:2506.13131.

[tang2023toolalpaca] Tang, Qiaoyu, Deng, Ziliang, Lin, Hongyu, Han, Xianpei, Liang, Qiao, Cao, Boxi, Sun, Le. (2023). Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301.

[wang2024gta] Wang, Jize, Zerun, Ma, Li, Yining, Zhang, Songyang, Chen, Cailian, Chen, Kai, Le, Xinyi. (2024). $\text{GTA. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.

[trivedi2024appworld] Trivedi, Harsh, Khot, Tushar, Hartmann, Mareike, Manku, Ruskin, Dong, Vinty, Li, Edward, Gupta, Shashank, Sabharwal, Ashish, Balasubramanian, Niranjan. (2024). AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. arXiv preprint arXiv:2407.18901.

[song2024beyond] Song, Yueqi, Xu, Frank, Zhou, Shuyan, Neubig, Graham. (2024). Beyond browsing: Api-based web agents. arXiv preprint arXiv:2410.16464.

[wei2025browsecomp] Wei, Jason, Sun, Zhiqing, Papay, Spencer, McKinney, Scott, Han, Jeffrey, Fulford, Isa, Chung, Hyung Won, Passos, Alex Tachard, Fedus, William, Glaese, Amelia. (2025). $\text{BrowseComp. arXiv preprint arXiv:2504.12516.

[zhang2025if] Zhang, Hangfan, Cui, Zhiyao, Wang, Xinrun, Zhang, Qiaosheng, Wang, Zhen, Wu, Dinghao, Hu, Shuyue. (2025). If multi-agent debate is the answer, what is the question. arXiv preprint arXiv:2502.08788.

[hu2022lora] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen{-. (2022). $\text{LoRA. The Tenth International Conference on Learning Representations.

[pfeiffer2020adapterfusion] Jonas Pfeiffer, Aishwarya Kamath, Andreas R{. (2021). $\text{AdapterFusion. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.

[ouyang2022training] Ouyang, Long, Wu, Jeffrey, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll, Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex, others. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems.

[koh2024visualwebarena] Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po{-. (2024). $\text{VisualWebArena. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[zhou2023webarena] Zhou, Shuyan, Xu, Frank F, Zhu, Hao, Zhou, Xuhui, Lo, Robert, Sridhar, Abishek, Cheng, Xianyi, Ou, Tianyue, Bisk, Yonatan, Fried, Daniel, others. (2023). Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854.

[pan2024webcanvas] Pan, Yichen, Kong, Dehan, Zhou, Sida, Cui, Cheng, Leng, Yifei, Jiang, Bing, Liu, Hangyu, Shang, Yanyi, Zhou, Shuyan, Wu, Tongshuang, others. (2024). $\text{WebCanvas. arXiv preprint arXiv:2406.12373.

[wu2025webwalker] Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang. (2025). $\text{WebWalker. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[liu2023agentbench] Liu, Xiao, Yu, Hao, Zhang, Hanchen, Xu, Yifan, Lei, Xuanyu, Lai, Hanyu, Gu, Yu, Ding, Hangliang, Men, Kaiwen, Yang, Kejuan, others. (2023). $\text{AgentBench. arXiv preprint arXiv:2308.03688.

[ma2024agentboard] Ma, Chang, Zhang, Junlei, Zhu, Zhihao, Yang, Cheng, Yang, Yujiu, Jin, Yaohui, Lan, Zhenzhong, Kong, Lingpeng, He, Junxian. (2024). Agentboard: An analytical evaluation board of multi-turn llm agents. arXiv preprint arXiv:2401.13178.

[tang2024enhancinglongcontextperformance] Yimin Tang, Yurong Xu, Ning Yan, Masood Mortazavi. (2024). Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism.

[wang2024survey] Wang, Lei, Ma, Chen, Feng, Xueyang, Zhang, Zeyu, Yang, Hao, Zhang, Jingsen, Chen, Zhiyuan, Tang, Jiakai, Chen, Xu, Lin, Yankai, others. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science.

[ruan2025benchmarking] Ruan, Kai, Huang, Mowen, Wen, Ji-Rong, Sun, Hao. (2025). Benchmarking $\text{LLMs. arXiv preprint arXiv:2505.04364.

[vitielloevaluating] Vitiello, Rosanna, Bisk, Yonatan, Ros{'e. Evaluating Interdependent Collaboration in Multi-Agent LLM Systems.

[zhu2025multiagentbench] Zhu, Kunlun, Du, Hongyi, Hong, Zhaochen, Yang, Xiaocheng, Guo, Shuyi, Wang, Zhe, Wang, Zhenhailong, Qian, Cheng, Tang, Xiangru, Ji, Heng, others. (2025). $\text{MultiAgentBench. arXiv preprint arXiv:2503.01935.

[wang2024benchmark] Wang, Siyuan, Long, Zhuohan, Fan, Zhihao, Wei, Zhongyu, Huang, Xuanjing. (2024). Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation. arXiv preprint arXiv:2402.11443.

[deng2024mobile] Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Rui Yan, Shuo Shang. (2024). $\text{Mobile-Bench. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[chen2024gui] Chen, Dongping, Huang, Yue, Wu, Siyuan, Tang, Jingyu, Zhou, Huichi, Zhang, Qihui, He, Zhigang, Bai, Yilin, Gao, Chujie, Chen, Liuyi, others. (2024). $\text{GUI. The Thirteenth International Conference on Learning Representations.

[lee2024mobilesafetybench] Lee, Juyong, Hahm, Dongyoon, Choi, June Suk, Knox, W Bradley, Lee, Kimin. (2024). $\text{MobileSafetyBench. arXiv preprint arXiv:2410.17520.

[rawles2024androidworld] Rawles, Christopher, Clinckemaillie, Sarah, Chang, Yifan, Waltz, Jonathan, Lau, Gabrielle, Fair, Marybeth, Li, Alice, Bishop, William, Li, Wei, Campbell-Ajala, Folawiyo, others. (2024). $\text{AndroidWorld. arXiv preprint arXiv:2405.14573.

[xu2024crab] Xu, Tianqi, Chen, Linyao, Wu, Dai-Jie, Chen, Yanjun, Zhang, Zecheng, Yao, Xiang, Xie, Zhiqiang, Chen, Yongchao, Liu, Shilong, Qian, Bochen, others. (2024). Crab: Cross-environment agent benchmark for multimodal language model agents. arXiv preprint arXiv:2407.01511.

[ma2024m] Ma, Zixian, Huang, Weikai, Zhang, Jieyu, Gupta, Tanmay, Krishna, Ranjay. (2024). m & m’s: A Benchmark to Evaluate Tool-Use for m ulti-step m ulti-modal Tasks. European Conference on Computer Vision.

[wang2024mobileagentbench] Wang, Luyuan, Deng, Yongyu, Zha, Yiwei, Mao, Guodong, Wang, Qinmin, Min, Tianchen, Chen, Wei, Chen, Shoufa. (2024). MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents. arXiv preprint arXiv:2406.08184.

[kapoor2024omniact] Kapoor, Raghav, Butala, Yash Parag, Russak, Melisa, Koh, Jing Yu, Kamble, Kiran, AlShikh, Waseem, Salakhutdinov, Ruslan. (2024). Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. European Conference on Computer Vision.

[xie2024osworld] Xie, Tianbao, Zhang, Danyang, Chen, Jixuan, Li, Xiaochuan, Zhao, Siheng, Cao, Ruisheng, Hua, Toh J, Cheng, Zhoujun, Shin, Dongchan, Lei, Fangyu, others. (2024). Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems.

[dagan2024plancraft] Dagan, Gautier, Keller, Frank, Lascarides, Alex. (2024). Plancraft: an evaluation dataset for planning with LLM agents. arXiv preprint arXiv:2412.21033.

[bonatti2024windows] Bonatti, Rogerio, Zhao, Dan, Bonacci, Francesco, Dupont, Dillon, Abdali, Sara, Li, Yinheng, Lu, Yadong, Wagle, Justin, Koishida, Kazuhito, Bucker, Arthur, others. (2024). Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264.

[jimenez2023swe] Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik R. Narasimhan. (2024). $\text{SWE-bench. The Twelfth International Conference on Learning Representations.

[styles2024workbench] Styles, Olly, Miller, Sam, Cerda-Mardini, Patricio, Guha, Tanaya, Sanchez, Victor, Vidgen, Bertie. (2024). Workbench: a benchmark dataset for agents in a realistic workplace setting. arXiv preprint arXiv:2405.00823.

[xu2024theagentcompany] Xu, Frank F, Song, Yufan, Li, Boxuan, Tang, Yuxuan, Jain, Kritanjali, Bao, Mengxue, Wang, Zora Z, Zhou, Xuhui, Guo, Zhitong, Cao, Murong, others. (2024). Theagentcompany: benchmarking llm agents on consequential real world tasks. arXiv preprint arXiv:2412.14161.

[schmidgall2024agentclinic] Schmidgall, Samuel, Ziaei, Rojin, Harris, Carl, Reis, Eduardo, Jopling, Jeffrey, Moor, Michael. (2024). $\text{AgentClinic. arXiv preprint arXiv:2405.07960.

[zhang2024benchmarking] Zhang, Yuge, Jiang, Qiyang, Han, Xingyu, Chen, Nan, Yang, Yuqing, Ren, Kan. (2024). Benchmarking data science agents. arXiv preprint arXiv:2402.17168.

[zhang2025datascibench] Zhang, Dan, Zhoubian, Sining, Cai, Min, Li, Fengzu, Yang, Lekang, Wang, Wei, Dong, Tianjiao, Hu, Ziniu, Tang, Jie, Yue, Yisong. (2025). $\text{DataSciBench. arXiv preprint arXiv:2502.13897.

[jin2025elt] Jin, Tengjun, Zhu, Yuxuan, Kang, Daniel. (2025). ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines. arXiv preprint arXiv:2504.04808.

[hu2024infiagent] Hu, Xueyu, Zhao, Ziyu, Wei, Shuang, Chai, Ziwei, Ma, Qianli, Wang, Guoyin, Wang, Xuwu, Su, Jing, Xu, Jingjing, Zhu, Ming, others. (2024). Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507.

[chan2024mle] Chan, Jun Shern, Chowdhury, Neil, Jaffe, Oliver, Aung, James, Sherburn, Dane, Mays, Evan, Starace, Giulio, Liu, Kevin, Maksin, Leon, Patwardhan, Tejal, others. (2024). Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095.

[nathani2025mlgym] Nathani, Deepak, Madaan, Lovish, Roberts, Nicholas, Bashlykov, Nikolay, Menon, Ajay, Moens, Vincent, Budhiraja, Amar, Magka, Despoina, Vorotilov, Vladislav, Chaurasia, Gaurav, others. (2025). $\text{MLGym. arXiv preprint arXiv:2502.14499.

[zhou2025simplestrongbaselinelongterm] Sizhe Zhou, Jiawei Han. (2025). A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents.

[bogin2024super] Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, Tushar Khot. (2024). $\text{SUPER:. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

[feng2024naturallanguagereinforcementlearning] Xidong Feng, Ziyu Wan, Mengyue Yang, Ziyan Wang, Girish A. Koushik, Yali Du, Ying Wen, Jun Wang. (2024). Natural Language Reinforcement Learning.

[ge2023openagi] Ge, Yingqiang, Hua, Wenyue, Mei, Kai, Tan, Juntao, Xu, Shuyuan, Li, Zelong, Zhang, Yongfeng, others. (2023). $\text{OpenAGI. Advances in Neural Information Processing Systems.

[shen2024taskbench] Shen, Yongliang, Song, Kaitao, Tan, Xu, Zhang, Wenqi, Ren, Kan, Yuan, Siyu, Lu, Weiming, Li, Dongsheng, Zhuang, Yueting. (2024). Taskbench: Benchmarking large language models for task automation. Advances in Neural Information Processing Systems.

[xie2024travelplanner] Xie, Jian, Zhang, Kai, Chen, Jiangjie, Zhu, Tinghui, Lou, Renze, Tian, Yuandong, Xiao, Yanghua, Su, Yu. (2024). Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622.

[lin2024wildbench] Lin, Bill Yuchen, Deng, Yuntian, Chandu, Khyathi, Brahman, Faeze, Ravichander, Abhilasha, Pyatkin, Valentina, Dziri, Nouha, Bras, Ronan Le, Choi, Yejin. (2024). Wildbench: Benchmarking llms with challenging tasks from real users in the wild. arXiv preprint arXiv:2406.04770.

[andriushchenko2024agentharm] Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J. Zico Kolter, Matt Fredrikson, Yarin Gal, Xander Davies. (2025). $\text{AgentHarm. The Thirteenth International Conference on Learning Representations.

[yuan2024r] Yuan, Tongxin, He, Zhiwei, Dong, Lingzhong, Wang, Yiming, Zhao, Ruijie, Xia, Tian, Xu, Lizhen, Zhou, Binglin, Li, Fangqi, Zhang, Zhuosheng, others. (2024). R-judge: Benchmarking safety risk awareness for llm agents. arXiv preprint arXiv:2401.10019.

[guo2024redcode] Guo, Chengquan, Liu, Xun, Xie, Chulin, Zhou, Andy, Zeng, Yi, Lin, Zinan, Song, Dawn, Li, Bo. (2024). Redcode: Risky code execution and generation benchmark for code agents. Advances in Neural Information Processing Systems.

[zhuge2024agent] Zhuge, Mingchen, Zhao, Changsheng, Ashley, Dylan, Wang, Wenyi, Khizbullin, Dmitrii, Xiong, Yunyang, Liu, Zechun, Chang, Ernie, Krishnamoorthi, Raghuraman, Tian, Yuandong, others. (2024). Agent-as-a-judge: Evaluate agents with agents. arXiv preprint arXiv:2410.10934.

[yehudai2025survey] Yehudai, Asaf, Eden, Lilach, Li, Alan, Uziel, Guy, Zhao, Yilun, Bar-Haim, Roy, Cohan, Arman, Shmueli-Scheuer, Michal. (2025). Survey on evaluation of $\text{LLM. arXiv preprint arXiv:2503.16416.

[gu2024survey] Gu, Jiawei, Jiang, Xuhui, Shi, Zhichao, Tan, Hexiang, Zhai, Xuehao, Xu, Chengjin, Li, Wei, Shen, Yinghan, Ma, Shengjie, Liu, Honghao, others. (2024). A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594.

[zheng2023judging] Zheng, Lianmin, Chiang, Wei-Lin, Sheng, Ying, Zhuang, Siyuan, Wu, Zhanghao, Zhuang, Yonghao, Lin, Zi, Li, Zhuohan, Li, Dacheng, Xing, Eric, others. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems.

[li2025accuracy] Li, Zihao, Yi, Weiwei, Chen, Jiahong. (2025). Accuracy Paradox in Large Language Models: Regulating Hallucination Risks in Generative AI. arXiv preprint.

[gao2025llm] Gao, Mingqi, Hu, Xinyu, Yin, Xunjian, Ruan, Jie, Pu, Xiao, Wan, Xiaojun. (2025). Llm-based nlg evaluation: Current status and challenges. Computational Linguistics.

[kim2024prometheus] Kim, Seungone, Suk, Juyoung, Longpre, Shayne, Lin, Bill Yuchen, Shin, Jamin, Welleck, Sean, Neubig, Graham, Lee, Moontae, Lee, Kyungjae, Seo, Minjoon. (2024). Prometheus 2: An open source language model specialized in evaluating other language models. arXiv preprint arXiv:2405.01535.

[wu2025humanmemoryaimemory] Yaxiong Wu, Sheng Liang, Chen Zhang, Yichao Wang, Yongyue Zhang, Huifeng Guo, Ruiming Tang, Yong Liu. (2025). From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs.

[wang2025mirixmultiagentmemoryllmbased] Wang, Yu, Chen, Xi. (2025). $\text{MIRIX. arXiv preprint arXiv:2507.07957.

[kang2025memoryosaiagent] Jiazheng Kang, Mingming Ji, Zhe Zhao, Ting Bai. (2025). Memory OS of AI Agent.

[hu2025evaluatingmemoryllmagents] Yuanzhe Hu, Yu Wang, Julian McAuley. (2025). Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions.

[lu2025agentrewardbench] L{`u. (2025). AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories. arXiv preprint arXiv:2504.08942.

[zhang2025agentorchestrahierarchicalmultiagentframework] Zhang, Wentao, Cui, Ce, Zhao, Yilei, Hu, Rui, Liu, Yang, Zhou, Yahui, An, Bo. (2025). Agentorchestra: A hierarchical multi-agent framework for general-purpose task solving. arXiv preprint arXiv:2506.12508.

[sapkota2025aiagentsvsagentic] Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee. (2025). AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges.

[qiang2025agentomleveragingllmagents] Zhangcheng Qiang, Weiqing Wang, Kerry Taylor. (2025). Agent-OM: Leveraging LLM Agents for Ontology Matching.

[lu2023memochattuningllmsuse] Lu, Junru, An, Siyu, Lin, Mingbao, Pergola, Gabriele, He, Yulan, Yin, Di, Sun, Xing, Wu, Yunsheng. (2023). $\text{MemoChat. arXiv preprint arXiv:2308.08239.

[anokhin2024arigraph] Anokhin, Petr, Semenov, Nikita, Sorokin, Artyom, Evseev, Dmitry, Kravchenko, Andrey, Burtsev, Mikhail, Burnaev, Evgeny. (2024). Arigraph: Learning knowledge graph world models with episodic memory for llm agents. arXiv preprint arXiv:2407.04363.

[ong2025lifelongdialogueagentstimelinebased] Kai Tzu-iunn Ong, Namyoung Kim, Minju Gwak, Hyungjoo Chae, Taeyoon Kwon, Yohan Jo, Seung-won Hwang, Dongha Lee, Jinyoung Yeo. (2025). Towards Lifelong Dialogue Agents via Timeline-based Memory Management.

[zhao2023narrativeplayinteractivenarrativeunderstanding] Runcong Zhao, Wenjia Zhang, Jiazheng Li, Lixing Zhu, Yanran Li, Yulan He, Lin Gui. (2023). NarrativePlay: Interactive Narrative Understanding.

[kim2024commonsenseaugmentedmemoryconstructionmanagement] Hana Kim, Kai Tzu-iunn Ong, Seoyeon Kim, Dongha Lee, Jinyoung Yeo. (2024). Commonsense-augmented Memory Construction and Management in Long-term Conversations via Context-aware Persona Refinement.

[du2025rethinkingmemoryaitaxonomy] Du, Yiming, Huang, Wenyu, Zheng, Danna, Wang, Zhaowei, Montella, Sebastien, Lapata, Mirella, Wong, Kam-Fai, Pan, Jeff Z. (2025). Rethinking memory in ai: Taxonomy, operations, topics, and future directions. arXiv preprint arXiv:2505.00675.

[zou2025surveylargelanguagemodel] Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, Yangning Li, Dongyuan Li, Renhe Jiang, Xue Liu, Philip S. Yu. (2025). A Survey on Large Language Model based Human-Agent Systems.

[zhang2024chain] Zhang, Yusen, Sun, Ruoxi, Chen, Yanfei, Pfister, Tomas, Zhang, Rui, Arik, Sercan. (2024). Chain of agents: Large language models collaborating on long-context tasks. Advances in Neural Information Processing Systems.

[li2024survey] Li, Xinyi, Wang, Sai, Zeng, Siqi, Wu, Yu, Yang, Yi. (2024). A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth.

[sulis2023survey] Sulis, Emilio, Mariani, Stefano, Montagna, Sara. (2023). A survey on agents applications in healthcare: Opportunities, challenges and trends. Computer Methods and Programs in Biomedicine.

[belle2025agents] Belle, Nikolas, Barnes, Dakota, Amayuelas, Alfonso, Bercovich, Ivan, Wang, Xin Eric, Wang, William. (2025). Agents of Change: Self-Evolving LLM Agents for Strategic Planning. arXiv preprint arXiv:2506.04651.

[chen2025surveyllmbasedmultiagentsystem] Shuaihang Chen, Yuanxing Liu, Wei Han, Weinan Zhang, Ting Liu. (2025). A Survey on LLM-based Multi-Agent System: Recent Advances and New Frontiers in Application.

[efeoglu2024retrieval] Efeoglu, Sefika, Paschke, Adrian. (2024). Retrieval-augmented generation-based relation extraction. arXiv preprint arXiv:2404.13397.

[hua2025multi] Hua, Min, Qi, Xinda, Chen, Dong, Jiang, Kun, Liu, Zemin Eitan, Sun, Hongyu, Zhou, Quan, Xu, Hongming. (2025). Multi-agent reinforcement learning for connected and automated vehicles control: Recent advancements and future prospects. IEEE Transactions on Automation Science and Engineering.

[cao_safelawbench_2025] Chuxue Cao, Han Zhu, Jiaming Ji, Qichao Sun, Zhenghao Zhu, Yinyu Wu, Josef Dai, Yaodong Yang, Sirui Han, Yike Guo. (2025). $\text{SafeLawBench. Findings of the Association for Computational Linguistics.

[pu2025piflow] Pu, Yingming, Lin, Tao, Chen, Hongyu. (2025). $\text{PiFlow. arXiv preprint arXiv:2505.15047.

[wang2025recursivelysummarizingenableslongterm] Wang, Qingyue, Fu, Yanhe, Cao, Yanan, Wang, Shuai, Tian, Zhiliang, Ding, Liang. (2025). Recursively summarizing enables long-term dialogue memory in large language models. Neurocomputing.

[huang2023transformerpatchermistakeworthneuron] Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou, Wenge Rong, Zhang Xiong. (2023). Transformer-Patcher: One Mistake worth One Neuron.

[fang2025alphaeditnullspaceconstrainedknowledge] Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, Tat-seng Chua. (2025). AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models.

[githubGitHubMemodbioAcontext] Acontext. (2025). {G.

[li2024genaibenchevaluatingimprovingcompositional] Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, Deva Ramanan. (2024). GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation.

[wang2024machineunlearningmeetsretrievalaugmented] Wang, Shang, Zhu, Tianqing, Ye, Dayong, Zhou, Wanlei. (2024). When machine unlearning meets retrieval-augmented generation ($\text{RAG. arXiv preprint arXiv:2410.15267.

[li2025webthinker] Li, Xiaoxi, Jin, Jiajie, Dong, Guanting, Qian, Hongjin, Zhu, Yutao, Wu, Yongkang, Wen, Ji-Rong, Dou, Zhicheng. (2025). $\text{WebThinker. arXiv preprint arXiv:2504.21776.

[wang2023multitask] Zhen Wang, Rameswar Panda, Leonid Karlinsky, Rog{'{e. (2023). Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning. The Eleventh International Conference on Learning Representations.

[alizadeh2024llmflashefficientlarge] Keivan Alizadeh, Seyed{-. (2024). $\text{LLM. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[10756271] Zou, Huhai, Li, Rongzhen, Sun, Tianhao, Wang, Fei, Li, Tao, Liu, Kai. (2024). Cooperative Scheduling and Hierarchical Memory Model for Multi-Agent Systems. 2024 IEEE International Symposium on Product Compliance Engineering - Asia (ISPCE-ASIA). doi:10.1109/ISPCE-ASIA64773.2024.10756271.

[gao2025synergizingragreasoningsystematic] Gao, Yunfan, Xiong, Yun, Zhong, Yijie, Bi, Yuxi, Xue, Ming, Wang, Haofen. (2025). Synergizing $\text{RAG. arXiv preprint arXiv:2504.15909.

[zhu2023ghostminecraftgenerallycapable] Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, Jifeng Dai. (2023). Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory.

[jin2025disentanglingmemoryreasoningability] Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, Yongfeng Zhang. (2025). Disentangling Memory and Reasoning Ability in Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[chen2025improvingfactualityexplicitworking] Mingda Chen, Yang Li, Karthik Padthe, Rulin Shao, Alicia Yi Sun, Luke Zettlemoyer, Gargi Ghosh, Wen{-. (2025). Improving Factuality with Explicit Working Memory. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[jin2025comprehensivesurveymultiagentcooperative] Weiqiang Jin, Hongyang Du, Biao Zhao, Xingwu Tian, Bohang Shi, Guang Yang. (2025). A Comprehensive Survey on Multi-Agent Cooperative Decision-Making: Scenarios, Approaches, Challenges and Perspectives.

[wang2025lifespancognitivesystems] Yu Wang, Chi Han, Tongtong Wu, Xiaoxin He, Wangchunshu Zhou, Nafis Sadeq, Xiusi Chen, Zexue He, Wei Wang, Gholamreza Haffari, Heng Ji, Julian McAuley. (2025). Towards LifeSpan Cognitive Systems.

[edit-fact] Nicola De Cao, Wilker Aziz, Ivan Titov. (2021). Editing Factual Knowledge in Language Models. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.

[lee2024humaninspiredreadingagentgist] Kuang{-. (2024). A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts. Forty-first International Conference on Machine Learning.

[liu2025diving] Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He. (2025). Diving into Self-Evolving Training for Multimodal Reasoning. Forty-second International Conference on Machine Learning.

[DBLP:conf/nips/ROME] Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov. (2022). Locating and Editing Factual Associations in {GPT. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.

[DBLP:conf/iclr/MEMIT] Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, David Bau. (2023). Mass-Editing Memory in a Transformer. The Eleventh International Conference on Learning Representations, {ICLR.

[liu2025spiral] Liu, Bo, Guertler, Leon, Yu, Simon, Liu, Zichen, Qi, Penghui, Balcells, Daniel, Liu, Mickel, Tan, Cheston, Shi, Weiyan, Lin, Min, others. (2025). $\text{SPIRAL. arXiv preprint arXiv:2506.24119.

[DBLP:journals/corr/CLoRA] Shishir Muralidhara, Didier Stricker, Ren{'{e. (2025). CLoRA: Parameter-Efficient Continual Learning with Low-Rank Adaptation. CoRR. doi:10.48550/ARXIV.2507.19887.

[DBLP:journals/corr/color] Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella. (2023). Continual Learning with Low Rank Adaptation. CoRR. doi:10.48550/ARXIV.2311.17601.

[DBLP:conf/ecir/visconde] Jayr Alencar Pereira, Robson do Nascimento Fidalgo, Roberto A. Lotufo, Rodrigo Nogueira. (2023). Visconde: Multi-document {QA. Advances in Information Retrieval - 45th European Conference on Information Retrieval, {ECIR. doi:10.1007/978-3-031-28238-6_44.

[long2025seeing] Long, Lin, He, Yichen, Ye, Wentao, Pan, Yiyuan, Lin, Yuan, Li, Hang, Zhao, Junbo, Li, Wei. (2025). Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory. arXiv preprint arXiv:2508.09736.

[yan2025memory] Yan, Sikuan, Yang, Xiufeng, Huang, Zuchao, Nie, Ercong, Ding, Zifeng, Li, Zonggen, Ma, Xiaowen, Sch{. (2025). $\text{Memory-R1. arXiv preprint arXiv:2508.19828.

[diao2025temporalworkingmemoryqueryguided] Xingjian Diao, Chunhui Zhang, Weiyi Wu, Zhongyu Ouyang, Peijun Qing, Ming Cheng, Soroush Vosoughi, Jiang Gui. (2025). Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding.

[li2023motmemoryofthoughtenableschatgpt] Xiaonan Li, Xipeng Qiu. (2023). $\text{MoT. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.

[gao2024clova] Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song{-. (2024). $\text{CLOVA:. {IEEE/CVF.

[li2025cort] Li, Chengpeng, Tang, Zhengyang, Li, Ziniu, Xue, Mingfeng, Bao, Keqin, Ding, Tian, Sun, Ruoyu, Wang, Benyou, Wang, Xiang, Lin, Junyang, others. (2025). $\text{CoRT. arXiv preprint arXiv:2506.09820.

[li2025start] Li, Chengpeng, Xue, Mingfeng, Zhang, Zhenru, Yang, Jiaxi, Zhang, Beichen, Wang, Xiang, Yu, Bowen, Hui, Binyuan, Lin, Junyang, Liu, Dayiheng. (2025). $\text{START. arXiv preprint arXiv:2503.04625.

[li2024structragboostingknowledgeintensive] Zhuoqun Li, Xuanang Chen, Haiyang Yu, Hongyu Lin, Yaojie Lu, Qiaoyu Tang, Fei Huang, Xianpei Han, Le Sun, Yongbin Li. (2025). $\text{StructRAG. The Thirteenth International Conference on Learning Representations.

[murre2015replication] Murre, Jaap MJ, Dros, Joeri. (2015). Replication and analysis of Ebbinghaus’ forgetting curve. PloS one.

[safaya-yuret-2024-neurocache] Safaya, Ali, Yuret, Deniz. (2024). Neurocache: Efficient Vector Retrieval for Long-range Language Modeling. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers).

[yuen2025intrinsicmemoryagentsheterogeneous] Sizhe Yuen, Francisco Gomez Medina, Ting Su, Yali Du, Adam J. Sobey. (2025). Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory.

[cao2023awesomegpumemoryconstrainedlong] Shuyang Cao, Lu Wang. (2024). $\text{AWESOME:. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers).

[gutiérrez2025hipporagneurobiologicallyinspiredlongterm] Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, Yu Su. (2024). $\text{HippoRAG. Advances in Neural Information Processing Systems.

[trivedi2022musiquemultihopquestionssinglehop] Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal. (2022). MuSiQue: Multihop Questions via Single-hop Question Composition.

[ho2020constructingmultihopqadataset] Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, Akiko Aizawa. (2020). Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps.

[yang2018hotpotqadatasetdiverseexplainable] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, Christopher D. Manning. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.

[gutiérrez2025ragmemorynonparametriccontinual] Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su. (2025). From RAG to Memory: Non-Parametric Continual Learning for Large Language Models.

[guo2024empoweringworkingmemorylarge] Jing Guo, Nan Li, Jianchuan Qi, Hang Yang, Ruiqiao Li, Yuzhen Feng, Si Zhang, Ming Xu. (2024). Empowering Working Memory for Large Language Model Agents.

[li2024graphreaderbuildinggraphbasedagent] Shilong Li, Yancheng He, Hangyu Guo, Xingyuan Bu, Ge Bai, Jie Liu, Jiaheng Liu, Xingwei Qu, Yangguang Li, Wanli Ouyang, Wenbo Su, Bo Zheng. (2024). $\text{GraphReader. Findings of the Association for Computational Linguistics: {EMNLP.

[hu2023chatdbaugmentingllmsdatabases] Hu, Chenxu, Fu, Jie, Du, Chenzhuang, Luo, Simian, Zhao, Junbo, Zhao, Hang. (2023). $\text{ChatDB. arXiv preprint arXiv:2306.03901.

[wang2024symbolicworkingmemoryenhances] Wang, Siyuan, Wei, Zhongyu, Choi, Yejin, Ren, Xiang. (2024). Symbolic Working Memory Enhances Language Models for Complex Rule Application. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

[lee-etal-2024-matter] Dongkyu Lee, Chandana Satya Prakash, Jack FitzGerald, Jens Lehmann. (2024). $\text{MATTER:. Findings of the Association for Computational Linguistics.

[Hou_2024] Wang, Boshi, Fang, Hao, Eisner, Jason, Van Durme, Benjamin, Su, Yu. (2024). $\text{LLMs. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[chenlearning] Chen, Guoxin, Zhang, Zhong, Cong, Xin, Guo, Fangda, Wu, Yesai, Lin, Yankai, Feng, Wenzheng, Wang, Yasheng. (2025). Learning Evolving Tools for Large Language Models. The Thirteenth International Conference on Learning Representations.

[gao2024confucius] Gao, Shen, Shi, Zhengliang, Zhu, Minghang, Fang, Bowen, Xin, Xin, Ren, Pengjie, Chen, Zhumin, Ma, Jun, Ren, Zhaochun. (2024). Confucius: Iterative tool learning from introspection feedback by easy-to-difficult curriculum. Proceedings of the AAAI Conference on Artificial Intelligence.

[guo2024stabletoolbench] Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, Yang Liu. (2024). $\text{StableToolBench. Findings of the Association for Computational Linguistics.

[chen2025facilitating] Mingyang Chen, Haoze Sun, Tianpeng Li, Fan Yang, Hao Liang, KeerLu, Bin CUI, Wentao Zhang, Zenan Zhou, Weipeng Chen. (2025). Facilitating Multi-turn Function Calling for {LLM. The Thirteenth International Conference on Learning Representations.

[yin2025magnet] Yin, Fan, Wang, Zifeng, Hsu, I, Yan, Jun, Jiang, Ke, Chen, Yanfei, Gu, Jindong, Le, Long T, Chang, Kai-Wei, Lee, Chen-Yu, others. (2025). Magnet: Multi-turn tool-use data synthesis and distillation via graph translation. arXiv preprint arXiv:2503.07826.

[prabhakar2025apigen] Prabhakar, Akshara, Liu, Zuxin, Zhu, Ming, Zhang, Jianguo, Awalgaonkar, Tulika, Wang, Shiyu, Liu, Zhiwei, Chen, Haolin, Hoang, Thai, Niebles, Juan Carlos, others. (2025). $\text{APIGen-MT. arXiv preprint arXiv:2504.03601.

[yao2025taubench] Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik R Narasimhan. (2025). {(\tau). The Thirteenth International Conference on Learning Representations.

[feng2025retool] Feng, Jiazhan, Huang, Shijue, Qu, Xingwei, Zhang, Ge, Qin, Yujia, Zhong, Baoquan, Jiang, Chengquan, Chi, Jinxin, Zhong, Wanjun. (2025). Retool: Reinforcement learning for strategic tool use in llms. arXiv preprint arXiv:2504.11536.

[zhang2025nemotron] Zhang, Shaokun, Dong, Yi, Zhang, Jieyu, Kautz, Jan, Catanzaro, Bryan, Tao, Andrew, Wu, Qingyun, Yu, Zhiding, Liu, Guilin. (2025). Nemotron-research-tool-n1: Tool-using language models with reinforced reasoning. arXiv preprint arXiv:2505.00024.

[qian2025toolrl] Qian, Cheng, Acikgoz, Emre Can, He, Qi, Wang, Hongru, Chen, Xiusi, Hakkani-T{. (2025). $\text{ToolRL. arXiv preprint arXiv:2504.13958.

[goldie2025synthetic] Goldie, Anna, Mirhoseini, Azalia, Zhou, Hao, Cai, Irene, Manning, Christopher D.Manning. (2025). Synthetic data generation & multi-step $\text{RL. arXiv preprint arXiv:2504.04736.

[yuan-etal-2025-easytool] Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, Deqing Yang. (2025). $\text{EASYTOOL:. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies.

[qu2025from] Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, Ji-Rong Wen. (2025). From Exploration to Mastery: Enabling {LLM. The Thirteenth International Conference on Learning Representations.

[fang2025play2prompt] Fang, Wei, Zhang, Yang, Qian, Kaizhi, Glass, James, Zhu, Yada. (2025). PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play. arXiv preprint arXiv:2503.14432.

[cai2024large] Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou. (2024). Large Language Models as Tool Makers. The Twelfth International Conference on Learning Representations.

[zhang2024offline] Zhang, Shaokun, Zhang, Jieyu, Liu, Jiale, Song, Linxin, Wang, Chi, Krishna, Ranjay, Wu, Qingyun. (2024). Offline training of language model agents with functions as learnable weights. Forty-first International Conference on Machine Learning.

[chen2024autoagentsframeworkautomaticagent] Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, B{. (2024). $\text{AutoAgents. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence.

[DBLP:journals/corr/abs-2312-13382] Singhvi, Arnav, Shetty, Manish, Tan, Shangyin, Potts, Christopher, Sen, Koushik, Zaharia, Matei, Khattab, Omar. (2023). Dspy assertions: Computational constraints for self-refining language model pipelines. arXiv preprint arXiv:2312.13382.

[liu2023dynamic] Liu, Zijun, Zhang, Yanzhe, Li, Peng, Liu, Yang, Yang, Diyi. (2023). Dynamic $\text{LLM-Agent. arXiv preprint arXiv:2310.02170.

[zhang2025gdesignerarchitectingmultiagentcommunication] Zhang, Guibin, Yue, Yanwei, Sun, Xiangguo, Wan, Guancheng, Yu, Miao, Fang, Junfeng, Wang, Kun, Chen, Tianlong, Cheng, Dawei. (2024). $\text{G-designer. arXiv preprint arXiv:2410.11782.

[wang2024executablecodeactionselicit] Wang, Xingyao, Chen, Yangyi, Yuan, Lifan, Zhang, Yizhe, Li, Yunzhu, Peng, Hao, Ji, Heng. (2024). Executable code actions elicit better llm agents. Forty-first International Conference on Machine Learning.

[wang2024agentworkflowmemory] Wang, Zora Zhiruo, Mao, Jiayuan, Fried, Daniel, Neubig, Graham. (2024). Agent Workflow Memory. Forty-second International Conference on Machine Learning.

[song2025adaptiveinconversationteambuilding] Song, Linxin, Liu, Jiale, Zhang, Jieyu, Zhang, Shaokun, Luo, Ao, Wang, Shijian, Wu, Qingyun, Wang, Chi. (2024). Adaptive in-conversation team building for language model agents. arXiv preprint arXiv:2405.19425.

[wang2025gsafeguardtopologyguidedsecuritylens] Shilong Wang, Guibin Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, Yang Wang. (2025). $\text{G-Safeguard. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[li2024autoflowautomatedworkflowgeneration] Li, Zelong, Xu, Shuyuan, Mei, Kai, Hua, Wenyue, Rama, Balaji, Raheja, Om, Wang, Hao, Zhu, He, Zhang, Yongfeng. (2024). $\text{AutoFlow. arXiv preprint arXiv:2407.12821.

[zhang2025aflow] Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, Chenglin Wu. (2025). $\text{AFlow. The Thirteenth International Conference on Learning Representations.

[wang2025scoreflow] Wang, Yinjie, Yang, Ling, Li, Guohao, Wang, Mengdi, Aragam, Bryon. (2025). $\text{ScoreFlow. arXiv preprint arXiv:2502.04306.

[ye2025masgpt] Rui Ye, Shuo Tang, Rui Ge, Yaxin Du, Zhenfei Yin, Siheng Chen, Jing Shao. (2025). {MAS. Forty-second International Conference on Machine Learning.

[zhuge2024gptswarm] Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, J{. (2024). {GPTS. Forty-first International Conference on Machine Learning.

[zheng2025mermaidflowredefiningagenticworkflow] Zheng, Chengqi, Chen, Jianda, Lyu, Yueming, Ng, Wen Zheng Terence, Zhang, Haopeng, Ong, Yew-Soon, Tsang, Ivor, Yin, Haiyan. (2025). MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming. arXiv preprint arXiv:2505.22967.

[bilodeau2022generative] Bilodeau, Camille, Jin, Wengong, Jaakkola, Tommi, Barzilay, Regina, Jensen, Klavs F. (2022). Generative models for molecular discovery: Recent advances and challenges. Wiley Interdisciplinary Reviews: Computational Molecular Science.

[makke2024interpretable] Makke, Nour, Chawla, Sanjay. (2024). Interpretable scientific discovery with symbolic regression: a review. Artificial Intelligence Review.

[zhugegptswarm] Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, J{. (2024). $\text{GPTSwarm. Forty-first International Conference on Machine Learning.

[hu2025automated] Shengran Hu, Cong Lu, Jeff Clune. (2025). Automated Design of Agentic Systems. The Thirteenth International Conference on Learning Representations.

[yuan-etal-2025-evoagent] Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Dongsheng Li, Deqing Yang. (2025). $\text{EvoAgent. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies.

[liu2025learn] Liu, Wei, Zhou, Ruochen, Deng, Yiyun, Huang, Yuzhen, Liu, Junteng, Deng, Yuntian, Zhang, Yizhe, He, Junxian. (2025). Learn to reason efficiently with adaptive length-based reward shaping. arXiv preprint arXiv:2505.15612.

[gao2025flowreasonerreinforcingquerylevelmetaagents] Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, Tianyu Pang. (2025). FlowReasoner: Reinforcing Query-Level Meta-Agents.

[xia2025experience] Xia, Siyu, Xu, Zekun, Chai, Jiajun, Fan, Wentian, Song, Yan, Wang, Xiaohan, Yin, Guojun, Lin, Wei, Zhang, Haifeng, Wang, Jun. (2025). From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory. arXiv preprint arXiv:2511.07800.

[chen2025towards] Chen, Qiguang, Qin, Libo, Liu, Jinhao, Peng, Dengyun, Guan, Jiannan, Wang, Peng, Hu, Mengkang, Zhou, Yuhang, Gao, Te, Che, Wanxiang. (2025). Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567.

[sun2025seagent] Sun, Zeyi, Liu, Ziyu, Zang, Yuhang, Cao, Yuhang, Dong, Xiaoyi, Wu, Tong, Lin, Dahua, Wang, Jiaqi. (2025). $\text{SEAgent. arXiv preprint arXiv:2508.04700.

[talebirad2023multi] Talebirad, Yashar, Nadiri, Amirhossein. (2023). Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314.

[su2025debflowautomatingagentcreation] Su, Jinwei, Xia, Yinghui, Shi, Ronghua, Wang, Jianhui, Huang, Jianuo, Wang, Yijin, SHI, TIANYU, Jingsong, Yang, He, Lewei. (2025). $\text{DebFlow. ICML 2025 Workshop on Collaborative and Federated Agentic Workflows.

[ye2025masgpttrainingllmsbuild] Ye, Rui, Tang, Shuo, Ge, Rui, Du, Yaxin, Yin, Zhenfei, Chen, Siheng, Shao, Jing. (2025). $\text{MAS-GPT. arXiv preprint arXiv:2503.03686.

[ke2025maszerodesigningmultiagentsystems] Ke, Zixuan, Xu, Austin, Ming, Yifei, Nguyen, Xuan-Phi, Xiong, Caiming, Joty, Shafiq. (2025). $\text{MAS-ZERO. arXiv preprint arXiv:2505.14996.

[ma2025agenticneuralnetworksselfevolving] Ma, Xiaowen, Lin, Chenyang, Zhang, Yao, Tresp, Volker, Ma, Yunpu. (2025). Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation. arXiv preprint arXiv:2506.09046.

[li2023camel] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem. (2023). {CAMEL. Thirty-seventh Conference on Neural Information Processing Systems.

[tran2025multi] Tran, Khanh-Tung, Dao, Dung, Nguyen, Minh-Duong, Pham, Quoc-Viet, O'Sullivan, Barry, Nguyen, Hoang D. (2025). Multi-$\text{Agent. arXiv preprint arXiv:2501.06322.

[krishnan2025advancingmultiagentsystemsmodel] Naveen Krishnan. (2025). Advancing Multi-Agent Systems Through Model Context Protocol: Architecture, Implementation, and Applications.

[zhang2025optimizingsequentialmultisteptasks] Zhang, Enhao, Zhu, Erkang, Bansal, Gagan, Fourney, Adam, Mozannar, Hussein, Gerrits, Jack. (2025). Optimizing Sequential Multi-Step Tasks with Parallel LLM Agents. arXiv preprint arXiv:2507.08944.

[guo2024large] Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang. (2024). Large Language Model Based Multi-agents: {A. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence.

[li2025parallelizedplanningactingefficientllmbased] Li, Yaoru, Liu, Shunyu, Zheng, Tongya, Song, Mingli. (2025). Parallelized Planning-Acting for Efficient $\text{LLM. arXiv preprint arXiv:2503.03505.

[wu2025joint] Bin Wu, Edgar Meij, Emine Yilmaz. (2025). A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in $\text{LLM. Findings of the Association for Computational Linguistics, {ACL.

[dubey2024llama] Grattafiori, Aaron, Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, Mathur, Akhil, Schelten, Alan, Vaughan, Alex, others. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

[yang2025qwen3] Yang, An, Li, Anfeng, Yang, Baosong, Zhang, Beichen, Hui, Binyuan, Zheng, Bo, Yu, Bowen, Gao, Chang, Huang, Chengen, Lv, Chenxu, others. (2025). Qwen3 technical report. arXiv preprint arXiv:2505.09388.

[zhao2023survey] Zhao, Wayne Xin, Zhou, Kun, Li, Junyi, Tang, Tianyi, Wang, Xiaolei, Hou, Yupeng, Min, Yingqian, Zhang, Beichen, Zhang, Junjie, Dong, Zican, others. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223.

[xi2025the] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, Qi Zhang, Tao Gui. (2025). The rise and potential of large language model based agents: a survey. Science China Information Sciences.

[jiang2024survey] Jiang, Juyong, Wang, Fan, Shen, Jiasi, Kim, Sungju, Kim, Sunghun. (2024). A survey on large language models for code generation. arXiv preprint arXiv:2406.00515.

[lu2024ai] Lu, Chris, Lu, Cong, Lange, Robert Tjarko, Foerster, Jakob, Clune, Jeff, Ha, David. (2024). The $\text{AI. arXiv preprint arXiv:2408.06292.

[lai2024autowebglm] Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang. (2024). $\text{AutoWebGLM. Proceedings of the 30th {ACM.

[kim2024mdagents] Kim, Yubin, Park, Chanwoo, Jeong, Hyewon, Chan, Yik S, Xu, Xuhai, McDuff, Daniel, Lee, Hyeonhoon, Ghassemi, Marzyeh, Breazeal, Cynthia, Park, Hae W. (2024). $\text{MDAgents. Advances in Neural Information Processing Systems.

[tian2025template] Yong{-. (2025). Template-Based Financial Report Generation in Agentic and Decomposed Information Retrieval. Proceedings of the 48th International {ACM.

[yao2024lawyer] Yao, Shunyu, Ke, Qingqing, Wang, Qiwei, Li, Kangtong, Hu, Jie. (2024). Lawyer $\text{GPT. Proceedings of the 2024 3rd International Symposium on Robotics, Artificial Intelligence and Information Engineering.

[wang2023self] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. The Eleventh International Conference on Learning Representations.

[yao2023tree] Yao, Shunyu, Yu, Dian, Zhao, Jeffrey, Shafran, Izhak, Griffiths, Tom, Cao, Yuan, Narasimhan, Karthik. (2023). Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems.

[sun2024query] Hao Sun, Alihan H{. (2024). $\text{Query-Dependent. The Twelfth International Conference on Learning Representations.

[zhou2025multi] Zhou, Han, Wan, Xingchen, Sun, Ruoxi, Palangi, Hamid, Iqbal, Shariq, Vuli{'c. (2025). $\text{Multi-Agent. arXiv preprint arXiv:2502.02533.

[li2025mm] Li, Shilong, Bu, Xingyuan, Wang, Wenjie, Liu, Jiaheng, Dong, Jun, He, Haoyang, Lu, Hao, Zhang, Haozhe, Jing, Chenchen, Li, Zhen, others. (2025). $\text{MM-BrowseComp. arXiv preprint arXiv:2508.13186.

[huang2024understanding] Huang, Xu, Liu, Weiwen, Chen, Xiaolong, Wang, Xingmei, Wang, Hao, Lian, Defu, Wang, Yasheng, Tang, Ruiming, Chen, Enhong. (2024). Understanding the planning of LLM agents: A survey. arXiv preprint arXiv:2402.02716.

[hu2025agents] Hu, Xueyu, Xiong, Tao, Yi, Biao, Wei, Zishu, Xiao, Ruixuan, Chen, Yurun, Ye, Jiasheng, Tao, Meiling, Zhou, Xiangxin, Zhao, Ziyu, others. (2025). $\text{OS. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[wei2022chain] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, Denny Zhou. (2022). $\text{Chain-of-Thought. Advances in Neural Information Processing Systems.

[browser_use2024] Müller, Magnus, Žunič, Gregor. (2024). Browser Use: Enable $\text{AI.

[dilara2024fine] Dilara Soylu, Christopher Potts, Omar Khattab. (2024). $\text{Fine-Tuning and Prompt Optimization. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

[huang2025resiliencellmbasedmultiagentcollaboration] Huang, Jen-tse, Zhou, Jiaxu, Jin, Tailin, Zhou, Xuhui, Chen, Zixi, Wang, Wenxuan, Yuan, Youliang, Lyu, Michael R, Sap, Maarten. (2024). On the resilience of llm-based multi-agent collaboration with faulty agents. arXiv preprint arXiv:2408.00989.

[han2025llmmultiagentsystemschallenges] Han, Shanshan, Zhang, Qifan, Yao, Yuhang, Jin, Weizhao, Xu, Zhaozhuo. (2024). $\text{LLM. arXiv preprint arXiv:2402.03578.

[chen2024internetagentsweavingweb] Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun. (2025). Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence. The Thirteenth International Conference on Learning Representations.

[chevalierboisvert2019babyaiplatformstudysample] Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, Yoshua Bengio. (2019). BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning.

[xu2025sedmscalableselfevolvingdistributed] Haoran Xu, Jiacong Hu, Ke Zhang, Lei Yu, Yuxin Tang, Xinyuan Song, Yiqun Duan, Lynn Ai, Bill Shi. (2025). SEDM: Scalable Self-Evolving Distributed Memory for Agents.

[lin2025creativityllmbasedmultiagentsystems] Lin, Yi-Cheng, Chen, Kang-Chieh, Li, Zhe-Yan, Wu, Tzu-Heng, Wu, Tzu-Hsuan, Chen, Kuan-Yu, Lee, Hung-yi, Chen, Yun-Nung. (2025). Creativity in LLM-based Multi-Agent Systems: A Survey. arXiv preprint arXiv:2505.21116.

[qian2024chatdevcommunicativeagentssoftware] Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, Maosong Sun. (2024). $\text{ChatDev. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[hou2025halohierarchicalautonomouslogicoriented] Hou, Zhipeng, Tang, Junyi, Wang, Yipeng. (2025). HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems. arXiv preprint arXiv:2505.13516.

[li2025agenthospitalsimulacrumhospital] Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, Yang Liu. (2025). Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents.

[zheng2023chatgpt] Zheng, Zhiling, Zhang, Oufan, Nguyen, Ha L, Rampal, Nakul, Alawadhi, Ali H, Rong, Zichao, Head-Gordon, Teresa, Borgs, Christian, Chayes, Jennifer T, Yaghi, Omar M. (2023). ChatGPT research group for optimizing the crystallinity of MOFs and COFs. ACS Central Science.

[fourney2024magenticonegeneralistmultiagentsolving] Fourney, Adam, Bansal, Gagan, Mozannar, Hussein, Tan, Cheng, Salinas, Eduardo, Niedtner, Friederike, Proebsting, Grace, Bassman, Griffin, Gerrits, Jack, Alber, Jacob, others. (2024). Magentic-one: A generalist multi-agent system for solving complex tasks. arXiv preprint arXiv:2411.04468.

[smolagents] Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, Erik Kaunismäki. (2025). ``smolagents'': a smol library to build great agentic systems..

[ko2025sevensecuritychallengessolved] Ko, Ronny, Jeong, Jiseong, Zheng, Shuyuan, Xiao, Chuan, Kim, Tae-Wan, Onizuka, Makoto, Shin, Won-Yong. (2025). Seven security challenges that must be solved in cross-domain multi-agent llm systems. arXiv preprint arXiv:2505.23847.

[chen2025xbenchtrackingagentsproductivity] Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, Chen Sun, Han Hou, Hui Yang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi-Hsin Hung, Yuan Jiang, Zexuan Liu, Zihan Yin, Zijian Ma, Zhiwen Mo. (2025). xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations.

[dong2025youtugraphragverticallyunifiedagents] Junnan Dong, Siyu An, Yifei Yu, Qian-Wen Zhang, Linhao Luo, Xiao Huang, Yunsheng Wu, Di Yin, Xing Sun. (2025). Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning.

[edge2025localglobalgraphrag] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, Jonathan Larson. (2025). From Local to Global: A Graph RAG Approach to Query-Focused Summarization.

[peng2024graphretrievalaugmentedgenerationsurvey] Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, Siliang Tang. (2024). Graph Retrieval-Augmented Generation: A Survey.

[asai2023selfraglearningretrievegenerate] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.

[lee2024planragplanthenretrievalaugmentedgeneration] Myeonghwa Lee, Seonho An, Min-Soo Kim. (2024). PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers.

[singh2025agenticretrievalaugmentedgenerationsurvey] Aditi Singh, Abul Ehtesham, Saket Kumar, Tala Talaei Khoei. (2025). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG.

[10.5555/3709347.3744042] Yingxuan Yang, Qiuying Peng, Jun Wang, Ying Wen, Weinan Zhang. (2025). Unlocking the Potential of Decentralized LLM-based {MAS:. Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems.

[geren2024blockchainlargelanguagemodel] Caleb Geren, Amanda Board, Gaby G. Dagher, Tim Andersen, Jun Zhuang. (2024). Blockchain for Large Language Model Security and Safety: {A. {SIGKDD.

[li2025graphteamfacilitatinglargelanguage] Li, Xin Sky, Chu, Qizhi, Chen, Yubin, Liu, Yang, Liu, Yaoqi, Yu, Zekai, Chen, Weize, Qian, Chen, Shi, Chuan, Yang, Cheng. (2024). $\text{GraphTeam. arXiv preprint arXiv:2410.18032.

[zhang2023appagentmultimodalagentssmartphone] Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, Gang Yu. (2025). $\text{AppAgent. Proceedings of the 2025 {CHI.

[zhang2025largelanguagemodelbrainedgui] Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang. (2025). Large Language Model-Brained $\text{GUI. Transactions on Machine Learning Research.

[202502.0406] Gustavo de Aquino e Aquino, Nádila da Silva de Azevedo, Leandro Youiti Silva Okimoto, Leonardo Yuto Suzuki Camelo, Hendrio Luis de Souza Bragança, Rubens Fernandes, Andre Printes, Fábio Cardoso, Raimundo Gomes, Israel Gondres Torné. From RAG to Multi-Agent Systems: A Survey of Modern Approaches in LLM Development. Preprints.

[yao2023webshopscalablerealworldweb] Shunyu Yao, Howard Chen, John Yang, Karthik Narasimhan. (2023). WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents.

[zhou2024webarenarealisticwebenvironment] Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig. (2024). WebArena: A Realistic Web Environment for Building Autonomous Agents.

[e2025rag] e Aquino, Gustavo de Aquino, de Azevedo, N{'a. (2025). From $\text{RAG.

[A2A_Protocol_2025] Google LLC, A2A Project Contributors. {Agent2Agent (A2A) Protocol.

[ANP_Protocol_2025] Chang, GaoWei, Agent Network Protocol Contributors. {Agent Network Protocol (ANP).

[MCP_Protocol_2025] Anthropic PBC, Model Context Protocol Contributors. {Model Context Protocol (MCP).

[Agora_Protocol_2025] Marro, Samuele, Agora Protocol Contributors. {Agora Protocol (AGORA).

[li2024improvingmultiagentdebatesparse] Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, Eugene Ie. (2024). Improving Multi-Agent Debate with Sparse Communication Topology. Findings of the Association for Computational Linguistics: {EMNLP.

[du2024improving] Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch. (2024). Improving Factuality and Reasoning in Language Models through Multiagent Debate.

[openai2024openaio1card] Jaech, Aaron, Kalai, Adam, Lerer, Adam, Richardson, Adam, El-Kishky, Ahmed, Low, Aiden, Helyar, Alec, Madry, Aleksander, Beutel, Alex, Carney, Alex, others. (2024). OpenAI o1 System Card. CoRR.

[sang2025pipelinessurveyparadigmshift] Jitao Sang, Jinlin Xiao, Jiarun Han, Jilin Chen, Xiaoyi Chen, Shuyu Wei, Yongjie Sun, Yuhang Wang. (2025). Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI.

[zhang2025surveyreinforcementlearninglarge] Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yuan, Junqi Gao, Dong Li, Zhiyuan Ma, Ganqu Cui, Zhiyuan Liu, Biqing Qi, Ning Ding, Bowen Zhou. (2025). A Survey of Reinforcement Learning for Large Reasoning Models.

[wu2025evolverselfevolvingllmagents] Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, Botian Shi. (2025). EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle.

[githubGitHubMemodbiomemobase] Memobase. (2025). {G.

[park2025mrsteveinstructionfollowingagentsminecraft] Park, Junyeong, Cho, Junmo, Ahn, Sungjin. (2024). $\text{MrSteve. arXiv preprint arXiv:2411.06736.

[park2023generativeagentsinteractivesimulacra] Park, Joon Sung, O'Brien, Joseph, Cai, Carrie Jun, Morris, Meredith Ringel, Liang, Percy, Bernstein, Michael S. (2023). Generative agents: Interactive simulacra of human behavior. Proceedings of the 36th annual acm symposium on user interface software and technology.

[fei2025mcp] Fei, Xiang, Zheng, Xiawu, Feng, Hao. (2025). $\text{MCP-Zero. arXiv preprint arXiv:2506.01056.

[chudziak2025elliottagents] Chudziak, Jaros{\l. (2025). $\text{ElliottAgents. arXiv preprint arXiv:2507.03435.

[li2023tradinggpt] Li, Yang, Yu, Yangyang, Li, Haohang, Chen, Zhi, Khashanah, Khaldoun. (2023). $\text{TradingGPT. arXiv preprint arXiv:2309.03736.

[fnu2025multi] Fnu, Himani, Kaushik, Keshav, Gaur, Priyanka, Saratchandran, Divya Valsala, Thapliyal, Anju Gairola, Sidhu, Kawerinder Singh, Singh, Vineeta. (2025). Multi-Agent Systems for Collaborative Financial Decision-Making Over Distributed Network Architectures. 2025 3rd International Conference on Advancement in Computation & Computer Technologies (InCACCT).

[liang-etal-2024-encouraging] Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, Zhaopeng Tu. (2024). Encouraging Divergent Thinking in Large Language Models through Multi-Agen Debate. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

[yin-etal-2023-exchange] Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuanjing Huang, Xipeng Qiu. (2023). $\text{Exchange-of-Thought. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.

[khan2024debating] Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rockt{. (2024). Debating with More Persuasive LLMs Leads to More Truthful Answers. Forty-first International Conference on Machine Learning.

[chang2024socrasynthmultillmreasoningconditional] Chang, Edward Y. (2024). SocraSynth: Multi-LLM reasoning with conditional statistics. arXiv preprint arXiv:2402.06634.

[eo2025debatenecessaryadaptivemultiagent] Eo, Sugyeong, Moon, Hyeonseok, Zi, Evelyn Hayoon, Park, Chanjun, Lim, Heuiseok. (2025). Debate only when necessary: Adaptive multiagent collaboration for efficient llm reasoning. arXiv preprint arXiv:2504.05047.

[xi2024agentgymevolvinglargelanguage] Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, Songyang Gao, Lu Chen, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang. (2024). AgentGym: Evolving Large Language Model-based Agents across Diverse Environments.

[tian2025mminabenchmarkingmultihopmultimodal] Shulin Tian, Ziniu Zhang, Liangyu Chen, Ziwei Liu. (2025). MMInA: Benchmarking Multihop Multimodal Internet Agents.

[hong2023metagpt] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, J{. (2024). $\text{MetaGPT. The Twelfth International Conference on Learning Representations.

[chun2025multiagentdebatemadsilver] Chun, Jina, Chen, Qihong, Li, Jiawei, Ahmed, Iftekhar. (2025). Is multi-agent debate (mad) the silver bullet? an empirical analysis of mad in code summarization and translation. arXiv preprint arXiv:2503.12029.

[yu2025dyntaskmasdynamictaskgraphdriven] Yu, Junwei, Ding, Yepeng, Sato, Hiroyuki. (2025). $\text{DynTaskMAS. arXiv preprint arXiv:2503.07675.

[wang2025all] Qian Wang, Tianyu Wang, Zhenheng Tang, Qinbin Li, Nuo Chen, Jingsheng Liang, Bingsheng He. (2025). All It Takes Is One Prompt: An Autonomous {LLM. ICLR 2025 Workshop on Foundation Models in the Wild.

[zhuang2024toolchain] Zhuang, Yuchen, Chen, Xiang, Yu, Tong, Mitra, Saayan, Bursztyn, Victor, Rossi, Ryan A, Sarkhel, Somdeb, Zhang, Chao. (2024). ToolChain: Efficient Action Space Navigation in Large Language Models with A* Search*. The Twelfth International Conference on Learning Representations.

[zhong2023memorybankenhancinglargelanguage] Zhong, Wanjun, Guo, Lianghong, Gao, Qiqi, Ye, He, Wang, Yanlin. (2024). Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence.

[yan2023larplanguageagentroleplay] Ming Yan, Ruihao Li, Hao Zhang, Hao Wang, Zhilan Yang, Ji Yan. (2023). LARP: Language-Agent Role Play for Open-World Games.

[zhou2025mem1learningsynergizememory] Zhou, Zijian, Qu, Ao, Wu, Zhaoxuan, Kim, Sunghwan, Prakash, Alok, Rus, Daniela, Zhao, Jinhua, Low, Bryan Kian Hsiang, Liang, Paul Pu. (2025). $\text{MEM1. arXiv preprint arXiv:2506.15841.

[liu2025toolplanner] Yanming Liu, Xinyue Peng, Jiannan Cao, Shi Bo, Yuwei Zhang, Xuhong Zhang, Sheng Cheng, Xun Wang, Jianwei Yin, Tianyu Du. (2025). $\text{Tool-Planner. The Thirteenth International Conference on Learning Representations.

[qin2024toolllm] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, Maosong Sun. (2024). $\text{ToolLLM. The Twelfth International Conference on Learning Representations.

[yang2023gpt4tools] Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, Ying Shan. (2023). $\text{GPT4Tools. Advances in Neural Information Processing Systems.

[gao2025multi] Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song{-. (2025). Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage. The Thirteenth International Conference on Learning Representations.

[dong2025agentic] Dong, Guanting, Mao, Hangyu, Ma, Kai, Bao, Licheng, Chen, Yifei, Wang, Zhongyuan, Chen, Zhongxia, Du, Jiazhen, Wang, Huiyang, Zhang, Fuzheng, others. (2025). Agentic reinforced policy optimization. arXiv preprint arXiv:2507.19849.

[dong2025tool] Dong, Guanting, Chen, Yifei, Li, Xiaoxi, Jin, Jiajie, Qian, Hongjin, Zhu, Yutao, Mao, Hangyu, Zhou, Guorui, Dou, Zhicheng, Wen, Ji-Rong. (2025). $\text{Tool-Star. arXiv preprint arXiv:2505.16410.

[li2025iterative] Li, Pengxiang, Gao, Zhi, Zhang, Bofei, Mi, Yapeng, Ma, Xiaojian, Shi, Chenrui, Yuan, Tao, Wu, Yuwei, Jia, Yunde, Zhu, Song-Chun, others. (2025). Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning. arXiv preprint arXiv:2504.21561.

[wang2024trove] Zhiruo Wang, Graham Neubig, Daniel Fried. (2024). $\text{TroVE. Forty-first International Conference on Machine Learning.

[anonymous2024replacing] Verga, Pat, Hofstatter, Sebastian, Althammer, Sophia, Su, Yixuan, Piktus, Aleksandra, Arkhangorodsky, Arkady, Xu, Minjie, White, Naomi, Lewis, Patrick. (2024). Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv preprint arXiv:2404.18796.

[modarressi2023ret] Modarressi, Ali, Imani, Ayyoob, Fayyaz, Mohsen, Sch{. (2023). $\text{RET-LLM. arXiv preprint arXiv:2305.14322.

[xu2025memoryaugmentedqueryreconstructionllmbased] Mufan Xu, Gewen Liang, Kehai Chen, Wei Wang, Xun Zhou, Muyun Yang, Tiejun Zhao, Min Zhang. (2025). Memory-augmented Query Reconstruction for LLM-based Knowledge Graph Reasoning.

[wang2023huatuotuningllamamodel] Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, Ting Liu. (2023). HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge.

[tang2025unleashing] Tang, Xinyu, Wang, Xiaolei, Zhao, Wayne Xin, Lu, Siyuan, Li, Yaliang, Wen, Ji-Rong. (2025). Unleashing the potential of large language models as prompt optimizers: Analogical analysis with gradient-based model optimizers. Proceedings of the AAAI Conference on Artificial Intelligence.

[wu2025inference] Wu, Yangzhen, Sun, Zhiqing, Li, Shanda, Welleck, Sean, Yang, Yiming. (2025). Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving. The Thirteenth International Conference on Learning Representations.

[pan2025why] Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica. (2025). Why Do $\text{Multi-Agent. ICLR 2025 Workshop on Building Trust in Language Models and Applications.

[zhang2025crowdcomparativereasoningunlocking] Qiyuan Zhang, Yufei Wang, Yuxin Jiang, Liangyou Li, Chuhan Wu, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma. (2025). Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for $\text{LLM-as-a-Judge. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[li2025adaptivegraphpruningmultiagent] Li, Boyi, Zhao, Zhonghan, Lee, Der-Horng, Wang, Gaoang. (2025). Adaptive Graph Pruning for Multi-Agent Communication. arXiv preprint arXiv:2506.02951.

[leong2025dynaswarmdynamicallygraphstructure] Leong, Hui Yi, Wu, Yuqing. (2025). $\text{DynaSwarm. arXiv preprint arXiv:2507.23261.

[zhang2025evoflowevolvingdiverseagentic] Gao, Hongcheng, Liu, Yue, He, Yufei, Dou, Longxu, Du, Chao, Deng, Zhijie, Hooi, Bryan, Lin, Min, Pang, Tianyu. (2025). $\text{FlowReasoner. arXiv preprint arXiv:2504.15257.

[lu2024morphagentempoweringagentsselfevolving] Lu, Siyuan, Shao, Jiaqi, Luo, Bing, Lin, Tao. (2024). $\text{MorphAgent. arXiv preprint arXiv:2410.15048.

[gu2025agentgroupchatv2divideandconquerllmbasedmultiagent] Gu, Zhouhong, Zhu, Xiaoxuan, Cai, Yin, Shen, Hao, Chen, Xingzhou, Wang, Qingyi, Li, Jialin, Shi, Xiaoran, Guo, Haoran, Huang, Wenxuan, others. (2025). $\text{AgentGroupChat-V2. arXiv preprint arXiv:2506.15451.

[shridhar2021alfworld] Mohit Shridhar, Xingdi Yuan, Marc{-. (2021). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. 9th International Conference on Learning Representations.

[zheng2024steve] Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu. (2024). $\text{Steve-Eye. The Twelfth International Conference on Learning Representations.

[kong2025surveyllmdrivenaiagent] Dezhang Kong, Shi Lin, Zhenhua Xu, Zhebo Wang, Minghao Li, Yufeng Li, Yilun Zhang, Hujin Peng, Zeyang Sha, Yuyuan Li, Changting Lin, Xun Wang, Xuan Liu, Ningyu Zhang, Chaochao Chen, Muhammad Khurram Khan, Meng Han. (2025). A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures.

[cheng2025hawkhierarchicalworkflowframework] Cheng, Yuyang, Xu, Yumiao, Yu, Chaojia, Zhao, Yong. (2025). $\text{HAWK. arXiv preprint arXiv:2507.04067.

[chen2024agentverse] Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, Jie Zhou. (2024). AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors. The Twelfth International Conference on Learning Representations.

[camel_workforce_docs] {CAMEL-AI. (2025). Workforce — CAMEL-AI Documentation.

[zhang-etal-2025-parallelized] Zhang, Jun, Yan, Yuwei, Yan, Junbo, Zheng, Zhiheng, Piao, Jinghua, Jin, Depeng, Li, Yong. (2025). A Parallelized Framework for Simulating Large-Scale {LLM. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). doi:10.18653/v1/2025.acl-industry.94.

[wang2025mixtureofagents] Junlin Wang, Jue WANG, Ben Athiwaratkun, Ce Zhang, James Zou. (2025). Mixture-of-Agents Enhances Large Language Model Capabilities. The Thirteenth International Conference on Learning Representations.

[niu2025flow] Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, Tongliang Liu. (2025). Flow: Modularized Agentic Workflow Automation. The Thirteenth International Conference on Learning Representations.

[zhou-etal-2024-fairer] Zhou, Han, Wan, Xingchen, Liu, Yinhong, Collier, Nigel, Vuli{'c. (2024). Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

[zhou2024batch] Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine A Heller, Subhrajit Roy. (2024). Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering. The Twelfth International Conference on Learning Representations.

[wang2025efficientagentsbuildingeffective] Ningning Wang, Xavier Hu, Pai Liu, He Zhu, Yue Hou, Heyuan Huang, Shengyu Zhang, Jian Yang, Jiaheng Liu, Ge Zhang, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou. (2025). Efficient Agents: Building Effective Agents While Reducing Cost.

[wang2024craftingpersonalizedagentsretrievalaugmented] Zheng Wang, Zhongyang Li, Zeren Jiang, Dandan Tu, Wei Shi. (2024). Crafting Personalized Agents through Retrieval-Augmented Generation on Editable Memory Graphs.

[sun2024knowledgegraphtuningrealtime] Jingwei Sun, Zhixu Du, Yiran Chen. (2024). Knowledge Graph Tuning: Real-time Large Language Model Personalization based on Human Feedback.

[a2024enhancinglongtermmemoryusing] Aadharsh Aadhithya A, Sachin Kumar S, Soman K. P. (2024). Enhancing Long-Term Memory using Hierarchical Aggregate Tree for Retrieval Augmented Generation.

[Lei2025DSMART] Lei, Xiang, Li, Qin, Zhang, Min. (2025). D-SMART: Enhancing LLM Dialogue Consistency via Dynamic Structured Memory And Reasoning Tree. arXiv preprint arXiv:2510.13363.

[Tan2025MemoTime] Tan, Xingyu, Wang, Xiaoyang, Xu, Xiwei, Yuan, Xin, Zhu, Liming, Zhang, Wenjie. (2025). MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning. arXiv preprint arXiv:2510.13614.

[ward2025memoriesdb] Ward, Joel. (2025). MemoriesDB: A Temporal-Semantic-Relational Database for Long-Term Agent Memory/Modeling Experience as a Graph of Temporal-Semantic Surfaces. arXiv preprint arXiv:2511.06179.

[salama2025meminsightautonomousmemoryaugmentation] Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, Yassine Benajiba. (2025). MemInsight: Autonomous Memory Augmentation for LLM Agents.

[Sarthi2024RAPTOR] Haoran Sun, Shaoning Zeng. (2025). Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents. ArXiv.

[Wu2025SGMemSG] Yu Wang, Xiusi Chen, Jingbo Shang, Julian McAuley. (2024). MEMORYLLM: Towards Self-Updatable Large Language Models. ArXiv.

[Zhang2025MemGen] Gui-Min Zhang, Muxin Fu, Shuicheng Yan. (2025). MemGen: Weaving Generative Latent Memory for Self-Evolving Agents. ArXiv.

[Wu2025CoMEM] Lingfeng Zhang, Yuecheng Liu, Zhanguang Zhang, Matin Aghaei, Yaochen Hu, Hongjian Gu, Mohammad Ali Alomrani, David Gamaliel Arcos Bravo, Raika Karimi, Atia Hamidizadeh, Haoping Xu, Guowei Huang, Zhanpeng Zhang, Tongtong Cao, Weichao Qiu, Xingyue Quan, Jianye Hao, Yuzheng Zhuang, Yingxue Zhang. (2025). Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation. CoRR. doi:10.48550/ARXIV.2502.14254.

[yu2025memagent] Yu, Hongli, Chen, Tinghong, Feng, Jiangtao, Chen, Jiangjie, Dai, Weinan, Yu, Qiying, Zhang, Ya-Qin, Ma, Wei-Ying, Liu, Jingjing, Wang, Mingxuan, others. (2025). MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent. arXiv preprint arXiv:2507.02259.

[jiang2025maicc] Jiang, Tao, Lin, Zichuan, Li, Lihe, Li, Yi-Chen, Guan, Cong, Yuan, Lei, Zhang, Zongzhang, Yu, Yang, Ye, Deheng. (2025). Multi-agent In-context Coordination via Decentralized Memory Retrieval. arXiv preprint arXiv:2511.10030.

[wu2025tokmemtokenizedproceduralmemory] Zijun Wu, Yongchang Hao, Lili Mou. (2025). TokMem: Tokenized Procedural Memory for Large Language Models.

[lumer2025memtool] Lumer, Elias, Gulati, Anmol, Subbiah, Vamse Kumar, Basavaraju, Pradeep Honaganahalli, Burke, James A. (2025). Memtool: Optimizing short-term memory management for dynamic tool calling in llm agent multi-turn conversations. arXiv preprint arXiv:2507.21428.

[hassell2025learning] Hassell, Jackson, Zhang, Dan, Kim, Hannah, Mitchell, Tom, Hruschka, Estevam. (2025). Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation. arXiv preprint arXiv:2510.19897.

[shi2025lookreasonforwardrevisitable] Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, An Zhang. (2025). Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents.

[kim2023treeclarificationsansweringambiguous] Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, Jaewoo Kang. (2023). Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models.

[ma2023queryrewrite] Ma, Xinbei, Gong, Yeyun, He, Pengcheng, Zhao, Hai, Duan, Nan. (2023). Query rewriting in retrieval-augmented large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.

[zuo2025videolucydeepmemorybacktracking] Jialong Zuo, Yongtai Deng, Lingdong Kong, Jingkang Yang, Rui Jin, Yiwei Zhang, Nong Sang, Liang Pan, Ziwei Liu, Changxin Gao. (2025). VideoLucy: Deep Memory Backtracking for Long Video Understanding.

[gurukar2025longvmnetacceleratinglongformvideo] Saket Gurukar, Asim Kadav. (2025). Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory.

[xie2024largemultimodalagentssurvey] Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, Guanbin Li. (2024). Large Multimodal Agents: A Survey.

[ye2025agentfold] Ye, Rui, Zhang, Zhongwang, Li, Kuan, Yin, Huifeng, Tao, Zhengwei, Zhao, Yida, Su, Liangcai, Zhang, Liwen, Qiao, Zile, Wang, Xinyu, others. (2025). AgentFold: Long-Horizon Web Agents with Proactive Context Management. arXiv preprint arXiv:2510.24699.

[yu2025vismemlatentvisionmemory] Yu, Xinlei, Xu, Chengming, Zhang, Guibin, Chen, Zhangquan, Zhang, Yudong, He, Yongbo, Jiang, Peng-Tao, Zhang, Jiangning, Hu, Xiaobin, Yan, Shuicheng. (2025). VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models. arXiv preprint arXiv:2511.11007.

[Watkins1992Qlearning] Christopher Watkins, Peter Dayan. (1992). Q-learning. Machine Learning.

[DBLP:conf/acl/rmm2025] Zhen Tan, Jun Yan, I{-. (2025). In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), {ACL.

[zhang2024guided] Zhang, Jiarui. (2024). Guided profile generation improves personalization with llms. arXiv preprint arXiv:2409.13093.

[kim2025nexussum] Kim, Hyuntak, Kim, Byung-Hak. (2025). NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization. arXiv preprint arXiv:2505.24575.

[MCCLOSKEY1989109] Michael McCloskey, Neal J. Cohen. (1989). Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. doi:https://doi.org/10.1016/S0079-7421(08)60536-8.

[wu2025incremental] Wu, Yisha, Zhao, Cen, Cao, Yuanpei, Xu, Xiaoqing, Mehdad, Yashar, Ji, Mindy, Cheng, Claire Na. (2025). Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track.

[chen2024write] Chen, Xiuying, Gao, Shen, Li, Mingzhe, Zhu, Qingqing, Gao, Xin, Zhang, Xiangliang. (2024). Write Summary Step-by-Step: A Pilot Study of Stepwise Summarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[schulman2017proximal] Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, Klimov, Oleg. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

[fang2025lightmem] Fang, Jizhan, Deng, Xinle, Xu, Haoming, Jiang, Ziyan, Tang, Yuqi, Xu, Ziwen, Deng, Shumin, Yao, Yunzhi, Wang, Mengru, Qiao, Shuofei, others. (2025). LightMem: Lightweight and Efficient Memory-Augmented Generation. arXiv preprint arXiv:2510.18866.

[chen2025moommaintenanceorganizationoptimization] Weishu Chen, Jinyi Tang, Zhouhui Hou, Shihao Han, Mingjie Zhan, Zhiyuan Huang, Delong Liu, Jiawei Guo, Zhicheng Zhao, Fei Su. (2025). MOOM: Maintenance, Organization and Optimization of Memory in Ultra-Long Role-Playing Dialogues.

[li2025memosoperatingmemoryaugmentedgeneration] Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, Junpeng Ren, Zehao Lin, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhiqiang Yin, Qingchen Yu, Bo Tang, Hongkang Yang, Zhi-Qin John Xu, Feiyu Xiong. (2025). MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models.

[cheng2022xmemlongtermvideoobject] Ho Kei Cheng, Alexander G. Schwing. (2022). XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model.

[DBLP:journals/corr/abs-2509-25911] Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian J. McAuley, Xiaojian Wu. (2025). Mem-{(\alpha). CoRR. doi:10.48550/ARXIV.2509.25911.

[bailly2025divide] Bailly, Alexandre, Saubin, Antoine, Kocevar, Gabriel, Bodin, Jonathan. (2025). Divide and summarize: improve SLM text summarization. Frontiers in Artificial Intelligence.

[wei2025deepseekocr] Wei, Haoran, Sun, Yaofeng, Li, Yukun. (2025). DeepSeek-OCR: Contexts Optical Compression. arXiv preprint arXiv:2510.18234.

[wu2021recursively] Wu, Jeff, Ouyang, Long, Ziegler, Daniel M, Stiennon, Nisan, Lowe, Ryan, Leike, Jan, Christiano, Paul. (2021). Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.

[kahatapitiya2025language] Kahatapitiya, Kumara, Ranasinghe, Kanchana, Park, Jongwoo, Ryoo, Michael S. (2025). Language repository for long video understanding. Findings of the Association for Computational Linguistics: ACL 2025.

[you2024towards] You, Zeng, Wen, Zhiquan, Chen, Yaofo, Li, Xin, Zeng, Runhao, Wang, Yaowei, Tan, Mingkui. (2024). Towards long video understanding via fine-detailed video story generation. IEEE Transactions on Circuits and Systems for Video Technology.

[DBLP:conf/cvpr/VLN] Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, Liang Lin. (2025). Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method. {IEEE/CVF. doi:10.1109/CVPR52734.2025.01128.

[DBLP:conf/aaai/00010W0025] Jiaang Li, Quan Wang, Zhongnan Wang, Yongdong Zhang, Zhendong Mao. (2025). {ELDER:. AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, {USA. doi:10.1609/AAAI.V39I23.34622.

[DBLP:conf/acl/WangTDWHJCJZ21] Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Jianshu Ji, Guihong Cao, Daxin Jiang, Ming Zhou. (2021). K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters. Findings of the Association for Computational Linguistics: {ACL/IJCNLP. doi:10.18653/V1/2021.FINDINGS-ACL.121.

[DBLP:journals/corr/PREMem] Sangyeop Kim, Yohan Lee, Sanghwa Kim, Hyunjong Kim, Sungzoon Cho. (2025). Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue. CoRR. doi:10.48550/ARXIV.2509.10852.

[zhao2025pretraininglimitedmemorylanguage] Linxi Zhao, Sofian Zalouk, Christian K. Belardi, Justin Lovelace, Jin Peng Zhou, Ryan Thomas Noonan, Dongyoung Go, Kilian Q. Weinberger, Yoav Artzi, Jennifer J. Sun. (2025). Pre-training Limited Memory Language Models with Internal and External Knowledge.

[DBLP:conf/nips/0104L0XY0X0C24] Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen. (2024). {WISE:. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024.

[DBLP:conf/aaai/Zhao0XLLH24] Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong{-. (2024). ExpeL: {LLM. Thirty-Eighth {AAAI. doi:10.1609/AAAI.V38I17.29936.

[DBLP:journals/tmlr/WangX0MXZFA24] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar. (2024). Voyager: An Open-Ended Embodied Agent with Large Language Models. Trans. Mach. Learn. Res..

[tang2025chemagentselfupdatinglibrarylarge] Xiangru Tang, Tianyu Hu, Muyang Ye, Yanjun Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, Arman Cohan, Mark Gerstein. (2025). ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning.

[DBLP:conf/emnlp/SentenceBert19] Nils Reimers, Iryna Gurevych. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, {EMNLP-IJCNLP. doi:10.18653/V1/D19-1410.

[DBLP:conf/icml/CLIP21] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning, {ICML.

[DBLP:conf/acl/HyDE23] Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan. (2023). Precise Zero-Shot Dense Retrieval without Relevance Labels. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), {ACL. doi:10.18653/V1/2023.ACL-LONG.99.

[DBLP:conf/cikm/XiLL0T0024] Yunjia Xi, Weiwen Liu, Jianghao Lin, Bo Chen, Ruiming Tang, Weinan Zhang, Yong Yu. (2024). MemoCRS: Memory-enhanced Sequential Conversational Recommender Systems with Large Language Models. Proceedings of the 33rd {ACM. doi:10.1145/3627673.3679599.

[tarasov2025sentenceanchoredgistcompressionlongcontext] Dmitrii Tarasov, Elizaveta Goncharova, Kuznetsov Andrey. (2025). Sentence-Anchored Gist Compression for Long-Context LLMs.

[DBLP:journals/corr/SemanticAnchor] Maitreyi Chatterjee, Devansh Agarwal. (2025). Semantic Anchoring in Agentic Memory: Leveraging Linguistic Structures for Persistent Conversational Context. CoRR. doi:10.48550/ARXIV.2508.12630.

[tang2025agentkbleveragingcrossdomain] Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou. (2025). Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving.

[DBLP:journals/corr/abs-2509-17459] Namyoung Kim, Kai Tzu{-. (2025). {PRINCIPLES:. CoRR. doi:10.48550/ARXIV.2509.17459.

[ouyang2025reasoningbankscalingagentselfevolving] Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, Tomas Pfister. (2025). ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory.

[xiao2025toolmemenhancingmultimodalagents] Yunzhong Xiao, Yangmin Li, Hewei Wang, Yunlong Tang, Zora Zhiruo Wang. (2025). ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory.

[githubGitHubVolcengineMineContext] MineContext. (2025). {G.

[githubGitHubAgentscopeaiReMe] AgentScope. (2025). {G.

[zhang2025memengineunifiedmodularlibrary] Zeyu Zhang, Quanyu Dai, Xu Chen, Rui Li, Zhongyang Li, Zhenhua Dong. (2025). MemEngine: A Unified and Modular Library for Developing Advanced Memory of LLM-based Agents.

[DBLP:conf/nips/RAG2020] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K{. (2020). Retrieval-Augmented Generation for Knowledge-Intensive {NLP. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.

[fang2025mempexploringagentprocedural] Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang. (2025). Memp: Exploring Agent Procedural Memory.

[DBLP:journals/corr/rcrrouter] Jun Liu, Zhenglun Kong, Changdi Yang, Fan Yang, Tianqi Li, Peiyan Dong, Joannah Nanjekye, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang. (2025). RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent {LLM. CoRR. doi:10.48550/ARXIV.2508.04903.

[DBLP:journals/pami/WangCLJHZLHZYML25] Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang. (2025). {JARVIS-1:. {IEEE. doi:10.1109/TPAMI.2024.3511593.

[DBLP:journals/tois/HuangLLYLX25] Xu Huang, Jianxun Lian, Yuxuan Lei, Jing Yao, Defu Lian, Xing Xie. (2025). Recommender {AI. {ACM. doi:10.1145/3731446.

[yang2024buffer] Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E. Gonzalez, Bin Cui. (2024). Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024.

[DBLP:conf/acl/IslamAP24] Md. Ashraful Islam, Mohammed Eunus Ali, Md. Rizwan Parvez. (2024). MapCoder: Multi-Agent Code Generation for Competitive Problem Solving. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), {ACL. doi:10.18653/V1/2024.ACL-LONG.269.

[kaiya2023lyfeagentsgenerativeagents] Zhao Kaiya, Michelangelo Naim, Jovana Kondic, Manuel Cortes, Jiaxin Ge, Shuying Luo, Guangyu Robert Yang, Andrew Ahn. (2023). Lyfe Agents: Generative agents for low-cost real-time social interactions.

[agasheAgentOpenAgentic2025] Agashe, Saaket, Han, Jiuzhou, Gan, Shuyu, Yang, Jiachen, Li, Ang, Wang, Xin Eric. Agent {{S. The {{Thirteenth International Conference.

[baddeleyWorkingMemoryTheories2012] Baddeley, Alan. Working {{Memory. Annual Review of Psychology. doi:10.1146/annurev-psych-120710-100422.

[bulatovRecurrentMemoryTransformer2022] Bulatov, Aydar, Kuratov, Yuri, Burtsev, Mikhail. Recurrent {{Memory Transformer. Advances in {{Neural Information Processing Systems.

[bulatovScalingTransformer1M2023] Bulatov, Aydar, Kuratov, Yuri, Burtsev, Mikhail S.. Scaling {{Transformer. CoRR. doi:10.48550/ARXIV.2304.11062.

[cowanWorkingMemoryUnderpins2014] Cowan, Nelson. Working {{Memory Underpins Cognitive Development. Educational psychology review. doi:10.1007/s10648-013-9246-y.

[daiTransformerXLAttentiveLanguage2019] Dai, Zihang, Yang, Zhilin, Yang, Yiming, Carbonell, Jaime G., Le, Quoc Viet, Salakhutdinov, Ruslan. Transformer-{{XL. Proceedings of the 57th {{Conference. doi:10.18653/V1/P19-1285.

[geIncontextAutoencoderContext2024] Ge, Tao, Hu, Jing, Wang, Lei, Wang, Xun, Chen, Si-Qing, Wei, Furu. In-Context {{Autoencoder. The {{Twelfth International Conference.

[huangLanguageModelsNot2025] Huang, Jen-tse, Sun, Kaiser, Wang, Wenxuan, Dredze, Mark. Language {{Models Do Not Have Human-Like Working Memory. doi:10.48550/arXiv.2505.10571.

[huHiAgentHierarchicalWorking2025] Hu, Mengkang, Chen, Tianxing, Chen, Qiguang, Mu, Yao, Shao, Wenqi, Luo, Ping. {{HiAgent. Proceedings of the 63rd {{Annual Meeting.

[jiangLLMLinguaCompressingPrompts2023] Jiang, Huiqiang, Wu, Qianhui, Lin, Chin-Yew, Yang, Yuqing, Qiu, Lili. {{LLMLingua. Proceedings of the 2023 {{Conference. doi:10.18653/V1/2023.EMNLP-MAIN.825.

[jiangLongLLMLinguaAcceleratingEnhancing2024] Jiang, Huiqiang, Wu, Qianhui, Luo, Xufang, Li, Dongsheng, Lin, Chin-Yew, Yang, Yuqing, Qiu, Lili. {{LongLLMLingua. Proceedings of the 62nd {{Annual Meeting. doi:10.18653/V1/2024.ACL-LONG.91.

[liaoHardSoftHybrid2025] Liao, Huanxuan, Hu, Wen, Xu, Yao, He, Shizhu, Zhao, Jun, Liu, Kang. Beyond {{Hard. CoRR. doi:10.48550/ARXIV.2505.15774.

[liSnapKVLLMKnows2024] Li, Yuhong, Huang, Yingbing, Yang, Bowen, Venkitesh, Bharat, Locatelli, Acyr, Ye, Hanchen, Cai, Tianle, Lewis, Patrick, Chen, Deming. {{SnapKV. Advances in {{Neural Information Processing Systems.

[packerMemGPTLLMsOperating2023] Packer, Charles, Fang, Vivian, Patil, Shishir G., Lin, Kevin, Wooders, Sarah, Gonzalez, Joseph E.. {{MemGPT. CoRR. doi:10.48550/ARXIV.2310.08560.

[ranaSayPlanGroundingLarge2023] Rana, Krishan, Haviland, Jesse, Garg, Sourav, {Abou-Chakra. {{SayPlan. Conference on {{Robot Learning.

[songLLMPlannerFewShotGrounded2023] Song, Chan Hee, Sadler, Brian M., Wu, Jiaman, Chao, Wei-Lun, Washington, Clayton, Su, Yu. {{LLM-Planner. {{IEEE. doi:10.1109/ICCV51070.2023.00280.

[wangKARMAAugmentingEmbodied2025] Wang, Zixuan, Yu, Bo, Zhao, Junzhe, Sun, Wenhao, Hou, Sai, Liang, Shuai, Hu, Xing, Han, Yinhe, Gan, Yiming. {{KARMA. {{IEEE International Conference. doi:10.1109/ICRA55743.2025.11128047.

[xiaoEfficientStreamingLanguage2024] Xiao, Guangxuan, Tian, Yuandong, Chen, Beidi, Han, Song, Lewis, Mike. Efficient {{Streaming Language Models. The {{Twelfth International Conference.

[yoonCompActCompressingRetrieved2024] Yoon, Chanwoong, Lee, Taewhoo, Hwang, Hyeon, Jeong, Minbyul, Kang, Jaewoo. {{CompAct. Proceedings of the 2024 {{Conference. doi:10.18653/V1/2024.EMNLP-MAIN.1194.

[zhuCALYPSOLLMsDungeon2023] Zhu, Andrew, Martin, Lara, Head, Andrew, {Callison-Burch. {{CALYPSO. Proceedings of the {{Nineteenth AAAI Conference. doi:10.1609/aiide.v19i1.27534.

[heMALMMMemoryAugmentedLarge2024] He, Bo, Li, Hengduo, Jang, Young Kyun, Jia, Menglin, Cao, Xuefei, Shah, Ashish, Shrivastava, Abhinav, Lim, Ser-Nam. {{MA-LMM. {{IEEE. doi:10.1109/CVPR52733.2024.01282.

[wangVideoAgentLongFormVideo2024] Wang, Xiaohan, Zhang, Yuhui, Zohar, Orr, {Yeung-Levy. {{VideoAgent. Computer {{Vision. doi:10.1007/978-3-031-72989-8_4.

[yuanMemSearcherTrainingLLMs2025] Yuan, Qianhao, Lou, Jie, Li, Zichao, Chen, Jiawei, Lu, Yaojie, Lin, Hongyu, Sun, Le, Zhang, Debing, Han, Xianpei. {{MemSearcher. doi:10.48550/arXiv.2511.02805.

[wuReSumUnlockingLongHorizon2025] Wu, Xixi, Li, Kuan, Zhao, Yida, Zhang, Liwen, Ou, Litu, Yin, Huifeng, Zhang, Zhongwang, Jiang, Yong, Xie, Pengjun, Huang, Fei, Cheng, Minhao, Wang, Shuai, Cheng, Hong, Zhou, Jingren. {{ReSum. CoRR. doi:10.48550/ARXIV.2509.13313.

[shinn_reflexion_2023] Shinn, Noah, Cassano, Federico, Gopinath, Ashwin, Narasimhan, Karthik, Yao, Shunyu. (2023). Reflexion: language agents with verbal reinforcement learning. Advances in {Neural.

[anokhin_arigraph_2025] Anokhin, Petr, Semenov, Nikita, Sorokin, Artyom Y., Evseev, Dmitry, Kravchenko, Andrey, Burtsev, Mikhail, Burnaev, Evgeny. (2025). {AriGraph. Proceedings of the {Thirty. doi:10.24963/IJCAI.2025/2.

[pan2024planningimaginationepisodicsimulation] Yiyuan Pan, Yunzhe Xu, Zhe Liu, Hesheng Wang. (2024). Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation.

[zhang_agent_2025] Zhang, Kai, Chen, Xiangchao, Liu, Bo, Xue, Tianci, Liao, Zeyi, Liu, Zhihan, Wang, Xiyao, Ning, Yuting, Chen, Zhaorun, Fu, Xiaohan, Xie, Jian, Sun, Yuxuan, Gou, Boyu, Qi, Qi, Meng, Zihang, Yang, Jianwei, Zhang, Ning, Li, Xian, Shah, Ashish, Huynh, Dat, Li, Hengduo, Yang, Zi, Cao, Sara, Jang, Lawrence, Zhou, Shuyan, Zhu, Jiacheng, Sun, Huan, Weston, Jason, Su, Yu, Wu, Yifan. (2025). Agent {Learning. CoRR. doi:10.48550/ARXIV.2510.08558.

[li2024optimus1hybridmultimodalmemory] Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie. (2024). Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks.

[zheng_synapse_2024] Zheng, Longtao, Wang, Rundong, Wang, Xinrun, An, Bo. (2024). Synapse: {Trajectory. The {Twelfth.

[yu_fincon_2024] Yu, Yangyang, Yao, Zhiyuan, Li, Haohang, Deng, Zhiyang, Jiang, Yuechen, Cao, Yupeng, Chen, Zhi, Suchow, Jordan W., Cui, Zhenyu, Liu, Rong, Xu, Zhaozhuo, Zhang, Denghui, Subbalakshmi, Koduvayur, Xiong, Guojun, He, Yueru, Huang, Jimin, Li, Dong, Xie, Qianqian. (2024). {FinCon. Advances in {Neural.

[islam_mapcoder_2024] Islam, Md Ashraful, Ali, Mohammed Eunus, Parvez, Md Rizwan. (2024). {MapCoder. Proceedings of the 62nd {Annual. doi:10.18653/V1/2024.ACL-LONG.269.

[he2020deberta] He, Pengcheng, Liu, Xiaodong, Gao, Jianfeng, Chen, Weizhu. (2020). Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.

[huang_r2d2_2025] Huang, Tenghao, Basu, Kinjal, Abdelaziz, Ibrahim, Kapanipathi, Pavan, May, Jonathan, Chen, Muhao. (2025). {R2D2. Proceedings of the 63rd {Annual.

[li2025omnivideobenchaudiovisualunderstandingevaluation] Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Jiafu Tang, Zhenghao Song, Dingling Zhang, Ying He, Haoxiang Liu, Yuxuan Wang, Qiufeng Wang, Zhenhe Wu, Jiehui Luo, Zhiyu Pan, Weihao Xie, Chenchen Zhang, Zhaohui Wang, Jiayi Tian, Yanghai Wang, Zhe Cao, Minxin Dai, Ke Wang, Runzhe Wen, Yinghao Ma, Yaning Pan, Sungkyun Chang, Termeh Taheri, Haiwen Xia, Christos Plachouras, Emmanouil Benetos, Yizhi Li, Ge Zhang, Jian Yang, Tianhao Peng, Zili Wang, Minghao Liu, Junran Peng, Zhaoxiang Zhang, Jiaheng Liu. (2025). OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs.

[bo2025agenticlearnergrowandrefinemultimodal] Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie, Xiao Tan, Kunbin Chen, Wei He, Xiaofan Li, Na Zhao, Jingdong Wang, Zechao Li. (2025). Agentic Learner with Grow-and-Refine Multimodal Semantic Memory.

[yu_browseragent_2025] Yu, Tao, Zhang, Zhengbo, Lyu, Zhiheng, Gong, Junhao, Yi, Hongzhu, Wang, Xinming, Zhou, Yuxuan, Yang, Jiabing, Nie, Ping, Huang, Yan, Chen, Wenhu. (2025). {BrowserAgent. doi:10.48550/arXiv.2510.10666.

[deng_mind2web_2023] Deng, Xiang, Gu, Yu, Zheng, Boyuan, Chen, Shijie, Stevens, Samual, Wang, Boshi, Sun, Huan, Su, Yu. (2023). {Mind2Web. Advances in {Neural.

[zhou_webarena_2024] Zhou, Shuyan, Xu, Frank F., Zhu, Hao, Zhou, Xuhui, Lo, Robert, Sridhar, Abishek, Cheng, Xianyi, Ou, Tianyue, Bisk, Yonatan, Fried, Daniel, Alon, Uri, Neubig, Graham. (2024). {WebArena. The {Twelfth.

[wang_recmind_2024] Wang, Yancheng, Jiang, Ziyan, Chen, Zheng, Yang, Fan, Zhou, Yingxue, Cho, Eunah, Fan, Xing, Lu, Yanbin, Huang, Xiaojiang, Yang, Yingzhen. (2024). {RecMind. Findings of the {Association. doi:10.18653/V1/2024.FINDINGS-NAACL.271.

[Page1999ThePC] Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. (1999). The PageRank Citation Ranking : Bringing Order to the Web. The Web Conference.

[kim_principles_2025] Kim, Namyoung, Ong, Kai Tzu-iunn, Hwang, Yeonjun, Kang, Minseok, Jihn, Iiseo, Kim, Gayoung, Kim, Minju, Yeo, Jinyoung. (2025). {PRINCIPLES. CoRR. doi:10.48550/ARXIV.2509.17459.

[wang_voyager_2024] Wang, Guanzhi, Xie, Yuqi, Jiang, Yunfan, Mandlekar, Ajay, Xiao, Chaowei, Zhu, Yuke, Fan, Linxi, Anandkumar, Anima. (2024). Voyager: {An. Trans. Mach. Learn. Res..

[zhang_darwin_2025] Zhang, Jenny, Hu, Shengran, Lu, Cong, Lange, Robert T., Clune, Jeff. (2025). Darwin {Godel. CoRR. doi:10.48550/ARXIV.2505.22954.

[DBLP:journals/corr/abs-2504-13805] Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, Wenchao Meng. (2025). LearnAct: Few-Shot Mobile {GUI. CoRR. doi:10.48550/ARXIV.2504.13805.

[han_legomem_2025] Han, Dongge, Couturier, Camille, Diaz, Daniel Madrigal, Zhang, Xuchao, Rühle, Victor, Rajmohan, Saravan. (2025). {LEGOMem. doi:10.48550/arXiv.2510.04851.

[bouzenia_repairagent_2024] Bouzenia, Islem, Devanbu, Premkumar, Pradel, Michael. (2024). {RepairAgent. doi:10.48550/arXiv.2403.17134.

[skillweaver] Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, Yu Su. (2025). SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills. CoRR. doi:10.48550/ARXIV.2504.07079.

[shi_retrieval_2025] Shi, Zhengliang, Wang, Yuhan, Yan, Lingyong, Ren, Pengjie, Wang, Shuaiqiang, Yin, Dawei, Ren, Zhaochun. (2025). Retrieval {Models. Findings of the {Association.

[gao_ptr_2024] Gao, Hang, Zhang, Yongfeng. (2024). {PTR. CoRR. doi:10.48550/ARXIV.2411.09613.

[zheng_toolrerank_2024] Zheng, Yuanhang, Li, Peng, Liu, Wei, Liu, Yang, Luan, Jian, Wang, Bin. (2024). {ToolRerank. Proceedings of the 2024 {Joint.

[patil_gorilla_2024] Patil, Shishir G., Zhang, Tianjun, Wang, Xin, Gonzalez, Joseph E.. (2024). Gorilla: {Large. Advances in {Neural.

[qiu_alita_2025] Qiu, Jiahao, Qi, Xuan, Zhang, Tongcheng, Juan, Xinzhe, Guo, Jiacheng, Lu, Yifu, Wang, Yimin, Yao, Zixin, Ren, Qihan, Jiang, Xun, Zhou, Xing, Liu, Dongrui, Yang, Ling, Wu, Yue, Huang, Kaixuan, Liu, Shilong, Wang, Hongru, Wang, Mengdi. (2025). Alita: {Generalist. CoRR. doi:10.48550/ARXIV.2505.20286.

[xiao_toolmem_2025] Xiao, Yunzhong, Li, Yangmin, Wang, Hewei, Tang, Yunlong, Wang, Zora Zhiruo. (2025). {ToolMem. CoRR. doi:10.48550/ARXIV.2510.06664.

[zhao_cola_2025] Zhao, Di, Ma, Longhui, Wang, Siwei, Wang, Miao, Lv, Zhao. (2025). {COLA. CoRR. doi:10.48550/ARXIV.2503.09263.

[yuan_craft_2024] Yuan, Lifan, Chen, Yangyi, Wang, Xingyao, Fung, Yi, Peng, Hao, Ji, Heng. (2024). {CRAFT. The {Twelfth.

[qu_colt_2024] Qu, Changle, Dai, Sunhao, Wei, Xiaochi, Cai, Hengyi, Wang, Shuaiqiang, Yin, Dawei, Xu, Jun, Wen, Ji-Rong. (2024). {COLT. CoRR. doi:10.48550/ARXIV.2405.16089.

[qian_creator_2023] Qian, Cheng, Han, Chi, Fung, Yi Ren, Qin, Yujia, Liu, Zhiyuan, Ji, Heng. (2023). {CREATOR. Findings of the {Association. doi:10.18653/V1/2023.FINDINGS-EMNLP.462.

[liDeepAgentGeneralReasoning2025] Li, Xiaoxi, Jiao, Wenxiang, Jin, Jiarui, Dong, Guanting, Jin, Jiajie, Wang, Yinuo, Wang, Hao, Zhu, Yutao, Wen, Ji-Rong, Lu, Yuan, Dou, Zhicheng. {{DeepAgent. doi:10.48550/arXiv.2510.21618.

[shenEncodeStoreRetrieveAugmentingHuman2024] Shen, Junxiao, Dudley, John J., Kristensson, Per Ola. Encode-{{Store-Retrieve. {{IEEE International Symposium. doi:10.1109/ISMAR62088.2024.00108.

[limbacherHMemHarnessingSynaptic2020] Limbacher, Thomas, Legenstein, Robert. H-Mem: Harnessing Synaptic Plasticity with Hebbian Memory Networks. Advances in Neural Information Processing Systems.

[shiRetrievalModelsArent2025] Shi, Zhengliang, Wang, Yuhan, Yan, Lingyong, Ren, Pengjie, Wang, Shuaiqiang, Yin, Dawei, Ren, Zhaochun. Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models. doi:10.48550/arXiv.2503.01763.

[quExplorationMasteryEnabling2025] Qu, Changle, Dai, Sunhao, Wei, Xiaochi, Cai, Hengyi, Wang, Shuaiqiang, Yin, Dawei, Xu, Jun, Wen, Ji-Rong. From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions. doi:10.48550/arXiv.2410.08197.

[qiuAlitaGSelfEvolvingGenerative2025a] Qiu, Jiahao, Qi, Xuan, Wang, Hongru, Juan, Xinzhe, Wang, Yimin, Zhao, Zelin, Geng, Jiayi, Guo, Jiacheng, Li, Peihang, Shi, Jingzhe, Liu, Shilong, Wang, Mengdi. Alita-G: Self-Evolving Generative Agent for Agent Generation. doi:10.48550/arXiv.2510.23601.

[labate2025solvingcontextwindowoverflow] Anton Bulle Labate, Valesca Moura de Sousa, Sandro Rama Fiorini, Leonardo Guerreiro Azevedo, Raphael Melo Thiago, Viviane Torres da Silva. (2025). Solving Context Window Overflow in AI Agents.

[liu2025memversemultimodalmemorylifelong] Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, Yi Yu, Shuyue Hu, Botian Shi, Ding Wang. (2025). MemVerse: Multimodal Memory for Lifelong Learning Agents.

[xuAMEMAgenticMemory2025] Xu, Wujiang, Liang, Zujie, Mei, Kai, Gao, Hang, Tan, Juntao, Zhang, Yongfeng. A-{{MEM. CoRR. doi:10.48550/ARXIV.2502.12110.

[yanGeneralAgenticMemory2025] Yan, B. Y., Li, Chaofan, Qian, Hongjin, Lu, Shuqi, Liu, Zheng. General Agentic Memory Via Deep Research. doi:10.48550/arXiv.2511.18423.

[rezazadehIsolatedConversationsHierarchical2025] Rezazadeh, Alireza, Li, Zichao, Wei, Wei, Bao, Yujia. From {{Isolated Conversations. The {{Thirteenth International Conference.

[yenMemoletReifyingReuse2024] Yen, Ryan, Zhao, Jian. Memolet: {{Reifying. Proceedings of the 37th {{Annual ACM Symposium. doi:10.1145/3654777.3676388.

[duMemGuideIntentDrivenMemory2025] Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, Kam-Fai Wong. (2025). MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents.

[chenGameGPTMultiagentCollaborative2023] Chen, Dake, Wang, Hanbin, Huo, Yunhao, Li, Yuzhao, Zhang, Haoyang. {{GameGPT. CoRR. doi:10.48550/ARXIV.2310.08067.

[gaoMemorySharingLarge2024] Gao, Hang, Zhang, Yongfeng. Memory {{Sharing. CoRR. doi:10.48550/ARXIV.2404.09982.

[gaoS^mbox3SocialnetworkSimulation2023] Gao, Chen, Lan, Xiaochong, Lu, Zhihong, Mao, Jinzhu, Piao, Jinghua, Wang, Huandong, Jin, Depeng, Li, Yong. S\textbackslash (\textasciicircum\textbackslash mbox3\textbackslash ): {{Social-network Simulation System. CoRR. doi:10.48550/ARXIV.2307.14984.

[2013ELLA] Ruvolo, Paul, Eaton, Eric. (2013). ELLA: An Efficient Lifelong Learning Algorithm. JMLR.org.

[panSeCom2025] Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin{-. (2025). SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents. The Thirteenth International Conference on Learning Representations, {ICLR.

[TFIDF1972] SPARCK JONES, {\relax KAREN. A {{STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL. Journal of Documentation. doi:10.1108/eb026526.

[DBLP:journals/ftir/RobertsonZ09BM25] Stephen E. Robertson, Hugo Zaragoza. (2009). The Probabilistic Relevance Framework: {BM25. Found. Trends Inf. Retr.. doi:10.1561/1500000019.

[DBLP:conf/eacl/YuanWFPLWML24] Peiwen Yuan, Xinglin Wang, Shaoxiong Feng, Boyuan Pan, Yiwei Li, Heda Wang, Xupeng Miao, Kan Li. (2024). Generative Dense Retrieval: Memory Can Be a Burden. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, {EACL.

[DBLP:conf/nips/Taytransformer22] Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Prakash Gupta, Tal Schuster, William W. Cohen, Donald Metzler. (2022). Transformer Memory as a Differentiable Search Index. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.

[DBLP:conf/nips/Wangneuralcorpus0022] Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, Mao Yang. (2022). A Neural Corpus Indexer for Document Retrieval. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.

[DBLP:journals/corr/agentrr] Erhu Feng, Wenbo Zhou, Zibin Liu, Le Chen, Yunpeng Dong, Cheng Zhang, Yisheng Zhao, Dong Du, Zhi{-. (2025). Get Experience from Practice: {LLM. CoRR. doi:10.48550/ARXIV.2505.17716.

[DBLP:journals/corr/matrix] Jiale Liu, Yifan Zeng, Malte H{\o. (2024). Memory-Augmented Agent Training for Business Document Understanding. CoRR. doi:10.48550/ARXIV.2412.15274.

[DBLP:journals/ijon/SAGE] Xuechen Liang, Meiling Tao, Yinghui Xia, Jianhui Wang, Kun Li, Yijin Wang, Yangfan He, Jingsong Yang, Tianyu Shi, Yuantao Wang, Miao Zhang, Xueqian Wang. (2025). {SAGE:. Neurocomputing. doi:10.1016/J.NEUCOM.2025.130470.

[DBLP:journals/corr/learntomemorize] Zeyu Zhang, Quanyu Dai, Rui Li, Xiaohe Bo, Xu Chen, Zhenhua Dong. (2025). Learn to Memorize: Optimizing LLM-based Agents with Adaptive Memory Framework. CoRR. doi:10.48550/ARXIV.2508.16629.

[DBLP:conf/nips/Mu0G23] Jesse Mu, Xiang Li, Noah D. Goodman. (2023). Learning to Compress Prompts with Gist Tokens. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.

[DBLP:conf/emnlp/LuoZXWLLCS24] Weiyao Luo, Suncong Zheng, Heming Xia, Weikang Wang, Yan Lei, Tianyu Liu, Shuang Chen, Zhifang Sui. (2024). Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens. Findings of the Association for Computational Linguistics: {EMNLP. doi:10.18653/V1/2024.FINDINGS-EMNLP.233.

[DBLP:conf/emnlp/ChevalierWAC23] Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen. (2023). Adapting Language Models to Compress Contexts. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, {EMNLP. doi:10.18653/V1/2023.EMNLP-MAIN.232.

[DBLP:conf/www/Qian0ZMLD025] Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, Tiejun Huang. (2025). MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation. Proceedings of the {ACM. doi:10.1145/3696410.3714805.

[DBLP:journals/corr/abs-2502-00592] Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian J. McAuley, Dan Gutfreund, Rog{'{e. (2025). {M+:. CoRR. doi:10.48550/ARXIV.2502.00592.

[DBLP:journals/corr/abs-2501-00663] Ali Behrouz, Peilin Zhong, Vahab Mirrokni. (2025). Titans: Learning to Memorize at Test Time. CoRR. doi:10.48550/ARXIV.2501.00663.

[DBLP:conf/iclr/NaSM24] Hyungho Na, Yunkyeong Seo, Il{-. (2024). Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning. The Twelfth International Conference on Learning Representations, {ICLR.

[DBLP:journals/corr/abs-2510-09038] Wenyi Wu, Kun Zhou, Ruoxin Yuan, Vivian Yu, Stephen Wang, Zhiting Hu, Biwei Huang. (2025). Auto-scaling Continuous Memory for {GUI. CoRR. doi:10.48550/ARXIV.2510.09038.

[DBLP:journals/corr/autorag] Dongkyu Kim, Byoungwook Kim, Donggeon Han, Matous Eibich. (2024). AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline. CoRR. doi:10.48550/ARXIV.2410.20878.

[DBLP:journals/corr/comorag] Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu, Jie Zhou, Jin Xu, Liyan Xu. (2025). ComoRAG: {A. CoRR. doi:10.48550/ARXIV.2508.10419.

[DBLP:journals/corr/abs-2502-04395] Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, Yuxuan Liang. (2025). Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting. CoRR. doi:10.48550/ARXIV.2502.04395.

[DBLP:journals/corr/abs-2508-19236] Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang. (2025). MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation. CoRR. doi:10.48550/ARXIV.2508.19236.

[DBLP:conf/acl/00010ZM25] Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao. (2025). SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), {ACL.

[DBLP:conf/iros/MezghaniSLMBBA22] Lina Mezghani, Sainbayar Sukhbaatar, Thibaut Lavril, Oleksandr Maksymets, Dhruv Batra, Piotr Bojanowski, Karteek Alahari. (2022). Memory-Augmented Reinforcement Learning for Image-Goal Navigation. {IEEE/RSJ. doi:10.1109/IROS47612.2022.9981090.

[DBLP:conf/iclr/WuRHS22] Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, Christian Szegedy. (2022). Memorizing Transformers. The Tenth International Conference on Learning Representations, {ICLR.

[DBLP:conf/acl/YaoL024] Yao Yao, Zuchao Li, Hai Zhao. (2024). SirLLM: Streaming Infinite Retentive {LLM. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), {ACL. doi:10.18653/V1/2024.ACL-LONG.143.

[DBLP:journals/corr/abs-2407-01178] Hongkang Yang, Zehao Lin, Wenjin Wang, Hao Wu, Zhiyu Li, Bo Tang, Wenqiang Wei, Jinbo Wang, Zeyun Tang, Shichao Song, Chenyang Xi, Yu Yu, Kai Chen, Feiyu Xiong, Linpeng Tang, Weinan E. (2024). Memory({. CoRR. doi:10.48550/ARXIV.2407.01178.

[DBLP:journals/corr/abs-2501-02950] Samuel J. Gershman, Ila Fiete, Kazuki Irie. (2025). Key-value memory in the brain. CoRR. doi:10.48550/ARXIV.2501.02950.

[DBLP:conf/nips/LiuDLWXXKS23] Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava. (2023). Scissorhands: Exploiting the Persistence of Importance Hypothesis for {LLM. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.

[DBLP:conf/nips/LiHYVLYCLC24] Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen. (2024). SnapKV: {LLM. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024.

[DBLP:journals/corr/PRIME] Hieu Tran, Zonghai Yao, Nguyen Luong Tran, Zhichao Yang, Feiyun Ouyang, Shuo Han, Razieh Rahimi, Hong Yu. (2025). {PRIME:. CoRR. doi:10.48550/ARXIV.2509.22315.

[DBLP:journals/corr/abs-2406-02069] Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao. (2024). PyramidKV: Dynamic {KV. CoRR. doi:10.48550/ARXIV.2406.02069.

[DBLP:conf/iclr/TangLLHKHYW25] Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Danning Ke, Shikuan Hong, Yiwu Yao, Gongyi Wang. (2025). RazorAttention: Efficient {KV. The Thirteenth International Conference on Learning Representations, {ICLR.

[DBLP:conf/nips/TworkowskiSPWMM23] Szymon Tworkowski, Konrad Staniszewski, Mikolaj Pacek, Yuhuai Wu, Henryk Michalewski, Piotr Milos. (2023). Focused Transformer: Contrastive Training for Context Scaling. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.

[DBLP:conf/nips/Wang0CLYGW23] Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, Furu Wei. (2023). Augmenting Language Models with Long-Term Memory. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.

[kang2025lm2largememorymodels] Jikun Kang, Wenqi Wu, Filippos Christianos, Alex J. Chan, Fraser Greenlee, George Thomas, Marvin Purtorab, Andy Toulis. (2025). LM2: Large Memory Models.

[DBLP:journals/corr/abs-2508-15253] Eunseong Choi, June Park, Hyeri Lee, Jongwuk Lee. (2025). Conflict-Aware Soft Prompting for Retrieval-Augmented Generation. CoRR. doi:10.48550/ARXIV.2508.15253.

[DBLP:conf/nips/Zhang00CZC0TRBW23] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R{'{e. (2023). {H2O:. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.

[DBLP:journals/corr/MA-RAG] Thang Nguyen, Peter Chin, Yu{-. (2025). {MA-RAG:. CoRR. doi:10.48550/ARXIV.2505.20096.

[DBLP:journals/corr/abs-2502-06049] Jikun Kang, Wenqi Wu, Filippos Christianos, Alex J. Chan, Fraser Greenlee, George Thomas, Marvin Purtorab, Andy Toulis. (2025). {LM2:. CoRR. doi:10.48550/ARXIV.2502.06049.

[DBLP:conf/iclr/Editable] Anton Sinitsin, Vsevolod Plokhotnyuk, Dmitry V. Pyrkin, Sergei Popov, Artem Babenko. (2020). Editable Neural Networks. 8th International Conference on Learning Representations, {ICLR.

[wei2022finetunedlanguagemodelszeroshot] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le. (2022). Finetuned Language Models Are Zero-Shot Learners.

[mukherjee2023orcaprogressivelearningcomplex] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah. (2023). Orca: Progressive Learning from Complex Explanation Traces of GPT-4.

[tunstall2023zephyrdirectdistillationlm] Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, Thomas Wolf. (2023). Zephyr: Direct Distillation of LM Alignment.

[DBLP:conf/icml/selfrewarding] Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston. (2024). Self-Rewarding Language Models. Forty-first International Conference on Machine Learning, {ICML.

[riedelDeclarativeMemory2015] Riedel, Wim J., Blokland, Arjan. Declarative Memory. Handbook of Experimental Pharmacology. doi:10.1007/978-3-319-16522-6_7.

[squireMemorySystemsBrain2004] Squire, Larry R.. Memory Systems of the Brain: A Brief History and Current Perspective. Neurobiology of Learning and Memory. doi:10.1016/j.nlm.2004.06.005.

[tulvingEpisodicSemanticMemory1972] Tulving, Endel. Episodic and Semantic Memory. Organization of Memory.

[tulvingEpisodicMemoryMind2002] Tulving, Endel. Episodic Memory: From Mind to Brain. Annual Review of Psychology. doi:10.1146/annurev.psych.53.100901.135114.

[piaoAgentSocietyLargeScaleSimulation2025] Piao, Jinghua, Yan, Yuwei, Zhang, Jun, Li, Nian, Yan, Junbo, Lan, Xiaochong, Lu, Zhihong, Zheng, Zhiheng, Wang, Jing Yi, Zhou, Di, Gao, Chen, Xu, Fengli, Zhang, Fang, Rong, Ke, Su, Jun, Li, Yong. AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society. doi:10.48550/arXiv.2502.08691.

[gaoSurveySelfEvolvingAgents2025b] Gao, Huan-ang, Geng, Jiayi, Hua, Wenyue, Hu, Mengkang, Juan, Xinzhe, Liu, Hongzhang, Liu, Shilong, Qiu, Jiahao, Qi, Xuan, Wu, Yiran, Wang, Hongru, Xiao, Han, Zhou, Yuhang, Zhang, Shaokun, Zhang, Jiayi, Xiang, Jinyu, Fang, Yixiong, Zhao, Qiwen, Liu, Dongrui, Ren, Qihan, Qian, Cheng, Wang, Zhenhailong, Hu, Minda, Wang, Huazheng, Wu, Qingyun, Ji, Heng, Wang, Mengdi. A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence. doi:10.48550/arXiv.2507.21046.

[meiSurveyContextEngineering2025] Mei, Lingrui, Yao, Jiayu, Ge, Yuyao, Wang, Yiwei, Bi, Baolong, Cai, Yujun, Liu, Jiazhi, Li, Mingyu, Li, Zhong-Zhi, Zhang, Duzhen, Zhou, Chenlin, Mao, Jiayi, Xia, Tianze, Guo, Jiafeng, Liu, Shenghua. A {{Survey. doi:10.48550/arXiv.2507.13334.

[reberNeuralBasisImplicit2013] Reber, Paul J.. The Neural Basis of Implicit Learning and Memory: A Review of Neuropsychological and Neuroimaging Research. Neuropsychologia. doi:10.1016/j.neuropsychologia.2013.06.019.

[segerCriticalReviewHabit2011] Seger, Carol A., Spiering, Brian J.. A Critical Review of Habit Learning and the {{Basal Ganglia. Frontiers in Systems Neuroscience. doi:10.3389/fnsys.2011.00066.

[suttonWelcomeEraExperience2025] Sutton, Richard S., David Silver. Welcome to the {{Era. The AI Innovator.

[jiang2025longtermmemoryfoundation] Xun Jiang, Feng Li, Han Zhao, Jiahao Qiu, Jiaying Wang, Jun Shao, Shihao Xu, Shu Zhang, Weiling Chen, Xavier Tang, Yize Chen, Mengyue Wu, Weizhi Ma, Mengdi Wang, Tianqiao Chen. (2025). Long Term Memory: The Foundation of AI Self-Evolution.

[DBLP:journals/corr/abs-2510-04618] Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, Kunle Olukotun. (2025). Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models. CoRR. doi:10.48550/ARXIV.2510.04618.

[DBLP:conf/chi/ZulfikarCM24] Wazeer Deen Zulfikar, Samantha W. T. Chan, Pattie Maes. (2024). Memoro: Using Large Language Models to Realize a Concise Interface for Real-Time Memory Augmentation. Proceedings of the {CHI. doi:10.1145/3613904.3642450.

[DBLP:journals/corr/abs-2509-05298] Rui Xi, Xianghan Wang. (2025). Livia: An Emotion-Aware {AR. CoRR. doi:10.48550/ARXIV.2509.05298.

[zhang2025memoryretrievalconsolidationlarge] Shaohua Zhang, Yuan Lin, Hang Li. (2025). Memory Retrieval and Consolidation in Large Language Models through Function Tokens.

[lin2025continuallearningsparsememory] Jessy Lin, Luke Zettlemoyer, Gargi Ghosh, Wen-Tau Yih, Aram Markosyan, Vincent-Pierre Berges, Barlas Oğuz. (2025). Continual Learning via Sparse Memory Finetuning.

[DBLP:conf/cvpr/SongCWZZWCG0ZLH24] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq{-. (2024). MovieChat: From Dense Token to Sparse Memory for Long Video Understanding. {IEEE/CVF. doi:10.1109/CVPR52733.2024.01725.

[rezazadeh2025collaborativememorymultiusermemory] Alireza Rezazadeh, Zichao Li, Ange Lou, Yuying Zhao, Wei Wei, Yujia Bao. (2025). Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control.

[DBLP:journals/corr/abs-2506-03141] Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu. (2025). Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval. CoRR. doi:10.48550/ARXIV.2506.03141.

[DBLP:journals/corr/abs-2501-00358] Yue Fan, Xiaojian Ma, Rongpeng Su, Jun Guo, Rujie Wu, Xi Chen, Qing Li. (2025). Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding. CoRR. doi:10.48550/ARXIV.2501.00358.

[DBLP:journals/corr/abs-2505-16348] Taeyoon Kwon, Dongwook Choi, Sunghwan Kim, Hyojun Kim, Seungjun Moon, Beong{-. (2025). Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance. CoRR. doi:10.48550/ARXIV.2505.16348.

[DBLP:journals/corr/abs-2510-14629] Jiani Huang, Xingchen Zou, Lianghao Xia, Qing Li. (2025). MR.Rec: Synergizing Memory and Reasoning for Personalized Recommendation Assistant with LLMs. CoRR. doi:10.48550/ARXIV.2510.14629.

[DBLP:conf/acl/LeeHPP023] Gibbeum Lee, Volker Hartmann, Jongho Park, Dimitris Papailiopoulos, Kangwook Lee. (2023). Prompted LLMs as Chatbot Modules for Long Open-domain Conversation. Findings of the Association for Computational Linguistics: {ACL. doi:10.18653/V1/2023.FINDINGS-ACL.277.

[DBLP:journals/corr/abs-2510-07925] Rebecca Westh{. (2025). Enabling Personalized Long-term Interactions in LLM-based Agents through Persistent Memory and User Profiles. CoRR. doi:10.48550/ARXIV.2510.07925.

[DBLP:journals/tmlr/kvcachemanagement] Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen. (2025). A Survey on Large Language Model Acceleration based on {KV. Trans. Mach. Learn. Res..

[Jiang_towards_2025] Jiang, Jiantong, Yang, Peiyu, Zhang, Rui, Liu, Feng. (2025). Towards Efficient Large Language Model Serving: A Survey on System-Aware {{KV. TechRxiv. doi:10.36227/techrxiv.176046306.66521015/v2.

[DBLP:journals/corr/abs-2308-09597] Cheng Li, Ziang Leng, Chenxi Yan, Junyi Shen, Hao Wang, Weishi Mi, Yaying Fei, Xiaoyang Feng, Song Yan, HaoSheng Wang, Linkang Zhan, Yaokai Jia, Pingyu Wu, Haozhen Sun. (2023). ChatHaruhi: Reviving Anime Character in Reality via Large Language Model. CoRR. doi:10.48550/ARXIV.2308.09597.

[DBLP:conf/acl/WangPQLZWGGN00024] Noah Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhao Huang, Jie Fu, Junran Peng. (2024). RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models. Findings of the Association for Computational Linguistics, {ACL. doi:10.18653/V1/2024.FINDINGS-ACL.878.

[wang2025scmenhancinglargelanguage] Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, Zhoujun Li. (2025). SCM: Enhancing Large Language Model with Self-Controlled Memory Framework.

[chen2025compress] Nuo Chen, Hongguang Li, Jianhui Chang, Juhua Huang, Baoyuan Wang, Jia Li. (2025). Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations. Proceedings of the 31st International Conference on Computational Linguistics, {COLING.

[DBLP:conf/coling/YuanSLWCL25] Ruifeng Yuan, Shichao Sun, Yongqi Li, Zili Wang, Ziqiang Cao, Wenjie Li. (2025). Personalized Large Language Model Assistant with Evolving Conditional Memory. Proceedings of the 31st International Conference on Computational Linguistics, {COLING.

[DBLP:journals/corr/abs-2404-09982] Hang Gao, Yongfeng Zhang. (2024). Memory Sharing for Large Language Model based Agents. CoRR. doi:10.48550/ARXIV.2404.09982.

[DBLP:journals/corr/abs-2406-00057] Nick Alonso, Tomas Figliolia, Anthony Ndirango, Beren Millidge. (2024). Toward Conversational Agents with Context and Time Sensitive Long-term Memory. CoRR. doi:10.48550/ARXIV.2406.00057.

[DBLP:journals/corr/abs-2508-15294] Gaoke Zhang, Bo Wang, Yunlong Ma, Dongming Zhao, Zifei Yu. (2025). Multiple Memory Systems for Enhancing the Long-term Memory of Agent. CoRR. doi:10.48550/ARXIV.2508.15294.

[DBLP:conf/iclr/FountasBOCLB025] Zafeirios Fountas, Martin Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou{-. (2025). Human-inspired Episodic Memory for Infinite Context LLMs. The Thirteenth International Conference on Learning Representations, {ICLR.

[DBLP:journals/corr/abs-2311-08719] Lei Liu, Xiaoyan Yang, Yue Shen, Binbin Hu, Zhiqiang Zhang, Jinjie Gu, Guannan Zhang. (2023). Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory. CoRR. doi:10.48550/ARXIV.2311.08719.

[DBLP:journals/corr/abs-2508-03341] Jiayan Nan, Wenquan Ma, Wenlong Wu, Yize Chen. (2025). Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science. CoRR. doi:10.48550/ARXIV.2508.03341.

[DBLP:journals/corr/abs-2507-02592] Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou. (2025). WebSailor: Navigating Super-human Reasoning for Web Agent. CoRR. doi:10.48550/ARXIV.2507.02592.

[DBLP:journals/corr/abs-2501-05366] Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, Zhicheng Dou. (2025). Search-o1: Agentic Search-Enhanced Large Reasoning Models. CoRR. doi:10.48550/ARXIV.2501.05366.

[DBLP:journals/corr/abs-2503-09516] Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, Jiawei Han. (2025). Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning. CoRR. doi:10.48550/ARXIV.2503.09516.

[DBLP:journals/corr/abs-2503-05592] Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Ji{-. (2025). R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning. CoRR. doi:10.48550/ARXIV.2503.05592.

[DBLP:journals/corr/abs-2505-22648] Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou. (2025). WebDancer: Towards Autonomous Information Seeking Agency. CoRR. doi:10.48550/ARXIV.2505.22648.

[DBLP:journals/corr/abs-2504-21776] Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji{-. (2025). WebThinker: Empowering Large Reasoning Models with Deep Research Capability. CoRR. doi:10.48550/ARXIV.2504.21776.

[DBLP:journals/corr/abs-2510-12635] Yuxiang Zhang, Jiangming Shu, Ye Ma, Xueyuan Lin, Shangxi Wu, Jitao Sang. (2025). Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks. CoRR. doi:10.48550/ARXIV.2510.12635.

[DBLP:journals/corr/abs-2510-11967] Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, Jiecao Chen. (2025). Scaling Long-Horizon {LLM. CoRR. doi:10.48550/ARXIV.2510.11967.

[chen2025iterresearchrethinkinglonghorizonagents] Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou. (2025). IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction.

[DBLP:journals/corr/abs-2508-16153] Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, Jun Wang. (2025). Memento: Fine-tuning {LLM. CoRR. doi:10.48550/ARXIV.2508.16153.

[DBLP:journals/corr/abs-2509-12810] Shicheng Ye, Chao Yu, Kaiqiang Ke, Chengdong Xu, Yinqi Wei. (2025). H({. CoRR. doi:10.48550/ARXIV.2509.12810.

[DBLP:conf/acl/YinWPL0W25] Xunjian Yin, Xinyi Wang, Liangming Pan, Li Lin, Xiaojun Wan, William Yang Wang. (2025). *G{*. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), {ACL.

[DBLP:journals/corr/abs-2504-14603] Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, Liqun Li, Yu Kang, Zhao Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian{-. (2025). {UFO2:. CoRR. doi:10.48550/ARXIV.2504.14603.

[wang2025unveilingprivacyrisksllm] Bo Wang, Weiyi He, Shenglai Zeng, Zhen Xiang, Yue Xing, Jiliang Tang, Pengfei He. (2025). Unveiling Privacy Risks in LLM Agent Memory.

[DBLP:journals/tois/LiJZZZZD25] Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, Zhicheng Dou. (2025). From Matching to Generation: {A. {ACM. doi:10.1145/3722552.

[DBLP:conf/www/Zeng0JSWZ24] Hansi Zeng, Chen Luo, Bowen Jin, Sheikh Muhammad Sarwar, Tianxin Wei, Hamed Zamani. (2024). Scalable and Effective Generative Information Retrieval. Proceedings of the {ACM. doi:10.1145/3589334.3645477.

[bai2025longbench] Bai, Yushi, Tu, Shangqing, Zhang, Jiajie, Peng, Hao, Wang, Xiaozhi, Lv, Xin, Cao, Shulin, Xu, Jiazheng, Hou, Lei, Dong, Yuxiao, others. (2025). Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[chen2025halumem] Chen, Ding, Niu, Simin, Li, Kehang, Liu, Peng, Zheng, Xiangping, Tang, Bo, Li, Xinchi, Xiong, Feiyu, Li, Zhiyu. (2025). HaluMem: Evaluating Hallucinations in Memory Systems of Agents. arXiv preprint arXiv:2511.03506.

[tan2025membench] Tan, Haoran, Zhang, Zeyu, Ma, Chen, Chen, Xu, Dai, Quanyu, Dong, Zhenhua. (2025). {M. Findings of the Association for Computational Linguistics: ACL 2025. doi:10.18653/v1/2025.findings-acl.989.

[jiang2025know] Jiang, Bowen, Hao, Zhuoqun, Cho, Young-Min, Li, Bryan, Yuan, Yuan, Chen, Sihao, Ungar, Lyle, Taylor, Camillo J, Roth, Dan. (2025). Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale. arXiv preprint arXiv:2504.14225.

[zhang2025explicit] Zhang, Zeyu, Zhang, Yang, Tan, Haoran, Li, Rui, Chen, Xu. (2025). Explicit vs implicit memory: Exploring multi-hop complex reasoning over personalized information. arXiv preprint arXiv:2508.13250.

[DBLP:conf/iclr/Zhao00HL25] Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, Kaixiang Lin. (2025). Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs. The Thirteenth International Conference on Learning Representations, {ICLR.

[jia2025evaluating] Jia, Zixi, Liu, Qinghua, Li, Hexiao, Chen, Yuyan, Liu, Jiqiang. (2025). Evaluating the Long-Term Memory of Large Language Models. Findings of the Association for Computational Linguistics: ACL 2025. doi:10.18653/v1/2025.findings-acl.1014.

[du2024perltqa] Du, Yiming, Wang, Hongru, Zhao, Zhengyi, Liang, Bin, Wang, Baojun, Zhong, Wanjun, Wang, Zezhong, Wong, Kam-Fai. (2024). {P. Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10).

[DBLP:conf/iclr/WuWYZCY25] Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai{-. (2025). LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. The Thirteenth International Conference on Learning Representations, {ICLR.

[ai2025memorybench] Ai, Qingyao, Tang, Yichen, Wang, Changyue, Long, Jianming, Su, Weihang, Liu, Yiqun. (2025). MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems. arXiv preprint arXiv:2510.17281.

[zheng2025lifelongagentbench] Zheng, Junhao, Cai, Xidi, Li, Qiuke, Zhang, Duzhen, Li, ZhongZhi, Zhang, Yingying, Song, Le, Ma, Qianli. (2025). LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners. arXiv preprint arXiv:2505.11942.

[DBLP:conf/nips/WuTLCL24] Cheng{-. (2024). StreamBench: Towards Benchmarking Continuous Improvement of Language Agents. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024.

[wan2025storybench] Wan, Luanbo, Ma, Weizhi. (2025). StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns. arXiv preprint arXiv:2506.13356.

[miyai2025webchorearena] Miyai, Atsuyuki, Zhao, Zaiying, Egashira, Kazuki, Sato, Atsuki, Sunada, Tatsumi, Onohara, Shota, Yamanishi, Hiromasa, Toyooka, Mashiro, Nishina, Kunato, Maeda, Ryoma, others. (2025). WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks. arXiv preprint arXiv:2506.01952.

[wei2025evo] Wei, Tianxin, Sachdeva, Noveen, Coleman, Benjamin, He, Zhankui, Bei, Yuanchen, Ning, Xuying, Ai, Mengting, Li, Yunzhe, He, Jingrui, Chi, Ed H, others. (2025). Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory. arXiv preprint arXiv:2511.20857.

[deng2024multi] Deng, Yang, Zhang, Xuan, Zhang, Wenxuan, Yuan, Yifei, Ng, See-Kiong, Chua, Tat-Seng. (2024). On the Multi-turn Instruction Following for Conversational Web Agents. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2024.acl-long.477.

[hu2025evaluating] Hu, Yuanzhe, Wang, Yu, McAuley, Julian. (2025). Evaluating memory in llm agents via incremental multi-turn interactions. arXiv preprint arXiv:2507.05257.

[maharana2024evaluating] Maharana, Adyasha, Lee, Dong-Ho, Tulyakov, Sergey, Bansal, Mohit, Barbieri, Francesco, Fang, Yuwei. (2024). Evaluating Very Long-Term Conversational Memory of {LLM. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2024.acl-long.747.

[he2025madial] He, Junqing, Zhu, Liang, Wang, Rui, Wang, Xi, Haffari, Gholamreza, Zhang, Jiaxing. (2025). {MAD. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). doi:10.18653/v1/2025.naacl-long.499.

[bai2024longbench] Bai, Yushi, Lv, Xin, Zhang, Jiajie, Lyu, Hongchang, Tang, Jiankai, Huang, Zhidian, Du, Zhengxiao, Liu, Xiao, Zeng, Aohan, Hou, Lei, Dong, Yuxiao, Tang, Jie, Li, Juanzi. (2024). {L. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2024.acl-long.172.

[hsieh2024ruler] Hsieh, Cheng-Ping, Sun, Simeng, Kriman, Samuel, Acharya, Shantanu, Rekesh, Dima, Jia, Fei, Zhang, Yang, Ginsburg, Boris. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models?. arXiv preprint arXiv:2404.06654.

[DBLP:conf/nips/KuratovBARSS024] Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Y. Sorokin, Mikhail Burtsev. (2024). BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024.

[wang2025multimodal] Wang, Hengyi, Shi, Haizhou, Tan, Shiwei, Qin, Weiyi, Wang, Wenyuan, Zhang, Tunyu, Nambi, Akshay, Ganu, Tanuja, Wang, Hao. (2025). Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). doi:10.18653/v1/2025.naacl-long.166.

[packer2023memgpt] Packer, Charles, Fang, Vivian, Patil, ShishirG, Lin, Kevin, Wooders, Sarah, Gonzalez, JosephE. (2023). MemGPT: Towards LLMs as Operating Systems.. arXiv preprint arXiv:2310.08560.

[Chhikara2025mem0] Chhikara, Prateek, Khant, Dev, Aryan, Saket, Singh, Taranjeet, Yadav, Deshraj. (2025). Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413.

[githubGitHubKingjulio8238Memary] Memary. (2025). {G.

[githubGitHubTopoteretescognee] Cognee. (2025). {G.

[supermemorySupermemoryUniversal] Supermemory. (2025). {S.

[githubGitHubLangchainailangmem] LangChain. (2025). {G.

[jiaAutoToolEfficientTool2025] Jia, Jingyi, Li, Qinbin. AutoTool: Efficient Tool Selection for Large Language Model Agents. doi:10.48550/arXiv.2511.14650.

[liCompressingContextEnhance2023a] Li, Yucheng, Dong, Bo, Guerin, Frank, Lin, Chenghua. Compressing Context to Enhance Inference Efficiency of Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2023.emnlp-main.391.

[lu2025dynamicaffectivememorymanagement] Junfeng Lu, Yueyan Li. (2025). Dynamic Affective Memory Management for Personalized LLM Agents.

[chen2024melodiexploringmemorycompression] Yinpeng Chen, DeLesley Hutchins, Aren Jansen, Andrey Zhmoginov, David Racz, Jesper Andersen. (2024). MELODI: Exploring Memory Compression for Long Contexts.

[xiao2025improvingefficiencyllmagent] Yuan-An Xiao, Pengfei Gao, Chao Peng, Yingfei Xiong. (2025). Improving the Efficiency of LLM Agent Systems through Trajectory Reduction.

[lu2025scalingllmmultiturnrl] Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, Jiecao Chen. (2025). Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management.

[kangACONOptimizingContext2025] Kang, Minki, Chen, Wei-Ning, Han, Dongge, Inan, Huseyin A., Wutschitz, Lukas, Chen, Yanzhi, Sim, Robert, Rajmohan, Saravan. ACON: Optimizing Context Compression for Long-Horizon LLM Agents. doi:10.48550/arXiv.2510.00615.

[liCAMConstructivistView2025] Li, Rui, Zhang, Zeyu, Bo, Xiaohe, Tian, Zihang, Chen, Xu, Dai, Quanyu, Dong, Zhenhua, Tang, Ruiming. CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension. doi:10.48550/arXiv.2510.05520.

[luScalingLLMMultiturn2025] Lu, Miao, Sun, Weiwei, Du, Weihua, Ling, Zhan, Yao, Xuesong, Liu, Kang, Chen, Jiecao. Scaling LLM Multi-Turn RL with End-to-End Summarization-Based Context Management. doi:10.48550/arXiv.2510.06727.

[tangTurnLimitsTraining2025] Tang, Qiaoyu, Xiang, Hao, Yu, Le, Yu, Bowen, Lu, Yaojie, Han, Xianpei, Sun, Le, Zhang, WenJuan, Wang, Pengbo, Liu, Shixuan, Zhang, Zhenru, Tu, Jianhong, Lin, Hongyu, Lin, Junyang. Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window. doi:10.48550/arXiv.2510.08276.

[liWebWeaverStructuringWebScale2025] Li, Zijian, Guan, Xin, Zhang, Bo, Huang, Shen, Zhou, Houquan, Lai, Shaopeng, Yan, Ming, Jiang, Yong, Xie, Pengjun, Huang, Fei, Zhang, Jun, Zhou, Jingren. WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research. doi:10.48550/arXiv.2509.13312.

[cai2025flexcontinuousagentevolution] Zhicheng Cai, Xinyuan Guo, Yu Pei, Jiangtao Feng, Jinsong Su, Jiangjie Chen, Ya-Qin Zhang, Wei-Ying Ma, Mingxuan Wang, Hao Zhou. (2025). FLEX: Continuous Agent Evolution via Forward Learning from Experience.

[zhai2025agentevolverefficientselfevolvingagent] Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, Zhaoyang Liu, Bolin Ding, Jingren Zhou. (2025). AgentEvolver: Towards Efficient Self-Evolving Agent System.

[chen2025scalingagentlearningexperience] Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh. (2025). Scaling Agent Learning via Experience Synthesis.

[wang2025inducingprogrammaticskillsagentic] Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, Daniel Fried. (2025). Inducing Programmatic Skills for Agentic Tasks.

[cai2025trainingfreegrouprelativepolicy] Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, Xing Sun. (2025). Training-Free Group Relative Policy Optimization.

[jiang2025multiagentincontextcoordinationdecentralized] Tao Jiang, Zichuan Lin, Lihe Li, Yi-Chen Li, Cong Guan, Lei Yuan, Zongzhang Zhang, Yang Yu, Deheng Ye. (2025). Multi-agent In-context Coordination via Decentralized Memory Retrieval.

[shinn2023reflexion] Shinn, Noah, Cassano, Federico, Gopinath, Ashwin, Narasimhan, Karthik, Yao, Shunyu. (2023). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems.

[niu2024ragtruth] Niu, Cheng, Wu, Yuanhao, Zhu, Juno, Xu, Siliang, Shum, Kashun, Zhong, Randy, Song, Juntong, Zhang, Tong. (2024). Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[sunredeep] Sun, ZhongXiang, Zang, Xiaoxue, Zheng, Kai, Xu, Jun, Zhang, Xiao, Yu, Weijie, Song, Yang, Li, Han. (2025). ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability. The Thirteenth International Conference on Learning Representations.

[lu2025spad] Lu, Pengqian, Lu, Jie, Liu, Anjin, Zhang, Guangquan. (2025). SPAD: Seven-Source Token Probability Attribution with Syntactic Aggregation for Detecting Hallucinations in RAG. arXiv preprint arXiv:2512.07515.

[ru2024ragchecker] Ru, Dongyu, Qiu, Lin, Hu, Xiangkun, Zhang, Tianhang, Shi, Peng, Chang, Shuaichen, Jiayang, Cheng, Wang, Cunxiang, Sun, Shichao, Li, Huanyu, others. (2024). Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems.

[wang2025astute] Wang, Fei, Wan, Xingchen, Sun, Ruoxi, Chen, Jiefeng, Arik, Sercan O. (2025). Astute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[wang2025retrieval] Wang, Han, Prasad, Archiki, Stengel-Eskin, Elias, Bansal, Mohit. (2025). Retrieval-augmented generation with conflicting evidence. arXiv preprint arXiv:2504.13079.

[shi2025privacy] Shi, Zitong, Wan, Guancheng, Huang, Wenke, Zhang, Guibin, Shao, Jiawei, Ye, Mang, Yang, Carl. (2025). Privacy-Enhancing Paradigms within Federated Multi-Agent Systems. arXiv preprint arXiv:2503.08175.

[rezazadeh2025collaborative] Rezazadeh, Alireza, Li, Zichao, Lou, Ange, Zhao, Yuying, Wei, Wei, Bao, Yujia. (2025). Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control. arXiv preprint arXiv:2505.18279.

[openai2024memorycontrol] OpenAI. (2024). Memory and new controls for ChatGPT.

[hu2024refchecker] Hu, Xiangkun, Ru, Dongyu, Qiu, Lin, Guo, Qipeng, Zhang, Tianhang, Xu, Yang, Luo, Yun, Liu, Pengfei, Zhang, Yue, Zhang, Zheng. (2024). RefChecker: Reference-based fine-grained hallucination checker and benchmark for large language models. arXiv preprint arXiv:2405.14486.

[sunlargepig] Sun, Zhongxiang, Si, Zihua, Zang, Xiaoxue, Zheng, Kai, Song, Yang, Zhang, Xiao, Xu, Jun. (2025). LargePiG for Hallucination-Free Query Generation: Your Large Language Model is Secretly a Pointer Generator. Proceedings of the ACM on Web Conference 2025. doi:10.1145/3696410.3714800.

[sun2025rearter] Sun, Zhongxiang, Wang, Qipeng, Yu, Weijie, Zang, Xiaoxue, Zheng, Kai, Xu, Jun, Zhang, Xiao, Song, Yang, Li, Han. (2025). Rearter: Retrieval-augmented reasoning with trustworthy process rewarding. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval.

[de-cao-etal-2021-editing] De Cao, Nicola, Aziz, Wilker, Titov, Ivan. (2021). Editing Factual Knowledge in Language Models. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2021.emnlp-main.522.

[atkinsonHumanMemoryProposed1968] Atkinson, R. C., Shiffrin, R. M.. Human Memory: A Proposed System and Its Control Processes. The Psychology of Learning and Motivation: II. doi:10.1016/S0079-7421(08)60422-3.

[kumaranWhatLearningSystems2016] Kumaran, Dharshan, Hassabis, Demis, McClelland, James L.. What Learning Systems Do Intelligent Agents Need? Complementary Learning Systems Theory Updated. Trends in Cognitive Sciences. doi:10.1016/j.tics.2016.05.004.

[mcclellandWhyThereAre1995] McClelland, James L., McNaughton, Bruce L., O'Reilly, Randall C.. Why There Are Complementary Learning Systems in the Hippocampus and Neocortex: Insights from the Successes and Failures of Connectionist Models of Learning and Memory. Psychological Review. doi:10.1037/0033-295X.102.3.419.

[schacterConstructiveMemoryGhosts2007] Schacter, Daniel L., Addis, Donna Rose. Constructive Memory: The Ghosts of Past and Future. Nature. doi:10.1038/445027a.

[andersonActiveForgettingAdaptation2021] Anderson, Michael C., Hulbert, Justin C.. Active Forgetting: Adaptation of Memory by Prefrontal Control. Annual Review of Psychology. doi:10.1146/annurev-psych-072720-094140.

[mattarPrioritizedMemoryAccess2018] Mattar, Marcelo G., Daw, Nathaniel D.. Prioritized Memory Access Explains Planning and Hippocampal Replay. Nature Neuroscience. doi:10.1038/s41593-018-0232-z.

[dong2025unified] Dong, Yifei, Wu, Fengyi, Chen, Guangyu, Cheng, Zhi-Qi, Hu, Qiyu, Zhou, Yuxuan, Sun, Jingdong, He, Jun-Yan, Dai, Qi, Hauptmann, Alexander G. (2025). Unified world models: Memory-augmented planning and foresight for visual navigation. arXiv preprint arXiv:2510.08713.

[hong2025relic] Hong, Yicong, Mei, Yiqun, Ge, Chongjian, Xu, Yiran, Zhou, Yang, Bi, Sai, Hold-Geoffroy, Yannick, Roberts, Mike, Fisher, Matthew, Shechtman, Eli, others. (2025). RELIC: Interactive Video World Model with Long-Horizon Memory. arXiv preprint arXiv:2512.04040.

[li2025vmem] Li, Runjia, Torr, Philip, Vedaldi, Andrea, Jakab, Tomas. (2025). VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory. arXiv preprint arXiv:2506.18903.

[liao2025genie] Liao, Yue, Zhou, Pengfei, Huang, Siyuan, Yang, Donglin, Chen, Shengcong, Jiang, Yuxin, Hu, Yue, Cai, Jingbin, Liu, Si, Luo, Jianlan, others. (2025). Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635.

[liu2025rolling] Liu, Kunhao, Hu, Wenbo, Xu, Jiale, Shan, Ying, Lu, Shijian. (2025). Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161.

[po2025long] Po, Ryan, Nitzan, Yotam, Zhang, Richard, Chen, Berlin, Dao, Tri, Shechtman, Eli, Wetzstein, Gordon, Huang, Xun. (2025). Long-context state-space video world models. arXiv preprint arXiv:2505.20171.

[savov2025statespacediffuser] Savov, Nedko, Kazemi, Naser, Zhang, Deheng, Paudel, Danda Pani, Wang, Xi, Van Gool, Luc. (2025). StateSpaceDiffuser: Bringing Long Context to Diffusion World Models. arXiv preprint arXiv:2505.22246.

[xiao2025worldmem] Xiao, Zeqi, Lan, Yushi, Zhou, Yifan, Ouyang, Wenqi, Yang, Shuai, Zeng, Yanhong, Pan, Xingang. (2025). Worldmem: Long-term consistent world simulation with memory. arXiv preprint arXiv:2504.12369.

[yang2025longlive] Yang, Shuai, Huang, Wei, Chu, Ruihang, Xiao, Yicheng, Zhao, Yuyang, Wang, Xianbang, Li, Muyang, Xie, Enze, Chen, Yingcong, Lu, Yao, others. (2025). Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622.

[bruce2024genie] Bruce, Jake, Dennis, Michael D, Edwards, Ashley, Parker-Holder, Jack, Shi, Yuge, Hughes, Edward, Lai, Matthew, Mavalankar, Aditi, Steigerwald, Richie, Apps, Chris, others. (2024). Genie: Generative interactive environments. Forty-first International Conference on Machine Learning.

[agarwal2025cosmos] Agarwal, Niket, Ali, Arslan, Bala, Maciej, Balaji, Yogesh, Barker, Erik, Cai, Tiffany, Chattopadhyay, Prithvijit, Chen, Yongxin, Cui, Yin, Ding, Yifan, others. (2025). Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575.

[guo2025ctrl] Guo, Yanjiang, Shi, Lucy Xiaoyang, Chen, Jianyu, Finn, Chelsea. (2025). Ctrl-world: A controllable generative world model for robot manipulation. arXiv preprint arXiv:2510.10125.

[bjorck2025gr00t] Bjorck, Johan, Casta{~n. (2025). Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734.

[yu2025videossm] Yu, Yifei, Wu, Xiaoshan, Hu, Xinting, Hu, Tao, Sun, Yangtian, Lyu, Xiaoyang, Wang, Bo, Ma, Lin, Ma, Yuewen, Wang, Zhongrui, others. (2025). VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory. arXiv preprint arXiv:2512.04519.

[yu2025context] Yu, Jiwen, Bai, Jianhong, Qin, Yiran, Liu, Quande, Wang, Xintao, Wan, Pengfei, Zhang, Di, Liu, Xihui. (2025). Context as memory: Scene-consistent interactive long video generation with memory retrieval. arXiv preprint arXiv:2506.03141.

[lscs] Wang, Yu, Han, Chi, Wu, Tongtong, He, Xiaoxin, Zhou, Wangchunshu, Sadeq, Nafis, Chen, Xiusi, He, Zexue, Wang, Wei, Haffari, Gholamreza, others. (2024). Towards lifespan cognitive systems. arXiv preprint arXiv:2409.13265.

[zhou2025ememsimplestrongbaselinelongterm] Sizhe Zhou, Jiawei Han. (2025). A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents.

[hu2026evermemosselforganizingmemoryoperating] Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, Yafeng Deng. (2026). EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning.

[luo2025videoragvisuallyalignedretrievalaugmentedlong] Yongdong Luo, Xiawu Zheng, Guilin Li, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, Rongrong Ji. (2025). Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension.

[hu2026memorymattersmoreeventcentric] Yuyang Hu, Jiongnan Liu, Jiejun Tan, Yutao Zhu, Zhicheng Dou. (2026). Memory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning.

[jiang2026magmamultigraphbasedagentic] Dongming Jiang, Yi Li, Guanpeng Li, Bingzhe Li. (2026). MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents.

[wang2025R3Mem] Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, Bang Liu. (2025). R{({^3. Findings of the Association for Computational Linguistics, {ACL.

[cao2025memorydecoder] Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin. (2025). Memory Decoder: {A. CoRR. doi:10.48550/ARXIV.2508.09874.

[tian2025RGMem] Ao Tian, Yunfeng Lu, Xinxin Fan, Changhao Wang, Lanzhi Zhou, Yeyao Zhang, Yanfang Liu. (2025). RGMem: Renormalization Group-based Memory Evolution for Language Agent User Profile. CoRR. doi:10.48550/ARXIV.2510.16392.

[zhou2026contextedusfaithfulstructured] Yiqing Zhou, Yu Lei, Shuzheng Si, Qingyan Sun, Wei Wang, Yifei Wu, Hao Wen, Gang Chen, Fanchao Qi, Maosong Sun. (2026). From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition.

[yeo2025worldmmdynamicmultimodalmemory] Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang. (2025). WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning.

[park2024memoria] Sangjun Park, JinYeong Bak. (2024). Memoria: Resolving Fateful Forgetting Problem through Human-Inspired Memory Architecture. Forty-first International Conference on Machine Learning, {ICML.

[sarin2025memoriascalableagenticmemory] Samarth Sarin, Lovepreet Singh, Bhaskarjit Sarmah, Dhagash Mehta. (2025). Memoria: A Scalable Agentic Memory Framework for Personalized Conversational AI.

[cai2025experiencedriven] Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, Liang He. (2025). Building Self-Evolving Agents via Experience-Driven Lifelong Learning: {A. CoRR. doi:10.48550/ARXIV.2508.19005.

[sun2025sophiapersistentagentframework] Mingyang Sun, Feng Hong, Weinan Zhang. (2025). Sophia: A Persistent Agent Framework of Artificial Life.

[zhang2026memrlselfevolvingagentsruntime] Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, Muning Wen. (2026). MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory.

[cao2025remembermerefineme] Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, Hai Zhao. (2025). Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution.

[zeppieri2025mmagmixedmemoryaugmentedgeneration] Stefano Zeppieri. (2025). MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications.

[latimer2025hindsight2020buildingagent] Chris Latimer, Nicoló Boschi, Andrew Neeser, Chris Bartholomew, Gaurav Srivastava, Xuan Wang, Naren Ramakrishnan. (2025). Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects.

[yan2025generalagenticmemorydeep] B. Y. Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, Zheng Liu. (2025). General Agentic Memory Via Deep Research.

[bini2025memloradistillingexpertadapters] Massimo Bini, Ondrej Bohdal, Umberto Michieli, Zeynep Akata, Mete Ozay, Taha Ceritli. (2025). MemLoRA: Distilling Expert Adapters for On-Device Memory Systems.

[zhang2025memevolvemetaevolutionagentmemory] Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, Shuicheng Yan. (2025). MemEvolve: Meta-Evolution of Agent Memory Systems.

[liSculptorEmpoweringLLMs2025] Li, Mo, Xu, L. H., Tan, Qitai, Ma, Long, Cao, Ting, Liu, Yunxin. Sculptor: {{Empowering LLMs. doi:10.48550/arXiv.2508.04664.

[linSleeptimeComputeInference2025] Lin, Kevin, Snell, Charlie, Wang, Yu, Packer, Charles, Wooders, Sarah, Stoica, Ion, Gonzalez, Joseph E.. Sleep-Time {{Compute. doi:10.48550/arXiv.2504.13171.

[yuAgenticMemoryLearning2026] Yu, Yi, Yao, Liuyi, Xie, Yuexiang, Tan, Qingquan, Feng, Jiaqi, Li, Yaliang, Wu, Libing. Agentic {{Memory. doi:10.48550/arXiv.2601.01885.