Skip to main content

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, Libing Wu

Abstract

Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows, making effective memory management critical. Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components, relying on heuristics or auxiliary controllers, which limits adaptability and end-to-end optimization. In this paper, we propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent’s policy. AgeMem exposes memory operations as tool-based actions, enabling the LLM agent to autonomously decide what and when to store, retrieve, update, summarize, or discard information. To train such unified behaviors, we propose a three-stage progressive reinforcement learning strategy and design a step-wise GRPO to address sparse and discontinuous rewards induced by memory operations. Experiments on five long-horizon benchmarks demonstrate that AgeMem consistently outperforms strong memory-augmented baselines across multiple LLM backbones, achieving improved task performance, higher-quality long-term memory, and more efficient context usage.

Introduction

In long-horizon agentic tasks involving multi-step reasoning and complex workflows (Chang et al., 2024), the effectiveness of large language model (LLM) agents is fundamentally constrained by the information they can attend to at any given time, which we collectively refer to as the agent's memory (Xiong et al., 2025; Goodyear et al., 2025). Memory typically falls into two categories: longterm memory (LTM), which persistently stores useror task-specific knowledge (Zhong et al., 2024; Jiang et al., 2024), and short-term memory (STM), which comprises the information contained in the current input context (Wu et al., 2025b; Gao et al., 2025b). High-quality LTM supports efficient retrieval of accumulated knowledge, while effective STM management reduces redundancy and preserves salient context. Together, they mitigate the limitations of finite context windows, making their joint management crucial for improving agent performance in complex reasoning settings.

However, existing research has predominantly treated LTM and STM as independent components. STM is commonly enhanced through retrievalaugmented generation (RAG) (Pan et al., 2025b), such as in MainRAG (Chang et al., 2025) and ReSum (Wu et al., 2025a), which expand usable context via external retrieval or periodic summarization. Although effective in some tasks, these methods rely heavily on predefined schedules or heuristic rules, potentially resulting in overlooked infrequent but critical details as well as unnecessary noise (Ma et al., 2025; Dong et al., 2025). In contrast, LTM management has progressed along separate lines, typically categorized into triggerbased (Kang et al., 2025; Wang and Chen, 2025; Wang et al., 2025c; Chhikara et al., 2025) and agent-based (Yan et al., 2025; Hu et al., 2025; Xu et al., 2025) paradigms. The former executes fixed memory operations at predefined moments, whereas the latter incorporates a specialized memory manager to determine what and how to store. Despite offering more flexibility, most approaches still depend on handcrafted rules or auxiliary expert models, limiting adaptability and increasing system complexity (Xiong et al., 2025).

As a consequence, LTM and STM are typically treated as separate and loosely coupled modules . As illustrated in Figure 1, existing architectures generally follow two patterns: (a) static STM with trigger-based LTM, or (b) static STM with agentbased LTM. In both settings, the two memory systems are optimized independently and later combined in an ad hoc way, leading to fragmented memory construction and suboptimal performance

Figure

(a) Static STM + Trigger-based LTM

(c) Unified Management (AgeMem)

Figure 1: Comparison between independent and unified memory management frameworks. (Left) Traditional framework with static STM and trigger-based LTM. (Middle) Independent framework with an additional Memory Manager controlling LTM in an agent-based manner, while STM remains static. (Right) The proposed AgeMem framework, where LTM and STM are jointly and intelligently managed via explicit tool-based operations.

in long-horizon reasoning tasks. Thus, unifying the management of LTM and STM remains a necessary yet largely unexplored challenge.

Nevertheless, achieving unified memory management poses three fundamental challenges. ( C1 ) Functional heterogeneity coordination: LTM and STM serve distinct yet complementary purposes: LTM determines what to store, update, or discard, while STM governs what to retrieve, summarize, or remove from the active context (Zhang et al., 2025b). The challenge lies in designing a unified mechanism that orchestrates their interplay synergistically. ( C2 ) Training paradigm mismatch: Existing reinforcement learning (RL) frameworks adopt markedly different training strategies for the two memory types (Ma et al., 2024). LTM-focused training often leverages session-level information available prior to interaction, whereas STM training typically injects distractors to simulate long-horizon contexts (Sun et al., 2024). Moreover, standard RL assumes continuous trajectories with stable rewards, which conflicts with the inherently fragmented and discontinuous experiences produced by memory operations (Wu et al., 2025a), making end-to-end optimization particularly challenging. ( C3 ) Practical deployment constraints: Many agent systems rely on an auxiliary expert LLM for memory control, significantly increasing inference cost and training complexity. How to integrate unified memory management directly into an agent without dependence on external expert models remains an open problem.

To address these challenges, we propose Agentic Memory (AgeMem) , a unified framework that jointly manages LTM and STM, illustrated in Figure 1 (right). Unlike prior designs that treat memory as an external component, AgeMem integrates both memory types directly into the agent's decision-making process. Through a unified toolbased interface, the LLM autonomously invokes and executes memory operations for both LTM and STM. Furthermore, we design a three-stage progressive RL strategy: the model first acquires LTM storage capabilities, then learns STM context management, and finally coordinates both forms of memory under full task settings. To address the fragmented experience issue across training stages, we design a step-wise Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which transforms cross-stage dependencies into learnable signals, thereby alleviating the challenges posed by sparse and discontinuous rewards in RL. We evaluate AgeMem on five long-context, reasoning-intensive benchmarks. Comprehensive results show that AgeMem consistently outperforms strong baselines, validating the effectiveness of unified memory management.

Our main contributions are as follows:

· We propose Agentic Memory (AgeMem) , a unified agentic memory framework that enables LLM-based agents to autonomously decide when , what , and how to manage both long-term and short-term memory. · We develop a three-stage progressive RL strat-

  • egy equipped with a step-wise GRPO mechanism, facilitating effective end-to-end learning of unified memory management behaviors. · We conduct comprehensive evaluations across multiple models and long-horizon benchmarks, demonstrating the robustness and effectiveness of AgeMem in complex agentic tasks.

Long-term memory (LTM). Persistent LTM is crucial for LLM-based agents operating over extended horizons (Wang et al., 2025b; Li et al., 2025). Recent work has explored diverse architectural designs for modeling LTM. LangMem (LangChain Team, 2025) provides a modular framework that supports multiple memory types, while A-Mem (Xu et al., 2025) adopts a Zettelkasten-inspired design that links structured knowledge units to facilitate consolidation. Mem0 (Chhikara et al., 2025) proposes a scalable extract-update pipeline and extends it to a graph-based variant for structured reasoning, and Zep (Rasmussen et al., 2025) represents memory as a temporal knowledge graph to enable crosssession and time-aware reasoning. Although effective in organizing and retrieving information, these approaches largely rely on predefined memory structures or heuristic update rules. As memory grows, such designs commonly suffer from increased system complexity and lack adaptive, learning-based strategies for prioritization and forgetting. In contrast, our work aims to learn an adaptive memory policy that allows agents to dynamically decide what to store, update, or forget, depending on task demands and long-term utility. Short-term memory (STM). STM in agentic LLMs primarily concerns context selection and retrieval (Wang et al., 2024; Jin et al., 2024). Retrieval-Augmented Generation (RAG) (Pan et al., 2025b; Salama et al., 2025; Kagaya et al., 2024) is the dominant paradigm, expanding usable context by injecting retrieved content into prompts. While effective, RAG does not fundamentally prevent context explosion in long-horizon settings and may introduce irrelevant or distracting information. To address this issue, ReSum (Wu et al., 2025a) periodically compresses interaction histories into compact reasoning states, allowing agents to operate beyond fixed context-window constraints. Yet its summarization schedule remains largely predefined, and aggressive compression risks discarding rare but crucial details. Our approach instead enables agents to learn when and how to retrieve, summarize, or filter context, achieving a more flexible balance between efficiency and information preservation.

Reinforcement learning for LLMs. Reinforcement learning has become an effective paradigm for improving the decision-making and reasoning capabilities of LLM-based agents (Yao et al., 2022; Jin et al., 2025; Qian et al., 2025; Chaudhari et al., 2025). Among recent advances, GRPO (Shao et al., 2024) enhances stability by optimizing policies based on the relative quality of sampled trajectories, removing the need for an explicit value function. GRPO and its variants (Gilabert et al., 2025; Wang et al., 2025a) have shown strong performance in complex reasoning tasks. However, existing RLbased systems generally treat memory as a static or external component, making them ill-suited for the discontinuous and fragmented trajectories associated with memory operations (Yan et al., 2025; Zhang et al., 2025a). In contrast, our work integrates RL directly into the memory management process, enabling unified training of both language generation and memory operations.

Method

We propose Agentic Memory (AgeMem) , a unified memory framework that enables LLM agents to autonomously manage both LTM and STM in an end-to-end manner. As illustrated in Figure 1 (right), AgeMem integrates memory management capabilities directly into the agent via a set of specialized tools, enabling the model to learn optimal strategies for unified memory management through three-stage progressive strategy.

Problem Formulation

Unified RL formulation for AgeMem. At each time step t , the agent observes a state s t ∈ S composed of the conversation context (short-term memory) C t , the long-term memory store M t , and the task specification T : s t = ( C t , M t , T ) . The specification T includes the input query q , contextual information I q , and (for training only) the expected answer A q . This formulation enables the agent to ground its decision-making in both transient context and persistent knowledge.

Given s t , the agent selects an action a t ∈ A from a hybrid action space that includes language generation as well as memory operations. The de-

cision is governed by a parameterized policy π θ , defined as π θ ( a t | s t ) = P ( a t | s t ; θ ) , where θ denotes the LLM parameters and a t = π θ ( ·| s t ) . For a trajectory τ = ( s 1 , a 1 , . . . , s T , a T ) , the cumulative reward is defined as:

$$

$$

where R i captures task performance and memory quality, and P penalty discourages redundant storage, excessive tool usage, and uncontrolled context expansion. The optimization objective is:

$$

$$

This formulation treats memory management as an integral component of the agent's policy, replacing handcrafted heuristics with a learnable mechanism. Three-stage trajectory structure. To capture long-horizon interactions and progressively train memory capabilities, each trajectory is divided into three consecutive stages: τ = ( τ (1) , τ (2) , τ (3) ) , with a total length of T = T 1 + T 2 + T 3 . In Stage 1, the agent engages in casual interactions and may store useful information into LTM. Stage 2 introduces distracting or irrelevant content, requiring the agent to manage its STM through selective retention and compression. Stage 3 presents a task that depends on coordinated use of both retained context and earlier accumulated LTM. A key aspect of this design is that the long-term memory M t persists across all stages, allowing early knowledge to influence later decisions. In contrast, the context C t is reset between Stages 1 and 2 to prevent information leakage across phases. The reset before Stage 2 ensures the agent cannot solve the final task via residual context, thereby forcing proper retrieval from LTM and enabling effective training of memory operations.

At each step, we collect an experience tuple e t = ( s t , a t , r t , log π θ old ( a t | s t )) , where r t is typically zero for intermediate steps and assigned after trajectory completion, and log π θ old ( a t | s t ) denotes the log probability under the old policy π θ old . This representation enables step-wise credit assignment under GRPO (Shao et al., 2024) and allows the agent to attribute long-term rewards to specific memory decisions across stages. By structuring trajectories in this staged yet continuous manner, the agent learns temporally coherent and task-adaptive memory policies essential for robust long-horizon reasoning.

Table 1: Memory management tools in AgeMem for manipulating long-term memory (LTM) and short-term memory (STM).

Memory Management via Tool Interface

AgeMemexposes memory-related operations to the LLM agent through an explicit tool interface (Table 1). The agent can modify its persistent LTM using ADD, UPDATE, and DELETE, while exercising fine-grained control over STM through RETRIEVE, SUMMARY, and FILTER. Incorporating these tools into the action space transforms memory control from an external heuristic pipeline into an intrinsic component of decision-making. This design allows the agent to adaptively manage memory according to task structure, history, and context. Implementation details are provided in the Appendix A.1.

Three-Stage Progressive RL Strategy

To learn unified and stable memory behaviors, we propose a progressive three-stage training strategy. For each task instance q ∈ T , the agent generates a complete trajectory:

$$

$$

where K denotes the number of independent rollouts, and each sub-trajectory τ ( i ) k corresponds to a specific training stage.

Stage 1 (LTM construction). The agent is exposed to contextual information I q in a casual conversational setting. The goal is to identify salient information and store it into LTM M t . During the interaction, the short-term context C t evolves naturally, and the agent may invoke LTM-related tools when appropriate. Formally, this stage yields a subtrajectory τ (1) k = { e t } T 1 t =1 , where each experience tuple e t follows the definition in Section 3.1.

Stage 2 (STM control under distractors). The short-term context is reset, while the constructed LTM M t is retained. The agent is then presented with semantically related but irrelevant or misleading distractors. The objective is to learn proactive STM control through tool-based operations,

such as filtering or summarizing context, in order to suppress noise and preserve useful information. This process forms the sub-trajectory τ (2) k = { e t } T 1 + T 2 t = T 1 +1 , which emphasizes context filtering and compression capability.

Stage 3 (Integrated reasoning and memory coordination). Finally, the agent receives a formal query q requiring both accurate reasoning and effective memory retrieval. The agent must retrieve relevant knowledge from M t , appropriately manage the context C t , and generate a final answer. This stage produces τ (3) k = { e t } T t = T 1 + T 2 +1 , which evaluates the ability of agent to coordinate longterm memory, short-term context management, and task solution in an end-to-end manner.

All three segments form a complete trajectory:

$$

$$

which is then used for policy optimization in the subsequent step-wise GRPO procedure. For a batch of B tasks, we further aggregate all experiences from K independent rollouts into a unified set E = ⋃ B q =1 ⋃ K k =1 { e t | e t ∈ τ ( q ) k } , with a total size of |E| = B × K × ¯ T , where ¯ T denotes the average trajectory length. More detailed rollout processes are provided in the Appendix A.3.

Step-wise GRPO for Unified Management

We adopt a step-wise variant of GRPO to connect long-range task rewards with memory decisions across all stages. For task q , let G q = { τ ( q ) 1 , . . . , τ ( q ) K } denote the group of parallel rollouts. Each trajectory yields a terminal reward r ( k,q ) T = R ( τ ( q ) k ) . We compute the groupnormalized advantage for the terminal step as:

$$

$$

where µ G q and σ G q are the mean and standard deviation of rewards within G q , ϵ prevents division by zero. This advantage is then broadcast to all preceding steps of the same trajectory A ( k,q ) t = A ( k,q ) T , which assigns a consistent learning signal to all memory and reasoning actions along the trajectory, including those in Stage 1 and Stage 2. In doing so, the final task outcome supervises every intermediate memory decision, enabling long-range credit assignment across heterogeneous stages. We then augment the experience set with advantages, E = ⋃ B,K q,k { ( e t , A t ) | e t ∈ τ ( q ) k , A t = A ( k,q ) t } .

Following GRPO, we maximize the expected objective over all experiences:

$$

$$

where the importance ratio ρ ( k,q ) t = π θ ( a t | s t ) π θ old ( a t | s t ) controls the update magnitude under the new policy, D ( k,q ) KL denotes the KL divergence penalty between the current policy π θ and a fixed reference π ref , and β is a coefficient that balances exploration and training stability.

Reward Function Design

We design a composite reward that evaluates both downstream task performance and the quality of memory management. The total trajectory-level reward is defined as

$$

$$

where w = [ w task , w context , w memory ] ⊤ are tunable coefficients, and R = [ R task , R context , R memory ] ⊤ correspond to rewards for task completion, context management, and long-term memory management. The penalty term P penalty captures violations such as context overflow or exceeding the interaction limit. Below, we summarize each component, and precise formulas are provided in the Appendix A.2. Task completion reward R task . This term provides the primary learning signal by assessing whether the agent solves the task correctly. We obtain a scalar score using an LLM-based judge S judge ( A pred , A q ) ∈ [0 , 1] , optionally applying a penalty when no answer is produced. This reward encourages accurate, complete task solutions and remains the dominant component to ensure alignment with task objectives.

Context management reward R context . This component evaluates STM behavior, focusing on how effectively the agent controls the active context C t . It combines three factors: (i) compression efficiency, promoting economical token usage; (ii) preventive actions, rewarding early summarization or filtering to avoid overflow; and (iii) information preservation, penalizing the loss of critical queryrelated content. Each factor is normalized, allowing the reward to balance context efficiency against retention of essential information.

Memory management reward R memory. This

term evaluates LTM operations. It aggregates signals for: (i) storage quality, measured as the fraction of stored entries labeled as high-quality and reusable; (ii) maintenance, rewarding meaningful update or delete operations to mitigate memory staleness; and (iii) semantic relevance, computed using an LLM-based score between retrieved memories and the query. Together, these signals incentivize selective, high-value memory construction and responsible upkeep over time.

Penalty terms P penalty . Penalties discourage undesirable behaviors such as exceeding the maximum number of dialogue turns or triggering context overflow. Penalty coefficients are chosen so that such violations lead to a substantial reduction in the final trajectory reward, encouraging the agent to maintain safe and efficient memory practices.

Experiments

Experimental Setup

Datasets. To comprehensively evaluate AgeMem, we select five widely-used datasets in LLM-based agent research: ALFWorld (Shridhar et al., 2020), SciWorld (Wang et al., 2022), PDDL (Chang et al., 2024), BabyAI (Chevalier-Boisvert et al., 2018), and HotpotQA (Yang et al., 2018). These datasets cover embodied action, game-based reasoning, and knowledge-intensive question answering, providing diverse evaluation scenarios. Since the HotpotQA dataset contains both questions and supporting facts, automatically providing Stage 1 contextual information, AgeMem is fine-tuned with RL only on the HotpotQA training set and then evaluated directly on all datasets. Detailed dataset statistics are provided in Appendix C.1.

Evaluation metrics. For the primary task completion metrics, we adopt Success Rate (SR) for ALFWorld, SciWorld, and BabyAI, Progress Rate (PR) for PDDL, and LLM-as-a-Judge (J) for HotpotQA. Additionally, we employ an LLM-based evaluator to assess the quality of stored long-term memory during knowledge reasoning, measured by Memory Quality (MQ). The prompts of the LLMbased evaluation are provided in Appendix C.2. Baselines & LLM backbones. We compare AgeMem against four representative agent LTM systems: LangMem (LangChain Team, 2025), AMem (Xu et al., 2025), Mem0 (Chhikara et al., 2025), and Mem0 g (a graph-based variant officially provided as part of Mem0). To better demonstrate the effectiveness of RL training, we also include

Figure 2: Memory Quality scores for different methods on HotpotQA. Higher scores indicate better relevance between stored memories and ground-truth facts.

Figure 2: Memory Quality scores for different methods on HotpotQA. Higher scores indicate better relevance between stored memories and ground-truth facts.

AgeMem-noRL, which is not fine-tuned with RL. In ablation studies on STM, we compare STM tools with RAG approach. For the base agent models, we use Qwen2.5-7B-Instruct and Qwen3-4B-Instruct. More baseline configurations are in Appendix C.3. Implementation details. Webuild agents using the Agentscope framework (Gao et al., 2025a) and finetune AgeMem using the Trinity framework (Pan et al., 2025a). For all reward weights in the reward function, we use uniform coefficients of 1.0 without manual tuning. Further implementation details are provided in Appendix C.4.

Main Results

Comparison with counterparts. Table 2 shows that AgeMem achieves the highest average performance on both Qwen2.5-7B-Instruct (41.96%) and Qwen3-4B-Instruct (54.31%), outperforming all baselines across five datasets with relative gains of 49.59% and 23.52% over no-memory, respectively. Compared to the best baselines (Mem0 and A-Mem), AgeMem improves by 4.82 and 8.57 percentage points on average. RL training contributes 8.53 percentage points and 8.72 percentage points improvements over AgeMem-noRL, validating the three-stage progressive RL strategy.

Quality of stored long-term memories. To evaluate the quality of stored memories, we leverage the ground-truth facts provided in the HotpotQA dataset and assess the relevance between stored memories and these facts using an LLM-based evaluator. Figure 2 presents the Memory Quality (MQ) scores for different baselines. AgeMem

Table 2: Performance comparison across five benchmarks. The best and second-best results are marked.

Figure

�������������

Figure 3: Average prompt token counts under different STM management configurations on HotpotQA. The suffix '-RAG' indicates the adoption of RAG in place of STM tool-based management.

achieves the highest memory quality on both model backbones, with MQ scores of 0.533 and 0.605, respectively. This indicates that the unified memory management framework not only improves task performance but also promotes the storage of highquality, reusable knowledge. The comparison with baseline methods further validates that AgeMem's tool-based memory operations lead to more selective and higher-quality memory construction.

Effectiveness of STM management. We evaluate the effectiveness of STM management by measuring the prompt token count under different configurations on HotpotQA. Figure 3 shows that AgeMem successfully reduces prompt token usage compared to variants without STM tools (-RAG). On Qwen2.5-7B-Instruct, AgeMem uses 2,117 tokens on average, compared to 2,186 tokens for AgeMem-RAG, representing a reduction of 3.1%. On Qwen3-4B-Instruct, the reduction is even more pronounced: AgeMem uses 2,191 tokens versus 2,310 tokens for AgeMem-RAG, a reduction of 5.1%. These results demonstrate that the learned STM management tools effectively control con-

Table 3: Tool usage statistics on HotpotQA. Numbers show average calls per episode.

text expansion, enabling more efficient token usage while maintaining task performance.

Tool usage analysis. Table 3 reports tool usage statistics before and after RL fine-tuning on HotpotQA. RL training substantially increases the use of long-term memory tools, especially ADD and UPDATE. On Qwen2.5-7B-Instruct, ADD operations rise from 0.92 to 1.64, and UPDATE operations appear after training (0.13 v.s. nearly zero). Similar trends are observed on Qwen3-4B-Instruct, with higher frequencies of both ADD and UPDATE. For short-term memory tools, RL leads to more balanced tool usage. The frequency of FILTER increases notably (e.g., from 0.02 to 0.31 on Qwen2.5), indicating proactive context control, while RETRIEVE remains relatively stable. Overall, these patterns suggest that RL training enables coordinated and adaptive memory management. Detailed case studies are provided in Appendix B.

Ablation Studies

LTM-STM components. To validate the contributions of individual components, we conduct

Figure 4: Ablation study on LTM, STM, and RL components (Qwen2.5-7B-Instruct). Base : No-memory baseline; +LT : AgeMem-noRL-RAG (LTM tools only); +LT/RL : AgeMem-RAG (RL with LTM tools); +LT/ST/RL : AgeMem (full AgeMem system with RL). Green arrows indicate performance gains over the baseline.

Figure 4: Ablation study on LTM, STM, and RL components (Qwen2.5-7B-Instruct). Base : No-memory baseline; +LT : AgeMem-noRL-RAG (LTM tools only); +LT/RL : AgeMem-RAG (RL with LTM tools); +LT/ST/RL : AgeMem (full AgeMem system with RL). Green arrows indicate performance gains over the baseline.

Qwen2.5-7B - GRPO Training Reward Convergence

Figure 5: Training convergence curves on Qwen2.5-7BInstruct comparing All-Returns (solid line) v.s. AnswerOnly (dashed line) reward strategies.

Figure 5: Training convergence curves on Qwen2.5-7BInstruct comparing All-Returns (solid line) v.s. AnswerOnly (dashed line) reward strategies.

Conclusion

In this work, we propose Agentic Memory (AgeMem), a unified memory management framework that enables LLM-based agents to jointly control long-term and short-term memory through learnable, tool-based actions. By integrating memory operations directly into the agent's policy and training them with a progressive reinforcement learning strategy, AgeMem replaces heuristic memory pipelines with an end-to-end optimized solution. Extensive experiments across diverse long-horizon benchmarks show that AgeMem improves both task performance and memory quality while maintaining efficient context usage. These results highlight the importance of unified, agent-centric memory policies and suggest a promising direction for building scalable and adaptive LLM agents capable of long-term reasoning.

ablation studies on LTM, STM, and RL training. Figure 4 presents results on three representative datasets using Qwen2.5-7B-Instruct as the backbone (results for Qwen3-4B-Instruct are provided in Appendix D.1). Adding LTM alone (+LT) yields substantial gains of +10.6%, +14.2%, and +7.4% over the baseline. Incorporating RL training (+LT/RL) further improves performance, particularly on HotpotQA (+6.3%), demonstrating the effectiveness of our reward-based optimization. The full AgeMem system (+LT/ST/RL) achieves the best results across all benchmarks, with overall improvements of +13.9%, +21.7%, and +16.1%. Notably, adding STM tools provides the most significant boost on SciWorld (+3.1%) and HotpotQA (+2.4%), validating that learned context management outperforms static RAG approaches. These progressive improvements confirm that unified memory management with end-to-end RL is essential for optimal agent performance.

Reward function. To demonstrate the effectiveness of our multi-component reward function design, we compare the full reward function (AllReturns) against a variant using only R task (AnswerOnly). Figure 5 shows the reward convergence curves of Qwen2.5-7B-Instruct during GRPO training on HotpotQA. The full reward function leads to significantly faster convergence and higher final performance compared to the task-only variant. As detailed in Table 4, the All-Returns strategy achieves higher LLM-as-a-Judge scores (0.544 v.s. 0.509) while maintaining substantially better memory quality (0.533 v.s. 0.479). Notably, despite using more tokens (2117 v.s. 2078), the All-Returns strategy achieves better overall performance, indicating that the additional context and memory operations contribute meaningfully to reasoning quality. Similar patterns are observed on Qwen34B-Instruct (see Appendix D.2).

Limitations

While AgeMem demonstrates strong performance across multiple settings, there remain opportunities for further extension. The current implementation adopts a fixed set of memory management tools, which provides a clear and effective abstraction but could be extended to support more fine-grained control in future work. In addition, although we evaluate our approach on several representative longhorizon benchmarks, broader coverage of tasks and environments may further strengthen the empirical understanding of the framework.

Acknowledgments

Detailed Design and Implementation of AgeMem

This appendix provides full technical details omitted from the main text due to space constraints. We first present precise definitions and pseudoformulations for each memory-management tool (Appendix A.1), then give implementable formulas for the reward components used in training (Appendix A.2). Finally, we provide the complete algorithmic specification (Appendix A.3).

Memory Management Tools

AgeMem exposes a small set of structured tools that the agent may invoke as part of its action a t . Each tool is implemented as a deterministic or stochastic function that transforms the short-term context C t , the long-term memory store M t , or both. Unlike traditional memory systems that rely on external heuristics or predefined schedules, AgeMem integrates these tools directly into the agent's action space, enabling the model to learn when and how to use each tool through reinforcement learning. Below we give precise operational definitions, implementation details, and the system prompts that guide tool usage.

Notation. Long-term memory store at time t is M t = { m i } |M t | i =1 , where each memory m i contains a content string and optional metadata. Short-term context is C t = [ u 1 , u 2 , . . . , u n t ] (message list), and enc ( · ) denotes a text encoder that returns a dense embedding. We use cosine similarity for semantic matching throughout the framework.

RETRIEVE. The RETRIEVE operation enables the agent to access relevant information from longterm memory based on semantic similarity. This operation is crucial for bringing stored knowledge into the active context when needed for reasoning. The retrieval operation returns the topk most similar memories to the query q :

$$

$$

where the similarity function is defined as:

$$

$$

The retrieved memories are then inserted into the short-term context C t , making them available for immediate reasoning. The parameter k controls the number of memories retrieved, typically set to

3-5 in our experiments to balance relevance and context size.

ADD. The ADD operation allows the agent to store new information in long-term memory for future use. This operation is essential for accumulating knowledge across interactions and sessions. A new memory entry is created by:

$$

$$

where c is the content to be stored, enc ( c ) is its embedding vector, and metadata includes timestamp, source information, and optional tags. The memory store is then updated:

$$

$$

The agent learns to identify salient information worth storing through the reward function, which encourages storing high-quality, reusable knowledge while penalizing redundant or irrelevant entries.

UPDATE and DELETE. Memory maintenance operations enable the agent to keep its long-term memory store current and relevant. The UPDATE operation modifies existing memories when new information supersedes or refines previous knowledge. For an existing memory m i , the update operation is defined as:

$$

$$

where c ′ is the updated content and metadata ′ reflects the modification timestamp. The DELETE operation removes obsolete or incorrect memories:

$$

$$

These operations are particularly important in longhorizon tasks where information may become outdated or where the agent needs to correct earlier mistakes. The reward function encourages meaningful updates and deletions that improve memory quality over time.

SUMMARY. The SUMMARY operation compresses conversation history in the short-term context to prevent context overflow while preserving essential information. This operation is critical for managing long conversations that exceed context window limits. Given a subset of context indices s , the summary operation is defined as:

$$

$$

where Summarize ( · ) is implemented by LLM with a summarization system prompt. The agent can specify which messages to summarize using the 'span' parameter, which can be:

The summarization process uses the following system prompt to ensure high-quality compression:

The summary will later be used to replace original conversation in the context, so make sure nothing essential is lost.

☛ You are a conversation summarization assistant. Your goal is to compress the given conversation span into a concise summary that preserves all important information, intentions, decisions, and unresolved questions. the

Notation.

While AgeMem demonstrates strong performance across multiple settings, there remain opportunities for further extension. The current implementation adopts a fixed set of memory management tools, which provides a clear and effective abstraction but could be extended to support more fine-grained control in future work. In addition, although we evaluate our approach on several representative longhorizon benchmarks, broader coverage of tasks and environments may further strengthen the empirical understanding of the framework.

textsc{Retrieve
textsc{Add
textsc{Update
textsc{Summary
textsc{Filter
Tool invocation as structured actions.

Input:

Let's start the conversation summarization.

The agent learns to invoke summarization proactively before context overflow occurs, balancing information preservation with efficiency.

FILTER. The FILTER operation filters out irrelevant or redundant messages from the short-term context based on semantic similarity. This operation helps maintain a focused context by filtering out noise and distractions. Specifically, it removes messages whose similarity to a given criteria c exceeds a threshold θ :

$$

$$

In all experiments, we set θ = 0 . 6 by default. The criteria c can be specified by the agent (e.g., a description of what to keep) or can be automatically derived from the current task context. This operation is particularly useful in Stage 2 of training, where distractors are introduced to test the agent's ability to filter irrelevant information.

Tool invocation as structured actions. Each tool is exposed via a schema specifying its function name and required arguments. The agent's policy outputs either language tokens (for text generation) or structured tool calls (for memory operations). The agent is guided by a system prompt that defines the tool-calling interface and response format. The system prompt used in AgeMem is as follows: ☛

You are an intelligent assistant that solves complex problems by managing context and memory with tools when needed.

Available Tools:[TOOLS]

Problem-Solving Workflow

You must follow a structured reasoning and action process for every task:

Always start with a ... block. Inside it, explain your reasoning, plan your next step, and decide whether you need to call a tool or provide a final answer.

<tool_call>[{{"name": "Retrieve_memory", " arguments": {{"query": "math problem solving strategies", "top_k": 3}}}}, {{"name": " Add_memory", "arguments": {{"content": " Strategy summary for reuse", "memory_type": "problem_solving"}}}}]</tool_call>

When you no longer need tools and are ready to present your final output, follow your last block with an ...</ answer> block containing the full response.

You must never include both "<tool_call>" and "" immediately after the same "" block.

You may repeat this sequence as needed: "" -> "<tool_call>" -> "" -> "< tool_call>" ... -> "" -> "" until the problem is completely solved.

Reward Function Design

We design a composite reward that evaluates both downstream task performance and the quality of memory management. The total trajectory-level reward is defined as

$$

$$

where w = [ w task , w context , w memory ] ⊤ are tunable coefficients, and R = [ R task , R context , R memory ] ⊤ correspond to rewards for task completion, context management, and long-term memory management. The penalty term P penalty captures violations such as context overflow or exceeding the interaction limit. Below, we summarize each component, and precise formulas are provided in the Appendix A.2. Task completion reward R task . This term provides the primary learning signal by assessing whether the agent solves the task correctly. We obtain a scalar score using an LLM-based judge S judge ( A pred , A q ) ∈ [0 , 1] , optionally applying a penalty when no answer is produced. This reward encourages accurate, complete task solutions and remains the dominant component to ensure alignment with task objectives.

Context management reward R context . This component evaluates STM behavior, focusing on how effectively the agent controls the active context C t . It combines three factors: (i) compression efficiency, promoting economical token usage; (ii) preventive actions, rewarding early summarization or filtering to avoid overflow; and (iii) information preservation, penalizing the loss of critical queryrelated content. Each factor is normalized, allowing the reward to balance context efficiency against retention of essential information.

Memory management reward R memory. This

term evaluates LTM operations. It aggregates signals for: (i) storage quality, measured as the fraction of stored entries labeled as high-quality and reusable; (ii) maintenance, rewarding meaningful update or delete operations to mitigate memory staleness; and (iii) semantic relevance, computed using an LLM-based score between retrieved memories and the query. Together, these signals incentivize selective, high-value memory construction and responsible upkeep over time.

Penalty terms P penalty . Penalties discourage undesirable behaviors such as exceeding the maximum number of dialogue turns or triggering context overflow. Penalty coefficients are chosen so that such violations lead to a substantial reduction in the final trajectory reward, encouraging the agent to maintain safe and efficient memory practices.

Practical Implementation Notes and Hyperparameters

Training configuration. We use the Trinity RL framework (Pan et al., 2025a) for policy optimization, implementing the step-wise GRPO algorithm as described in the method section. We use K = 8 independent rollouts per task for group normalization. The KL divergence coefficient β is set to 0.1. Reward weights. All reward weights are set to 1/3: w task = w context = w memory = 1 / 3 . This uniform weighting ensures that all components contribute equally to the learning signal, allowing the agent to naturally balance task performance and memory management.

Model settings. The maximum context length is set to 8,192 tokens, and the maximum response length is set to 2,048 tokens. When the context exceeds this limit, the agent receives a penalty, encouraging proactive use of STM management tools. All experiments are conducted on 8 NVIDIA RTX 4090 GPUs with 48GB memory each.

Experience bookkeeping.

Datasets. To comprehensively evaluate AgeMem, we select five widely-used datasets in LLM-based agent research: ALFWorld (Shridhar et al., 2020), SciWorld (Wang et al., 2022), PDDL (Chang et al., 2024), BabyAI (Chevalier-Boisvert et al., 2018), and HotpotQA (Yang et al., 2018). These datasets cover embodied action, game-based reasoning, and knowledge-intensive question answering, providing diverse evaluation scenarios. Since the HotpotQA dataset contains both questions and supporting facts, automatically providing Stage 1 contextual information, AgeMem is fine-tuned with RL only on the HotpotQA training set and then evaluated directly on all datasets. Detailed dataset statistics are provided in Appendix C.1.

Evaluation metrics. For the primary task completion metrics, we adopt Success Rate (SR) for ALFWorld, SciWorld, and BabyAI, Progress Rate (PR) for PDDL, and LLM-as-a-Judge (J) for HotpotQA. Additionally, we employ an LLM-based evaluator to assess the quality of stored long-term memory during knowledge reasoning, measured by Memory Quality (MQ). The prompts of the LLMbased evaluation are provided in Appendix C.2. Baselines & LLM backbones. We compare AgeMem against four representative agent LTM systems: LangMem (LangChain Team, 2025), AMem (Xu et al., 2025), Mem0 (Chhikara et al., 2025), and Mem0 g (a graph-based variant officially provided as part of Mem0). To better demonstrate the effectiveness of RL training, we also include

Figure 2: Memory Quality scores for different methods on HotpotQA. Higher scores indicate better relevance between stored memories and ground-truth facts.

Figure 2: Memory Quality scores for different methods on HotpotQA. Higher scores indicate better relevance between stored memories and ground-truth facts.

AgeMem-noRL, which is not fine-tuned with RL. In ablation studies on STM, we compare STM tools with RAG approach. For the base agent models, we use Qwen2.5-7B-Instruct and Qwen3-4B-Instruct. More baseline configurations are in Appendix C.3. Implementation details. Webuild agents using the Agentscope framework (Gao et al., 2025a) and finetune AgeMem using the Trinity framework (Pan et al., 2025a). For all reward weights in the reward function, we use uniform coefficients of 1.0 without manual tuning. Further implementation details are provided in Appendix C.4.

Advantage broadcasting.

In long-horizon agentic tasks involving multi-step reasoning and complex workflows (Chang et al., 2024), the effectiveness of large language model (LLM) agents is fundamentally constrained by the information they can attend to at any given time, which we collectively refer to as the agent's memory (Xiong et al., 2025; Goodyear et al., 2025). Memory typically falls into two categories: longterm memory (LTM), which persistently stores useror task-specific knowledge (Zhong et al., 2024; Jiang et al., 2024), and short-term memory (STM), which comprises the information contained in the current input context (Wu et al., 2025b; Gao et al., 2025b). High-quality LTM supports efficient retrieval of accumulated knowledge, while effective STM management reduces redundancy and preserves salient context. Together, they mitigate the limitations of finite context windows, making their joint management crucial for improving agent performance in complex reasoning settings.

However, existing research has predominantly treated LTM and STM as independent components. STM is commonly enhanced through retrievalaugmented generation (RAG) (Pan et al., 2025b), such as in MainRAG (Chang et al., 2025) and ReSum (Wu et al., 2025a), which expand usable context via external retrieval or periodic summarization. Although effective in some tasks, these methods rely heavily on predefined schedules or heuristic rules, potentially resulting in overlooked infrequent but critical details as well as unnecessary noise (Ma et al., 2025; Dong et al., 2025). In contrast, LTM management has progressed along separate lines, typically categorized into triggerbased (Kang et al., 2025; Wang and Chen, 2025; Wang et al., 2025c; Chhikara et al., 2025) and agent-based (Yan et al., 2025; Hu et al., 2025; Xu et al., 2025) paradigms. The former executes fixed memory operations at predefined moments, whereas the latter incorporates a specialized memory manager to determine what and how to store. Despite offering more flexibility, most approaches still depend on handcrafted rules or auxiliary expert models, limiting adaptability and increasing system complexity (Xiong et al., 2025).

As a consequence, LTM and STM are typically treated as separate and loosely coupled modules . As illustrated in Figure 1, existing architectures generally follow two patterns: (a) static STM with trigger-based LTM, or (b) static STM with agentbased LTM. In both settings, the two memory systems are optimized independently and later combined in an ad hoc way, leading to fragmented memory construction and suboptimal performance

Figure

(a) Static STM + Trigger-based LTM

(c) Unified Management (AgeMem)

Figure 1: Comparison between independent and unified memory management frameworks. (Left) Traditional framework with static STM and trigger-based LTM. (Middle) Independent framework with an additional Memory Manager controlling LTM in an agent-based manner, while STM remains static. (Right) The proposed AgeMem framework, where LTM and STM are jointly and intelligently managed via explicit tool-based operations.

in long-horizon reasoning tasks. Thus, unifying the management of LTM and STM remains a necessary yet largely unexplored challenge.

Nevertheless, achieving unified memory management poses three fundamental challenges. ( C1 ) Functional heterogeneity coordination: LTM and STM serve distinct yet complementary purposes: LTM determines what to store, update, or discard, while STM governs what to retrieve, summarize, or remove from the active context (Zhang et al., 2025b). The challenge lies in designing a unified mechanism that orchestrates their interplay synergistically. ( C2 ) Training paradigm mismatch: Existing reinforcement learning (RL) frameworks adopt markedly different training strategies for the two memory types (Ma et al., 2024). LTM-focused training often leverages session-level information available prior to interaction, whereas STM training typically injects distractors to simulate long-horizon contexts (Sun et al., 2024). Moreover, standard RL assumes continuous trajectories with stable rewards, which conflicts with the inherently fragmented and discontinuous experiences produced by memory operations (Wu et al., 2025a), making end-to-end optimization particularly challenging. ( C3 ) Practical deployment constraints: Many agent systems rely on an auxiliary expert LLM for memory control, significantly increasing inference cost and training complexity. How to integrate unified memory management directly into an agent without dependence on external expert models remains an open problem.

To address these challenges, we propose Agentic Memory (AgeMem) , a unified framework that jointly manages LTM and STM, illustrated in Figure 1 (right). Unlike prior designs that treat memory as an external component, AgeMem integrates both memory types directly into the agent's decision-making process. Through a unified toolbased interface, the LLM autonomously invokes and executes memory operations for both LTM and STM. Furthermore, we design a three-stage progressive RL strategy: the model first acquires LTM storage capabilities, then learns STM context management, and finally coordinates both forms of memory under full task settings. To address the fragmented experience issue across training stages, we design a step-wise Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which transforms cross-stage dependencies into learnable signals, thereby alleviating the challenges posed by sparse and discontinuous rewards in RL. We evaluate AgeMem on five long-context, reasoning-intensive benchmarks. Comprehensive results show that AgeMem consistently outperforms strong baselines, validating the effectiveness of unified memory management.

Our main contributions are as follows:

· We propose Agentic Memory (AgeMem) , a unified agentic memory framework that enables LLM-based agents to autonomously decide when , what , and how to manage both long-term and short-term memory. · We develop a three-stage progressive RL strat-

  • egy equipped with a step-wise GRPO mechanism, facilitating effective end-to-end learning of unified memory management behaviors. · We conduct comprehensive evaluations across multiple models and long-horizon benchmarks, demonstrating the robustness and effectiveness of AgeMem in complex agentic tasks.
Clipped surrogate and KL penalty.
Batching and grouping.
Tool-call tokenization and masking.
LLM judges and filtering.
Default hyperparameters (summary).

Reproducibility Checklist

AgeMem Algorithm

This section provides the complete algorithmic specification of AgeMem , our unified memory

management framework for LLM-based agents. The training procedure integrates three progressive stages (long-term memory construction, shortterm context management under distractors, and integrated task execution) into a single end-to-end reinforcement learning loop. We present the main training algorithm using a two-column layout for compactness (Algorithm 1-2), followed by detailed rollout procedures for each stage (Algorithms 3-5).

Training overview (algorithm 1-2). The core training loop follows a generate-then-optimize paradigm. For each task q in a training batch B , we generate K independent rollout trajectories { τ ( q ) k } K k =1 using the current policy π θ . Each trajectory τ ( q ) k = ( τ (1) k , τ (2) k , τ (3) k ) concatenates experiences from all three stages, forming a complete episode from initial memory construction to final task completion. The agent first builds longterm memory from contextual information I q (Algorithms 3), then learns to filter out distracting information while maintaining useful context (Algorithms 4), and finally retrieves stored knowledge to finish the target task (Algorithms 5). All experiences are collected into a unified buffer E spanning multiple tasks and rollouts.

After the rollout phase, we apply group-based advantage normalization to enable fair comparison across tasks with different reward scales. For each task group G q , terminal rewards { r ( k,q ) T } K k =1 are normalized to zero mean and unit variance, yielding advantages A ( k,q ) T that reflect relative performance within the group. These terminal advantages are then broadcast uniformly to all timesteps within the same trajectory, establishing a consistent learning signal that connects early-stage memory decisions to final task outcomes. This step-wise GRPO mechanism enables long-range credit assignment across heterogeneous operations. The policy is then updated via gradient ascent on the expected advantage, regularized by a KL divergence term to maintain proximity to a reference policy π ref for training stability.

Stage-specific rollout procedures (algorithm 35). The three-stage rollout design reflects the natural progression of memory-augmented task solving. Algorithm 3 implements the first stage, where the agent engages in casual conversation while being gradually exposed to the contextual information I q . During these T 1 exploratory turns, the agent must identify salient information and deter- mine when and which long-term memory tools to invoke-including ADD, UPDATE, DELETE-to construct an initial memory store M . To support informed memory decisions, the agent proactively performs memory retrieval at every step. This retrieval is not task-driven but serves as an introspective operation: it enables the agent to maintain awareness of the current LTM contents, facilitating decisions about updating or discarding stale entries and ensuring that newly stored information remains coherent with existing knowledge. Since the task query has not yet been revealed in Stage 1, the agent must rely on general cues about which information may become useful later. This encourages the formation of reusable, well-structured memory traces rather than query-specific shortcuts, laying the foundation for effective long-horizon memory management in later stages.

Algorithm 4 describes the second stage, which deliberately stresses the agent's context management capabilities. The short-term context C is reset to avoid information leakage and affect the learning of STM management, while the constructed long-term memory M persists from Stage 1. Over T 2 turns, the agent receives semantically related but ultimately irrelevant distractor messages that could mislead downstream reasoning if left unmanaged. The agent must learn to proactively invoke FILTER to filter out low-relevance content based on semantic similarity thresholds, or SUMMARY to compress accumulated context when token budgets become constrained. This stage trains robust filtering strategies that generalize beyond simple heuristics, as the agent receives learning signals from the eventual task performance in Stage 3.

Algorithm 5 presents the final integrated execution stage. Upon receiving the target query q , the agent must coordinate retrieval from long-term memory M , context management operations on C , and multi-step reasoning to produce a final answer A pred. The agent may invoke RETRIEVE to fetch relevant stored facts, SUMMARY to maintain a tractable context window, and ultimately generate a structured response. Once the answer is produced or the maximum steps are reached, a composite reward function (Section A.2) evaluates the three-stage trajectory across multiple dimensions. This terminal reward R ( τ ) is assigned to the final timestep and serves as the supervision signal that propagates back through all three stages during advantage computation.

Training overview (algorithm~ ref{alg:main-training
Stage-specific rollout procedures (algorithm~ ref{alg:stage1-rollout

Case Study: AgeMem in Action

This section presents three representative case studies demonstrating how AgeMem enables effective unified memory management through reinforcement learning. Each case compares agent behavior before and after RL training to highlight the learned memory strategies. We use a personal learning assistant scenario where the agent helps users plan customized study programs based on their preferences and constraints.

Case 1: Long-term Memory Construction and Maintenance

This case illustrates how AgeMem learns to selectively construct, update, and maintain long-term memory across extended conversations. The agent must identify salient user information from casual dialogue and manage memory entries as new information supersedes old preferences.

Before RL training. Prior to training, the baseline agent lacks strategic memory management. It either stores all information indiscriminately or fails to recognize when stored knowledge becomes obsolete.

User: Hello! I'm a visual learner who prefers 60minute study sessions. I have Python basics but zero ML experience. I'm particularly interested in computer vision applications like face recognition.

Before RL training.

Unified RL formulation for AgeMem. At each time step t , the agent observes a state s t ∈ S composed of the conversation context (short-term memory) C t , the long-term memory store M t , and the task specification T : s t = ( C t , M t , T ) . The specification T includes the input query q , contextual information I q , and (for training only) the expected answer A q . This formulation enables the agent to ground its decision-making in both transient context and persistent knowledge.

Given s t , the agent selects an action a t ∈ A from a hybrid action space that includes language generation as well as memory operations. The de-

cision is governed by a parameterized policy π θ , defined as π θ ( a t | s t ) = P ( a t | s t ; θ ) , where θ denotes the LLM parameters and a t = π θ ( ·| s t ) . For a trajectory τ = ( s 1 , a 1 , . . . , s T , a T ) , the cumulative reward is defined as:

$$

$$

where R i captures task performance and memory quality, and P penalty discourages redundant storage, excessive tool usage, and uncontrolled context expansion. The optimization objective is:

$$

$$

This formulation treats memory management as an integral component of the agent's policy, replacing handcrafted heuristics with a learnable mechanism. Three-stage trajectory structure. To capture long-horizon interactions and progressively train memory capabilities, each trajectory is divided into three consecutive stages: τ = ( τ (1) , τ (2) , τ (3) ) , with a total length of T = T 1 + T 2 + T 3 . In Stage 1, the agent engages in casual interactions and may store useful information into LTM. Stage 2 introduces distracting or irrelevant content, requiring the agent to manage its STM through selective retention and compression. Stage 3 presents a task that depends on coordinated use of both retained context and earlier accumulated LTM. A key aspect of this design is that the long-term memory M t persists across all stages, allowing early knowledge to influence later decisions. In contrast, the context C t is reset between Stages 1 and 2 to prevent information leakage across phases. The reset before Stage 2 ensures the agent cannot solve the final task via residual context, thereby forcing proper retrieval from LTM and enabling effective training of memory operations.

At each step, we collect an experience tuple e t = ( s t , a t , r t , log π θ old ( a t | s t )) , where r t is typically zero for intermediate steps and assigned after trajectory completion, and log π θ old ( a t | s t ) denotes the log probability under the old policy π θ old . This representation enables step-wise credit assignment under GRPO (Shao et al., 2024) and allows the agent to attribute long-term rewards to specific memory decisions across stages. By structuring trajectories in this staged yet continuous manner, the agent learns temporally coherent and task-adaptive memory policies essential for robust long-horizon reasoning.

Table 1: Memory management tools in AgeMem for manipulating long-term memory (LTM) and short-term memory (STM).

After RL training.

User: Actually, I realize 60 minutes is too short. I work better with 120-minute deep focus blocks.

Case 2: Short-term Context Management Under Distraction

This case demonstrates how AgeMem learns to proactively manage short-term context when faced with irrelevant information that could interfere with task focus. The agent must recognize distractors and apply appropriate filtering or summarization strategies.

Before RL training. The baseline agent passively accumulates all conversation content in context, leading to dilution of task-relevant information and eventual context overflow.

User: I need a focused 3-day ML crash course for face recognition. By the way, I'm also exploring quantum computing, blockchain, robotics, and learning to bake sourdough bread and do latte art.

Before RL training.

Unified RL formulation for AgeMem. At each time step t , the agent observes a state s t ∈ S composed of the conversation context (short-term memory) C t , the long-term memory store M t , and the task specification T : s t = ( C t , M t , T ) . The specification T includes the input query q , contextual information I q , and (for training only) the expected answer A q . This formulation enables the agent to ground its decision-making in both transient context and persistent knowledge.

Given s t , the agent selects an action a t ∈ A from a hybrid action space that includes language generation as well as memory operations. The de-

cision is governed by a parameterized policy π θ , defined as π θ ( a t | s t ) = P ( a t | s t ; θ ) , where θ denotes the LLM parameters and a t = π θ ( ·| s t ) . For a trajectory τ = ( s 1 , a 1 , . . . , s T , a T ) , the cumulative reward is defined as:

$$

$$

where R i captures task performance and memory quality, and P penalty discourages redundant storage, excessive tool usage, and uncontrolled context expansion. The optimization objective is:

$$

$$

This formulation treats memory management as an integral component of the agent's policy, replacing handcrafted heuristics with a learnable mechanism. Three-stage trajectory structure. To capture long-horizon interactions and progressively train memory capabilities, each trajectory is divided into three consecutive stages: τ = ( τ (1) , τ (2) , τ (3) ) , with a total length of T = T 1 + T 2 + T 3 . In Stage 1, the agent engages in casual interactions and may store useful information into LTM. Stage 2 introduces distracting or irrelevant content, requiring the agent to manage its STM through selective retention and compression. Stage 3 presents a task that depends on coordinated use of both retained context and earlier accumulated LTM. A key aspect of this design is that the long-term memory M t persists across all stages, allowing early knowledge to influence later decisions. In contrast, the context C t is reset between Stages 1 and 2 to prevent information leakage across phases. The reset before Stage 2 ensures the agent cannot solve the final task via residual context, thereby forcing proper retrieval from LTM and enabling effective training of memory operations.

At each step, we collect an experience tuple e t = ( s t , a t , r t , log π θ old ( a t | s t )) , where r t is typically zero for intermediate steps and assigned after trajectory completion, and log π θ old ( a t | s t ) denotes the log probability under the old policy π θ old . This representation enables step-wise credit assignment under GRPO (Shao et al., 2024) and allows the agent to attribute long-term rewards to specific memory decisions across stages. By structuring trajectories in this staged yet continuous manner, the agent learns temporally coherent and task-adaptive memory policies essential for robust long-horizon reasoning.

Table 1: Memory management tools in AgeMem for manipulating long-term memory (LTM) and short-term memory (STM).

After RL training.

User: Actually, I realize 60 minutes is too short. I work better with 120-minute deep focus blocks.

Case 3: Integrated Task Execution with Memory Coordination

This case demonstrates the complete AgeMem workflow where the agent must retrieve from longterm memory, manage short-term context, and solve a task requiring coordinated memory operations.

Before RL training. The baseline agent either fails to store information initially or cannot effectively retrieve it when needed, leading to incomplete or generic responses.

User: Based on everything I've told you about my learning style and preferences, create a personalized Day 1 study schedule with specific time blocks, topics, and resources.

Before RL training.

Unified RL formulation for AgeMem. At each time step t , the agent observes a state s t ∈ S composed of the conversation context (short-term memory) C t , the long-term memory store M t , and the task specification T : s t = ( C t , M t , T ) . The specification T includes the input query q , contextual information I q , and (for training only) the expected answer A q . This formulation enables the agent to ground its decision-making in both transient context and persistent knowledge.

Given s t , the agent selects an action a t ∈ A from a hybrid action space that includes language generation as well as memory operations. The de-

cision is governed by a parameterized policy π θ , defined as π θ ( a t | s t ) = P ( a t | s t ; θ ) , where θ denotes the LLM parameters and a t = π θ ( ·| s t ) . For a trajectory τ = ( s 1 , a 1 , . . . , s T , a T ) , the cumulative reward is defined as:

$$

$$

where R i captures task performance and memory quality, and P penalty discourages redundant storage, excessive tool usage, and uncontrolled context expansion. The optimization objective is:

$$

$$

This formulation treats memory management as an integral component of the agent's policy, replacing handcrafted heuristics with a learnable mechanism. Three-stage trajectory structure. To capture long-horizon interactions and progressively train memory capabilities, each trajectory is divided into three consecutive stages: τ = ( τ (1) , τ (2) , τ (3) ) , with a total length of T = T 1 + T 2 + T 3 . In Stage 1, the agent engages in casual interactions and may store useful information into LTM. Stage 2 introduces distracting or irrelevant content, requiring the agent to manage its STM through selective retention and compression. Stage 3 presents a task that depends on coordinated use of both retained context and earlier accumulated LTM. A key aspect of this design is that the long-term memory M t persists across all stages, allowing early knowledge to influence later decisions. In contrast, the context C t is reset between Stages 1 and 2 to prevent information leakage across phases. The reset before Stage 2 ensures the agent cannot solve the final task via residual context, thereby forcing proper retrieval from LTM and enabling effective training of memory operations.

At each step, we collect an experience tuple e t = ( s t , a t , r t , log π θ old ( a t | s t )) , where r t is typically zero for intermediate steps and assigned after trajectory completion, and log π θ old ( a t | s t ) denotes the log probability under the old policy π θ old . This representation enables step-wise credit assignment under GRPO (Shao et al., 2024) and allows the agent to attribute long-term rewards to specific memory decisions across stages. By structuring trajectories in this staged yet continuous manner, the agent learns temporally coherent and task-adaptive memory policies essential for robust long-horizon reasoning.

Table 1: Memory management tools in AgeMem for manipulating long-term memory (LTM) and short-term memory (STM).

After RL training.

User: Actually, I realize 60 minutes is too short. I work better with 120-minute deep focus blocks.

Summary.

Experimental Implementation

Dataset Details

We provide detailed statistics and characteristics of the five datasets used in our experiments:

ALFWorld (Shridhar et al., 2020) is an embodied AI benchmark in which agents must complete household tasks by following natural language instructions in a simulated environment. The dataset consists of several thousand training environments

and multiple validation and test splits, covering six task types: pick and place, examine in light, clean and place, heat and place, cool and place, and pick two and place. These tasks require long-horizon interaction with objects, making ALFWorld well suited for evaluating planning and memory management capabilities.

SciWorld (Wang et al., 2022) is an interactive science experiment simulation environment where agents must perform multi-step experiments to answer scientific questions. The benchmark includes a diverse set of tasks spanning multiple scientific domains, such as physics, chemistry, and biology, and emphasizes procedural reasoning and hypothesis-driven exploration. Its complexity makes it suitable for testing an agent's ability to retain and retrieve relevant knowledge over extended interaction sequences.

PDDL (Chang et al., 2024) refers to a set of planning benchmarks formulated using the Planning Domain Definition Language. These benchmarks evaluate an agent's ability to solve symbolic planning problems across multiple domains by generating valid sequences of actions that achieve specified goal states. The tasks primarily test structured reasoning and the ability to maintain and utilize intermediate planning states.

BabyAI (Chevalier-Boisvert et al., 2018) is a gridworld navigation benchmark with natural language instructions. The environment contains a large collection of instruction-following tasks (levels), where agents must navigate and interact with objects to satisfy compositional language commands. Due to its sequential decision-making structure, BabyAI is commonly used to evaluate short-term context tracking and instruction grounding.

HotpotQA (Yang et al., 2018) is a multi-hop question answering dataset that requires reasoning over multiple Wikipedia paragraphs. It contains approximately 90k training questions along with validation and test splits, and each question is annotated with supporting facts. This structure makes HotpotQA particularly suitable for evaluating long-term memory storage and retrieval. In our experiments, we use HotpotQA for reinforcement learning training, as its annotated supporting facts naturally provide structured contextual information for Stage 1 supervision.

LLM-based Evaluation Details

For the Memory Quality (MQ) metric, we employ an LLM-based evaluator to assess the quality of supporting facts stored in memory by comparing predicted supporting facts with ground-truth expected facts. The evaluator uses the following prompt template:

The evaluator compares the stored memory entries (predicted supporting facts) with the groundtruth supporting facts provided in the HotpotQA dataset. The score reflects both the coverage of expected facts and the relevance of predicted facts to the question. We use Qwen-Max as the evaluator model, and each evaluation is performed independently to ensure consistency.

☛ You are an expert judge evaluating the quality of supporting facts for question answering. Question: [QUESTION] Answer: [ANSWER] Ground Truth Supporting Facts (the facts that should be identified): Expected Supporting Facts: -[FACT_1] -[FACT_2] ... Model Predicted Supporting Facts (the facts identified by the model and stored in the long-term memory): Predicted Supporting Facts: -[PREDICTED_FACT_1] -[PREDICTED_FACT_2] ... Please evaluate how well the predicted supporting facts match the ground truth expected facts: 1. Are all expected facts covered by the predictions? 2. Are the predicted facts actually relevant to answering the question? 3. Are there any irrelevant facts in the predictions? Score on a scale of 0.0 to 1.0: -1.0: Perfect match -all expected facts are correctly identified, no irrelevant facts -0.8-0.9: Mostly correct with minor omissions or one irrelevant fact -0.6-0.7: Partially correct -some relevant facts identified but missing important ones -0.4-0.5: Some correct elements but significant errors or omissions -0.2-0.3: Mostly incorrect with few correct elements -0.0-0.1: Completely incorrect or irrelevant Respond with only a number between 0.0 and 1.0 ( e.g., "0.85"). ✡

Baseline Configurations

All baseline implementations follow their respective official open-source codebases to ensure fair comparison. We provide the source links and implementation details below.

LangMem (LangChain Team, 2025): We use the official implementation available at https: //langchain-ai.github.io/langmem/ with default hyperparameters. LangMem employs a modular memory framework that supports multiple memory types. We configure it to use the default memory storage and retrieval mechanisms as specified in the official documentation.

A-Mem (Xu et al., 2025): We implement AMem following the Zettelkasten-inspired design described in the original paper, using the official codebase at https://github.com/WujiangXu/ A-mem-sys/ . The system links structured knowledge units to facilitate consolidation. We use the recommended hyperparameters for memory consolidation as provided in the repository.

Mem0 (Chhikara et al., 2025): We use the official Mem0 implementation available at https: //github.com/mem0ai/mem0 with the default extract-update pipeline. For the graph-based variant (Mem0 g ), we enable the graph structure option and use the recommended graph construction parameters as specified in the official implementation. AgeMem-noRL : This variant uses the same tool interface as AgeMem but without reinforcement learning. This baseline helps isolate the contribution of RL training to the overall performance.

RAG variants : For the RAG-based baselines (AgeMem-noRL-RAG and AgeMem-RAG), we replace the STM tools with a standard RAG pipeline that retrieves relevant memories at each step and appends them to the context. The retrieval is performed using cosine similarity between the current context and stored memories, following standard RAG practices. This comparison demonstrates the advantage of learned STM management over static retrieval-based approaches.

Implementation Details

Training configuration. We use the Trinity RL framework (Pan et al., 2025a) for policy optimization, implementing the step-wise GRPO algorithm as described in the method section. We use K = 8 independent rollouts per task for group normalization. The KL divergence coefficient β is set to 0.1. Reward weights. All reward weights are set to 1/3: w task = w context = w memory = 1 / 3 . This uniform weighting ensures that all components contribute equally to the learning signal, allowing the agent to naturally balance task performance and memory management.

Model settings. The maximum context length is set to 8,192 tokens, and the maximum response length is set to 2,048 tokens. When the context exceeds this limit, the agent receives a penalty, encouraging proactive use of STM management tools. All experiments are conducted on 8 NVIDIA RTX 4090 GPUs with 48GB memory each.

Additional Results

Ablation Study

This section provides complementary ablation study results for Qwen3-4B-Instruct. Figure 9 shows the progressive contribution of LTM, STM, and RL components on Qwen3-4B-Instruct across three representative datasets. The results demonstrate consistent trends with Qwen2.5-7B-Instruct, validating the generalizability of our approach across different model sizes.

Reward Function Ablation on Qwen3-4B

To validate the generalizability of our multicomponent reward design across different model architectures and scales, we conduct the same reward function ablation study as in the main text on Qwen3-4B-Instruct. This section provides a complete analysis parallel to the Qwen2.5-7B-Instruct results presented in the main paper.

Figure

�����

������ ���������

Figure 9: Ablation study results for Qwen3-4B-Instruct. Base : No-Memory baseline; +LT : AgeMem-noRL-RAG (LTM tools only); +LT/RL : AgeMem-RAG (RL with LTM tools); +LT/ST/RL : AgeMem (full AgeMem system with RL). Green arrows indicate performance gains over the baseline.

Qwen3-4B - GRPO Training Reward Convergence

Figure 10: Training convergence curves on Qwen3-4BInstruct comparing All-Returns (solid line) v.s. AnswerOnly (dashed line) reward strategies.

Figure 10: Training convergence curves on Qwen3-4BInstruct comparing All-Returns (solid line) v.s. AnswerOnly (dashed line) reward strategies.

Convergence Analysis

Figure 10 demonstrates the reward convergence patterns on Qwen3-4B-Instruct. Similar to Qwen2.57B-Instruct, the All-Returns strategy consistently outperforms Answer-Only throughout the training process. Several notable observations emerge:

More Stable Dynamics: The convergence curve shows noticeably smoother progression with lower variance, particularly in the later training stages (steps 70-100). This stability suggests that Qwen3's architecture may have better inductive biases for the reward learning task.

Consistent Superiority: While the absolute improvement is smaller than Qwen2.5-7B-Instruct, the All-Returns strategy maintains its advantage throughout training, validating the robustness of our reward design.

Quantitative Results

$$ R(\tau) = \sum w_i \cdot R_i\left(\tau\right) + P_{\text{penalty}}(\tau), $$

$$ \theta^{*} = \arg\max_{\theta} \mathbb{E}{\tau \sim \pi\theta} [ R(\tau) ]. $$

$$ \tau_k^{(q)} = \big(\tau_k^{(1)}, , \tau_k^{(2)}, , \tau_k^{(3)}\big), \quad k = 1, \dots, K, $$

$$ \tau_k^{(q)}=(e_1, e_2, \ldots, e_T),\quad T = T_1 + T_2 + T_3, $$

$$ A_T^{(k,q)} = \frac{r_T^{(k,q)} - \mu_{G_q}}{\sigma_{G_q} + \epsilon}, $$

$$ \begin{aligned} J(\theta) &= \mathbb{E}{(e_t, A_t) \sim \mathcal{E}} \big[\rho_t A_t - \beta D{\text{KL}}[\pi_{\theta}|\pi_{\text{ref}}] \big] \ &= \frac{1}{|\mathcal{E}|} \sum_{q=1}^{B} \sum_{k=1}^{K} \sum_{t=1}^{T_k^{(q)}} \big[\rho_t^{(k,q)} A_t^{(k,q)} - \beta D_{KL}^{(k,q)} \big], \end{aligned} $$

$$ \textsc{Retrieve}(q, k) = \text{TopK}(\mathcal{M}_t, ; \text{sim}(q, m_i), ; k), $$

$$ \text{sim}(q, m_i) = \frac{\text{enc}(q)^\top , \text{enc}(m_i)} {|\text{enc}(q)| , |\text{enc}(m_i)|}. $$

$$ m_{\text{new}} = \big(c, \text{enc}(c), \text{metadata}\big), $$

$$ \mathcal{M}_{t+1} = \mathcal{M}t \cup {m{\text{new}}}. $$

$$ C_t^{\prime} = C_t \setminus {u_i \mid i \in s} , \cup , {\text{Summarize}({u_i}_{i \in s})}, $$

$$ C_t^{\prime} = \left{ u_i \in C_t , \middle| , \text{sim}(c, u_i) < \theta \right}. $$

$$ R_{\text{task}}

\begin{cases} S_{\text{judge}}(A_{\text{pred}}, A_{q}), & \text{if has answer}, \ P_{\text{no-answer}}, & \text{otherwise}, \end{cases} $$

$$ R_{\text{context}} = \sum_{i=1}^{3} \alpha_i R_i, $$

$$ R_{\text{compression}} = \max!\left(0,; 1 - \frac{T_{\text{used}}}{T_{\text{max}}}\right), $$

$$ R_{\text{preventive}} = \mathbbm{1}[\text{tool invoked before overflow}], $$

$$ R_{\text{preservation}} = \mathbbm{1}_{\text{preserve}}. $$

$$ R_{\text{storage}} = \frac{N_{\text{high_quality}}}{\max(1, N_{\text{total}})}. $$

$$ R_{\text{maintenance}} = \mathbbm{1}[\text{update or delete performed}]. $$

$$ P_{\text{penalty}} = \sum_{k= 1}^{2} P_k \cdot \mathbbm{1}[\text{violation}_k], $$

Algorithm: algorithm
[H]
\caption{AgeMem Training (Part 1)}
\label{alg:main-training}
\small
\begin{algorithmic}[1]
\Require Policy $\pi_{\theta}$, reference $\pi_{\text{ref}}$, batch $\mathcal{B}$, rollouts $K$
\Ensure Trained policy $\pi_{\theta^{*}}$
\State Initialize $\theta$ and $\theta_{\text{old}} \gets \theta$
\For{each training iteration}
\State $\mathcal{E} \gets \emptyset$ \textcolor{gray}{// Init experience buffer}
\State \textcolor{blue}{// \textbf{Rollout Phase}}
\For{each task $q \in \mathcal{B}$}
\State Get context $I_q$ for task $q$
\State $M_{\text{dis}} \gets \textsc{DistractorGen}(q)$
\For{$k = 1$ to $K$}
\State $\mathcal{M} \gets \emptyset$ \textcolor{gray}{// Init LTM}
\State $\tau_k^{(1)} \gets \textsc{Stage1}(I_q, \pi_{\theta}, \theta_{\text{old}}, \mathcal{M})$
\State $C \gets \emptyset$ \textcolor{gray}{// Reset STM}
\State $\tau_k^{(2)} \gets \textsc{Stage2}(M_{\text{dis}}, \pi_{\theta}, \theta_{\text{old}}, \mathcal{M})$
\State $\tau_k^{(3)} \gets \textsc{Stage3}(q, \pi_{\theta}, \theta_{\text{old}}, \mathcal{M})$
\State $\tau_k^{(q)} \gets \tau_k^{(1)} \oplus \tau_k^{(2)} \oplus \tau_k^{(3)}$
\State $\mathcal{E} \gets \mathcal{E} \cup \tau_k^{(q)}$
\EndFor
\EndFor
\EndFor
\end{algorithmic}
Algorithm: algorithm
[H]
\caption{AgeMem Training (Part 2)}
\label{alg:main-training-2}
\small
\begin{algorithmic}[1]
\setcounter{ALG@line}{20}
\State \textcolor{blue}{// \textbf{Advantage Computation}}
\For{each group $G_q = \{\tau_k^{(q)}\}_{k=1}^{K}$}
\State Extract rewards: $\{r_T^{(k,q)}\}_{k=1}^{K}$
\State $\mu_{G_q} \gets \frac{1}{K}\sum_{k=1}^{K} r_T^{(k,q)}$
\State $\sigma_{G_q} \gets \sqrt{\frac{1}{K-1}\sum_{k=1}^{K}(r_T^{(k,q)} - \mu_{G_q})^2}$
\For{each trajectory $\tau_k^{(q)} = (e_1, \ldots, e_T)$}
\State $A_T^{(k,q)} \gets \frac{r_T^{(k,q)} - \mu_{G_q}}{\sigma_{G_q} + \epsilon}$
\For{$t = 1$ to $T$}
\State $A_t^{(k,q)} \gets A_T^{(k,q)}$ \textcolor{gray}{// Broadcast}
\EndFor
\EndFor
\EndFor
\State \textcolor{blue}{// \textbf{Policy Update}}
\State $J(\theta) \gets \mathbb{E}_{(e_t, A_t) \sim \mathcal{E}} [\rho_t A_t - \beta D_{\text{KL}}[\pi_{\theta}\|\pi_{\text{ref}}]]$
\State $\theta \gets \theta + \eta \nabla_{\theta} J(\theta)$
\State $\theta_{\text{old}} \gets \theta$
\State \Return $\pi_{\theta}$
\end{algorithmic}
Algorithm: algorithm
[t]
\caption{Stage 1: LTM Construction}
\label{alg:stage1-rollout}
\begin{algorithmic}[1]
\Require Contextual information $I_q$, policy $\pi_{\theta}$, old params $\theta_{\text{old}}$, memory $\mathcal{M}$, max turn number $N_{max}$
\Ensure Stage 1 trajectory $\tau^{(1)} = (e_1^{(1)}, \ldots, e_{T_1}^{(1)})$
\State Initialize $\tau^{(1)} \gets \emptyset$ and $C \gets \emptyset$
\For{$t = 1$ to $N_{max}$}
\State Sample message $m_t \sim I_q$
\State $\mathcal{M}_{\text{ret}} \gets \textsc{Retrieve}(\mathcal{M}, m_t, k) \cup m_t$
\State $C \gets C \cup \mathcal{M}_{\text{ret}}$
\State $s_t \gets (C, \mathcal{M}, q)$
\State $a_t \sim \pi_{\theta}(\cdot \mid s_t)$

% \If{$a_t$ invokes $\textsc{Add}(c, \text{meta})$}
% \State $\mathcal{M} \gets \mathcal{M} \cup \{(c, \text{meta})\}$ \textcolor{gray}{// Store to LTM}
% \EndIf
\State Update $C$ with response from $a_t$
\State $e_t^{(1)} \gets (s_t, a_t, 0, \log \pi_{\theta_{\text{old}}}(a_t \mid s_t))$
\State $\tau^{(1)} \gets \tau^{(1)} \cup \{e_t^{(1)}\}$

\State Memory tool calls from $a_t$ \textcolor{gray}{// Memory Management}
\If{Output Answer from $a_t$}
\State Conversation Break
\EndIf
\EndFor
\State \Return $\tau^{(1)}$
\end{algorithmic}
Algorithm: algorithm
[p]
\caption{Stage 2: STM Control under Distractors}
\label{alg:stage2-rollout}
\begin{algorithmic}[1]
\Require Distractors $M_{\text{dis}}$, policy $\pi_{\theta}$, old params $\theta_{\text{old}}$, memory $\mathcal{M}$, max turn number $N_{max}$
\Ensure Stage 2 trajectory $\tau^{(2)} = (e_1^{(2)}, \ldots, e_{T_2}^{(2)})$
\State Initialize $\tau^{(2)} \gets \emptyset$ and $C \gets \emptyset$ \textcolor{gray}{// $\mathcal{M}$ persists from Stage 1}
\For{$t = 1$ to $N_{max}$}
\State $C \gets C \cup \{M_{\text{dis}}[t]\}$ \textcolor{gray}{// Inject distractor}
\State $s_t \gets (C, \mathcal{M}, q)$
\State $a_t \sim \pi_{\theta}(\cdot \mid s_t)$
\State Update $C$ with response from $a_t$
\State $e_t^{(2)} \gets (s_t, a_t, 0, \log \pi_{\theta_{\text{old}}}(a_t \mid s_t))$
\State $\tau^{(2)} \gets \tau^{(2)} \cup \{e_t^{(2)}\}$
\State Memory tool calls from $a_t$ \textcolor{gray}{// Memory Management}
\If{Output Answer from $a_t$}
\State Conversation Break
\EndIf
\EndFor
\State \Return $\tau^{(2)}$
\end{algorithmic}
Algorithm: algorithm
[1]
\Require Policy $\pi_{\theta}$, reference $\pi_{\text{ref}}$, batch $\mathcal{B}$, rollouts $K$
\Ensure Trained policy $\pi_{\theta^{*}}$
\State Initialize $\theta$ and $\theta_{\text{old}} \gets \theta$
\For{each training iteration}
\State $\mathcal{E} \gets \emptyset$ \textcolor{gray}{// Init experience buffer}
\State \textcolor{blue}{// \textbf{Rollout Phase}}
\For{each task $q \in \mathcal{B}$}
\State Get context $I_q$ for task $q$
\State $M_{\text{dis}} \gets \textsc{DistractorGen}(q)$
\For{$k = 1$ to $K$}
\State $\mathcal{M} \gets \emptyset$ \textcolor{gray}{// Init LTM}
\State $\tau_k^{(1)} \gets \textsc{Stage1}(I_q, \pi_{\theta}, \theta_{\text{old}}, \mathcal{M})$
\State $C \gets \emptyset$ \textcolor{gray}{// Reset STM}
\State $\tau_k^{(2)} \gets \textsc{Stage2}(M_{\text{dis}}, \pi_{\theta}, \theta_{\text{old}}, \mathcal{M})$
\State $\tau_k^{(3)} \gets \textsc{Stage3}(q, \pi_{\theta}, \theta_{\text{old}}, \mathcal{M})$
\State $\tau_k^{(q)} \gets \tau_k^{(1)} \oplus \tau_k^{(2)} \oplus \tau_k^{(3)}$
\State $\mathcal{E} \gets \mathcal{E} \cup \tau_k^{(q)}$
\EndFor
\EndFor
\EndFor
Algorithm: algorithm
[1]
\Require Contextual information $I_q$, policy $\pi_{\theta}$, old params $\theta_{\text{old}}$, memory $\mathcal{M}$, max turn number $N_{max}$
\Ensure Stage 1 trajectory $\tau^{(1)} = (e_1^{(1)}, \ldots, e_{T_1}^{(1)})$
\State Initialize $\tau^{(1)} \gets \emptyset$ and $C \gets \emptyset$
\For{$t = 1$ to $N_{max}$}
\State Sample message $m_t \sim I_q$
\State $\mathcal{M}_{\text{ret}} \gets \textsc{Retrieve}(\mathcal{M}, m_t, k) \cup m_t$
\State $C \gets C \cup \mathcal{M}_{\text{ret}}$
\State $s_t \gets (C, \mathcal{M}, q)$
\State $a_t \sim \pi_{\theta}(\cdot \mid s_t)$

% \If{$a_t$ invokes $\textsc{Add}(c, \text{meta})$}
% \State $\mathcal{M} \gets \mathcal{M} \cup \{(c, \text{meta})\}$ \textcolor{gray}{// Store to LTM}
% \EndIf
\State Update $C$ with response from $a_t$
\State $e_t^{(1)} \gets (s_t, a_t, 0, \log \pi_{\theta_{\text{old}}}(a_t \mid s_t))$
\State $\tau^{(1)} \gets \tau^{(1)} \cup \{e_t^{(1)}\}$

\State Memory tool calls from $a_t$ \textcolor{gray}{// Memory Management}
\If{Output Answer from $a_t$}
\State Conversation Break
\EndIf
\EndFor
\State \Return $\tau^{(1)}$

Table 5 reports the reward ablation results on HotpotQA with Qwen3-4B-Instruct. Compared to the Answer-Only strategy, the All-Returns reward consistently improves overall performance. In particular, it yields higher LLM-as-a-Judge scores (0.555

��

v.s. 0.546) and substantially better memory quality (MQ: 0.605 v.s. 0.415), indicating that explicitly rewarding memory-related behaviors leads to more reliable memory organization. The All-Returns strategy also encourages more active tool usage (8.67 v.s. 7.21), suggesting that the agent learns to leverage memory operations more effectively when intermediate returns are optimized. This improvement comes with only a marginal increase in token consumption (2191 v.s. 2164), implying that the gains are not driven by excessive context expansion but by more efficient memory utilization. Overall, these results show that incorporating memory-aware rewards significantly enhances both memory quality and task performance on Qwen34B-Instruct. The observed trends are consistent with those obtained on Qwen2.5-7B-Instruct, confirming the robustness of the reward design across different model backbones.

ToolTargetFunction
ADDLTMAdd new knowledge to M t
UPDATELTMModify entries in M t
DELETELTMRemove entries from M t
RETRIEVESTMRetrieve entries from M t to C t
SUMMARYSTMSummarize segments in C t
FILTERSTMFilter out irrelevant segments from C t
LLMBackboneMethodALFWorldSciWorldPDDLBabyAIHotpotQAAverage
Qwen2.5-7B-InstructNo-Memory27.1613.810.1550.838.3628.05
Qwen2.5-7B-InstructLangMem38.2728.2915.8551.3437.4334.23
Qwen2.5-7B-InstructA-Mem34.6828.0618.3958.8243.9536.78
Qwen2.5-7B-InstructMem037.4926.9913.9660.5846.6637.14
Qwen2.5-7B-InstructMem0 g35.3430.514.8658.7842.0636.31
Qwen2.5-7B-InstructAgeMem-noRL37.928.678.8746.3445.3633.43
Qwen2.5-7B-InstructAgeMem (Ours)41.0735.5517.3161.4254.4441.96
Qwen3-4B-InstructNo-Memory38.5147.8930.1455.8347.4843.97
Qwen3-4B-InstructLangMem40.8950.4228.4253.842.743.25
Qwen3-4B-InstructA-Mem34.3150.1434.4161.3548.4845.74
Mem041.1751.3831.7260.0539.1644.7
Mem0 g36.6947.7629.6157.5938.1241.95
AgeMem-noRL38.0250.4227.5257.4854.4945.59
AgeMem (Ours)48.9759.4835.0772.5655.4954.31
Tool CategoryQwen2.5-7BQwen2.5-7BQwen3-4BQwen3-4B
noRLGRPOnoRLGRPO
LTM Tool StatisticsLTM Tool StatisticsLTM Tool StatisticsLTM Tool StatisticsLTM Tool Statistics
ADD Memory0.921.642.492.64
UPDATE Memory0.000.130.130.34
DELETE Memory0.000.080.000.22
STM Tool StatisticsSTM Tool StatisticsSTM Tool StatisticsSTM Tool StatisticsSTM Tool Statistics
RETRIEVE Memory2.311.954.624.35
SUMMARY Context1.080.820.110.96
FILTER Context0.020.310.150.16
Total Calls4.334.927.508.67
StrategyJ ( ↑ )TN ( ↓ )MQ ( ↑ )TC (-)
Answer-Only0.50920780.4793.93
All-Returns0.54421170.5334.92
StrategyJ ( ↑ )TN ( ↓ )MQ ( ↑ )TC (-)
Answer-Only0.54621640.4157.21
All-Returns0.55521910.6058.67

Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows, making effective memory management critical. Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components, relying on heuristics or auxiliary controllers, which limits adaptability and end-to-end optimization. In this paper, we propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent’s policy. AgeMem exposes memory operations as tool-based actions, enabling the LLM agent to autonomously decide what and when to store, retrieve, update, summarize, or discard information. To train such unified behaviors, we propose a three-stage progressive reinforcement learning strategy and design a step-wise GRPO to address sparse and discontinuous rewards induced by memory operations. Experiments on five long-horizon benchmarks demonstrate that AgeMem consistently outperforms strong memory-augmented baselines across multiple LLM backbones, achieving improved task performance, higher-quality long-term memory, and more efficient context usage.

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu1,2, Liuyi Yao1,†, Yuexiang Xie1, Qingquan Tan2, Jiaqi Feng2, Yaliang Li1, and Libing Wu2,† 1Alibaba Group, 2School of Cyber Science and Engineering, Wuhan University {yui1212,tanqingquan,jiaqiFeng,wu}@whu.edu.cn {yly287738,yuexiang.xyx,yaliang.li}@alibaba-inc.com †Corresponding authors

In long-horizon agentic tasks involving multi-step reasoning and complex workflows (Chang et al., 2024), the effectiveness of large language model (LLM) agents is fundamentally constrained by the information they can attend to at any given time, which we collectively refer to as the agent’s memory (Xiong et al., 2025; Goodyear et al., 2025). Memory typically falls into two categories: long-term memory (LTM), which persistently stores user- or task-specific knowledge (Zhong et al., 2024; Jiang et al., 2024), and short-term memory (STM), which comprises the information contained in the current input context (Wu et al., 2025b; Gao et al., 2025b). High-quality LTM supports efficient retrieval of accumulated knowledge, while effective STM management reduces redundancy and preserves salient context. Together, they mitigate the limitations of finite context windows, making their joint management crucial for improving agent performance in complex reasoning settings.

However, existing research has predominantly treated LTM and STM as independent components. STM is commonly enhanced through retrieval-augmented generation (RAG) (Pan et al., 2025b), such as in MainRAG (Chang et al., 2025) and ReSum (Wu et al., 2025a), which expand usable context via external retrieval or periodic summarization. Although effective in some tasks, these methods rely heavily on predefined schedules or heuristic rules, potentially resulting in overlooked infrequent but critical details as well as unnecessary noise (Ma et al., 2025; Dong et al., 2025). In contrast, LTM management has progressed along separate lines, typically categorized into trigger-based (Kang et al., 2025; Wang and Chen, 2025; Wang et al., 2025c; Chhikara et al., 2025) and agent-based (Yan et al., 2025; Hu et al., 2025; Xu et al., 2025) paradigms. The former executes fixed memory operations at predefined moments, whereas the latter incorporates a specialized memory manager to determine what and how to store. Despite offering more flexibility, most approaches still depend on handcrafted rules or auxiliary expert models, limiting adaptability and increasing system complexity (Xiong et al., 2025).

As a consequence, LTM and STM are typically treated as separate and loosely coupled modules. As illustrated in Figure 1, existing architectures generally follow two patterns: (a) static STM with trigger-based LTM, or (b) static STM with agent-based LTM. In both settings, the two memory systems are optimized independently and later combined in an ad hoc way, leading to fragmented memory construction and suboptimal performance in long-horizon reasoning tasks. Thus, unifying the management of LTM and STM remains a necessary yet largely unexplored challenge.

Nevertheless, achieving unified memory management poses three fundamental challenges. (C1) Functional heterogeneity coordination: LTM and STM serve distinct yet complementary purposes: LTM determines what to store, update, or discard, while STM governs what to retrieve, summarize, or remove from the active context (Zhang et al., 2025b). The challenge lies in designing a unified mechanism that orchestrates their interplay synergistically. (C2) Training paradigm mismatch: Existing reinforcement learning (RL) frameworks adopt markedly different training strategies for the two memory types (Ma et al., 2024). LTM-focused training often leverages session-level information available prior to interaction, whereas STM training typically injects distractors to simulate long-horizon contexts (Sun et al., 2024). Moreover, standard RL assumes continuous trajectories with stable rewards, which conflicts with the inherently fragmented and discontinuous experiences produced by memory operations (Wu et al., 2025a), making end-to-end optimization particularly challenging. (C3) Practical deployment constraints: Many agent systems rely on an auxiliary expert LLM for memory control, significantly increasing inference cost and training complexity. How to integrate unified memory management directly into an agent without dependence on external expert models remains an open problem.

To address these challenges, we propose Agentic Memory (AgeMem), a unified framework that jointly manages LTM and STM, illustrated in Figure 1 (right). Unlike prior designs that treat memory as an external component, AgeMem integrates both memory types directly into the agent’s decision-making process. Through a unified tool-based interface, the LLM autonomously invokes and executes memory operations for both LTM and STM. Furthermore, we design a three-stage progressive RL strategy: the model first acquires LTM storage capabilities, then learns STM context management, and finally coordinates both forms of memory under full task settings. To address the fragmented experience issue across training stages, we design a step-wise Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which transforms cross-stage dependencies into learnable signals, thereby alleviating the challenges posed by sparse and discontinuous rewards in RL. We evaluate AgeMem on five long-context, reasoning-intensive benchmarks. Comprehensive results show that AgeMem consistently outperforms strong baselines, validating the effectiveness of unified memory management.

Our main contributions are as follows:

We propose Agentic Memory (AgeMem), a unified agentic memory framework that enables LLM-based agents to autonomously decide when, what, and how to manage both long-term and short-term memory.

We develop a three-stage progressive RL strategy equipped with a step-wise GRPO mechanism, facilitating effective end-to-end learning of unified memory management behaviors.

We conduct comprehensive evaluations across multiple models and long-horizon benchmarks, demonstrating the robustness and effectiveness of AgeMem in complex agentic tasks.

Long-term memory (LTM). Persistent LTM is crucial for LLM-based agents operating over extended horizons (Wang et al., 2025b; Li et al., 2025). Recent work has explored diverse architectural designs for modeling LTM. LangMem (LangChain Team, 2025) provides a modular framework that supports multiple memory types, while A-Mem (Xu et al., 2025) adopts a Zettelkasten-inspired design that links structured knowledge units to facilitate consolidation. Mem0 (Chhikara et al., 2025) proposes a scalable extract-update pipeline and extends it to a graph-based variant for structured reasoning, and Zep (Rasmussen et al., 2025) represents memory as a temporal knowledge graph to enable cross-session and time-aware reasoning. Although effective in organizing and retrieving information, these approaches largely rely on predefined memory structures or heuristic update rules. As memory grows, such designs commonly suffer from increased system complexity and lack adaptive, learning-based strategies for prioritization and forgetting. In contrast, our work aims to learn an adaptive memory policy that allows agents to dynamically decide what to store, update, or forget, depending on task demands and long-term utility. Short-term memory (STM). STM in agentic LLMs primarily concerns context selection and retrieval (Wang et al., 2024; Jin et al., 2024). Retrieval-Augmented Generation (RAG) (Pan et al., 2025b; Salama et al., 2025; Kagaya et al., 2024) is the dominant paradigm, expanding usable context by injecting retrieved content into prompts. While effective, RAG does not fundamentally prevent context explosion in long-horizon settings and may introduce irrelevant or distracting information. To address this issue, ReSum (Wu et al., 2025a) periodically compresses interaction histories into compact reasoning states, allowing agents to operate beyond fixed context-window constraints. Yet its summarization schedule remains largely predefined, and aggressive compression risks discarding rare but crucial details. Our approach instead enables agents to learn when and how to retrieve, summarize, or filter context, achieving a more flexible balance between efficiency and information preservation. Reinforcement learning for LLMs. Reinforcement learning has become an effective paradigm for improving the decision-making and reasoning capabilities of LLM-based agents (Yao et al., 2022; Jin et al., 2025; Qian et al., 2025; Chaudhari et al., 2025). Among recent advances, GRPO (Shao et al., 2024) enhances stability by optimizing policies based on the relative quality of sampled trajectories, removing the need for an explicit value function. GRPO and its variants (Gilabert et al., 2025; Wang et al., 2025a) have shown strong performance in complex reasoning tasks. However, existing RL-based systems generally treat memory as a static or external component, making them ill-suited for the discontinuous and fragmented trajectories associated with memory operations (Yan et al., 2025; Zhang et al., 2025a). In contrast, our work integrates RL directly into the memory management process, enabling unified training of both language generation and memory operations.

We propose Agentic Memory (AgeMem), a unified memory framework that enables LLM agents to autonomously manage both LTM and STM in an end-to-end manner. As illustrated in Figure 1 (right), AgeMem integrates memory management capabilities directly into the agent via a set of specialized tools, enabling the model to learn optimal strategies for unified memory management through three-stage progressive strategy.

Unified RL formulation for AgeMem. At each time step tt, the agent observes a state st∈𝒮s_{t}\in\mathcal{S} composed of the conversation context (short-term memory) CtC_{t}, the long-term memory store ℳt\mathcal{M}{t}, and the task specification 𝒯\mathcal{T}: st=(Ct,ℳt,𝒯)s{t}=(C_{t},\mathcal{M}{t},\mathcal{T}). The specification 𝒯\mathcal{T} includes the input query qq, contextual information IqI{q}, and (for training only) the expected answer AqA_{q}. This formulation enables the agent to ground its decision-making in both transient context and persistent knowledge.

Given sts_{t}, the agent selects an action at∈𝒜a_{t}\in\mathcal{A} from a hybrid action space that includes language generation as well as memory operations. The decision is governed by a parameterized policy πθ\pi_{\theta}, defined as πθ​(at|st)=P​(at|st;θ)\pi_{\theta}(a_{t}|s_{t})=P(a_{t}|s_{t};\theta), where θ\theta denotes the LLM parameters and at=πθ(⋅|st)a_{t}=\pi_{\theta}(\cdot|s_{t}). For a trajectory τ=(s1,a1,…,sT,aT)\tau=(s_{1},a_{1},\ldots,s_{T},a_{T}), the cumulative reward is defined as:

where RiR_{i} captures task performance and memory quality, and PpenaltyP_{\text{penalty}} discourages redundant storage, excessive tool usage, and uncontrolled context expansion. The optimization objective is:

This formulation treats memory management as an integral component of the agent’s policy, replacing handcrafted heuristics with a learnable mechanism. Three-stage trajectory structure. To capture long-horizon interactions and progressively train memory capabilities, each trajectory is divided into three consecutive stages: τ=(τ(1),τ(2),τ(3)),\tau=(\tau^{(1)},\tau^{(2)},\tau^{(3)}), with a total length of T=T1+T2+T3T=T_{1}+T_{2}+T_{3}. In Stage 1, the agent engages in casual interactions and may store useful information into LTM. Stage 2 introduces distracting or irrelevant content, requiring the agent to manage its STM through selective retention and compression. Stage 3 presents a task that depends on coordinated use of both retained context and earlier accumulated LTM. A key aspect of this design is that the long-term memory ℳt\mathcal{M}{t} persists across all stages, allowing early knowledge to influence later decisions. In contrast, the context CtC{t} is reset between Stages 1 and 2 to prevent information leakage across phases. The reset before Stage 2 ensures the agent cannot solve the final task via residual context, thereby forcing proper retrieval from LTM and enabling effective training of memory operations.

At each step, we collect an experience tuple et=(st,at,rt,log⁡πθold​(at|st))e_{t}=(s_{t},a_{t},r_{t},\log\pi_{\theta_{\text{old}}}(a_{t}|s_{t})), where rtr_{t} is typically zero for intermediate steps and assigned after trajectory completion, and log⁡πθold​(at|st)\log\pi_{\theta_{\text{old}}}(a_{t}|s_{t}) denotes the l​o​glog probability under the old policy πθold\pi_{\theta_{\text{old}}}. This representation enables step-wise credit assignment under GRPO (Shao et al., 2024) and allows the agent to attribute long-term rewards to specific memory decisions across stages. By structuring trajectories in this staged yet continuous manner, the agent learns temporally coherent and task-adaptive memory policies essential for robust long-horizon reasoning.

AgeMem exposes memory-related operations to the LLM agent through an explicit tool interface (Table 1). The agent can modify its persistent LTM using Add, Update, and Delete, while exercising fine-grained control over STM through Retrieve, Summary, and Filter. Incorporating these tools into the action space transforms memory control from an external heuristic pipeline into an intrinsic component of decision-making. This design allows the agent to adaptively manage memory according to task structure, history, and context. Implementation details are provided in the Appendix A.1.

To learn unified and stable memory behaviors, we propose a progressive three-stage training strategy. For each task instance q∈𝒯q\in\mathcal{T}, the agent generates a complete trajectory:

where KK denotes the number of independent rollouts, and each sub-trajectory τk(i)\tau_{k}^{(i)} corresponds to a specific training stage. Stage 1 (LTM construction). The agent is exposed to contextual information IqI_{q} in a casual conversational setting. The goal is to identify salient information and store it into LTM ℳt\mathcal{M}{t}. During the interaction, the short-term context CtC{t} evolves naturally, and the agent may invoke LTM-related tools when appropriate. Formally, this stage yields a sub-trajectory τk(1)={et}t=1T1\tau_{k}^{(1)}={e_{t}}{t=1}^{T{1}}, where each experience tuple ete_{t} follows the definition in Section 3.1. Stage 2 (STM control under distractors). The short-term context is reset, while the constructed LTM ℳt\mathcal{M}{t} is retained. The agent is then presented with semantically related but irrelevant or misleading distractors. The objective is to learn proactive STM control through tool-based operations, such as filtering or summarizing context, in order to suppress noise and preserve useful information. This process forms the sub-trajectory τk(2)={et}t=T1+1T1+T2\tau{k}^{(2)}={e_{t}}{t=T{1}+1}^{T_{1}+T_{2}}, which emphasizes context filtering and compression capability. Stage 3 (Integrated reasoning and memory coordination). Finally, the agent receives a formal query qq requiring both accurate reasoning and effective memory retrieval. The agent must retrieve relevant knowledge from ℳt\mathcal{M}{t}, appropriately manage the context CtC{t}, and generate a final answer. This stage produces τk(3)={et}t=T1+T2+1T\tau_{k}^{(3)}={e_{t}}{t=T{1}+T_{2}+1}^{T}, which evaluates the ability of agent to coordinate long-term memory, short-term context management, and task solution in an end-to-end manner.

All three segments form a complete trajectory:

which is then used for policy optimization in the subsequent step-wise GRPO procedure. For a batch of BB tasks, we further aggregate all experiences from KK independent rollouts into a unified set ℰ=⋃q=1B⋃k=1K{et∣et∈τk(q)}\mathcal{E}=\bigcup_{q=1}^{B}\bigcup_{k=1}^{K}{e_{t}\mid e_{t}\in\tau_{k}^{(q)}}, with a total size of |ℰ|=B×K×T¯|\mathcal{E}|=B\times K\times\bar{T}, where T¯\bar{T} denotes the average trajectory length. More detailed rollout processes are provided in the Appendix A.3.

We adopt a step-wise variant of GRPO to connect long-range task rewards with memory decisions across all stages. For task qq, let Gq={τ1(q),…,τK(q)}G_{q}={\tau_{1}^{(q)},\ldots,\tau_{K}^{(q)}} denote the group of parallel rollouts. Each trajectory yields a terminal reward rT(k,q)=R​(τk(q))r_{T}^{(k,q)}=R(\tau_{k}^{(q)}). We compute the group-normalized advantage for the terminal step as:

where μGq\mu_{G_{q}} and σGq\sigma_{G_{q}} are the mean and standard deviation of rewards within GqG_{q}, ϵ\epsilon prevents division by zero. This advantage is then broadcast to all preceding steps of the same trajectory At(k,q)=AT(k,q)A_{t}^{(k,q)}=A_{T}^{(k,q)}, which assigns a consistent learning signal to all memory and reasoning actions along the trajectory, including those in Stage 1 and Stage 2. In doing so, the final task outcome supervises every intermediate memory decision, enabling long-range credit assignment across heterogeneous stages. We then augment the experience set with advantages, ℰ=⋃q,kB,K{(et,At)|et∈τk(q),At=At(k,q)}\mathcal{E}=\bigcup_{q,k}^{B,K}{(e_{t},A_{t})|e_{t}\in\tau_{k}^{(q)},A_{t}=A_{t}^{(k,q)}}.

Following GRPO, we maximize the expected objective over all experiences:

where the importance ratio ρt(k,q)=πθ​(at|st)πθold​(at|st)\rho_{t}^{(k,q)}=\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{\text{old}}}(a_{t}|s_{t})} controls the update magnitude under the new policy, DKL(k,q)D_{\text{KL}}^{(k,q)} denotes the KL divergence penalty between the current policy πθ\pi_{\theta} and a fixed reference πref\pi_{\text{ref}}, and β\beta is a coefficient that balances exploration and training stability.

We design a composite reward that evaluates both downstream task performance and the quality of memory management. The total trajectory-level reward is defined as

where 𝐰=[wtask,wcontext,wmemory]⊤\mathbf{w}=[w_{\text{task}},w_{\text{context}},w_{\text{memory}}]^{\top} are tunable coefficients, and 𝐑=[Rtask,Rcontext,Rmemory]⊤\mathbf{R}=[R_{\text{task}},R_{\text{context}},R_{\text{memory}}]^{\top} correspond to rewards for task completion, context management, and long-term memory management. The penalty term PpenaltyP_{\text{penalty}} captures violations such as context overflow or exceeding the interaction limit. Below, we summarize each component, and precise formulas are provided in the Appendix A.2. Task completion reward RtaskR_{\text{task}}. This term provides the primary learning signal by assessing whether the agent solves the task correctly. We obtain a scalar score using an LLM-based judge Sjudge​(Apred,Aq)∈[0,1]S_{\text{judge}}(A_{\text{pred}},A_{q})\in[0,1], optionally applying a penalty when no answer is produced. This reward encourages accurate, complete task solutions and remains the dominant component to ensure alignment with task objectives. Context management reward RcontextR_{\text{context}}. This component evaluates STM behavior, focusing on how effectively the agent controls the active context CtC_{t}. It combines three factors: (i) compression efficiency, promoting economical token usage; (ii) preventive actions, rewarding early summarization or filtering to avoid overflow; and (iii) information preservation, penalizing the loss of critical query-related content. Each factor is normalized, allowing the reward to balance context efficiency against retention of essential information. Memory management reward RmemoryR_{\text{memory}}. This term evaluates LTM operations. It aggregates signals for: (i) storage quality, measured as the fraction of stored entries labeled as high-quality and reusable; (ii) maintenance, rewarding meaningful update or delete operations to mitigate memory staleness; and (iii) semantic relevance, computed using an LLM-based score between retrieved memories and the query. Together, these signals incentivize selective, high-value memory construction and responsible upkeep over time. Penalty terms PpenaltyP_{\text{penalty}}. Penalties discourage undesirable behaviors such as exceeding the maximum number of dialogue turns or triggering context overflow. Penalty coefficients are chosen so that such violations lead to a substantial reduction in the final trajectory reward, encouraging the agent to maintain safe and efficient memory practices.

Datasets. To comprehensively evaluate AgeMem, we select five widely-used datasets in LLM-based agent research: ALFWorld (Shridhar et al., 2020), SciWorld (Wang et al., 2022), PDDL (Chang et al., 2024), BabyAI (Chevalier-Boisvert et al., 2018), and HotpotQA (Yang et al., 2018). These datasets cover embodied action, game-based reasoning, and knowledge-intensive question answering, providing diverse evaluation scenarios. Since the HotpotQA dataset contains both questions and supporting facts, automatically providing Stage 1 contextual information, AgeMem is fine-tuned with RL only on the HotpotQA training set and then evaluated directly on all datasets. Detailed dataset statistics are provided in Appendix C.1. Evaluation metrics. For the primary task completion metrics, we adopt Success Rate (SR) for ALFWorld, SciWorld, and BabyAI, Progress Rate (PR) for PDDL, and LLM-as-a-Judge (J) for HotpotQA. Additionally, we employ an LLM-based evaluator to assess the quality of stored long-term memory during knowledge reasoning, measured by Memory Quality (MQ). The prompts of the LLM-based evaluation are provided in Appendix C.2. Baselines & LLM backbones. We compare AgeMem against four representative agent LTM systems: LangMem (LangChain Team, 2025), A-Mem (Xu et al., 2025), Mem0 (Chhikara et al., 2025), and Mem0g\text{Mem0}^{g} (a graph-based variant officially provided as part of Mem0). To better demonstrate the effectiveness of RL training, we also include AgeMem-noRL, which is not fine-tuned with RL. In ablation studies on STM, we compare STM tools with RAG approach. For the base agent models, we use Qwen2.5-7B-Instruct and Qwen3-4B-Instruct. More baseline configurations are in Appendix C.3. Implementation details. We build agents using the Agentscope framework (Gao et al., 2025a) and fine-tune AgeMem using the Trinity framework (Pan et al., 2025a). For all reward weights in the reward function, we use uniform coefficients of 1.0 without manual tuning. Further implementation details are provided in Appendix C.4.

Comparison with counterparts. Table 2 shows that AgeMem achieves the highest average performance on both Qwen2.5-7B-Instruct (41.96%) and Qwen3-4B-Instruct (54.31%), outperforming all baselines across five datasets with relative gains of 49.59% and 23.52% over no-memory, respectively. Compared to the best baselines (Mem0 and A-Mem), AgeMem improves by 4.82 and 8.57 percentage points on average. RL training contributes 8.53 percentage points and 8.72 percentage points improvements over AgeMem-noRL, validating the three-stage progressive RL strategy.

Quality of stored long-term memories. To evaluate the quality of stored memories, we leverage the ground-truth facts provided in the HotpotQA dataset and assess the relevance between stored memories and these facts using an LLM-based evaluator. Figure 2 presents the Memory Quality (MQ) scores for different baselines. AgeMem achieves the highest memory quality on both model backbones, with MQ scores of 0.533 and 0.605, respectively. This indicates that the unified memory management framework not only improves task performance but also promotes the storage of high-quality, reusable knowledge. The comparison with baseline methods further validates that AgeMem’s tool-based memory operations lead to more selective and higher-quality memory construction. Effectiveness of STM management. We evaluate the effectiveness of STM management by measuring the prompt token count under different configurations on HotpotQA. Figure 3 shows that AgeMem successfully reduces prompt token usage compared to variants without STM tools (-RAG). On Qwen2.5-7B-Instruct, AgeMem uses 2,117 tokens on average, compared to 2,186 tokens for AgeMem-RAG, representing a reduction of 3.1%. On Qwen3-4B-Instruct, the reduction is even more pronounced: AgeMem uses 2,191 tokens versus 2,310 tokens for AgeMem-RAG, a reduction of 5.1%. These results demonstrate that the learned STM management tools effectively control context expansion, enabling more efficient token usage while maintaining task performance.

Tool usage analysis. Table 3 reports tool usage statistics before and after RL fine-tuning on HotpotQA. RL training substantially increases the use of long-term memory tools, especially Add and Update. On Qwen2.5-7B-Instruct, Add operations rise from 0.92 to 1.64, and Update operations appear after training (0.13 v.s. nearly zero). Similar trends are observed on Qwen3-4B-Instruct, with higher frequencies of both Add and Update. For short-term memory tools, RL leads to more balanced tool usage. The frequency of Filter increases notably (e.g., from 0.02 to 0.31 on Qwen2.5), indicating proactive context control, while Retrieve remains relatively stable. Overall, these patterns suggest that RL training enables coordinated and adaptive memory management. Detailed case studies are provided in Appendix B.

LTM-STM components. To validate the contributions of individual components, we conduct ablation studies on LTM, STM, and RL training. Figure 4 presents results on three representative datasets using Qwen2.5-7B-Instruct as the backbone (results for Qwen3-4B-Instruct are provided in Appendix D.1). Adding LTM alone (+LT) yields substantial gains of +10.6%, +14.2%, and +7.4% over the baseline. Incorporating RL training (+LT/RL) further improves performance, particularly on HotpotQA (+6.3%), demonstrating the effectiveness of our reward-based optimization. The full AgeMem system (+LT/ST/RL) achieves the best results across all benchmarks, with overall improvements of +13.9%, +21.7%, and +16.1%. Notably, adding STM tools provides the most significant boost on SciWorld (+3.1%) and HotpotQA (+2.4%), validating that learned context management outperforms static RAG approaches. These progressive improvements confirm that unified memory management with end-to-end RL is essential for optimal agent performance. Reward function. To demonstrate the effectiveness of our multi-component reward function design, we compare the full reward function (All-Returns) against a variant using only RtaskR_{\text{task}} (Answer-Only). Figure 5 shows the reward convergence curves of Qwen2.5-7B-Instruct during GRPO training on HotpotQA. The full reward function leads to significantly faster convergence and higher final performance compared to the task-only variant. As detailed in Table 4, the All-Returns strategy achieves higher LLM-as-a-Judge scores (0.544 v.s. 0.509) while maintaining substantially better memory quality (0.533 v.s. 0.479). Notably, despite using more tokens (2117 v.s. 2078), the All-Returns strategy achieves better overall performance, indicating that the additional context and memory operations contribute meaningfully to reasoning quality. Similar patterns are observed on Qwen3-4B-Instruct (see Appendix D.2).

In this work, we propose Agentic Memory (AgeMem), a unified memory management framework that enables LLM-based agents to jointly control long-term and short-term memory through learnable, tool-based actions. By integrating memory operations directly into the agent’s policy and training them with a progressive reinforcement learning strategy, AgeMem replaces heuristic memory pipelines with an end-to-end optimized solution. Extensive experiments across diverse long-horizon benchmarks show that AgeMem improves both task performance and memory quality while maintaining efficient context usage. These results highlight the importance of unified, agent-centric memory policies and suggest a promising direction for building scalable and adaptive LLM agents capable of long-term reasoning.

While AgeMem demonstrates strong performance across multiple settings, there remain opportunities for further extension. The current implementation adopts a fixed set of memory management tools, which provides a clear and effective abstraction but could be extended to support more fine-grained control in future work. In addition, although we evaluate our approach on several representative long-horizon benchmarks, broader coverage of tasks and environments may further strengthen the empirical understanding of the framework.

This appendix provides full technical details omitted from the main text due to space constraints. We first present precise definitions and pseudo-formulations for each memory-management tool (Appendix A.1), then give implementable formulas for the reward components used in training (Appendix A.2). Finally, we provide the complete algorithmic specification (Appendix A.3).

AgeMem exposes a small set of structured tools that the agent may invoke as part of its action ata_{t}. Each tool is implemented as a deterministic or stochastic function that transforms the short-term context CtC_{t}, the long-term memory store ℳt\mathcal{M}_{t}, or both. Unlike traditional memory systems that rely on external heuristics or predefined schedules, AgeMem integrates these tools directly into the agent’s action space, enabling the model to learn when and how to use each tool through reinforcement learning. Below we give precise operational definitions, implementation details, and the system prompts that guide tool usage.

Long-term memory store at time tt is ℳt={mi}i=1|ℳt|\mathcal{M}{t}={m{i}}{i=1}^{|\mathcal{M}{t}|}, where each memory mim_{i} contains a content string and optional metadata. Short-term context is Ct=[u1,u2,…,unt]C_{t}=[u_{1},u_{2},\ldots,u_{n_{t}}] (message list), and enc​(⋅)\text{enc}(\cdot) denotes a text encoder that returns a dense embedding. We use cosine similarity for semantic matching throughout the framework.

The Retrieve operation enables the agent to access relevant information from long-term memory based on semantic similarity. This operation is crucial for bringing stored knowledge into the active context when needed for reasoning. The retrieval operation returns the top-kk most similar memories to the query qq:

where the similarity function is defined as:

The retrieved memories are then inserted into the short-term context CtC_{t}, making them available for immediate reasoning. The parameter kk controls the number of memories retrieved, typically set to 3-5 in our experiments to balance relevance and context size.

The Add operation allows the agent to store new information in long-term memory for future use. This operation is essential for accumulating knowledge across interactions and sessions. A new memory entry is created by:

where cc is the content to be stored, enc​(c)\text{enc}(c) is its embedding vector, and metadata includes timestamp, source information, and optional tags. The memory store is then updated:

The agent learns to identify salient information worth storing through the reward function, which encourages storing high-quality, reusable knowledge while penalizing redundant or irrelevant entries.

Memory maintenance operations enable the agent to keep its long-term memory store current and relevant. The Update operation modifies existing memories when new information supersedes or refines previous knowledge. For an existing memory mim_{i}, the update operation is defined as:

where c′c^{\prime} is the updated content and metadata′\text{metadata}^{\prime} reflects the modification timestamp. The Delete operation removes obsolete or incorrect memories:

These operations are particularly important in long-horizon tasks where information may become outdated or where the agent needs to correct earlier mistakes. The reward function encourages meaningful updates and deletions that improve memory quality over time.

The Summary operation compresses conversation history in the short-term context to prevent context overflow while preserving essential information. This operation is critical for managing long conversations that exceed context window limits. Given a subset of context indices ss, the summary operation is defined as:

where Summarize​(⋅)\text{Summarize}(\cdot) is implemented by LLM with a summarization system prompt. The agent can specify which messages to summarize using the ‘span’ parameter, which can be:

‘‘all’’: Summarize all non-system messages.

The summarization process uses the following system prompt to ensure high-quality compression:

The agent learns to invoke summarization proactively before context overflow occurs, balancing information preservation with efficiency.

The Filter operation filters out irrelevant or redundant messages from the short-term context based on semantic similarity. This operation helps maintain a focused context by filtering out noise and distractions. Specifically, it removes messages whose similarity to a given criteria cc exceeds a threshold θ\theta:

In all experiments, we set θ=0.6\theta=0.6 by default. The criteria cc can be specified by the agent (e.g., a description of what to keep) or can be automatically derived from the current task context. This operation is particularly useful in Stage 2 of training, where distractors are introduced to test the agent’s ability to filter irrelevant information.

Each tool is exposed via a schema specifying its function name and required arguments. The agent’s policy outputs either language tokens (for text generation) or structured tool calls (for memory operations). The agent is guided by a system prompt that defines the tool-calling interface and response format. The system prompt used in AgeMem is as follows:

This prompt structure ensures that the agent follows a consistent format for reasoning, tool invocation, and final answers, which is essential for reliable parsing and reward computation during RL training. The structured format also enables the agent to coordinate multiple memory operations within a single reasoning step, supporting efficient unified memory management.

Figure 6 and 7 present our tool schemas for short-term memory and long-term memory management, showing the exact function signatures and argument types that the agent can invoke.

This section provides implementable formulas for the reward components described in the main text. All component scores are normalized to [0,1][0,1] (unless noted) to enable stable weighting. Overview. The overall trajectory-level reward is defined as:

where 𝐰=[wtask,wcontext,wmemory]⊤\mathbf{w}=[w_{\text{task}},w_{\text{context}},w_{\text{memory}}]^{\top} are tunable weights, 𝐑=[Rtask,Rcontext,Rmemory]⊤\mathbf{R}=[R_{\text{task}},R_{\text{context}},R_{\text{memory}}]^{\top} denote task completion, context management, and memory management rewards respectively, and PpenaltyP_{\text{penalty}} penalizes undesired behaviors. Task completion reward RtaskR_{\text{task}}. Let the agent produce a final answer ApredA_{\text{pred}}. We obtain a judge score Sjudge​(Apred,Aq)∈[0,1]S_{\text{judge}}(A_{\text{pred}},A_{q})\in[0,1] via an evaluator (LLM judge), where AqA_{q} denotes the expected ground truth. Then the task reward RtaskR_{\text{task}} is:

with Pno_answer=−1.0P_{\text{no_answer}}=-1.0 by default. Context management reward RcontextR_{\text{context}}. We decompose the overall context management reward into three normalized components that jointly evaluate how effectively the model maintains a compact yet information-preserving context state. Formally, we define:

where Ri∈{Rcompression,Rpreventive,Rpreservation}R_{i}\in{R_{\text{compression}},R_{\text{preventive}},R_{\text{preservation}}}, ∑iαi=1\sum_{i}\alpha_{i}=1, and we use uniform weights αi=1/3\alpha_{i}=1/3 unless otherwise specified. For compression efficiency, we evaluate the compactness of the final context CtC_{t} by computing

where TusedT_{\text{used}} denotes the number of tokens present in the context when the final answer is generated, and TmaxT_{\text{max}} is the allowed budget. For preventive management, we define RpreventiveR_{\text{preventive}} to assess proactive behavior:

which equals 1 when the model invokes a context-reduction tool before reaching the token limit, and 0 otherwise. For information preservation, we identify a set of key tokens or phrases KqK_{q} extracted from the user query qq, such as named entities or temporal and spatial expressions. Let 𝟙preserve\mathbbm{1}_{\text{preserve}} indicate whether these items remain present (either directly or via a retained summary) at the time of answer generation. The preservation reward is therefore

Memory management reward RmemoryR_{\text{memory}}. The memory management reward consists of three key components that evaluate retrieval quality, storage quality, maintenance operations, and semantic relevance. We define it as:

where Rj∈{Rstorage,Rmaintenance,Rrelevance}R_{j}\in{R_{\text{storage}},R_{\text{maintenance}},R_{\text{relevance}}}, ∑jβj=1\sum_{j}\beta_{j}=1, and we use uniform weights βj=1/3\beta_{j}=1/3 unless otherwise specified. For Storage Quality, during the memory storage process in Stage 1, the agent may add NtotalN_{\text{total}} memory entries, among which Nhigh_qualityN_{\text{high_quality}} are identified as high-quality based on an LLM’s analysis of the input query qq and its expected answer AqA_{q}. The storage quality reward is defined as the proportion of high-quality memories:

This metric incentivizes the agent to store valuable information while avoiding the accumulation of redundant or low-quality memories. For Maintenance, to encourage the agent to actively maintain the memory bank, we reward update or delete operations:

This mechanism promotes dynamic memory management and timely cleanup. For Semantic Relevance, to quantify the semantic match between retrieved memories and the query, we introduce an LLM-based relevance assessment. Let SLLM​(ℛ,q)S_{\text{LLM}}(\mathcal{R},q) be the semantic relevance score of the retrieved memory set ℛ\mathcal{R} with respect to query qq, normalized to the interval [0,1][0,1]. The semantic relevance reward is defined as:

This component ensures that retrieved memories are semantically aligned with the current task, enhancing overall reasoning quality. Penalty terms PpenaltyP_{\text{penalty}}. We penalize major constraint violations to ensure the agent operates within specified limits:

where Pk∈{Prounds,Poverflow}P_{k}\in{P_{\text{rounds}},P_{\text{overflow}}} and violationk∈{𝟙​[Nrounds>Nmax],𝟙​[Tused>Tmax]}\text{violation}{k}\in{\mathbbm{1}[N{\text{rounds}}>N_{\max}],\mathbbm{1}[T_{\text{used}}>T_{\max}]}. Here, NroundsN_{\text{rounds}} denotes the number of interaction rounds, NmaxN_{\max} is the maximum allowed rounds, TusedT_{\text{used}} represents the total token usage, and TmaxT_{\text{max}} is the token budget limit. The penalty coefficients are set to Prounds=−1P_{\text{rounds}}=-1 and Poverflow=−0.5P_{\text{overflow}}=-0.5 by default.

This section provides the complete algorithmic specification of AgeMem, our unified memory management framework for LLM-based agents. The training procedure integrates three progressive stages (long-term memory construction, short-term context management under distractors, and integrated task execution) into a single end-to-end reinforcement learning loop. We present the main training algorithm using a two-column layout for compactness (Algorithm 1–2), followed by detailed rollout procedures for each stage (Algorithms 3–5).

The core training loop follows a generate-then-optimize paradigm. For each task qq in a training batch ℬ\mathcal{B}, we generate KK independent rollout trajectories {τk(q)}k=1K{\tau_{k}^{(q)}}{k=1}^{K} using the current policy πθ\pi{\theta}. Each trajectory τk(q)=(τk(1),τk(2),τk(3))\tau_{k}^{(q)}=(\tau_{k}^{(1)},\tau_{k}^{(2)},\tau_{k}^{(3)}) concatenates experiences from all three stages, forming a complete episode from initial memory construction to final task completion. The agent first builds long-term memory from contextual information IqI_{q} (Algorithms 3), then learns to filter out distracting information while maintaining useful context (Algorithms 4), and finally retrieves stored knowledge to finish the target task (Algorithms 5). All experiences are collected into a unified buffer ℰ\mathcal{E} spanning multiple tasks and rollouts.

After the rollout phase, we apply group-based advantage normalization to enable fair comparison across tasks with different reward scales. For each task group GqG_{q}, terminal rewards {rT(k,q)}k=1K{r_{T}^{(k,q)}}{k=1}^{K} are normalized to zero mean and unit variance, yielding advantages AT(k,q)A{T}^{(k,q)} that reflect relative performance within the group. These terminal advantages are then broadcast uniformly to all timesteps within the same trajectory, establishing a consistent learning signal that connects early-stage memory decisions to final task outcomes. This step-wise GRPO mechanism enables long-range credit assignment across heterogeneous operations. The policy is then updated via gradient ascent on the expected advantage, regularized by a KL divergence term to maintain proximity to a reference policy πref\pi_{\text{ref}} for training stability.

The three-stage rollout design reflects the natural progression of memory-augmented task solving. Algorithm 3 implements the first stage, where the agent engages in casual conversation while being gradually exposed to the contextual information IqI_{q}. During these T1T_{1} exploratory turns, the agent must identify salient information and determine when and which long-term memory tools to invoke—including Add, Update, Delete—to construct an initial memory store ℳ\mathcal{M}. To support informed memory decisions, the agent proactively performs memory retrieval at every step. This retrieval is not task-driven but serves as an introspective operation: it enables the agent to maintain awareness of the current LTM contents, facilitating decisions about updating or discarding stale entries and ensuring that newly stored information remains coherent with existing knowledge. Since the task query has not yet been revealed in Stage 1, the agent must rely on general cues about which information may become useful later. This encourages the formation of reusable, well-structured memory traces rather than query-specific shortcuts, laying the foundation for effective long-horizon memory management in later stages.

Algorithm 4 describes the second stage, which deliberately stresses the agent’s context management capabilities. The short-term context CC is reset to avoid information leakage and affect the learning of STM management, while the constructed long-term memory ℳ\mathcal{M} persists from Stage 1. Over T2T_{2} turns, the agent receives semantically related but ultimately irrelevant distractor messages that could mislead downstream reasoning if left unmanaged. The agent must learn to proactively invoke Filter to filter out low-relevance content based on semantic similarity thresholds, or Summary to compress accumulated context when token budgets become constrained. This stage trains robust filtering strategies that generalize beyond simple heuristics, as the agent receives learning signals from the eventual task performance in Stage 3.

Algorithm 5 presents the final integrated execution stage. Upon receiving the target query qq, the agent must coordinate retrieval from long-term memory ℳ\mathcal{M}, context management operations on CC, and multi-step reasoning to produce a final answer ApredA_{\text{pred}}. The agent may invoke Retrieve to fetch relevant stored facts, Summary to maintain a tractable context window, and ultimately generate a structured response. Once the answer is produced or the maximum steps are reached, a composite reward function (Section A.2) evaluates the three-stage trajectory across multiple dimensions. This terminal reward R​(τ)R(\tau) is assigned to the final timestep and serves as the supervision signal that propagates back through all three stages during advantage computation.

This section presents three representative case studies demonstrating how AgeMem enables effective unified memory management through reinforcement learning. Each case compares agent behavior before and after RL training to highlight the learned memory strategies. We use a personal learning assistant scenario where the agent helps users plan customized study programs based on their preferences and constraints.

This case illustrates how AgeMem learns to selectively construct, update, and maintain long-term memory across extended conversations. The agent must identify salient user information from casual dialogue and manage memory entries as new information supersedes old preferences.

Prior to training, the baseline agent lacks strategic memory management. It either stores all information indiscriminately or fails to recognize when stored knowledge becomes obsolete.

User: Hello! I’m a visual learner who prefers 60-minute study sessions. I have Python basics but zero ML experience. I’m particularly interested in computer vision applications like face recognition.

Assistant (Baseline):

[Later in conversation…]

User: Actually, I realize 60 minutes is too short. I work better with 120-minute deep focus blocks.

Analysis: The baseline fails to store initial preferences and cannot recognize when information needs updating.

User: I’ve been using 120-minute sessions consistently for a while now, and they’re perfect for my learning style. I’m completely settled on this duration - no more experimenting with shorter sessions.

Analysis: The trained agent strategically uses Add_memory to store initial preferences, Update_memory to modify existing information, and Delete_memory followed by Add_memory to clean up memory when historical references become obsolete, maintaining clean and current memory state.

This case demonstrates how AgeMem learns to proactively manage short-term context when faced with irrelevant information that could interfere with task focus. The agent must recognize distractors and apply appropriate filtering or summarization strategies.

The baseline agent passively accumulates all conversation content in context, leading to dilution of task-relevant information and eventual context overflow.

User: I need a focused 3-day ML crash course for face recognition. By the way, I’m also exploring quantum computing, blockchain, robotics, and learning to bake sourdough bread and do latte art.

Analysis: The baseline retains all information in context, treating distractors equally with task-relevant content. As conversation continues, the context becomes bloated with irrelevant details about quantum computing, bread-making, etc., consuming token budget without contributing to the ML planning task.

After training with Stage 2 rollouts, AgeMem learns to recognize and filter out distractors while preserving task focus. When context grows large (simulated here after several exchanges), the agent proactively applies context management tools.

[After several more exchanges, context has accumulated detailed daily schedules, tool lists, and resource links.]

User: Can you now give me the final complete plan with all details integrated?

Analysis: The trained agent strategically uses Filter_context to remove distractors early, maintaining task focus, and later applies Summary_context when context grows large, preventing overflow while preserving essential information. The baseline would have retained all content verbatim, leading to context dilution or overflow.

Analysis: The baseline produces a generic schedule that ignores the user’s stated preference for 120-minute deep focus blocks and visual learning style.

After completing AgeMem training across all three stages, the agent demonstrates integrated memory coordination: retrieving relevant user preferences from LTM, managing context efficiently, and generating personalized responses.

Analysis: The trained agent uses Retrieve_memory to access stored user preferences from LTM, then synthesizes this information with the current task to generate a highly personalized response that respects the 120-minute session duration and emphasizes visual learning resources. The integration of retrieved memory with task execution produces superior, context-aware outputs compared to the baseline’s generic approach.

These three cases demonstrate how AgeMem’s three-stage progressive training enables agents to develop sophisticated memory management strategies. Case 1 shows selective storage and maintenance of long-term knowledge through Add_memory, Update_memory, and Delete_memory. Case 2 illustrates proactive short-term context control under distraction via Filter_context and Summary_context. Case 3 demonstrates the integration of these capabilities, where Retrieve_memory enables the agent to access stored knowledge and coordinate memory systems to solve tasks effectively. In each case, the RL-trained agent significantly outperforms the baseline by learning when and how to apply memory tools, resulting in more focused, consistent, and personalized interactions.

We provide detailed statistics and characteristics of the five datasets used in our experiments: ALFWorld (Shridhar et al., 2020) is an embodied AI benchmark in which agents must complete household tasks by following natural language instructions in a simulated environment. The dataset consists of several thousand training environments and multiple validation and test splits, covering six task types: pick and place, examine in light, clean and place, heat and place, cool and place, and pick two and place. These tasks require long-horizon interaction with objects, making ALFWorld well suited for evaluating planning and memory management capabilities. SciWorld (Wang et al., 2022) is an interactive science experiment simulation environment where agents must perform multi-step experiments to answer scientific questions. The benchmark includes a diverse set of tasks spanning multiple scientific domains, such as physics, chemistry, and biology, and emphasizes procedural reasoning and hypothesis-driven exploration. Its complexity makes it suitable for testing an agent’s ability to retain and retrieve relevant knowledge over extended interaction sequences. PDDL (Chang et al., 2024) refers to a set of planning benchmarks formulated using the Planning Domain Definition Language. These benchmarks evaluate an agent’s ability to solve symbolic planning problems across multiple domains by generating valid sequences of actions that achieve specified goal states. The tasks primarily test structured reasoning and the ability to maintain and utilize intermediate planning states. BabyAI (Chevalier-Boisvert et al., 2018) is a grid-world navigation benchmark with natural language instructions. The environment contains a large collection of instruction-following tasks (levels), where agents must navigate and interact with objects to satisfy compositional language commands. Due to its sequential decision-making structure, BabyAI is commonly used to evaluate short-term context tracking and instruction grounding. HotpotQA (Yang et al., 2018) is a multi-hop question answering dataset that requires reasoning over multiple Wikipedia paragraphs. It contains approximately 90k training questions along with validation and test splits, and each question is annotated with supporting facts. This structure makes HotpotQA particularly suitable for evaluating long-term memory storage and retrieval. In our experiments, we use HotpotQA for reinforcement learning training, as its annotated supporting facts naturally provide structured contextual information for Stage 1 supervision.

For the Memory Quality (MQ) metric, we employ an LLM-based evaluator to assess the quality of supporting facts stored in memory by comparing predicted supporting facts with ground-truth expected facts. The evaluator uses the following prompt template:

The evaluator compares the stored memory entries (predicted supporting facts) with the ground-truth supporting facts provided in the HotpotQA dataset. The score reflects both the coverage of expected facts and the relevance of predicted facts to the question. We use Qwen-Max as the evaluator model, and each evaluation is performed independently to ensure consistency.

For the LLM-as-a-Judge metric on HotpotQA, we use a similar approach, where Qwen-Max evaluates the correctness of the agent’s answer by comparing it with the ground-truth answer. The evaluator uses the following prompt template:

All baseline implementations follow their respective official open-source codebases to ensure fair comparison. We provide the source links and implementation details below. LangMem (LangChain Team, 2025): We use the official implementation available at https://langchain-ai.github.io/langmem/ with default hyperparameters. LangMem employs a modular memory framework that supports multiple memory types. We configure it to use the default memory storage and retrieval mechanisms as specified in the official documentation. A-Mem (Xu et al., 2025): We implement A-Mem following the Zettelkasten-inspired design described in the original paper, using the official codebase at https://github.com/WujiangXu/A-mem-sys/. The system links structured knowledge units to facilitate consolidation. We use the recommended hyperparameters for memory consolidation as provided in the repository. Mem0 (Chhikara et al., 2025): We use the official Mem0 implementation available at https://github.com/mem0ai/mem0 with the default extract-update pipeline. For the graph-based variant (Mem0g), we enable the graph structure option and use the recommended graph construction parameters as specified in the official implementation. AgeMem-noRL: This variant uses the same tool interface as AgeMem but without reinforcement learning. This baseline helps isolate the contribution of RL training to the overall performance. RAG variants: For the RAG-based baselines (AgeMem-noRL-RAG and AgeMem-RAG), we replace the STM tools with a standard RAG pipeline that retrieves relevant memories at each step and appends them to the context. The retrieval is performed using cosine similarity between the current context and stored memories, following standard RAG practices. This comparison demonstrates the advantage of learned STM management over static retrieval-based approaches.

Training configuration. We use the Trinity RL framework (Pan et al., 2025a) for policy optimization, implementing the step-wise GRPO algorithm as described in the method section. We use K=8K=8 independent rollouts per task for group normalization. The KL divergence coefficient β\beta is set to 0.1. Reward weights. All reward weights are set to 1/3: wtask=wcontext=wmemory=1/3w_{\text{task}}=w_{\text{context}}=w_{\text{memory}}=1/3. This uniform weighting ensures that all components contribute equally to the learning signal, allowing the agent to naturally balance task performance and memory management. Model settings. The maximum context length is set to 8,192 tokens, and the maximum response length is set to 2,048 tokens. When the context exceeds this limit, the agent receives a penalty, encouraging proactive use of STM management tools. All experiments are conducted on 8 NVIDIA RTX 4090 GPUs with 48GB memory each.

This section provides complementary ablation study results for Qwen3-4B-Instruct. Figure 9 shows the progressive contribution of LTM, STM, and RL components on Qwen3-4B-Instruct across three representative datasets. The results demonstrate consistent trends with Qwen2.5-7B-Instruct, validating the generalizability of our approach across different model sizes.

To validate the generalizability of our multi-component reward design across different model architectures and scales, we conduct the same reward function ablation study as in the main text on Qwen3-4B-Instruct. This section provides a complete analysis parallel to the Qwen2.5-7B-Instruct results presented in the main paper.

Figure 10 demonstrates the reward convergence patterns on Qwen3-4B-Instruct. Similar to Qwen2.5-7B-Instruct, the All-Returns strategy consistently outperforms Answer-Only throughout the training process. Several notable observations emerge:

More Stable Dynamics: The convergence curve shows noticeably smoother progression with lower variance, particularly in the later training stages (steps 70-100). This stability suggests that Qwen3’s architecture may have better inductive biases for the reward learning task.

Consistent Superiority: While the absolute improvement is smaller than Qwen2.5-7B-Instruct, the All-Returns strategy maintains its advantage throughout training, validating the robustness of our reward design.

Table: S3.T1: Memory management tools in AgeMem for manipulating long-term memory (LTM) and short-term memory (STM).

ToolTargetFunction
AddLTMAdd new knowledge to ℳt\mathcal{M}_{t}
UpdateLTMModify entries in ℳt\mathcal{M}_{t}
DeleteLTMRemove entries from ℳt\mathcal{M}_{t}
RetrieveSTMRetrieve entries from ℳt\mathcal{M}{t} to CtC{t}
SummarySTMSummarize segments in CtC_{t}
FilterSTMFilter out irrelevant segments from CtC_{t}

Table: S4.T2: Performance comparison across five benchmarks. The best and second-best results are marked.

LLM BackboneMethodALFWorldSciWorldPDDLBabyAIHotpotQAAverage
Qwen2.5-7B-InstructNo-Memory27.1613.8010.1550.8038.3628.05
LangMem38.2728.2915.8551.3437.4334.23
A-Mem34.6828.0618.3958.8243.9536.78
Mem037.4926.9913.9660.5846.6637.14
Mem0g35.3430.5014.8658.7842.0636.31
AgeMem-noRL37.9028.678.8746.3445.3633.43
AgeMem (Ours)41.0735.5517.3161.4254.4441.96
Qwen3-4B-InstructNo-Memory38.5147.8930.1455.8347.4843.97
LangMem40.8950.4228.4253.8042.7043.25
A-Mem34.3150.1434.4161.3548.4845.74
Mem041.1751.3831.7260.0539.1644.70
Mem0g36.6947.7629.6157.5938.1241.95
AgeMem-noRL38.0250.4227.5257.4854.4945.59
AgeMem (Ours)48.9759.4835.0772.5655.4954.31

Table: S4.T3: Tool usage statistics on HotpotQA. Numbers show average calls per episode.

Tool CategoryQwen2.5-7BQwen3-4B
noRLGRPOnoRLGRPO
LTM Tool Statistics
Add Memory0.921.642.492.64
Update Memory0.000.130.130.34
Delete Memory0.000.080.000.22
STM Tool Statistics
Retrieve Memory2.311.954.624.35
Summary Context1.080.820.110.96
Filter Context0.020.310.150.16
Total Calls4.334.927.508.67

Table: S4.T4: Reward function ablation on HotpotQA using Qwen2.5-7B-Instruct. All-Returns v.s. Answer-Only reward strategies. “TN” is the token number, and “TC” denotes the number of tool calls.

StrategyJ(↑\uparrow)TN(↓\downarrow)MQ(↑\uparrow)TC(-)
Answer-Only0.50920780.4793.93
All-Returns0.54421170.5334.92

Refer to caption Comparison between independent and unified memory management frameworks. (Left) Traditional framework with static STM and trigger-based LTM. (Middle) Independent framework with an additional Memory Manager controlling LTM in an agent-based manner, while STM remains static. (Right) The proposed AgeMem framework, where LTM and STM are jointly and intelligently managed via explicit tool-based operations.

Refer to caption Memory Quality scores for different methods on HotpotQA. Higher scores indicate better relevance between stored memories and ground-truth facts.

Refer to caption Ablation study on LTM, STM, and RL components (Qwen2.5-7B-Instruct). Base: No-memory baseline; +LT: AgeMem-noRL-RAG (LTM tools only); +LT/RL: AgeMem-RAG (RL with LTM tools); +LT/ST/RL: AgeMem (full AgeMem system with RL). Green arrows indicate performance gains over the baseline.

Refer to caption Training convergence curves on Qwen2.5-7B-Instruct comparing All-Returns (solid line) v.s. Answer-Only (dashed line) reward strategies.

Refer to caption Ablation study results for Qwen3-4B-Instruct. Base: No-Memory baseline; +LT: AgeMem-noRL-RAG (LTM tools only); +LT/RL: AgeMem-RAG (RL with LTM tools); +LT/ST/RL: AgeMem (full AgeMem system with RL). Green arrows indicate performance gains over the baseline.

$$ \displaystyle J(\theta) $$ \tag{S3.E6X}

ToolTargetFunction
ADDLTMAdd new knowledge to M t
UPDATELTMModify entries in M t
DELETELTMRemove entries from M t
RETRIEVESTMRetrieve entries from M t to C t
SUMMARYSTMSummarize segments in C t
FILTERSTMFilter out irrelevant segments from C t
LLMBackboneMethodALFWorldSciWorldPDDLBabyAIHotpotQAAverage
Qwen2.5-7B-InstructNo-Memory27.1613.810.1550.838.3628.05
Qwen2.5-7B-InstructLangMem38.2728.2915.8551.3437.4334.23
Qwen2.5-7B-InstructA-Mem34.6828.0618.3958.8243.9536.78
Qwen2.5-7B-InstructMem037.4926.9913.9660.5846.6637.14
Qwen2.5-7B-InstructMem0 g35.3430.514.8658.7842.0636.31
Qwen2.5-7B-InstructAgeMem-noRL37.928.678.8746.3445.3633.43
Qwen2.5-7B-InstructAgeMem (Ours)41.0735.5517.3161.4254.4441.96
Qwen3-4B-InstructNo-Memory38.5147.8930.1455.8347.4843.97
Qwen3-4B-InstructLangMem40.8950.4228.4253.842.743.25
Qwen3-4B-InstructA-Mem34.3150.1434.4161.3548.4845.74
Mem041.1751.3831.7260.0539.1644.7
Mem0 g36.6947.7629.6157.5938.1241.95
AgeMem-noRL38.0250.4227.5257.4854.4945.59
AgeMem (Ours)48.9759.4835.0772.5655.4954.31
Tool CategoryQwen2.5-7BQwen2.5-7BQwen3-4BQwen3-4B
noRLGRPOnoRLGRPO
LTM Tool StatisticsLTM Tool StatisticsLTM Tool StatisticsLTM Tool StatisticsLTM Tool Statistics
ADD Memory0.921.642.492.64
UPDATE Memory0.000.130.130.34
DELETE Memory0.000.080.000.22
STM Tool StatisticsSTM Tool StatisticsSTM Tool StatisticsSTM Tool StatisticsSTM Tool Statistics
RETRIEVE Memory2.311.954.624.35
SUMMARY Context1.080.820.110.96
FILTER Context0.020.310.150.16
Total Calls4.334.927.508.67
StrategyJ ( ↑ )TN ( ↓ )MQ ( ↑ )TC (-)
Answer-Only0.50920780.4793.93
All-Returns0.54421170.5334.92
StrategyJ ( ↑ )TN ( ↓ )MQ ( ↑ )TC (-)
Answer-Only0.54621640.4157.21
All-Returns0.55521910.6058.67

Figure

Figure

Figure

Figure

References

[Aho:72] Alfred V. Aho, Jeffrey D. Ullman. (1972). The Theory of Parsing, Translation and Compiling.

[APA:83] {American Psychological Association. (1983). Publications Manual.

[Chandra:81] Ashok K. Chandra, Dexter C. Kozen, Larry J. Stockmeyer. (1981). Alternation. Journal of the Association for Computing Machinery. doi:10.1145/322234.322243.

[andrew2007scalable] Andrew, Galen, Gao, Jianfeng. (2007). Scalable training of {L1. Proceedings of the 24th International Conference on Machine Learning.

[Gusfield:97] Dan Gusfield. (1997). Algorithms on Strings, Trees and Sequences.

[rasooli-tetrault-2015] Mohammad Sadegh Rasooli, Joel R. Tetreault. (2015). Yara Parser: {A. Computing Research Repository.

[Ando2005] Ando, Rie Kubota, Zhang, Tong. (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. Journal of Machine Learning Research.

[chang2024agentboard] Chang, Ma, Zhang, Junlei, Zhu, Zhihao, Yang, Cheng, Yang, Yujiu, Jin, Yaohui, Lan, Zhenzhong, Kong, Lingpeng, He, Junxian. (2024). Agentboard: An analytical evaluation board of multi-turn llm agents. Advances in neural information processing systems.

[xiong2025memory] Xiong, Zidi, Lin, Yuping, Xie, Wenya, He, Pengfei, Liu, Zirui, Tang, Jiliang, Lakkaraju, Himabindu, Xiang, Zhen. (2025). How memory management impacts llm agents: An empirical study of experience-following behavior. arXiv preprint arXiv:2505.16067.

[goodyear2025effect] Goodyear, Lyle, Guo, Rachel, Johari, Ramesh. (2025). The Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games. arXiv preprint arXiv:2506.15624.

[liskavets2025prompt] Liskavets, Barys, Ushakov, Maxim, Roy, Shuvendu, Klibanov, Mark, Etemad, Ali, Luke, Shane K. (2025). Prompt compression with context-aware sentence encoding for fast and improved llm inference. Proceedings of the AAAI Conference on Artificial Intelligence.

[zhong2024memorybank] Zhong, Wanjun, Guo, Lianghong, Gao, Qiqi, Ye, He, Wang, Yanlin. (2024). Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence.

[jiang2024long] Jiang, Xun, Li, Feng, Zhao, Han, Qiu, Jiahao, Wang, Jiaying, Shao, Jun, Xu, Shihao, Zhang, Shu, Chen, Weiling, Tang, Xavier, others. (2024). Long term memory: The foundation of ai self-evolution. arXiv preprint arXiv:2410.15665.

[wu2025human] Wu, Yaxiong, Liang, Sheng, Zhang, Chen, Wang, Yichao, Zhang, Yongyue, Guo, Huifeng, Tang, Ruiming, Liu, Yong. (2025). From human memory to ai memory: A survey on memory mechanisms in the era of llms. arXiv preprint arXiv:2504.15965.

[gao2025efficient] Gao, Pengyu, Zhao, Jinming, Chen, Xinyue, Yilin, Long. (2025). An Efficient Context-Dependent Memory Framework for LLM-Centric Agents. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track).

[chang2025main] Chang, Chia-Yuan, Jiang, Zhimeng, Rakesh, Vineeth, Pan, Menghai, Yeh, Chin-Chia Michael, Wang, Guanchu, Hu, Mingzhi, Xu, Zhichao, Zheng, Yan, Das, Mahashweta, others. (2025). Main-rag: Multi-agent filtering retrieval-augmented generation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[wu2025resum] Wu, Xixi, Li, Kuan, Zhao, Yida, Zhang, Liwen, Ou, Litu, Yin, Huifeng, Zhang, Zhongwang, Yu, Xinmiao, Zhang, Dingchu, Jiang, Yong, others. (2025). ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization. arXiv preprint arXiv:2509.13313.

[ma2025should] Ma, Qianou, Peng, Weirui, Yang, Chenyang, Shen, Hua, Koedinger, Ken, Wu, Tongshuang. (2025). What should we engineer in prompts? training humans in requirement-driven llm use. ACM Transactions on Computer-Human Interaction.

[dong2025survey] Dong, Yihong, Jiang, Xue, Qian, Jiaru, Wang, Tian, Zhang, Kechi, Jin, Zhi, Li, Ge. (2025). A survey on code generation with llm-based agents. arXiv preprint arXiv:2508.00083.

[kang2025memory] Kang, Jiazheng, Ji, Mingming, Zhao, Zhe, Bai, Ting. (2025). Memory OS of AI Agent. arXiv preprint arXiv:2506.06326.

[wang2025mirix] Wang, Yu, Chen, Xi. (2025). Mirix: Multi-agent memory system for llm-based agents. arXiv preprint arXiv:2507.07957.

[wang2025inducing] Wang, Zora Zhiruo, Gandhi, Apurva, Neubig, Graham, Fried, Daniel. (2025). Inducing programmatic skills for agentic tasks. arXiv preprint arXiv:2504.06821.

[chhikara2025mem0] Chhikara, Prateek, Khant, Dev, Aryan, Saket, Singh, Taranjeet, Yadav, Deshraj. (2025). Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413.

[yan2025memory] Yan, Sikuan, Yang, Xiufeng, Huang, Zuchao, Nie, Ercong, Ding, Zifeng, Li, Zonggen, Ma, Xiaowen, Kersting, Kristian, Pan, Jeff Z, Sch{. (2025). Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning. arXiv preprint arXiv:2508.19828.

[hu2025evaluating] Hu, Yuanzhe, Wang, Yu, McAuley, Julian. (2025). Evaluating memory in llm agents via incremental multi-turn interactions. arXiv preprint arXiv:2507.05257.

[xu2025mem] Xu, Wujiang, Liang, Zujie, Mei, Kai, Gao, Hang, Tan, Juntao, Zhang, Yongfeng. (2025). A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110.

[pan2025memory] Pan, Zhuoshi, Wu, Qianhui, Jiang, Huiqiang, Luo, Xufang, Cheng, Hao, Li, Dongsheng, Yang, Yuqing, Lin, Chin-Yew, Zhao, H Vicky, Qiu, Lili, others. (2025). On memory construction and retrieval for personalized conversational agents. arXiv preprint arXiv:2502.05589.

[zhang2025survey] Zhang, Zeyu, Dai, Quanyu, Bo, Xiaohe, Ma, Chen, Li, Rui, Chen, Xu, Zhu, Jieming, Dong, Zhenhua, Wen, Ji-Rong. (2025). A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems.

[sun2024llm] Sun, Chuanneng, Huang, Songjun, Pompili, Dario. (2024). Llm-based multi-agent reinforcement learning: Current and future directions. arXiv preprint arXiv:2405.11106.

[ma2024coevolving] Ma, Hao, Hu, Tianyi, Pu, Zhiqiang, Boyin, Liu, Ai, Xiaolin, Liang, Yanyan, Chen, Min. (2024). Coevolving with the other you: Fine-tuning llm with sequential cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems.

[shao2024deepseekmath] Shao, Zhihong, Wang, Peiyi, Zhu, Qihao, Xu, Runxin, Song, Junxiao, Bi, Xiao, Zhang, Haowei, Zhang, Mingchuan, Li, YK, Wu, Yang, others. (2024). Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.

[langmem2025] {LangChain Team. (2025). LangMem SDK for Agent Long-Term Memory.

[rasmussen2025zep] Rasmussen, Preston, Paliychuk, Pavlo, Beauvais, Travis, Ryan, Jack, Chalef, Daniel. (2025). Zep: a temporal knowledge graph architecture for agent memory. arXiv preprint arXiv:2501.13956.

[wang2025karma] Wang, Zixuan, Yu, Bo, Zhao, Junzhe, Sun, Wenhao, Hou, Sai, Liang, Shuai, Hu, Xing, Han, Yinhe, Gan, Yiming. (2025). Karma: Augmenting embodied ai agents with long-and-short term memory systems. 2025 IEEE International Conference on Robotics and Automation (ICRA).

[li2025hello] Li, Hao, Yang, Chenghao, Zhang, An, Deng, Yang, Wang, Xiang, Chua, Tat-Seng. (2025). Hello again! llm-powered personalized agent for long-term dialogue. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers).

[wang2024agent] Wang, Zora Zhiruo, Mao, Jiayuan, Fried, Daniel, Neubig, Graham. (2024). Agent workflow memory. arXiv preprint arXiv:2409.07429.

[jin2024llm] Jin, Hongye, Han, Xiaotian, Yang, Jingfeng, Jiang, Zhimeng, Liu, Zirui, Chang, Chia-Yuan, Chen, Huiyuan, Hu, Xia. (2024). Llm maybe longlm: Self-extend llm context window without tuning. arXiv preprint arXiv:2401.01325.

[salama2025meminsight] Salama, Rana, Cai, Jason, Yuan, Michelle, Currey, Anna, Sunkara, Monica, Zhang, Yi, Benajiba, Yassine. (2025). Meminsight: Autonomous memory augmentation for llm agents. arXiv preprint arXiv:2503.21760.

[kagaya2024rap] Kagaya, Tomoyuki, Yuan, Thong Jing, Lou, Yuxuan, Karlekar, Jayashree, Pranata, Sugiri, Kinose, Akira, Oguri, Koki, Wick, Felix, You, Yang. (2024). Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents. arXiv preprint arXiv:2402.03610.

[yao2022react] Yao, Shunyu, Zhao, Jeffrey, Yu, Dian, Du, Nan, Shafran, Izhak, Narasimhan, Karthik R, Cao, Yuan. (2022). React: Synergizing reasoning and acting in language models. The eleventh international conference on learning representations.

[jin2025search] Jin, Bowen, Zeng, Hansi, Yue, Zhenrui, Yoon, Jinsung, Arik, Sercan, Wang, Dong, Zamani, Hamed, Han, Jiawei. (2025). Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516.

[qian2025toolrl] Qian, Cheng, Acikgoz, Emre Can, He, Qi, Wang, Hongru, Chen, Xiusi, Hakkani-T{. (2025). Toolrl: Reward is all tool learning needs. arXiv preprint arXiv:2504.13958.

[zhang2025memory] Zhang, Yuxiang, Shu, Jiangming, Ma, Ye, Lin, Xueyuan, Wu, Shangxi, Sang, Jitao. (2025). Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks. arXiv preprint arXiv:2510.12635.

[gilabert2025terminology] Gilabert, Javier Garcia, Escolano, Carlos, Liao, Xixian, Melero, Maite. (2025). Terminology-Constrained Translation from Monolingual Data using GRPO. Proceedings of the Tenth Conference on Machine Translation.

[wang2025grpo] Wang, Hongcheng, Huang, Yinuo, Wang, Sukai, Ren, Guanghui, Dong, Hao. (2025). GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training. arXiv preprint arXiv:2509.24494.

[shridhar2020alfworld] Shridhar, Mohit, Yuan, Xingdi, C{^o. (2020). Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.

[wang2022scienceworld] Wang, Ruoyao, Jansen, Peter, C{^o. (2022). Scienceworld: Is your agent smarter than a 5th grader?. arXiv preprint arXiv:2203.07540.

[chevalier2018babyai] Chevalier-Boisvert, Maxime, Bahdanau, Dzmitry, Lahlou, Salem, Willems, Lucas, Saharia, Chitwan, Nguyen, Thien Huu, Bengio, Yoshua. (2018). Babyai: A platform to study the sample efficiency of grounded language learning. arXiv preprint arXiv:1810.08272.

[yang2018hotpotqa] Yang, Zhilin, Qi, Peng, Zhang, Saizheng, Bengio, Yoshua, Cohen, William, Salakhutdinov, Ruslan, Manning, Christopher D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. Proceedings of the 2018 conference on empirical methods in natural language processing.

[gao2025agentscope] Gao, Dawei, Li, Zitao, Xie, Yuexiang, Kuang, Weirui, Yao, Liuyi, Qian, Bingchen, Ma, Zhijian, Cui, Yue, Luo, Haohao, Li, Shen, others. (2025). AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications. arXiv preprint arXiv:2508.16279.

[pan2025trinity] Pan, Xuchen, Chen, Yanxi, Chen, Yushuo, Sun, Yuchang, Chen, Daoyuan, Zhang, Wenhao, Xie, Yuexiang, Huang, Yilun, Zhang, Yilei, Gao, Dawei, others. (2025). Trinity-rft: A general-purpose and unified framework for reinforcement fine-tuning of large language models. arXiv preprint arXiv:2505.17826.

[chaudhari2025rlhf] Chaudhari, Shreyas, Aggarwal, Pranjal, Murahari, Vishvak, Rajpurohit, Tanmay, Kalyan, Ashwin, Narasimhan, Karthik, Deshpande, Ameet, Castro da Silva, Bruno. (2025). Rlhf deciphered: A critical analysis of reinforcement learning from human feedback for llms. ACM Computing Surveys.

[bib16] Main-rag: multi-agent filtering retrieval-augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2607–2622. Cited by: §1.

[bib8] Agentboard: an analytical evaluation board of multi-turn llm agents. Advances in neural information processing systems 37, pp. 74325–74362. Cited by: §C.1, §1, §4.1.

[bib52] Rlhf deciphered: a critical analysis of reinforcement learning from human feedback for llms. ACM Computing Surveys 58 (2), pp. 1–37. Cited by: §2.

[bib48] Babyai: a platform to study the sample efficiency of grounded language learning. arXiv preprint arXiv:1810.08272. Cited by: §C.1, §4.1.

[bib23] Mem0: building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413. Cited by: §C.3, §1, §2, §4.1.

[bib19] A survey on code generation with llm-based agents. arXiv preprint arXiv:2508.00083. Cited by: §1.

[bib50] AgentScope 1.0: a developer-centric framework for building agentic applications. arXiv preprint arXiv:2508.16279. Cited by: §4.1.

[bib15] An efficient context-dependent memory framework for llm-centric agents. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pp. 1055–1069. Cited by: §1.

[bib44] Terminology-constrained translation from monolingual data using grpo. In Proceedings of the Tenth Conference on Machine Translation, pp. 1335–1343. Cited by: §2.

[bib10] The effect of state representation on llm agent behavior in dynamic routing games. arXiv preprint arXiv:2506.15624. Cited by: §1.

[bib25] Evaluating memory in llm agents via incremental multi-turn interactions. arXiv preprint arXiv:2507.05257. Cited by: §1.

[bib13] Long term memory: the foundation of ai self-evolution. arXiv preprint arXiv:2410.15665. Cited by: §1.

[bib41] Search-r1: training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516. Cited by: §2.

[bib37] Llm maybe longlm: self-extend llm context window without tuning. arXiv preprint arXiv:2401.01325. Cited by: §2.

[bib39] Rap: retrieval-augmented planning with contextual memory for multimodal llm agents. arXiv preprint arXiv:2402.03610. Cited by: §2.

[bib20] Memory os of ai agent. arXiv preprint arXiv:2506.06326. Cited by: §1.

[bib32] LangChain. External Links: Link Cited by: §C.3, §2, §4.1.

[bib35] Hello again! llm-powered personalized agent for long-term dialogue. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5259–5276. Cited by: §2.

[bib30] Coevolving with the other you: fine-tuning llm with sequential cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems 37, pp. 15497–15525. Cited by: §1.

[bib18] What should we engineer in prompts? training humans in requirement-driven llm use. ACM Transactions on Computer-Human Interaction 32 (4), pp. 1–27. Cited by: §1.

[bib51] Trinity-rft: a general-purpose and unified framework for reinforcement fine-tuning of large language models. arXiv preprint arXiv:2505.17826. Cited by: §C.4, §4.1.

[bib27] On memory construction and retrieval for personalized conversational agents. arXiv preprint arXiv:2502.05589. Cited by: §1, §2.

[bib42] Toolrl: reward is all tool learning needs. arXiv preprint arXiv:2504.13958. Cited by: §2.

[bib33] Zep: a temporal knowledge graph architecture for agent memory. arXiv preprint arXiv:2501.13956. Cited by: §2.

[bib38] Meminsight: autonomous memory augmentation for llm agents. arXiv preprint arXiv:2503.21760. Cited by: §2.

[bib31] Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §1, §2, §3.1.

[bib46] Alfworld: aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768. Cited by: §C.1, §4.1.

[bib29] Llm-based multi-agent reinforcement learning: current and future directions. arXiv preprint arXiv:2405.11106. Cited by: §1.

[bib45] GRPO-ma: multi-answer generation in grpo for stable and efficient chain-of-thought training. arXiv preprint arXiv:2509.24494. Cited by: §2.

[bib47] Scienceworld: is your agent smarter than a 5th grader?. arXiv preprint arXiv:2203.07540. Cited by: §C.1, §4.1.

[bib21] Mirix: multi-agent memory system for llm-based agents. arXiv preprint arXiv:2507.07957. Cited by: §1.

[bib34] Karma: augmenting embodied ai agents with long-and-short term memory systems. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §2.

[bib22] Inducing programmatic skills for agentic tasks. arXiv preprint arXiv:2504.06821. Cited by: §1.

[bib36] Agent workflow memory. arXiv preprint arXiv:2409.07429. Cited by: §2.

[bib17] ReSum: unlocking long-horizon search intelligence via context summarization. arXiv preprint arXiv:2509.13313. Cited by: §1, §1, §2.

[bib14] From human memory to ai memory: a survey on memory mechanisms in the era of llms. arXiv preprint arXiv:2504.15965. Cited by: §1.

[bib9] How memory management impacts llm agents: an empirical study of experience-following behavior. arXiv preprint arXiv:2505.16067. Cited by: §1, §1.

[bib26] A-mem: agentic memory for llm agents. arXiv preprint arXiv:2502.12110. Cited by: §C.3, §1, §2, §4.1.

[bib24] Memory-r1: enhancing large language model agents to manage and utilize memories via reinforcement learning. arXiv preprint arXiv:2508.19828. Cited by: §1, §2.

[bib49] HotpotQA: a dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380. Cited by: §C.1, §4.1.

[bib40] React: synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, Cited by: §2.

[bib43] Memory as action: autonomous context curation for long-horizon agentic tasks. arXiv preprint arXiv:2510.12635. Cited by: §2.

[bib28] A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems 43 (6), pp. 1–47. Cited by: §1.

[bib12] Memorybank: enhancing large language models with long-term memory. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, pp. 19724–19731. Cited by: §1.