Skip to main content

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, Volker Tresp, Yunpu Ma

Abstract

% Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking any learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns to perform structured memory operations, including adding, updating, deleting, or taking no operation on memory entries; and an Answer Agent that selects the most relevant entries and reasons over them to produce an answer. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management and utilization with minimal supervision. % With as few as 152 question-answer pairs and a corresponding temporal memory bank for training, Memory-R1 outperforms the strongest existing baseline and demonstrates strong generalization across diverse question types and LLM backbones. Beyond presenting an effective approach, this work provides insights into how RL can unlock more agentic, memory-aware behavior in LLMs, pointing toward richer, more persistent reasoning systems. Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking a learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns structured operations, including ADD, UPDATE, DELETE, and NOOP; and an Answer Agent that pre-selects and reasons over relevant entries. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management with minimal supervision. With only 152 training QA pairs, Memory-R1 outperforms strong baselines and generalizes across diverse question types, three benchmarks (LoCoMo, MSC, LongMemEval), and multiple model scales (3B–14B).

Introduction

Large Language Models (LLMs) have shown remarkable ability in understanding and generating natural language, making them central to recent advances in AI (OpenAI et al., 2024; Qwen et al., 2025). Yet, they remain fundamentally stateless (Yu et al., 2025; Fan et al., 2025; Goodyear et al., 2025): their memory is bounded by a finite context window and any information that falls outside this window is forgotten, preventing them from maintaining knowledge across long conversations or evolving tasks (Wang et al., 2024; Fei et al., 2023).

One early effort is the Tensor Brain framework, which uses a bilayer tensor network with index and representation layers to model episodic, semantic, and working memory (Tresp et al., 2023). Recent studies augment LLMs with explicit external memory modules (Zhang et al., 2024), most of which adopt the retrieval-augmented generation (RAG) paradigm (Pan et al., 2025; Salama et al., 2025), appending retrieved memory entries to the model's input prompt. While this extends access to past information, it also creates a fundamental retrieval challenge: heuristics may return too few entries, omitting crucial context, or too many, flooding the model with irrelevant information and degrading performance (Liu et al., 2023). In this paradigm, retrieved memories are passed to the LLM without meaningful filtering or prioritization, forcing the model to reason over both relevant and irrelevant content, which makes it prone to distraction by noise. Humans, by contrast, retrieve broadly but then filter, integrating only the most useful pieces to maintain coherent, evolving knowledge.

Equally important is the challenge of memory management : deciding what to remember, update, or discard. Some systems (Packer et al., 2023; Modarressi et al., 2024; Xiong et al., 2025) adopt CRUD-style operations, namely create, read, update, and delete, which are adapted from databases (Martin, 1983). A more recent work (AIOS Foundation, 2024) augments this paradigm with a search operator, while Mem0 (Chhikara et al., 2025) investigates the operator set { ADD , UPDATE , DELETE , NOOP }. We adopt this setting, as it provides a minimal yet expressive framework for modeling memory dynamics. Existing approaches mainly rely on vanilla LLMs to choose operations from in-context instructions without any learning signal tied to correctness (Packer et al., 2023; Chhikara et al., 2025). Even simple cases can fail. Figure 1, a simplified example drawn from a LoCoMo conversation (Maharana et al., 2024), shows how a user says 'I adopted a dog named Buddy' and later adds 'I

Figure 1: Comparison of Memory-R1 and a vanilla LLM memory system. (Left) In a multi-session dialogue, the user mentions adopting two dogs across sessions. (Middle) The vanilla Memory Manager misinterprets this as a contradiction and issues DELETE+ADD, fragmenting memory. (Right) The RL-trained Memory Manager issues a single UPDATE to consolidate the fact, while the Answer Agent distills 60 retrieved memories down to the relevant one ('Andrew adopted 2 dogs named Buddy and Scout') and correctly answers '2 dogs.'

Figure 1: Comparison of Memory-R1 and a vanilla LLM memory system. (Left) In a multi-session dialogue, the user mentions adopting two dogs across sessions. (Middle) The vanilla Memory Manager misinterprets this as a contradiction and issues DELETE+ADD, fragmenting memory. (Right) The RL-trained Memory Manager issues a single UPDATE to consolidate the fact, while the Answer Agent distills 60 retrieved memories down to the relevant one ('Andrew adopted 2 dogs named Buddy and Scout') and correctly answers '2 dogs.'

adopted another dog named Scout' . A vanilla system misinterprets this as a contradiction, issuing DELETE + ADD and overwriting the original memory. A trained agent instead consolidates with an UPDATE : 'Andrew adopted two dogs, Buddy and Scout. ' Appendix A.1 provides a real dialogue trace illustrating this case in practice.

These challenges of retrieving and managing memory remain largely unsolved. Supervised finetuning provides limited help because it is impractical to label every memory operation or retrieval decision. Reinforcement learning (RL), in contrast, has recently shown strong potential for aligning LLM behavior with high-level objectives, including tool use (Qian et al., 2025; Wang et al., 2025), web navigation (Wei et al., 2025), and search optimization (Jin et al., 2025; Song et al., 2025). Building on this success, we argue that RL is the missing ingredient for adaptive memory in LLM agents. By optimizing outcome-based rewards, models can learn when to add, update, delete, or retain information and how to use retrieved memories for reasoning.

In this paper, we present Memory-R1, an RL fine-tuned, memory-augmented LLM framework with two specialized agents: (1) a Memory Manager that performs structured memory operations to maintain and evolve the memory bank, and (2) an Answer Agent that applies a Memory Distillation policy to filter memories retrieved via

Retrieval-Augmented Generation (RAG) and reason over the selected entries to produce answers. Both agents are fine-tuned using PPO (Schulman et al., 2017) or GRPO (Shao et al., 2024), achieving strong performance with as few as 152 question-answer pairs. On the LoCoMo benchmark (Maharana et al., 2024), Memory-R1 delivers substantial gains over the most competitive baseline, Mem0 (Chhikara et al., 2025). Using the LLaMA-3.1-8B-Instruct backbone, Memory-R1GRPO achieves relative improvements of 28% in F1, 34% in BLEU-1, and 30% in LLM-as-a-Judge. These improvements set a new state of the art on LoCoMo and underscore Memory-R1's ability to achieve large performance gains with minimal supervision, highlighting its efficiency.

Our contributions are summarized as follows: (1) We introduce Memory-R1, the first RL framework for memory-augmented LLMs, consisting of a Memory Manager to perform structured memory operations and an Answer Agent to filter and reason over memories retrieved via RAG. (2) We develop a data-efficient fine-tuning strategy using PPO and GRPO that enables Memory-R1 to achieve strong performance with as few as 152 question-answer pairs, demonstrating that large memory improvements can be achieved with minimal supervision. (3) We provide in-depth analysis of RL choices, model size, and memory design, offering actionable insights for building the next generation of

memory-aware, reasoning-capable LLM agents.

Memory Augmented LLM-based Agents

LLMs have emerged as powerful general-purpose reasoners, capable of engaging in multi-turn dialogues, decomposing tasks into actionable steps, and leveraging prior context to guide decision making (Brown et al., 2020; Chowdhery et al., 2022; OpenAI et al., 2024). However, their reliance on fixed-length context windows limits their ability to retain information over extended interactions. To overcome this, recent work augments LLM agents with external memory modules, enabling long-horizon reasoning and persistent knowledge accumulation through selective storage, retrieval, and updating of information. Several approaches illustrate this trend. LoCoMo (Maharana et al., 2024) introduces a benchmark to evaluate agents' ability to retrieve and reason over temporally distant conversational history. ReadAgent (Lee et al., 2024) proposes a human-inspired reading agent that uses gist-based memory for reasoning over very long contexts. MemoryBank (Zhong et al., 2024) proposes a compositional memory controller for lifelong agent memory. MemGPT (Packer et al., 2023) introduces working and long-term buffers with scheduling policies. For a broader perspective, we refer readers to the recent survey on memory systems in AI agents (Du et al., 2025). While most existing approaches rely on static memory designs, our work instead develops a learnable memory system trained with reinforcement learning.

LLM and Reinforcement Learning

The intersection of LLM and RL has received increasing attention as researchers seek to move beyond static supervised fine-tuning and enable models to learn from dynamic, interactive feedback. Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) is a foundational method used to align LLM outputs with human preferences. Recent works extend RL to more structured decision-making tasks for LLMs. For instance, Toolformer (Schick et al., 2023) and ReActstyle agents (Yao et al., 2023) frame tool use as an RL problem, where the LLM learns when to query external tools or APIs. Search-R1 (Jin et al., 2025) trains LLMs to issue web search queries using RL to maximize final answer correctness. Similarly, the Trial and Error approach (Song et al.,

  1. optimizes agents to select better reasoning paths. These approaches demonstrate that RL can improve complex behavior sequences in LLMs. However, memory management and utilization in LLMs remain underexplored in the RL setting. Existing memory-augmented LLM systems (Chhikara et al., 2025; Packer et al., 2023) typically rely on heuristics to control memory operations, lacking adaptability and long-term optimization. Our work, Memory-R1, is among the first to frame memory operation selection, and the utilization of relevant memories as an RL problem.

Method

We present Memory-R1, a reinforcement learning framework for multi-session dialogue tasks, where each dialogue contains multiple sessions (separate interactions occurring at different times) and each session consists of several turns (a backand-forth exchange between two users). Answering a question always requires synthesizing information spread across sessions, posing a strong challenge for long-horizon memory management and reasoning. Figure 2 illustrates the overall pipeline. At each dialogue turn, the LLM extracts and summarizes information worth remembering, then retrieves related entries from the memory bank as part of the Retrieval-Augmented Generation (RAG) framework. The Memory Manager decides whether to ADD , UPDATE , DELETE , or NOOP , thereby maintaining and evolving the memory state. For question answering, the Answer Agent applies a memory distillation policy over retrieved memories to filter noise and reason over the most relevant content. Both agents are fine-tuned with PPO or GRPO, enabling outcome-driven learning of memory operations and selective utilization. Further implementation details, such as model hyperparameters, optimization schedule, and training setup, are provided in Appendix D.

RL Fine-tuning for Memory Manager

Task Formulation The Memory Manager maintains the memory bank by selecting one of ADD , UPDATE , DELETE , NOOP for each new piece of information extracted from a dialogue, outputting both the operation and updated content m ′ . Training uses (i) a partially constructed memory bank and (ii) a new dialogue turn with information relevant to downstream QA. The goal is to learn which operation produces a memory state that enables the

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Answer Agent to answer correctly. Formally, the Memory Manager is modeled as a policy π θ that takes extracted information x and retrieved memories M old as input, and outputs an operation o with content m ′ :

$$

$$

where x is the new information and M old the current memory bank. The data construction details are provided in Appendix B.2.

PPO for Memory Manager We fine-tune the Memory Manager with Proximal Policy Optimization (PPO; Schulman et al., 2017). Given candidate memory x and memory bank M old, the manager samples an operation o and updated content m ′ from policy π θ , applies it to the memory bank, and forwards the result to the frozen Answer Agent. Answer correctness provides a scalar reward r , from which we estimate an advantage A . The clipped PPO objective is:

$$

$$

where ρ θ = π θ ( o,m ′ | x, M old ) π old ( o,m ′ | x, M old ) is the importance ratio, A is the advantage estimated from the answerbased reward r , and ϵ is the clipping threshold for stable updates.

GRPO for Memory Manager We also train the Memory Manager with Group Relative Policy Optimization (GRPO; Shao et al., 2024), which samples a group of G candidate actions per state and computes their relative advantages. This formulation avoids an explicit value function while maintaining PPO-style stability. For a state s = ( x, M old ) , the GRPO objective is:

$$

$$

where each candidate i yields reward r i , A i = r i -mean ( r ) std ( r ) , r = { r 1 , . . . , r G } , is its standardized group-relative advantage, and ρ ( i ) θ is the per-action importance ratio. The KL term regularizes updates to prevent policy drift away from the reference π ref .

Reward Design for Memory Manager We use an outcome-driven reward: the Memory Manager's operations are judged by their effect on downstream QA. After applying operation o with proposed content m ′ , the updated memory bank is passed to the frozen Answer Agent, and the reward is based on answer correctness:

$$

$$

where y pred is the predicted answer and y gold the ground truth. This exact-match signal requires no manual labels, remains scalable, and is sufficient to teach effective memory operations.

Task Formulation

Large Language Models (LLMs) have shown remarkable ability in understanding and generating natural language, making them central to recent advances in AI (OpenAI et al., 2024; Qwen et al., 2025). Yet, they remain fundamentally stateless (Yu et al., 2025; Fan et al., 2025; Goodyear et al., 2025): their memory is bounded by a finite context window and any information that falls outside this window is forgotten, preventing them from maintaining knowledge across long conversations or evolving tasks (Wang et al., 2024; Fei et al., 2023).

One early effort is the Tensor Brain framework, which uses a bilayer tensor network with index and representation layers to model episodic, semantic, and working memory (Tresp et al., 2023). Recent studies augment LLMs with explicit external memory modules (Zhang et al., 2024), most of which adopt the retrieval-augmented generation (RAG) paradigm (Pan et al., 2025; Salama et al., 2025), appending retrieved memory entries to the model's input prompt. While this extends access to past information, it also creates a fundamental retrieval challenge: heuristics may return too few entries, omitting crucial context, or too many, flooding the model with irrelevant information and degrading performance (Liu et al., 2023). In this paradigm, retrieved memories are passed to the LLM without meaningful filtering or prioritization, forcing the model to reason over both relevant and irrelevant content, which makes it prone to distraction by noise. Humans, by contrast, retrieve broadly but then filter, integrating only the most useful pieces to maintain coherent, evolving knowledge.

Equally important is the challenge of memory management : deciding what to remember, update, or discard. Some systems (Packer et al., 2023; Modarressi et al., 2024; Xiong et al., 2025) adopt CRUD-style operations, namely create, read, update, and delete, which are adapted from databases (Martin, 1983). A more recent work (AIOS Foundation, 2024) augments this paradigm with a search operator, while Mem0 (Chhikara et al., 2025) investigates the operator set { ADD , UPDATE , DELETE , NOOP }. We adopt this setting, as it provides a minimal yet expressive framework for modeling memory dynamics. Existing approaches mainly rely on vanilla LLMs to choose operations from in-context instructions without any learning signal tied to correctness (Packer et al., 2023; Chhikara et al., 2025). Even simple cases can fail. Figure 1, a simplified example drawn from a LoCoMo conversation (Maharana et al., 2024), shows how a user says 'I adopted a dog named Buddy' and later adds 'I

Figure 1: Comparison of Memory-R1 and a vanilla LLM memory system. (Left) In a multi-session dialogue, the user mentions adopting two dogs across sessions. (Middle) The vanilla Memory Manager misinterprets this as a contradiction and issues DELETE+ADD, fragmenting memory. (Right) The RL-trained Memory Manager issues a single UPDATE to consolidate the fact, while the Answer Agent distills 60 retrieved memories down to the relevant one ('Andrew adopted 2 dogs named Buddy and Scout') and correctly answers '2 dogs.'

Figure 1: Comparison of Memory-R1 and a vanilla LLM memory system. (Left) In a multi-session dialogue, the user mentions adopting two dogs across sessions. (Middle) The vanilla Memory Manager misinterprets this as a contradiction and issues DELETE+ADD, fragmenting memory. (Right) The RL-trained Memory Manager issues a single UPDATE to consolidate the fact, while the Answer Agent distills 60 retrieved memories down to the relevant one ('Andrew adopted 2 dogs named Buddy and Scout') and correctly answers '2 dogs.'

adopted another dog named Scout' . A vanilla system misinterprets this as a contradiction, issuing DELETE + ADD and overwriting the original memory. A trained agent instead consolidates with an UPDATE : 'Andrew adopted two dogs, Buddy and Scout. ' Appendix A.1 provides a real dialogue trace illustrating this case in practice.

These challenges of retrieving and managing memory remain largely unsolved. Supervised finetuning provides limited help because it is impractical to label every memory operation or retrieval decision. Reinforcement learning (RL), in contrast, has recently shown strong potential for aligning LLM behavior with high-level objectives, including tool use (Qian et al., 2025; Wang et al., 2025), web navigation (Wei et al., 2025), and search optimization (Jin et al., 2025; Song et al., 2025). Building on this success, we argue that RL is the missing ingredient for adaptive memory in LLM agents. By optimizing outcome-based rewards, models can learn when to add, update, delete, or retain information and how to use retrieved memories for reasoning.

In this paper, we present Memory-R1, an RL fine-tuned, memory-augmented LLM framework with two specialized agents: (1) a Memory Manager that performs structured memory operations to maintain and evolve the memory bank, and (2) an Answer Agent that applies a Memory Distillation policy to filter memories retrieved via

Retrieval-Augmented Generation (RAG) and reason over the selected entries to produce answers. Both agents are fine-tuned using PPO (Schulman et al., 2017) or GRPO (Shao et al., 2024), achieving strong performance with as few as 152 question-answer pairs. On the LoCoMo benchmark (Maharana et al., 2024), Memory-R1 delivers substantial gains over the most competitive baseline, Mem0 (Chhikara et al., 2025). Using the LLaMA-3.1-8B-Instruct backbone, Memory-R1GRPO achieves relative improvements of 28% in F1, 34% in BLEU-1, and 30% in LLM-as-a-Judge. These improvements set a new state of the art on LoCoMo and underscore Memory-R1's ability to achieve large performance gains with minimal supervision, highlighting its efficiency.

Our contributions are summarized as follows: (1) We introduce Memory-R1, the first RL framework for memory-augmented LLMs, consisting of a Memory Manager to perform structured memory operations and an Answer Agent to filter and reason over memories retrieved via RAG. (2) We develop a data-efficient fine-tuning strategy using PPO and GRPO that enables Memory-R1 to achieve strong performance with as few as 152 question-answer pairs, demonstrating that large memory improvements can be achieved with minimal supervision. (3) We provide in-depth analysis of RL choices, model size, and memory design, offering actionable insights for building the next generation of

memory-aware, reasoning-capable LLM agents.

PPO for Memory Manager

Task Formulation The Memory Manager maintains the memory bank by selecting one of ADD , UPDATE , DELETE , NOOP for each new piece of information extracted from a dialogue, outputting both the operation and updated content m ′ . Training uses (i) a partially constructed memory bank and (ii) a new dialogue turn with information relevant to downstream QA. The goal is to learn which operation produces a memory state that enables the

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Answer Agent to answer correctly. Formally, the Memory Manager is modeled as a policy π θ that takes extracted information x and retrieved memories M old as input, and outputs an operation o with content m ′ :

$$

$$

where x is the new information and M old the current memory bank. The data construction details are provided in Appendix B.2.

PPO for Memory Manager We fine-tune the Memory Manager with Proximal Policy Optimization (PPO; Schulman et al., 2017). Given candidate memory x and memory bank M old, the manager samples an operation o and updated content m ′ from policy π θ , applies it to the memory bank, and forwards the result to the frozen Answer Agent. Answer correctness provides a scalar reward r , from which we estimate an advantage A . The clipped PPO objective is:

$$

$$

where ρ θ = π θ ( o,m ′ | x, M old ) π old ( o,m ′ | x, M old ) is the importance ratio, A is the advantage estimated from the answerbased reward r , and ϵ is the clipping threshold for stable updates.

GRPO for Memory Manager We also train the Memory Manager with Group Relative Policy Optimization (GRPO; Shao et al., 2024), which samples a group of G candidate actions per state and computes their relative advantages. This formulation avoids an explicit value function while maintaining PPO-style stability. For a state s = ( x, M old ) , the GRPO objective is:

$$

$$

where each candidate i yields reward r i , A i = r i -mean ( r ) std ( r ) , r = { r 1 , . . . , r G } , is its standardized group-relative advantage, and ρ ( i ) θ is the per-action importance ratio. The KL term regularizes updates to prevent policy drift away from the reference π ref .

Reward Design for Memory Manager We use an outcome-driven reward: the Memory Manager's operations are judged by their effect on downstream QA. After applying operation o with proposed content m ′ , the updated memory bank is passed to the frozen Answer Agent, and the reward is based on answer correctness:

$$

$$

where y pred is the predicted answer and y gold the ground truth. This exact-match signal requires no manual labels, remains scalable, and is sufficient to teach effective memory operations.

GRPO for Memory Manager

Task Formulation The Memory Manager maintains the memory bank by selecting one of ADD , UPDATE , DELETE , NOOP for each new piece of information extracted from a dialogue, outputting both the operation and updated content m ′ . Training uses (i) a partially constructed memory bank and (ii) a new dialogue turn with information relevant to downstream QA. The goal is to learn which operation produces a memory state that enables the

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Answer Agent to answer correctly. Formally, the Memory Manager is modeled as a policy π θ that takes extracted information x and retrieved memories M old as input, and outputs an operation o with content m ′ :

$$

$$

where x is the new information and M old the current memory bank. The data construction details are provided in Appendix B.2.

PPO for Memory Manager We fine-tune the Memory Manager with Proximal Policy Optimization (PPO; Schulman et al., 2017). Given candidate memory x and memory bank M old, the manager samples an operation o and updated content m ′ from policy π θ , applies it to the memory bank, and forwards the result to the frozen Answer Agent. Answer correctness provides a scalar reward r , from which we estimate an advantage A . The clipped PPO objective is:

$$

$$

where ρ θ = π θ ( o,m ′ | x, M old ) π old ( o,m ′ | x, M old ) is the importance ratio, A is the advantage estimated from the answerbased reward r , and ϵ is the clipping threshold for stable updates.

GRPO for Memory Manager We also train the Memory Manager with Group Relative Policy Optimization (GRPO; Shao et al., 2024), which samples a group of G candidate actions per state and computes their relative advantages. This formulation avoids an explicit value function while maintaining PPO-style stability. For a state s = ( x, M old ) , the GRPO objective is:

$$

$$

where each candidate i yields reward r i , A i = r i -mean ( r ) std ( r ) , r = { r 1 , . . . , r G } , is its standardized group-relative advantage, and ρ ( i ) θ is the per-action importance ratio. The KL term regularizes updates to prevent policy drift away from the reference π ref .

Reward Design for Memory Manager We use an outcome-driven reward: the Memory Manager's operations are judged by their effect on downstream QA. After applying operation o with proposed content m ′ , the updated memory bank is passed to the frozen Answer Agent, and the reward is based on answer correctness:

$$

$$

where y pred is the predicted answer and y gold the ground truth. This exact-match signal requires no manual labels, remains scalable, and is sufficient to teach effective memory operations.

PPO for Memory Manager

Task Formulation The Memory Manager maintains the memory bank by selecting one of ADD , UPDATE , DELETE , NOOP for each new piece of information extracted from a dialogue, outputting both the operation and updated content m ′ . Training uses (i) a partially constructed memory bank and (ii) a new dialogue turn with information relevant to downstream QA. The goal is to learn which operation produces a memory state that enables the

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Answer Agent to answer correctly. Formally, the Memory Manager is modeled as a policy π θ that takes extracted information x and retrieved memories M old as input, and outputs an operation o with content m ′ :

$$

$$

where x is the new information and M old the current memory bank. The data construction details are provided in Appendix B.2.

PPO for Memory Manager We fine-tune the Memory Manager with Proximal Policy Optimization (PPO; Schulman et al., 2017). Given candidate memory x and memory bank M old, the manager samples an operation o and updated content m ′ from policy π θ , applies it to the memory bank, and forwards the result to the frozen Answer Agent. Answer correctness provides a scalar reward r , from which we estimate an advantage A . The clipped PPO objective is:

$$

$$

where ρ θ = π θ ( o,m ′ | x, M old ) π old ( o,m ′ | x, M old ) is the importance ratio, A is the advantage estimated from the answerbased reward r , and ϵ is the clipping threshold for stable updates.

GRPO for Memory Manager We also train the Memory Manager with Group Relative Policy Optimization (GRPO; Shao et al., 2024), which samples a group of G candidate actions per state and computes their relative advantages. This formulation avoids an explicit value function while maintaining PPO-style stability. For a state s = ( x, M old ) , the GRPO objective is:

$$

$$

where each candidate i yields reward r i , A i = r i -mean ( r ) std ( r ) , r = { r 1 , . . . , r G } , is its standardized group-relative advantage, and ρ ( i ) θ is the per-action importance ratio. The KL term regularizes updates to prevent policy drift away from the reference π ref .

Reward Design for Memory Manager We use an outcome-driven reward: the Memory Manager's operations are judged by their effect on downstream QA. After applying operation o with proposed content m ′ , the updated memory bank is passed to the frozen Answer Agent, and the reward is based on answer correctness:

$$

$$

where y pred is the predicted answer and y gold the ground truth. This exact-match signal requires no manual labels, remains scalable, and is sufficient to teach effective memory operations.

GRPO for Memory Manager

Task Formulation The Memory Manager maintains the memory bank by selecting one of ADD , UPDATE , DELETE , NOOP for each new piece of information extracted from a dialogue, outputting both the operation and updated content m ′ . Training uses (i) a partially constructed memory bank and (ii) a new dialogue turn with information relevant to downstream QA. The goal is to learn which operation produces a memory state that enables the

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Answer Agent to answer correctly. Formally, the Memory Manager is modeled as a policy π θ that takes extracted information x and retrieved memories M old as input, and outputs an operation o with content m ′ :

$$

$$

where x is the new information and M old the current memory bank. The data construction details are provided in Appendix B.2.

PPO for Memory Manager We fine-tune the Memory Manager with Proximal Policy Optimization (PPO; Schulman et al., 2017). Given candidate memory x and memory bank M old, the manager samples an operation o and updated content m ′ from policy π θ , applies it to the memory bank, and forwards the result to the frozen Answer Agent. Answer correctness provides a scalar reward r , from which we estimate an advantage A . The clipped PPO objective is:

$$

$$

where ρ θ = π θ ( o,m ′ | x, M old ) π old ( o,m ′ | x, M old ) is the importance ratio, A is the advantage estimated from the answerbased reward r , and ϵ is the clipping threshold for stable updates.

GRPO for Memory Manager We also train the Memory Manager with Group Relative Policy Optimization (GRPO; Shao et al., 2024), which samples a group of G candidate actions per state and computes their relative advantages. This formulation avoids an explicit value function while maintaining PPO-style stability. For a state s = ( x, M old ) , the GRPO objective is:

$$

$$

where each candidate i yields reward r i , A i = r i -mean ( r ) std ( r ) , r = { r 1 , . . . , r G } , is its standardized group-relative advantage, and ρ ( i ) θ is the per-action importance ratio. The KL term regularizes updates to prevent policy drift away from the reference π ref .

Reward Design for Memory Manager We use an outcome-driven reward: the Memory Manager's operations are judged by their effect on downstream QA. After applying operation o with proposed content m ′ , the updated memory bank is passed to the frozen Answer Agent, and the reward is based on answer correctness:

$$

$$

where y pred is the predicted answer and y gold the ground truth. This exact-match signal requires no manual labels, remains scalable, and is sufficient to teach effective memory operations.

Reward Design for Memory Manager

Task Formulation The Memory Manager maintains the memory bank by selecting one of ADD , UPDATE , DELETE , NOOP for each new piece of information extracted from a dialogue, outputting both the operation and updated content m ′ . Training uses (i) a partially constructed memory bank and (ii) a new dialogue turn with information relevant to downstream QA. The goal is to learn which operation produces a memory state that enables the

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Answer Agent to answer correctly. Formally, the Memory Manager is modeled as a policy π θ that takes extracted information x and retrieved memories M old as input, and outputs an operation o with content m ′ :

$$

$$

where x is the new information and M old the current memory bank. The data construction details are provided in Appendix B.2.

PPO for Memory Manager We fine-tune the Memory Manager with Proximal Policy Optimization (PPO; Schulman et al., 2017). Given candidate memory x and memory bank M old, the manager samples an operation o and updated content m ′ from policy π θ , applies it to the memory bank, and forwards the result to the frozen Answer Agent. Answer correctness provides a scalar reward r , from which we estimate an advantage A . The clipped PPO objective is:

$$

$$

where ρ θ = π θ ( o,m ′ | x, M old ) π old ( o,m ′ | x, M old ) is the importance ratio, A is the advantage estimated from the answerbased reward r , and ϵ is the clipping threshold for stable updates.

GRPO for Memory Manager We also train the Memory Manager with Group Relative Policy Optimization (GRPO; Shao et al., 2024), which samples a group of G candidate actions per state and computes their relative advantages. This formulation avoids an explicit value function while maintaining PPO-style stability. For a state s = ( x, M old ) , the GRPO objective is:

$$

$$

where each candidate i yields reward r i , A i = r i -mean ( r ) std ( r ) , r = { r 1 , . . . , r G } , is its standardized group-relative advantage, and ρ ( i ) θ is the per-action importance ratio. The KL term regularizes updates to prevent policy drift away from the reference π ref .

Reward Design for Memory Manager We use an outcome-driven reward: the Memory Manager's operations are judged by their effect on downstream QA. After applying operation o with proposed content m ′ , the updated memory bank is passed to the frozen Answer Agent, and the reward is based on answer correctness:

$$

$$

where y pred is the predicted answer and y gold the ground truth. This exact-match signal requires no manual labels, remains scalable, and is sufficient to teach effective memory operations.

RL Fine-Tuning for Answer Agent

Task Formulation The Answer Agent leverages the memory bank maintained by the Memory Manager to answer questions in multi-session dialogues. Following (Chhikara et al., 2025), 60 candidate memories are retrieved for each question

via similarity-based RAG, and the agent performs memory distillation to select the most relevant entries before generating an answer.

We model the agent as a policy π θ mapping the question q and retrieved set M ret to an answer y :

$$

$$

PPO for Answer Agent We fine-tune the Answer Agent using the same PPO algorithm as in Section 3.1. The agent takes the question q and retrieved memories M ret and generates an answer y . The objective mirrors Equation (2), applied to the generated sequence. The importance ratio is:

$$

$$

where y is the generated answer. Advantages derive from final answer quality (e.g., exact match), and clipping ensures stable updates.

GRPO for Answer Agent We also fine-tune the Answer Agent with GRPO, following the formulation in Section 3.1. For each ( q, M ret ) , the policy samples G candidate answers { y i } G i =1 . Their exact-match rewards against y gt are normalized into group-relative advantages. GRPO uses the same importance ratio as PPO, computed per candidate, and stabilizes training without a value function by comparing candidates within each group.

Reward Design for Answer Agent We use the Exact Match (EM) score between the generated answer y pred and ground truth y gold as the reward. This design directly ties the reward to the correctness of the final answer, encouraging the agent to select and reason over memories in a way that yields accurate outputs rather than optimizing for intermediate steps.

Task Formulation

Large Language Models (LLMs) have shown remarkable ability in understanding and generating natural language, making them central to recent advances in AI (OpenAI et al., 2024; Qwen et al., 2025). Yet, they remain fundamentally stateless (Yu et al., 2025; Fan et al., 2025; Goodyear et al., 2025): their memory is bounded by a finite context window and any information that falls outside this window is forgotten, preventing them from maintaining knowledge across long conversations or evolving tasks (Wang et al., 2024; Fei et al., 2023).

One early effort is the Tensor Brain framework, which uses a bilayer tensor network with index and representation layers to model episodic, semantic, and working memory (Tresp et al., 2023). Recent studies augment LLMs with explicit external memory modules (Zhang et al., 2024), most of which adopt the retrieval-augmented generation (RAG) paradigm (Pan et al., 2025; Salama et al., 2025), appending retrieved memory entries to the model's input prompt. While this extends access to past information, it also creates a fundamental retrieval challenge: heuristics may return too few entries, omitting crucial context, or too many, flooding the model with irrelevant information and degrading performance (Liu et al., 2023). In this paradigm, retrieved memories are passed to the LLM without meaningful filtering or prioritization, forcing the model to reason over both relevant and irrelevant content, which makes it prone to distraction by noise. Humans, by contrast, retrieve broadly but then filter, integrating only the most useful pieces to maintain coherent, evolving knowledge.

Equally important is the challenge of memory management : deciding what to remember, update, or discard. Some systems (Packer et al., 2023; Modarressi et al., 2024; Xiong et al., 2025) adopt CRUD-style operations, namely create, read, update, and delete, which are adapted from databases (Martin, 1983). A more recent work (AIOS Foundation, 2024) augments this paradigm with a search operator, while Mem0 (Chhikara et al., 2025) investigates the operator set { ADD , UPDATE , DELETE , NOOP }. We adopt this setting, as it provides a minimal yet expressive framework for modeling memory dynamics. Existing approaches mainly rely on vanilla LLMs to choose operations from in-context instructions without any learning signal tied to correctness (Packer et al., 2023; Chhikara et al., 2025). Even simple cases can fail. Figure 1, a simplified example drawn from a LoCoMo conversation (Maharana et al., 2024), shows how a user says 'I adopted a dog named Buddy' and later adds 'I

Figure 1: Comparison of Memory-R1 and a vanilla LLM memory system. (Left) In a multi-session dialogue, the user mentions adopting two dogs across sessions. (Middle) The vanilla Memory Manager misinterprets this as a contradiction and issues DELETE+ADD, fragmenting memory. (Right) The RL-trained Memory Manager issues a single UPDATE to consolidate the fact, while the Answer Agent distills 60 retrieved memories down to the relevant one ('Andrew adopted 2 dogs named Buddy and Scout') and correctly answers '2 dogs.'

Figure 1: Comparison of Memory-R1 and a vanilla LLM memory system. (Left) In a multi-session dialogue, the user mentions adopting two dogs across sessions. (Middle) The vanilla Memory Manager misinterprets this as a contradiction and issues DELETE+ADD, fragmenting memory. (Right) The RL-trained Memory Manager issues a single UPDATE to consolidate the fact, while the Answer Agent distills 60 retrieved memories down to the relevant one ('Andrew adopted 2 dogs named Buddy and Scout') and correctly answers '2 dogs.'

adopted another dog named Scout' . A vanilla system misinterprets this as a contradiction, issuing DELETE + ADD and overwriting the original memory. A trained agent instead consolidates with an UPDATE : 'Andrew adopted two dogs, Buddy and Scout. ' Appendix A.1 provides a real dialogue trace illustrating this case in practice.

These challenges of retrieving and managing memory remain largely unsolved. Supervised finetuning provides limited help because it is impractical to label every memory operation or retrieval decision. Reinforcement learning (RL), in contrast, has recently shown strong potential for aligning LLM behavior with high-level objectives, including tool use (Qian et al., 2025; Wang et al., 2025), web navigation (Wei et al., 2025), and search optimization (Jin et al., 2025; Song et al., 2025). Building on this success, we argue that RL is the missing ingredient for adaptive memory in LLM agents. By optimizing outcome-based rewards, models can learn when to add, update, delete, or retain information and how to use retrieved memories for reasoning.

In this paper, we present Memory-R1, an RL fine-tuned, memory-augmented LLM framework with two specialized agents: (1) a Memory Manager that performs structured memory operations to maintain and evolve the memory bank, and (2) an Answer Agent that applies a Memory Distillation policy to filter memories retrieved via

Retrieval-Augmented Generation (RAG) and reason over the selected entries to produce answers. Both agents are fine-tuned using PPO (Schulman et al., 2017) or GRPO (Shao et al., 2024), achieving strong performance with as few as 152 question-answer pairs. On the LoCoMo benchmark (Maharana et al., 2024), Memory-R1 delivers substantial gains over the most competitive baseline, Mem0 (Chhikara et al., 2025). Using the LLaMA-3.1-8B-Instruct backbone, Memory-R1GRPO achieves relative improvements of 28% in F1, 34% in BLEU-1, and 30% in LLM-as-a-Judge. These improvements set a new state of the art on LoCoMo and underscore Memory-R1's ability to achieve large performance gains with minimal supervision, highlighting its efficiency.

Our contributions are summarized as follows: (1) We introduce Memory-R1, the first RL framework for memory-augmented LLMs, consisting of a Memory Manager to perform structured memory operations and an Answer Agent to filter and reason over memories retrieved via RAG. (2) We develop a data-efficient fine-tuning strategy using PPO and GRPO that enables Memory-R1 to achieve strong performance with as few as 152 question-answer pairs, demonstrating that large memory improvements can be achieved with minimal supervision. (3) We provide in-depth analysis of RL choices, model size, and memory design, offering actionable insights for building the next generation of

memory-aware, reasoning-capable LLM agents.

PPO for Answer Agent

Task Formulation The Answer Agent leverages the memory bank maintained by the Memory Manager to answer questions in multi-session dialogues. Following (Chhikara et al., 2025), 60 candidate memories are retrieved for each question

via similarity-based RAG, and the agent performs memory distillation to select the most relevant entries before generating an answer.

We model the agent as a policy π θ mapping the question q and retrieved set M ret to an answer y :

$$

$$

PPO for Answer Agent We fine-tune the Answer Agent using the same PPO algorithm as in Section 3.1. The agent takes the question q and retrieved memories M ret and generates an answer y . The objective mirrors Equation (2), applied to the generated sequence. The importance ratio is:

$$

$$

where y is the generated answer. Advantages derive from final answer quality (e.g., exact match), and clipping ensures stable updates.

GRPO for Answer Agent We also fine-tune the Answer Agent with GRPO, following the formulation in Section 3.1. For each ( q, M ret ) , the policy samples G candidate answers { y i } G i =1 . Their exact-match rewards against y gt are normalized into group-relative advantages. GRPO uses the same importance ratio as PPO, computed per candidate, and stabilizes training without a value function by comparing candidates within each group.

Reward Design for Answer Agent We use the Exact Match (EM) score between the generated answer y pred and ground truth y gold as the reward. This design directly ties the reward to the correctness of the final answer, encouraging the agent to select and reason over memories in a way that yields accurate outputs rather than optimizing for intermediate steps.

GRPO for Answer Agent

Task Formulation The Answer Agent leverages the memory bank maintained by the Memory Manager to answer questions in multi-session dialogues. Following (Chhikara et al., 2025), 60 candidate memories are retrieved for each question

via similarity-based RAG, and the agent performs memory distillation to select the most relevant entries before generating an answer.

We model the agent as a policy π θ mapping the question q and retrieved set M ret to an answer y :

$$

$$

PPO for Answer Agent We fine-tune the Answer Agent using the same PPO algorithm as in Section 3.1. The agent takes the question q and retrieved memories M ret and generates an answer y . The objective mirrors Equation (2), applied to the generated sequence. The importance ratio is:

$$

$$

where y is the generated answer. Advantages derive from final answer quality (e.g., exact match), and clipping ensures stable updates.

GRPO for Answer Agent We also fine-tune the Answer Agent with GRPO, following the formulation in Section 3.1. For each ( q, M ret ) , the policy samples G candidate answers { y i } G i =1 . Their exact-match rewards against y gt are normalized into group-relative advantages. GRPO uses the same importance ratio as PPO, computed per candidate, and stabilizes training without a value function by comparing candidates within each group.

Reward Design for Answer Agent We use the Exact Match (EM) score between the generated answer y pred and ground truth y gold as the reward. This design directly ties the reward to the correctness of the final answer, encouraging the agent to select and reason over memories in a way that yields accurate outputs rather than optimizing for intermediate steps.

Reward Design for Answer Agent

Task Formulation The Answer Agent leverages the memory bank maintained by the Memory Manager to answer questions in multi-session dialogues. Following (Chhikara et al., 2025), 60 candidate memories are retrieved for each question

via similarity-based RAG, and the agent performs memory distillation to select the most relevant entries before generating an answer.

We model the agent as a policy π θ mapping the question q and retrieved set M ret to an answer y :

$$

$$

PPO for Answer Agent We fine-tune the Answer Agent using the same PPO algorithm as in Section 3.1. The agent takes the question q and retrieved memories M ret and generates an answer y . The objective mirrors Equation (2), applied to the generated sequence. The importance ratio is:

$$

$$

where y is the generated answer. Advantages derive from final answer quality (e.g., exact match), and clipping ensures stable updates.

GRPO for Answer Agent We also fine-tune the Answer Agent with GRPO, following the formulation in Section 3.1. For each ( q, M ret ) , the policy samples G candidate answers { y i } G i =1 . Their exact-match rewards against y gt are normalized into group-relative advantages. GRPO uses the same importance ratio as PPO, computed per candidate, and stabilizes training without a value function by comparing candidates within each group.

Reward Design for Answer Agent We use the Exact Match (EM) score between the generated answer y pred and ground truth y gold as the reward. This design directly ties the reward to the correctness of the final answer, encouraging the agent to select and reason over memories in a way that yields accurate outputs rather than optimizing for intermediate steps.

Experiments

Experimental Setup

Dataset and Model We evaluate Memory-R1 on three benchmarks: LoCoMo (Maharana et al., 2024), MSC (Packer et al., 2023), and LongMemEval (Wu et al., 2024). LoCoMo contains long multi-session dialogues (about 600 turns, 26k tokens) with QA pairs covering single-hop, multihop, open-domain, and temporal reasoning. Following prior work (Chhikara et al., 2025), we exclude the adversarial subset and use a 1:1:8 train/validation/test split (152/81/1307 questions). Mod- els are trained only on LoCoMo and evaluated zeroshot on MSC and LongMemEval. We use LLaMA3.1-8B-Instruct and Qwen-2.5 Instruct backbones (3B, 7B, 14B). Dataset construction details are provided in Appendix B.

Evaluation Metrics We evaluate performance using three metrics: token-level F1 (F1), BLEU-1 (B1), and LLM-as-a-Judge (J). F1 and B1 measure lexical overlap with ground-truth answers, while J uses a separate LLM to assess semantic correctness, relevance, completeness, and contextual appropriateness. Implementation details for LLM-asa-Judge are provided in Appendix C.

Baselines To evaluate the effectiveness of MEMORY-R1, we compare it against several established baselines for multi-session dialogue reasoning: (1) LoCoMo (Maharana et al., 2024), a RAGstyle framework that converts entire dialogues into chunks and retrieves relevant segments for answering questions, serving as the benchmark baseline for long-range, multi-session conversation reasoning; (2) A-Mem (Xu et al., 2025), a dynamic agentic memory system that creates, links, and updates structured memories to enhance reasoning across sessions; (3) Mem0 (Chhikara et al., 2025), a modular memory system with explicit in context memory operations designed for scalable deployment; (4) MemoryOS (Kang et al., 2025), a system-level framework that treats memory as an operating system abstraction for LLMs, providing unified mechanisms for memory read, write, and management across sessions to support long-horizon reasoning; (5) Memory-SFT. To isolate the effect of RL, we implement a supervised fine-tuning variant of our framework. Memory-SFT uses the same architecture and training data as Memory-R1 but replaces RL optimization with behavior cloning from GPT5-generated trajectories.

For a fair comparison, we re-implemented all baselines using both the LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct models as backbones, with temperature set to 0 and a maximum token limit of 2048. This consistent setup ensures reproducibility and allows us to assess how each method performs across different model architectures.

Dataset and Model
Evaluation Metrics

We conduct ablation studies to assess the contribution of each component in Memory-R1, isolating the effects of the Memory Manager, the Answer Agent, and the Memory Distillation mechanism. We also compare the training dynamics of PPO and GRPO.

Effect of Memory Manager We compare the full Memory-R1 pipeline with an ablated variant without RL fine-tuning of the Memory Manager, both using LLaMA-3.1-8B-Instruct. As shown in Figure 5 (a,d), removing the RL-fine-tuned Memory Manager consistently degrades performance. Under PPO, F1, BLEU-1, and LLM-as-a-Judge drop from 41.0, 32.9, and 57.5 to 34.5, 28.1, and 49.0, respectively. Under GRPO, the correspond- ing scores decrease to 37.5, 30.6, and 52.9. These results confirm that outcome-driven RL enables more effective memory operations than scripted control.

Effect of Answer Agent Figure 5 (b,d) shows that RL fine-tuning the Answer Agent substantially improves answer quality. Without the Memory-R1 Answer Agent, PPO achieves F1, BLEU-1, and J scores of 32.5, 24.6, and 59.4, while GRPO reaches 33.0, 24.9, and 59.9. With the full pipeline, PPO improves to 41.0, 32.9, and 57.5, and GRPO further increases performance to 45.0, 37.5, and 62.7. This demonstrates that reward-driven fine-tuning enhances answer quality beyond static retrieval. A case study is provided in Appendix A.2.

Effect of Memory Distillation We evaluate memory distillation by comparing Answer Agents trained with and without distillation (Figure 5 (c,d)). With distillation enabled, PPO improves from 39.3, 30.9, and 57.4 to 41.0, 32.9, and 57.5 on F1, BLEU1, and J, respectively. GRPO shows larger gains, increasing from 41.0, 34.4, and 60.1 to 45.0, 37.5, and 62.7. These results indicate that filtering irrelevant memories reduces noise and improves reasoning.

RL-Fine-Tuned Answer Agent Gains More with Stronger Memory Manager We test whether Answer Agent gains depend on Memory Manager quality. Figure 6 compares PPO/GRPO agents with a LLaMA-3.1-8B manager versus a stronger GPT4o-mini manager. Improvements are larger with the stronger manager (F1: +10.10 vs. +19.72; BLEU-1: +10.81 vs. +18.19; J: +5.05 vs. +15.76), showing that Memory-R1 compounds benefits and the Answer Agent scales with memory quality.

Comparison of RL Policies We compare PPO and GRPO for training the Answer Agent, using

ng in

— PPO

—— GRPO |

—— GRPO

Figure

S

Figure 6: Performance gains of Answer Agent increase when paired with stronger Memory Managers, showing compounding benefits from higher memory quality.

Table 2: Reward Design Choice Comparison. PPO with J-based reward achieves higher J scores but lower F1 and B1 due to verbose outputs, while the EM-based reward yields balanced performance across metrics.

exact match against ground-truth answers as the reward signal. As shown in Figure 7, GRPO exhibits faster initial convergence, likely due to its grouped return normalization providing stronger early guidance. However, as training progresses, both methods steadily improve and ultimately reach comparable final reward levels.

Reward Design Analysis We experimented with different reward models for fine-tuning the Answer Agent. As shown in Table 2, using the LLMas-a-Judge value as reward leads to the highest J score (63.58), but performs poorly on F1 and BLEU-1. This is because the reward encourages longer, descriptive answers, which misaligns with string-overlap metrics. For example, when asked 'Did John and James study together?', the EMbased model outputs 'Yes', while the LLM-as-aJudge-based model produces 'Yes, John and James studied together, as they were part of the same online programming group, as implied by the memories above.' Although both are semantically correct, the latter is penalized under F1 and BLEU-1. This makes direct comparison with baselines difficult, since responses are no longer length-controlled. To avoid bias from relying on a single metric, we adopt the EM reward, which yields balanced improvements across all three metrics.

Figure 7: Training reward curves for PPO and GRPO on the Answer Agent using exact match as the reward. GRPO converges faster initially, and both reach similar final rewards.

Figure 7: Training reward curves for PPO and GRPO on the Answer Agent using exact match as the reward. GRPO converges faster initially, and both reach similar final rewards.

Figure 8: Accuracy and latency comparison across different inference pipelines: Base, Base + Reranker, and Memory-R1 (GRPO).

Figure 8: Accuracy and latency comparison across different inference pipelines: Base, Base + Reranker, and Memory-R1 (GRPO).

Comparison of Learned Memory Distillation and Reranking We compare learned memory distillation in Memory-R1 with reranker-based pipelines in terms of accuracy and inference latency across three settings: Base, Base + Reranker, and Memory-R1 with a GRPO-trained Answer Agent (Figure 8). While reranking provides modest accuracy gains, it incurs substantial latency overhead. In contrast, Memory-R1 achieves higher accuracy with lower median and tail latency, demonstrating a more favorable accuracy-latency trade-off. Additional analyses are provided in Appendix G.

Baselines

We conduct ablation studies to assess the contribution of each component in Memory-R1, isolating the effects of the Memory Manager, the Answer Agent, and the Memory Distillation mechanism. We also compare the training dynamics of PPO and GRPO.

Effect of Memory Manager We compare the full Memory-R1 pipeline with an ablated variant without RL fine-tuning of the Memory Manager, both using LLaMA-3.1-8B-Instruct. As shown in Figure 5 (a,d), removing the RL-fine-tuned Memory Manager consistently degrades performance. Under PPO, F1, BLEU-1, and LLM-as-a-Judge drop from 41.0, 32.9, and 57.5 to 34.5, 28.1, and 49.0, respectively. Under GRPO, the correspond- ing scores decrease to 37.5, 30.6, and 52.9. These results confirm that outcome-driven RL enables more effective memory operations than scripted control.

Effect of Answer Agent Figure 5 (b,d) shows that RL fine-tuning the Answer Agent substantially improves answer quality. Without the Memory-R1 Answer Agent, PPO achieves F1, BLEU-1, and J scores of 32.5, 24.6, and 59.4, while GRPO reaches 33.0, 24.9, and 59.9. With the full pipeline, PPO improves to 41.0, 32.9, and 57.5, and GRPO further increases performance to 45.0, 37.5, and 62.7. This demonstrates that reward-driven fine-tuning enhances answer quality beyond static retrieval. A case study is provided in Appendix A.2.

Effect of Memory Distillation We evaluate memory distillation by comparing Answer Agents trained with and without distillation (Figure 5 (c,d)). With distillation enabled, PPO improves from 39.3, 30.9, and 57.4 to 41.0, 32.9, and 57.5 on F1, BLEU1, and J, respectively. GRPO shows larger gains, increasing from 41.0, 34.4, and 60.1 to 45.0, 37.5, and 62.7. These results indicate that filtering irrelevant memories reduces noise and improves reasoning.

RL-Fine-Tuned Answer Agent Gains More with Stronger Memory Manager We test whether Answer Agent gains depend on Memory Manager quality. Figure 6 compares PPO/GRPO agents with a LLaMA-3.1-8B manager versus a stronger GPT4o-mini manager. Improvements are larger with the stronger manager (F1: +10.10 vs. +19.72; BLEU-1: +10.81 vs. +18.19; J: +5.05 vs. +15.76), showing that Memory-R1 compounds benefits and the Answer Agent scales with memory quality.

Comparison of RL Policies We compare PPO and GRPO for training the Answer Agent, using

ng in

— PPO

—— GRPO |

—— GRPO

Figure

S

Figure 6: Performance gains of Answer Agent increase when paired with stronger Memory Managers, showing compounding benefits from higher memory quality.

Table 2: Reward Design Choice Comparison. PPO with J-based reward achieves higher J scores but lower F1 and B1 due to verbose outputs, while the EM-based reward yields balanced performance across metrics.

exact match against ground-truth answers as the reward signal. As shown in Figure 7, GRPO exhibits faster initial convergence, likely due to its grouped return normalization providing stronger early guidance. However, as training progresses, both methods steadily improve and ultimately reach comparable final reward levels.

Reward Design Analysis We experimented with different reward models for fine-tuning the Answer Agent. As shown in Table 2, using the LLMas-a-Judge value as reward leads to the highest J score (63.58), but performs poorly on F1 and BLEU-1. This is because the reward encourages longer, descriptive answers, which misaligns with string-overlap metrics. For example, when asked 'Did John and James study together?', the EMbased model outputs 'Yes', while the LLM-as-aJudge-based model produces 'Yes, John and James studied together, as they were part of the same online programming group, as implied by the memories above.' Although both are semantically correct, the latter is penalized under F1 and BLEU-1. This makes direct comparison with baselines difficult, since responses are no longer length-controlled. To avoid bias from relying on a single metric, we adopt the EM reward, which yields balanced improvements across all three metrics.

Figure 7: Training reward curves for PPO and GRPO on the Answer Agent using exact match as the reward. GRPO converges faster initially, and both reach similar final rewards.

Figure 7: Training reward curves for PPO and GRPO on the Answer Agent using exact match as the reward. GRPO converges faster initially, and both reach similar final rewards.

Figure 8: Accuracy and latency comparison across different inference pipelines: Base, Base + Reranker, and Memory-R1 (GRPO).

Figure 8: Accuracy and latency comparison across different inference pipelines: Base, Base + Reranker, and Memory-R1 (GRPO).

Comparison of Learned Memory Distillation and Reranking We compare learned memory distillation in Memory-R1 with reranker-based pipelines in terms of accuracy and inference latency across three settings: Base, Base + Reranker, and Memory-R1 with a GRPO-trained Answer Agent (Figure 8). While reranking provides modest accuracy gains, it incurs substantial latency overhead. In contrast, Memory-R1 achieves higher accuracy with lower median and tail latency, demonstrating a more favorable accuracy-latency trade-off. Additional analyses are provided in Appendix G.

Baselines

We conduct ablation studies to assess the contribution of each component in Memory-R1, isolating the effects of the Memory Manager, the Answer Agent, and the Memory Distillation mechanism. We also compare the training dynamics of PPO and GRPO.

Effect of Memory Manager We compare the full Memory-R1 pipeline with an ablated variant without RL fine-tuning of the Memory Manager, both using LLaMA-3.1-8B-Instruct. As shown in Figure 5 (a,d), removing the RL-fine-tuned Memory Manager consistently degrades performance. Under PPO, F1, BLEU-1, and LLM-as-a-Judge drop from 41.0, 32.9, and 57.5 to 34.5, 28.1, and 49.0, respectively. Under GRPO, the correspond- ing scores decrease to 37.5, 30.6, and 52.9. These results confirm that outcome-driven RL enables more effective memory operations than scripted control.

Effect of Answer Agent Figure 5 (b,d) shows that RL fine-tuning the Answer Agent substantially improves answer quality. Without the Memory-R1 Answer Agent, PPO achieves F1, BLEU-1, and J scores of 32.5, 24.6, and 59.4, while GRPO reaches 33.0, 24.9, and 59.9. With the full pipeline, PPO improves to 41.0, 32.9, and 57.5, and GRPO further increases performance to 45.0, 37.5, and 62.7. This demonstrates that reward-driven fine-tuning enhances answer quality beyond static retrieval. A case study is provided in Appendix A.2.

Effect of Memory Distillation We evaluate memory distillation by comparing Answer Agents trained with and without distillation (Figure 5 (c,d)). With distillation enabled, PPO improves from 39.3, 30.9, and 57.4 to 41.0, 32.9, and 57.5 on F1, BLEU1, and J, respectively. GRPO shows larger gains, increasing from 41.0, 34.4, and 60.1 to 45.0, 37.5, and 62.7. These results indicate that filtering irrelevant memories reduces noise and improves reasoning.

RL-Fine-Tuned Answer Agent Gains More with Stronger Memory Manager We test whether Answer Agent gains depend on Memory Manager quality. Figure 6 compares PPO/GRPO agents with a LLaMA-3.1-8B manager versus a stronger GPT4o-mini manager. Improvements are larger with the stronger manager (F1: +10.10 vs. +19.72; BLEU-1: +10.81 vs. +18.19; J: +5.05 vs. +15.76), showing that Memory-R1 compounds benefits and the Answer Agent scales with memory quality.

Comparison of RL Policies We compare PPO and GRPO for training the Answer Agent, using

ng in

— PPO

—— GRPO |

—— GRPO

Figure

S

Figure 6: Performance gains of Answer Agent increase when paired with stronger Memory Managers, showing compounding benefits from higher memory quality.

Table 2: Reward Design Choice Comparison. PPO with J-based reward achieves higher J scores but lower F1 and B1 due to verbose outputs, while the EM-based reward yields balanced performance across metrics.

exact match against ground-truth answers as the reward signal. As shown in Figure 7, GRPO exhibits faster initial convergence, likely due to its grouped return normalization providing stronger early guidance. However, as training progresses, both methods steadily improve and ultimately reach comparable final reward levels.

Reward Design Analysis We experimented with different reward models for fine-tuning the Answer Agent. As shown in Table 2, using the LLMas-a-Judge value as reward leads to the highest J score (63.58), but performs poorly on F1 and BLEU-1. This is because the reward encourages longer, descriptive answers, which misaligns with string-overlap metrics. For example, when asked 'Did John and James study together?', the EMbased model outputs 'Yes', while the LLM-as-aJudge-based model produces 'Yes, John and James studied together, as they were part of the same online programming group, as implied by the memories above.' Although both are semantically correct, the latter is penalized under F1 and BLEU-1. This makes direct comparison with baselines difficult, since responses are no longer length-controlled. To avoid bias from relying on a single metric, we adopt the EM reward, which yields balanced improvements across all three metrics.

Figure 7: Training reward curves for PPO and GRPO on the Answer Agent using exact match as the reward. GRPO converges faster initially, and both reach similar final rewards.

Figure 7: Training reward curves for PPO and GRPO on the Answer Agent using exact match as the reward. GRPO converges faster initially, and both reach similar final rewards.

Figure 8: Accuracy and latency comparison across different inference pipelines: Base, Base + Reranker, and Memory-R1 (GRPO).

Figure 8: Accuracy and latency comparison across different inference pipelines: Base, Base + Reranker, and Memory-R1 (GRPO).

Comparison of Learned Memory Distillation and Reranking We compare learned memory distillation in Memory-R1 with reranker-based pipelines in terms of accuracy and inference latency across three settings: Base, Base + Reranker, and Memory-R1 with a GRPO-trained Answer Agent (Figure 8). While reranking provides modest accuracy gains, it incurs substantial latency overhead. In contrast, Memory-R1 achieves higher accuracy with lower median and tail latency, demonstrating a more favorable accuracy-latency trade-off. Additional analyses are provided in Appendix G.

Implementation Details

We fine-tune MEMORY-R1 on LLaMA-3.1-8BInstruct and Qwen-2.5-3B, 7B, and 14B-Instruct models to evaluate robustness across architectures. Experiments are primarily conducted on 4 NVIDIA H100 GPUs (80GB each), except for Qwen-2.5-

14B, which requires 8 GPUs. The total batch size is 128 with a micro-batch size of 2 per GPU. The maximum prompt and response lengths are set to 4096 and 2048 tokens, respectively.

Prompts for memory operations and memoryaugmented answer generation are adapted from Chhikara et al. (2025). Reinforcement learning fine-tuning is performed using PPO and GRPO within the VERL framework (Sheng et al., 2025). For PPO, actor and critic networks are jointly trained with learning rates of 1 × 10 -6 and 1 × 10 -5 , respectively, using a constant warmup schedule. GRPO updates only the actor via grouped return normalization.

During RL training, we use a decoding temperature of τ = 1 . 0 to encourage exploration and collect diverse reward signals, which helps stabilize policy learning. For validation and testing, greedy decoding ( τ = 0 ) is applied to ensure deterministic outputs and consistent metric evaluation.

Implementation Details

We fine-tune MEMORY-R1 on LLaMA-3.1-8BInstruct and Qwen-2.5-3B, 7B, and 14B-Instruct models to evaluate robustness across architectures. Experiments are primarily conducted on 4 NVIDIA H100 GPUs (80GB each), except for Qwen-2.5-

14B, which requires 8 GPUs. The total batch size is 128 with a micro-batch size of 2 per GPU. The maximum prompt and response lengths are set to 4096 and 2048 tokens, respectively.

Prompts for memory operations and memoryaugmented answer generation are adapted from Chhikara et al. (2025). Reinforcement learning fine-tuning is performed using PPO and GRPO within the VERL framework (Sheng et al., 2025). For PPO, actor and critic networks are jointly trained with learning rates of 1 × 10 -6 and 1 × 10 -5 , respectively, using a constant warmup schedule. GRPO updates only the actor via grouped return normalization.

During RL training, we use a decoding temperature of τ = 1 . 0 to encourage exploration and collect diverse reward signals, which helps stabilize policy learning. For validation and testing, greedy decoding ( τ = 0 ) is applied to ensure deterministic outputs and consistent metric evaluation.

Main Results

Table 1 reports the performance of Memory-R1 across LLaMA-3.1-8B-Instruct and Qwen-2.5-7BInstruct models on the LoCoMo benchmark, covering diverse question types including single-hop,

Table 1: Evaluation results of Memory-R1 and baselines across LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct on the LoCoMo benchmark dataset. Models are evaluated on F1, BLEU-1 (B1), and LLM-as-a-Judge (J) across Single Hop , Multi-Hop , Open Domain , and Temporal questions. Higher values indicate better performance. The best results are marked in bold.

Figure 3: Scalability of Memory-R1 across model sizes (Qwen-2.5-3B, 7B, 14B-Instruct). Both PPO- and GRPOtuned variants consistently outperform the base models across F1, BLEU-1 (B1), and LLM-as-a-Judge (J) metrics, showing strong scaling behavior.

Figure 3: Scalability of Memory-R1 across model sizes (Qwen-2.5-3B, 7B, 14B-Instruct). Both PPO- and GRPOtuned variants consistently outperform the base models across F1, BLEU-1 (B1), and LLM-as-a-Judge (J) metrics, showing strong scaling behavior.

Figure 4: Generalization analysis of Memory-R1 across three benchmarks (LoCoMo, MSC, and LongMemEval), using LLaMA-3.1-8B-Instruct (left) and Qwen-2.5-7BInstruct (right) as backbones.

Figure 4: Generalization analysis of Memory-R1 across three benchmarks (LoCoMo, MSC, and LongMemEval), using LLaMA-3.1-8B-Instruct (left) and Qwen-2.5-7BInstruct (right) as backbones.

multi-hop, open-domain, and temporal reasoning. We evaluate two variants of Memory-R1, one fine-tuned with PPO and another with GRPO, and benchmark them against leading memoryaugmented baselines, including LoCoMo (RAG), A-Mem, Mem0, MemoryOS and Memory-SFT.

Across both model families, Memory-R1 consistently achieves new state-of-the-art performance. On LLaMA-3.1-8B, Memory-R1-GRPO delivers the strongest overall performance, improving F1 by 28.5%, B1 by 34.0%, and J by 30.2% relatively over the strongest baseline MemoryOS. Similarly, Memory-R1-PPO also yields substantial improvements, raising overall F1, B1, and J scores by 17.2%, 17.6%, and 19.4%, respectively. When applied to Qwen-2.5-7B-Instruct, MemoryR1-GRPO again emerges as the top performer, surpassing MemoryOS by margins of 24.5% (F1), 24.1% (B1), and 20.0% (J). PPO remains competitive, delivering strong gains over all non-RL baselines. Notably, while Memory-SFT benefits from guidance by a powerful teacher model (GPT-5), our reinforcement learning approach still outperforms it, highlighting the effectiveness of outcome-driven optimization over purely supervised imitation.

Main Results

Table 1 reports the performance of Memory-R1 across LLaMA-3.1-8B-Instruct and Qwen-2.5-7BInstruct models on the LoCoMo benchmark, covering diverse question types including single-hop,

Table 1: Evaluation results of Memory-R1 and baselines across LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct on the LoCoMo benchmark dataset. Models are evaluated on F1, BLEU-1 (B1), and LLM-as-a-Judge (J) across Single Hop , Multi-Hop , Open Domain , and Temporal questions. Higher values indicate better performance. The best results are marked in bold.

Figure 3: Scalability of Memory-R1 across model sizes (Qwen-2.5-3B, 7B, 14B-Instruct). Both PPO- and GRPOtuned variants consistently outperform the base models across F1, BLEU-1 (B1), and LLM-as-a-Judge (J) metrics, showing strong scaling behavior.

Figure 3: Scalability of Memory-R1 across model sizes (Qwen-2.5-3B, 7B, 14B-Instruct). Both PPO- and GRPOtuned variants consistently outperform the base models across F1, BLEU-1 (B1), and LLM-as-a-Judge (J) metrics, showing strong scaling behavior.

Figure 4: Generalization analysis of Memory-R1 across three benchmarks (LoCoMo, MSC, and LongMemEval), using LLaMA-3.1-8B-Instruct (left) and Qwen-2.5-7BInstruct (right) as backbones.

Figure 4: Generalization analysis of Memory-R1 across three benchmarks (LoCoMo, MSC, and LongMemEval), using LLaMA-3.1-8B-Instruct (left) and Qwen-2.5-7BInstruct (right) as backbones.

multi-hop, open-domain, and temporal reasoning. We evaluate two variants of Memory-R1, one fine-tuned with PPO and another with GRPO, and benchmark them against leading memoryaugmented baselines, including LoCoMo (RAG), A-Mem, Mem0, MemoryOS and Memory-SFT.

Across both model families, Memory-R1 consistently achieves new state-of-the-art performance. On LLaMA-3.1-8B, Memory-R1-GRPO delivers the strongest overall performance, improving F1 by 28.5%, B1 by 34.0%, and J by 30.2% relatively over the strongest baseline MemoryOS. Similarly, Memory-R1-PPO also yields substantial improvements, raising overall F1, B1, and J scores by 17.2%, 17.6%, and 19.4%, respectively. When applied to Qwen-2.5-7B-Instruct, MemoryR1-GRPO again emerges as the top performer, surpassing MemoryOS by margins of 24.5% (F1), 24.1% (B1), and 20.0% (J). PPO remains competitive, delivering strong gains over all non-RL baselines. Notably, while Memory-SFT benefits from guidance by a powerful teacher model (GPT-5), our reinforcement learning approach still outperforms it, highlighting the effectiveness of outcome-driven optimization over purely supervised imitation.

Generalization and Scalability

We further investigate the robustness of MemoryR1 across model scales and datasets. Figure 3 shows results on the Qwen-2.5 family (3B, 7B, 14B). Memory-R1 consistently outperforms the base model at every scale, with PPO and GRPO

Figure 5: Ablation analysis of Memory-R1. Each subfigure shows the effect of removing one component: (a) Memory Manager, (b) Answer Agent, (c) Memory Distillation, and (d) the full pipeline. Performance drops in all ablations, demonstrating that each component contributes to the final results. Grey dashed lines indicate the baseline pipeline without RL fine-tuning.

Figure 5: Ablation analysis of Memory-R1. Each subfigure shows the effect of removing one component: (a) Memory Manager, (b) Answer Agent, (c) Memory Distillation, and (d) the full pipeline. Performance drops in all ablations, demonstrating that each component contributes to the final results. Grey dashed lines indicate the baseline pipeline without RL fine-tuning.

delivering clear gains in F1, BLEU-1, and J scores. These improvements persist as models scale, demonstrating that reinforcement learning remains effective in teaching LLMs memory management regardless of backbone capacity.

To evaluate cross-task generalization, we apply the pipeline fine-tuned only on LoCoMo directly to two additional benchmarks: MSC and LongMemEval. As shown in Figure 4, Memory-R1 with both PPO and GRPO continues to achieve consistent improvements across all three datasets and metrics, despite never being trained on MSC or LongMemEval. This zero-shot transfer highlights the robustness of Memory-R1 and shows its ability to generalize beyond its training distribution. The gains extend across single-hop, multihop, open-domain, and temporal questions, demonstrating Memory-R1 as a generalizable framework for adaptive, memory-augmented LLMs capable of long-horizon reasoning. Detailed results on LoCoMo, MSC, and LongMemEval, with type-level breakdowns, are provided in Appendix F.

Ablation Studies

We conduct ablation studies to assess the contribution of each component in Memory-R1, isolating the effects of the Memory Manager, the Answer Agent, and the Memory Distillation mechanism. We also compare the training dynamics of PPO and GRPO.

Effect of Memory Manager We compare the full Memory-R1 pipeline with an ablated variant without RL fine-tuning of the Memory Manager, both using LLaMA-3.1-8B-Instruct. As shown in Figure 5 (a,d), removing the RL-fine-tuned Memory Manager consistently degrades performance. Under PPO, F1, BLEU-1, and LLM-as-a-Judge drop from 41.0, 32.9, and 57.5 to 34.5, 28.1, and 49.0, respectively. Under GRPO, the correspond- ing scores decrease to 37.5, 30.6, and 52.9. These results confirm that outcome-driven RL enables more effective memory operations than scripted control.

Effect of Answer Agent Figure 5 (b,d) shows that RL fine-tuning the Answer Agent substantially improves answer quality. Without the Memory-R1 Answer Agent, PPO achieves F1, BLEU-1, and J scores of 32.5, 24.6, and 59.4, while GRPO reaches 33.0, 24.9, and 59.9. With the full pipeline, PPO improves to 41.0, 32.9, and 57.5, and GRPO further increases performance to 45.0, 37.5, and 62.7. This demonstrates that reward-driven fine-tuning enhances answer quality beyond static retrieval. A case study is provided in Appendix A.2.

Effect of Memory Distillation We evaluate memory distillation by comparing Answer Agents trained with and without distillation (Figure 5 (c,d)). With distillation enabled, PPO improves from 39.3, 30.9, and 57.4 to 41.0, 32.9, and 57.5 on F1, BLEU1, and J, respectively. GRPO shows larger gains, increasing from 41.0, 34.4, and 60.1 to 45.0, 37.5, and 62.7. These results indicate that filtering irrelevant memories reduces noise and improves reasoning.

RL-Fine-Tuned Answer Agent Gains More with Stronger Memory Manager We test whether Answer Agent gains depend on Memory Manager quality. Figure 6 compares PPO/GRPO agents with a LLaMA-3.1-8B manager versus a stronger GPT4o-mini manager. Improvements are larger with the stronger manager (F1: +10.10 vs. +19.72; BLEU-1: +10.81 vs. +18.19; J: +5.05 vs. +15.76), showing that Memory-R1 compounds benefits and the Answer Agent scales with memory quality.

Comparison of RL Policies We compare PPO and GRPO for training the Answer Agent, using

ng in

— PPO

—— GRPO |

—— GRPO

Figure

S

Figure 6: Performance gains of Answer Agent increase when paired with stronger Memory Managers, showing compounding benefits from higher memory quality.

Table 2: Reward Design Choice Comparison. PPO with J-based reward achieves higher J scores but lower F1 and B1 due to verbose outputs, while the EM-based reward yields balanced performance across metrics.

exact match against ground-truth answers as the reward signal. As shown in Figure 7, GRPO exhibits faster initial convergence, likely due to its grouped return normalization providing stronger early guidance. However, as training progresses, both methods steadily improve and ultimately reach comparable final reward levels.

Reward Design Analysis We experimented with different reward models for fine-tuning the Answer Agent. As shown in Table 2, using the LLMas-a-Judge value as reward leads to the highest J score (63.58), but performs poorly on F1 and BLEU-1. This is because the reward encourages longer, descriptive answers, which misaligns with string-overlap metrics. For example, when asked 'Did John and James study together?', the EMbased model outputs 'Yes', while the LLM-as-aJudge-based model produces 'Yes, John and James studied together, as they were part of the same online programming group, as implied by the memories above.' Although both are semantically correct, the latter is penalized under F1 and BLEU-1. This makes direct comparison with baselines difficult, since responses are no longer length-controlled. To avoid bias from relying on a single metric, we adopt the EM reward, which yields balanced improvements across all three metrics.

Figure 7: Training reward curves for PPO and GRPO on the Answer Agent using exact match as the reward. GRPO converges faster initially, and both reach similar final rewards.

Figure 7: Training reward curves for PPO and GRPO on the Answer Agent using exact match as the reward. GRPO converges faster initially, and both reach similar final rewards.

Figure 8: Accuracy and latency comparison across different inference pipelines: Base, Base + Reranker, and Memory-R1 (GRPO).

Figure 8: Accuracy and latency comparison across different inference pipelines: Base, Base + Reranker, and Memory-R1 (GRPO).

Comparison of Learned Memory Distillation and Reranking We compare learned memory distillation in Memory-R1 with reranker-based pipelines in terms of accuracy and inference latency across three settings: Base, Base + Reranker, and Memory-R1 with a GRPO-trained Answer Agent (Figure 8). While reranking provides modest accuracy gains, it incurs substantial latency overhead. In contrast, Memory-R1 achieves higher accuracy with lower median and tail latency, demonstrating a more favorable accuracy-latency trade-off. Additional analyses are provided in Appendix G.

Effect of Memory Manager

For training the Memory Manager, we use a detailed prompt that instructs the model how to perform four memory operations: ADD , UPDATE , DELETE , and NOOP . The full prompt spans multiple figures for readability.

Effect of Answer Agent

We provide the full prompt used to instruct the Answer Agent in our case study. This prompt defines the reasoning process, memory selection criteria, and formatting requirements for the model's responses. Figure 11 shows the complete instructions, context, and representative retrieved memories.

Effect of Memory Distillation

Memories selected as relevant:

Answer: beach

Discussion: The original model consumed all retrieved memories indiscriminately and defaulted to 'mountains,' likely influenced by irrelevant mentions of mountaineering. In contrast, Memory-R1 filtered out distractors, surfaced only beach-related memories, and generated the correct answer. This case highlights how Memory Distillation helps the model discard noise, focus on true signals, and improve factual accuracy.

RL-Fine-Tuned Answer Agent Gains More with Stronger Memory Manager

Task Formulation The Memory Manager maintains the memory bank by selecting one of ADD , UPDATE , DELETE , NOOP for each new piece of information extracted from a dialogue, outputting both the operation and updated content m ′ . Training uses (i) a partially constructed memory bank and (ii) a new dialogue turn with information relevant to downstream QA. The goal is to learn which operation produces a memory state that enables the

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Figure 2: Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL-fine-tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Answer Agent to answer correctly. Formally, the Memory Manager is modeled as a policy π θ that takes extracted information x and retrieved memories M old as input, and outputs an operation o with content m ′ :

$$

$$

where x is the new information and M old the current memory bank. The data construction details are provided in Appendix B.2.

PPO for Memory Manager We fine-tune the Memory Manager with Proximal Policy Optimization (PPO; Schulman et al., 2017). Given candidate memory x and memory bank M old, the manager samples an operation o and updated content m ′ from policy π θ , applies it to the memory bank, and forwards the result to the frozen Answer Agent. Answer correctness provides a scalar reward r , from which we estimate an advantage A . The clipped PPO objective is:

$$

$$

where ρ θ = π θ ( o,m ′ | x, M old ) π old ( o,m ′ | x, M old ) is the importance ratio, A is the advantage estimated from the answerbased reward r , and ϵ is the clipping threshold for stable updates.

GRPO for Memory Manager We also train the Memory Manager with Group Relative Policy Optimization (GRPO; Shao et al., 2024), which samples a group of G candidate actions per state and computes their relative advantages. This formulation avoids an explicit value function while maintaining PPO-style stability. For a state s = ( x, M old ) , the GRPO objective is:

$$

$$

where each candidate i yields reward r i , A i = r i -mean ( r ) std ( r ) , r = { r 1 , . . . , r G } , is its standardized group-relative advantage, and ρ ( i ) θ is the per-action importance ratio. The KL term regularizes updates to prevent policy drift away from the reference π ref .

Reward Design for Memory Manager We use an outcome-driven reward: the Memory Manager's operations are judged by their effect on downstream QA. After applying operation o with proposed content m ′ , the updated memory bank is passed to the frozen Answer Agent, and the reward is based on answer correctness:

$$

$$

where y pred is the predicted answer and y gold the ground truth. This exact-match signal requires no manual labels, remains scalable, and is sufficient to teach effective memory operations.

Comparison of RL Policies
Reward Design Analysis

We provide a detailed latency analysis to better understand the efficiency characteristics of MemoryR1 and its individual components. All latency results are reported using median (p50) and tail

Table 3: Extended evaluation of Memory-R1 with Qwen-2.5 model family as backbones on the LoCoMo benchmark. Results are reported across question types (Single-Hop, Multi-Hop, Open-Domain, Temporal) and overall performance. Best scores are highlighted in bold.

Table 4: Extended evaluation of Memory-R1 on the LongMemEval benchmark using LLaMA and Qwen backbones. Each cell shows F1/B1/J for a given model-method combination, reported with one decimal precision. Task types are abbreviated as: SSU = Single-Session-User, SSP = Single-Session-Preference, OD = Open Domain, MS = Multi-Session, KU = Knowledge Update, TR = Temporal Reasoning, and O = Overall. The best value for each metric (F1, B1, J) within a task row is highlighted in bold.

Table 5: Overall results on the LongMemEval benchmark. We report the mean scores across all six evaluation dimensions. The best results are marked in bold .

(p95) inference time, measured across three components of the pipeline: the Memory Manager, Memory Search, and the Answer Agent. We compare the base model, PPO-trained variants, and GRPOtrained variants on both LLaMA-3.1-8B and Qwen2.5-7B backbones.

Overall Trends Across both model families, Memory-R1 does not introduce prohibitive latency overhead despite incorporating explicit memory management and reasoning components. In many cases, GRPO-trained variants achieve lower tail latency than both the base model and PPO variants, indicating that reinforcement learning can improve not only accuracy but also inference efficiency.

Memory Manager Latency For the Memory Manager component, latency remains relatively stable across Base, PPO, and GRPO variants. On LLaMA-3.1-8B, median latency ranges narrowly between 1.98 s and 2.17 s, with p95 latency around 3.4-3.6 s. Similar behavior is observed on Qwen2.5-7B, where p50 latency stays below 1.4 s across all variants. These results suggest that RL finetuning does not materially increase the computational cost of memory operation selection.

Memory Search Latency Memory Search exhibits consistently low latency across all settings. On both backbones, median latency remains below 0.35 s, and p95 latency remains under 0.65 s. Differences between Base, PPO, and GRPO variants are minimal, indicating that improvements in downstream accuracy are not driven by more expensive retrieval operations.

Comparison of Learned Memory Distillation and Reranking

To illustrate how RL fine-tuned Answer Agent with Memory Distillation improves answer accuracy, we compare the original model's output with the RL fine-tuned model on a representative example from LoCoMo. The prompt provided to the model is shown in Figure 11.

Question: Does John live close to a beach or the mountains?

Conclusion

We presented Memory-R1, a reinforcement learning framework that enables LLM-based agents to effectively manage and utilize external memory. Unlike heuristic pipelines, Memory-R1 learns memory operations as well as memory distillation and usage for answering. With only 152 training examples, it achieves state-of-the-art results on LoCoMo, scales across model sizes, and generalizes to MSC and LongMemEval without retraining. Ablation studies confirm that reinforcement learning improves every component of the system. Overall, Memory-R1 highlights reinforcement learning as a promising direction for adaptive and agentic memory in LLMs.

Limitations

Our evaluation focuses on dialogue-centric datasets. While these benchmarks cover a wide range of reasoning types, extending Memory-R1 to multimodal data may introduce challenges beyond the scope of this work. Additionally, we train the Memory Manager and Answer Agent separately to ensure stability under sparse rewards. This separation is necessary but makes the process less straightforward. An end-to-end multi-agent reinforcement learning approach could simplify training and enable richer coordination, which we view as a promising direction for future work.

Case Study of Behavior of Agents before and after Fine-tuning

From In-context Memory Manager to RL fine-tuned Memory Manager

To demonstrate how RL fine-tuning improves memory operations, we present two real representative examples. In the first case, the user initially mentions adopting a dog named Buddy, and later states that they have adopted another dog named Scout.

From Vanilla LLM to Memory‑Distilled RL Answer Agent

To illustrate how RL fine-tuned Answer Agent with Memory Distillation improves answer accuracy, we compare the original model's output with the RL fine-tuned model on a representative example from LoCoMo. The prompt provided to the model is shown in Figure 11.

Question: Does John live close to a beach or the mountains?

Dataset Details

Test Data

LoCoMo. LoCoMo (Maharana et al., 2024) is a benchmark of long-term multi-session dialogues, with conversations averaging 300 turns and 9k tokens, spanning up to 35 sessions. It serves as our primary experimental dataset, on which we conduct and report detailed results.

MSC. We further evaluate on the Multi-Session Chat (MSC) dataset (Xu et al., 2021), which contains open-domain dialogues spanning multiple sessions. Following MemGPT (Packer et al., 2023), we use a modified version of MSC tailored to the memory-augmented evaluation setting, where questions depend on information distributed across earlier sessions. This dataset tests whether models can maintain continuity across temporally separated interactions.

LongMemEval We also evaluate on LongMemEval (Wu et al., 2024), a benchmark designed to test long-term memory capabilities of LLMs. It covers diverse tasks including factual recall, temporal reasoning, and entity tracking, with questions requiring integration of information from long and sparse contexts. LongMemEval complements LoCoMo and MSC by emphasizing broader generalization beyond dialogue-centric settings.

Training Data

Weconstruct separate training datasets for the Memory Manager and the Answer Agent from the LoCoMo multi-turn dialogues. The LoCoMo dataset is publicly released under the CC BY-NC 4.0 license. We slightly modify it for dialogue segmentation to fit our reinforcement learning pipeline, while preserving its original license terms and using it solely for non-commercial research purposes. All other datasets used in this paper (MSC and LongMemEval) are publicly available research benchmarks and are used in accordance with their respective licenses.

Memory Manager Training Data. For each dialogue turn t , GPT-4o-mini builds a temporal memory bank from the preceding 24 turns. The current turn t is fused with this snapshot to form the input. Unlike supervised annotation of memory operations, we do not provide explicit labels ( ADD , UPDATE , DELETE , NOOP ). Instead, the Memory Manager is optimized via reinforcement learning, where the correctness of the downstream Answer Agent's answer provides the learning signal. The full procedure is given in Algorithm 1.

Answer Agent Training Data. For each question q in LoCoMo, we retrieve 60 candidate memories using retrieval-augmented search (RAG) over the temporal memory bank. The retrieved set, paired with the question and gold answer, serves as the training input for the Answer Agent, which learns

Algorithm 1 Data Construction for Memory-R1 Training

to distill the relevant entries and generate concise, correct responses.

Prompts

In developing our Memory Manager Prompt, answer generation agent prompt, and LLM-as-aJudge prompt, we adapt elements from the prompt released by prior work (Packer et al., 2023; Chhikara et al., 2025)

Memory Manager Prompt

For training the Memory Manager, we use a detailed prompt that instructs the model how to perform four memory operations: ADD , UPDATE , DELETE , and NOOP . The full prompt spans multiple figures for readability.

Answer Agent Prompt

We provide the full prompt used to instruct the Answer Agent in our case study. This prompt defines the reasoning process, memory selection criteria, and formatting requirements for the model's responses. Figure 11 shows the complete instructions, context, and representative retrieved memories.

LLM-as-a-Judge (J) Prompt

For evaluating the correctness of generated answers, we employ an LLM-as-a-Judge prompt. The judge model is asked to label each answer as CORRECT or WRONG based on comparison with the gold answer. The complete prompt template is shown in Figure 12.

Memory Manager Prompt (Part 1): Overview and ADD/UPDATE Instruction You are a smart memory manager which controls the memory of a system. You can perform four operations: (1) add into the memory, (2) update the memory, (3) delete from the memory, and (4) no change. Based on the above four operations, the memory will change. Compare newly retrieved facts with the existing memory. For each new fact, decide whether to: -ADD: Add it to the memory as a new element -UPDATE: Update an existing memory element -DELETE: Delete an existing memory element -NONE: Make no change (if the fact is already present or irrelevant) 1. Add: If the retrieved facts contain new information not present in the memory, then you have to add it by generating a new ID in the id field. -Example: Old Memory: [ {"id" : "0", "text" : "User is a software engineer"} ] Retrieved facts: ["Name is John"] New Memory: { "memory" : [ {"id" : "0", "text" : "User is a software engineer", "event" : "NONE"}, {"id" : "1", "text" : "Name is John", "event" : "ADD"} ] } 2. Update: If the retrieved facts contain information that is already present in the memory but the information is totally different, then you have to update it. If the retrieved fact contains information that conveys the same thing as the memory, keep the version with more detail. Example (a) - if the memory contains "User likes to play cricket" and the retrieved fact is "Loves to play cricket with friends", then update the memory with the retrieved fact. Example (b) - if the memory contains "Likes cheese pizza" and the retrieved fact is "Loves cheese pizza", then do NOT update it because they convey the same information. Important: When updating, keep the same ID and preserve old_memory. -Example: Old Memory: [ {"id" : "0", "text" : "I really like cheese pizza"}, {"id" : "2", "text" : "User likes to play cricket"} ] Retrieved facts: ["Loves chicken pizza", "Loves to play cricket with friends"] New Memory: { "memory" : [ {"id" : "0", "text" : "Loves cheese and chicken pizza", "event" : "UPDATE", "old_memory" : "I really like cheese pizza"}, {"id" : "2", "text" : "Loves to play cricket with friends", "event" : "UPDATE", "old_memory" : "User likes to play cricket"} ] }

Figure 9: Memory Manager Prompt (Part 1): Overview and ADD/UPDATE operation instruction.

Memory Manager Prompt (Part 2): DELETE/NO_OPERATION Instructions 3. Delete: If the retrieved facts contain information that contradicts the memory, delete it. When deleting, return the same IDs -do not generate new IDs. -Example: Old Memory: [ {"id" : "1", "text" : "Loves cheese pizza"} ] Retrieved facts: ["Dislikes cheese pizza"] New Memory: { "memory" : [ {"id" : "1", "text" : "Loves cheese pizza", "event" : "DELETE"} ] } 4. No Change: If the retrieved facts are already present, make no change. -Example: Old Memory: [ {"id" : "0", "text" : "Name is John"} ] Retrieved facts: ["Name is John"] New Memory: { "memory" : [ {"id" : "0", "text" : "Name is John", "event" : "NONE"} ] }

Algorithm 2 Data Construction for Answer Agent Training

Implementation Details

We fine-tune MEMORY-R1 on LLaMA-3.1-8BInstruct and Qwen-2.5-3B, 7B, and 14B-Instruct models to evaluate robustness across architectures. Experiments are primarily conducted on 4 NVIDIA H100 GPUs (80GB each), except for Qwen-2.5-

14B, which requires 8 GPUs. The total batch size is 128 with a micro-batch size of 2 per GPU. The maximum prompt and response lengths are set to 4096 and 2048 tokens, respectively.

Prompts for memory operations and memoryaugmented answer generation are adapted from Chhikara et al. (2025). Reinforcement learning fine-tuning is performed using PPO and GRPO within the VERL framework (Sheng et al., 2025). For PPO, actor and critic networks are jointly trained with learning rates of 1 × 10 -6 and 1 × 10 -5 , respectively, using a constant warmup schedule. GRPO updates only the actor via grouped return normalization.

During RL training, we use a decoding temperature of τ = 1 . 0 to encourage exploration and collect diverse reward signals, which helps stabilize policy learning. For validation and testing, greedy decoding ( τ = 0 ) is applied to ensure deterministic outputs and consistent metric evaluation.

Alogirthm

The overall Memory-R1 pipeline contains two complementary procedures, outlined in Algorithm 3 and Algorithm 4. Algorithm 3 (Memory Bank Construction) governs how the system incrementally builds and refines the external memory bank as

Figure 11: Prompt and retrieved memories used in the case study, showing all instructions, context, and memory entries provided to the model.

Figure 11: Prompt and retrieved memories used in the case study, showing all instructions, context, and memory entries provided to the model.

new dialogue turns arrive. For each dialogue input, an LLM extracts key information, retrieves semantically related entries from the memory bank via retrieval-augmented generation (RAG), and invokes the RL fine-tuned Memory Manager to classify the update action as one of { ADD , UPDATE , DELETE , NOOP }. Depending on the chosen action, the memory store is updated accordingly-either inserting a new entry, merging information into an existing one, pruning contradictory content, or leaving the memory unchanged.

Algorithm 4 (Memory-augmented Answer Generation) describes how the system leverages the constructed memory bank to generate answers. Given an incoming question, the model retrieves the top-k relevant memory candidates, concatenates them with the question to form a memory-augmented prompt, and applies the Answer Agent's Memory Distillation policy to filter for the most relevant facts. The distilled memory context, along with the query, is then passed to the Answer Agent to produce the final response, which is added to the answer set. Together, these algorithms enable Memory-R1 to jointly manage memory and generate memory augmented answers.

Training in Memory-R1 is performed in two stages, with the Memory Manager and Answer Agent optimized separately. When training the Memory Manager, the Answer Agent is frozen and used only to provide outcome-based rewards: the

LLM-as-a-Judge Prompt Template Your task is to label an answer to a question as 'CORRECT' or 'WRONG'. You will be given the following data: (1) a question (posed by one user to another user), (2) a 'gold' (ground truth) answer, (3) a generated answer, which you will score as CORRECT or WRONG. The point of the question is to ask about something one user should know about the other user based on their prior conversations. The gold answer will usually be a concise and short answer that includes the referenced topic, for example: Question: Do you remember what I got the last time I went to Hawaii? Gold answer: A shell necklace The generated answer might be longer, but you should be generous with your grading - as long as it touches on the same topic as the gold answer, it should be counted as CORRECT. For time-related questions, the gold answer will be a specific date, month, or year. The generated answer might include relative references (e.g., "last Tuesday"), but you should be generous - if it refers to the same time period as the gold answer, mark it CORRECT, even if the format differs (e.g., "May 7th" vs. "7 May"). Now it's time for the real question: Question: {question} Gold answer: {gold_answer} Generated answer: {generated_answer} First, provide a short (one sentence) explanation of your reasoning, then finish with CORRECT or WRONG. Do NOT include both CORRECT and WRONG in your response, or it will break the evaluation script. Return the label in JSON format with the key as "label".

Figure 12: LLM-as-a-Judge prompt used to evaluate model answers. The judge model labels each generated answer as CORRECT or WRONG based on comparison with the gold answer, with explicit instructions for handling time references and topic matching.

Manager's operations are reinforced if the resulting memory state improves the Answer Agent's ability to answer correctly. Conversely, when training the Answer Agent, the Memory Manager is fixed to ensure a stable memory input. Algorithm 5 illustrates this process for the Memory Manager, where dialogue turns are processed sequentially, candidate operations are sampled, the memory bank is updated, and policy gradients (via PPO or GRPO) are applied based on downstream answer correctness. This decoupled setup avoids attribution ambiguity while still allowing both components to co-adapt over alternating training phases.

1Ludwig Maximilian University of Munich 2Technical University of Munich 3University of Cambridge 4University of Hong Kong s.yan@campus.lmu.de, cognitive.yunpu@gmail.com

Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking any learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns to perform structured memory operations, including adding, updating, deleting, or taking no operation on memory entries; and an Answer Agent that selects the most relevant entries and reasons over them to produce an answer. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management and utilization with minimal supervision. With as few as 152 question-answer pairs and a corresponding temporal memory bank for training, Memory-R1 outperforms the strongest existing baseline and demonstrates strong generalization across diverse question types and LLM backbones. Beyond presenting an effective approach, this work provides insights into how RL can unlock more agentic, memory-aware behavior in LLMs, pointing toward richer, more persistent reasoning systems.

Large Language Models (LLMs) have shown remarkable ability in understanding and generating natural language, making them central to recent advances in AI openai2024gpt4technicalreport; qwen2025qwen25technicalreport. Yet, they remain fundamentally stateless yu2025stateful; fan2025if; goodyear2025effect: each incoming query is processed independently of other interactions, because LLMs are constrained by a finite context window that prevents them from retaining and leveraging information across long conversations or evolving tasks wang2024adaptingllmsefficientcontext; fei2023extendingcontextwindowlarge.

One early and influential effort in addressing this statelessness is the Tensor Brain framework, which introduced a memory retrieval mechanism similar to retrieval-augmented generation (RAG) tresp2023tensor. In addition to retrieval, it highlights the cognitive importance of semantic and episodic memory, particularly in perception and reasoning. Building on this idea, subsequent work augments LLMs with external memory modules zhang2024survey. Most adopt the RAG paradigm pan2025memory; salama2025meminsightautonomousmemoryaugmentation, appending retrieved memory entries to the model’s input prompt. While this extends access to past information, it also creates a fundamental retrieval challenge: heuristics may return too few entries, omitting crucial context, or too many, flooding the model with irrelevant information and degrading performance liu2023lostmiddlelanguagemodels. In this paradigm, retrieved memories are passed to the LLM without meaningful filtering or prioritization, forcing the model to reason over both relevant and irrelevant content, which makes it prone to distraction by noise. Humans, by contrast, retrieve broadly but then filter selectively, integrating only the most useful pieces to maintain coherent, evolving knowledge.

Equally important is the challenge of memory management: deciding what to remember, update, or discard. Some systems packer2023memgpt; modarressi2024memllm; xiong2025memory adopt CRUD-style operations—create, read, update, and delete—which were originally introduced in database systems martin1983database; sulemani2021crud. A more recent work aios2024sdk augments this paradigm with a search operator, while Mem0 chhikara2025mem0 investigates the operator set {ADD, UPDATE, DELETE, NOOP}. We adopt this setting, as it provides a minimal yet expressive framework for modeling memory dynamics. However, existing approaches mainly rely on vanilla LLMs to choose operations from in-context instructions without any learning signal tied to correctness packer2023memgpt; chhikara2025mem0. Even simple cases can fail. Figure 1, a simplified example drawn from a LOCOMO conversation (maharana-etal-2024-evaluating), shows how a user first says “I adopted a dog named Buddy” and later adds “I adopted another dog named Scout”. A vanilla system misinterprets this as a contradiction, issuing DELETE+ADD and overwriting the original memory. A trained agent, however, consolidates with an UPDATE: “Andrew adopted two dogs, Buddy and Scout.” Appendix A.1 provides a real dialogue trace illustrating this case in practice.

These challenges of retrieving and managing memory remain largely unsolved. Supervised fine-tuning provides limited help because it is impractical to label every memory operation or retrieval decision. Reinforcement learning (RL), by contrast, has recently shown strong potential for aligning LLM behavior with high-level objectives, including tool use qian2025toolrlrewardtoollearning; wang2025actingreasoningmoreteaching, web navigation wei2025webagentr1trainingwebagents, and search optimization jin2025search; song2025r1. Building on this success, we argue that RL is the missing ingredient for adaptive memory in LLM agents. By optimizing outcome-based rewards, models can learn when to add, update, delete, or retain information and how to use retrieved memories for reasoning.

In this paper, we present Memory-R1, an RL fine-tuned, memory-augmented LLM framework with two specialized agents: (1) a Memory Manager that learns to perform structured memory operations, including adding, updating, deleting, and taking no operation, to maintain and evolve the memory bank, and (2) an Answer Agent that applies a Memory Distillation policy to filter memories retrieved via Retrieval‑Augmented Generation (RAG) and reason over the selected entries to produce answers. Both agents are fine-tuned using PPO schulman2017proximal or GRPO shao2024deepseekmath, achieving strong performance with as few as 152 question–answer pairs. On the LOCOMO benchmark (maharana-etal-2024-evaluating), Memory-R1 delivers substantial gains over the most competitive baseline, Mem0 chhikara2025mem0. Using the LLaMA-3.1-8B-Instruct backbone, Memory-R1-GRPO achieves relative improvements of 48% in F1, 69% in BLEU-1, and 37% in LLM-as-a-Judge. These improvements set a new state of the art on LOCOMO and underscore Memory‑R1’s ability to achieve large performance gains with minimal supervision, highlighting its efficiency.

Our contributions are summarized as follows:

We introduce Memory-R1, the first RL framework for memory-augmented LLMs, consisting of a Memory Manager to perform structured memory operations and an Answer Agent to filter and reason over memories retrieved via RAG.

We develop a data-efficient fine-tuning strategy using PPO and GRPO that enables Memory-R1 to achieve strong performance with as few as 152 question–answer pairs, demonstrating that large memory improvements can be achieved with minimal supervision.

We provide in-depth analysis of RL choices, model size, and memory design, offering actionable insights for building the next generation of memory-aware, reasoning-capable LLM agents.

LLMs have emerged as powerful general-purpose reasoners, capable of engaging in multi-turn dialogues, decomposing tasks into actionable steps, and leveraging prior context to guide decision making brown2020languagemodelsfewshotlearners; chowdhery2022palmscalinglanguagemodeling; openai2024gpt4technicalreport. Building on these capabilities, LLM-based agents have been proposed to solve complex tasks in interactive settings, such as conversational assistance thoppilan2022lamdalanguagemodelsdialog; ouyang2022traininglanguagemodelsfollow, tool-augmented reasoning schick2023toolformerlanguagemodelsteach; yao2023reactsynergizingreasoningacting, and autonomous decision making in structured or partially observable environments park2023generativeagentsinteractivesimulacra; wang2023voyageropenendedembodiedagent; shinn2023reflexion.

To address the limitations of fixed-length context windows and short-term memory, recent works have explored augmenting LLM agents with external memory modules. These memory-augmented agents aim to support long-horizon reasoning and persistent knowledge accumulation by selectively storing, retrieving, and updating relevant information. Notable approaches include LOCOMO maharana2024evaluating, a benchmark model that evaluates agents’ ability to retrieve and reason over temporally distant conversational history; ReadAgent lee2024readagent, which incorporates retrieval mechanisms for memory-grounded dialogue; MemoryBank zhong2024memorybank, a compositional memory controller for lifelong agent memory; MemGPT packer2023memgpt, which introduces working and long-term memory buffers with memory scheduling policies; and A-Mem xu2025amemagenticmemoryllm, a hybrid architecture that combines dynamic memory access with reinforcement learning(RL). These systems represent key steps toward building LLM agents with scalable, interpretable, and persistent memory capabilities. For a comprehensive taxonomy of memory systems in AI agents, we refer readers to the recent survey du2025rethinkingmemoryaitaxonomy.

The intersection of LLM and RL has received increasing attention as researchers seek to move beyond static supervised fine-tuning and enable models to learn from dynamic, interactive feedback. Reinforcement Learning from Human Feedback (RLHF) ouyang2022traininglanguagemodelsfollow is a foundational method used to align LLM outputs with human preferences. Recent works extend RL to more structured decision-making tasks for LLMs. For instance, Toolformer schick2023toolformerlanguagemodelsteach and ReAct-style agents yao2023reactsynergizingreasoningacting frame tool use as an RL problem, where the LLM learns when to query external tools or APIs. Search-R1 jin2025search trains LLMs to issue web search queries using RL to maximize final answer correctness. Similarly, the Trial and Error approach song2024trialerrorexplorationbasedtrajectory optimizes agents to select better reasoning paths. These approaches demonstrate that RL can improve complex behavior sequences in LLMs. However, memory management and utilization in LLMs remain underexplored in the RL setting. Existing memory-augmented LLM systems chhikara2025mem0; packer2023memgpt; lee2024readagent typically rely on heuristics to control memory operations, lacking adaptability and long-term optimization. Our work, Memory-R1, is among the first to frame memory operation selection, and the utilization of relevant memories as an RL problem.

In this section, we describe the Memory‑R1 approach for multi‑session dialogue tasks, where each dialogue contains multiple sessions (separate interactions occurring at different times) and each session consists of several turns (a back‑and‑forth exchange between two users). Answering a question often requires synthesizing information spread across these sessions, posing a strong challenge for long‑horizon memory management and reasoning. We first present the overall pipeline (Section 3.1), followed by RL fine‑tuning of the Memory Manager (Section 3.2) and the Answer Agent (Section 3.3).

This section provides an overview of the Memory‑R1 pipeline. Figure 2 outlines the pipeline. At each dialogue turn, an LLM determines information that is worth to remember and summarize it into a piece of information. Then it triggers a retrieval step using Retrieval-Augmented Generation (RAG) to locate related entries in the memory bank. The Memory Manager then decides whether to ADD, UPDATE, DELETE, or NOOP, maintaining and evolving the memory state.

For question answering, the user’s query retrieves up to 60 candidate memories, following the setup of chhikara2025mem0. The Answer Agent applies a Memory Distillation policy to filter the most relevant entries and generate an answer. Both agents are fine-tuned with PPO and GRPO, enabling outcome-driven learning of memory operations and selective utilization.

The Memory Manager maintains and updates the memory bank by selecting the appropriate operation from {ADD, UPDATE, DELETE, or NOOP} for each new piece of information extracted from a dialogue. It outputs both the operation and the updated memory content m′m^{\prime}.

Training uses (i) a partially constructed memory bank and (ii) a new dialogue turn containing information relevant to downstream question answering. The Memory Manager learns which operation produces a memory state that enables the Answer Agent to answer correctly. Training data is drawn from the LOCOMO dataset to ensure realistic multi-session dialogue settings (details in Appendix B).

As shown in Algorithm 1, the Memory Manager is formalized as a policy πθ\pi_{\theta} that takes the extracted information xx and retrieved memories ℳold\mathcal{M}_{\text{old}} as input, and outputs an operation o∈{ADD,UPDATE,DELETE,NOOP}o\in{\text{ADD},\text{UPDATE},\text{DELETE},\text{NOOP}} with associated content m′m^{\prime}:

where xx is the extracted information and ℳold\mathcal{M}_{\text{old}} the current memory bank.

We fine‑tune the Memory Manager using Proximal Policy Optimization (PPO, schulman2017proximal). Given a candidate memory xx and an existing memory bank ℳold\mathcal{M}{\text{old}}, the Memory Manager outputs a memory operation oo along with content m′m^{\prime}. These actions are sampled from the current policy πθ\pi{\theta} and applied to update the memory bank, which is then passed to the frozen Answer Agent. The correctness of the answer provides the reward signal. PPO optimizes a clipped surrogate objective to stabilize training:

where ρθ=πθ​(o,m′∣x,ℳold)πold​(o,m′∣x,ℳold)\rho_{\theta}=\frac{\pi_{\theta}(o,m^{\prime}\mid x,\mathcal{M}{\text{old}})}{\pi{\text{old}}(o,m^{\prime}\mid x,\mathcal{M}_{\text{old}})} is the importance ratio between the new and old policies, and AA is the advantage computed from the improvement in the Answer Agent’s final answer accuracy. The clipping range [1−ϵ, 1+ϵ][1-\epsilon,,1+\epsilon] constrains policy updates and ensures stable learning.

We also apply Group Relative Policy Optimization (GRPO, shao2024deepseekmath) as an alternative to PPO for training the Memory Manager. GRPO samples a group of GG candidate actions and assigns relative advantages within the group, eliminating the need for a learned value function while preserving PPO‑style stability.

Given a state s=(x,ℳold)s=(x,\mathcal{M}{\text{old}}), the old policy πold\pi{\text{old}} generates GG candidate actions. The GRPO objective is:

where ρθ(i)\rho_{\theta}^{(i)} follows the same importance ratio definition as in PPO, now applied per sampled action. The group‑relative advantage is computed by standardizing the QA rewards of all sampled actions:

The KL term 𝔻KL​[πθ∥πref]\mathbb{D}{\text{KL}}\big[\pi{\theta},|,\pi_{\text{ref}}\big] constrains updates, preventing drift from the reference policy.

We adopt an outcome‑driven reward design for the Memory Manager. At each step, the Memory Manager selects an operation oo and proposes new content m′m^{\prime} for the memory bank. Rather than scoring individual edits, we evaluate the updated memory bank by its impact on downstream question answering: the Answer Agent can only produce the correct answer if the Memory Manager’s operations were effective. The reward is defined as:

where ypredy_{\text{pred}} is the Answer Agent’s prediction after the update and ygoldy_{\text{gold}} is the ground‑truth answer. This outcome‑based signal eliminates the need for manual memory relevance labels and keeps supervision simple and scalable. Freezing the Answer Agent during training avoids attribution ambiguity, and despite the indirect signal, exact‑match rewards alone are sufficient to teach the Memory Manager effective memory operations.

The Answer Agent leverages a memory bank curated by the Memory Manager to handle questions in long‑horizon, multi‑session dialogues. Following chhikara2025mem0, for each question, 60 candidate memories are retrieved via a similarity‑based RAG search using the question as the query. The agent then performs memory distillation: it filters this retrieved memory set to surface the most relevant entries and generates the final answer conditioned on these distilled memories.

where qq is the current question and ℳret\mathcal{M}_{\text{ret}} is the subset of the memory bank retrieved by RAG.

We fine‑tune the Answer Agent using the same PPO framework as in Section 3.1. While the optimization is identical, the task differs: the Answer Agent takes the current question qq and retrieved memories ℳret\mathcal{M}_{\text{ret}} and generates an answer yy token‑by‑token.

The PPO objective mirrors Equation (2), applied over the generated answer sequence. The importance ratio is:

where yy is the full generated answer. Advantages are computed from final answer quality (e.g., exact match with the reference), and clipping keeps updates within a trust region for stability.

We also fine‑tune the Answer Agent with GRPO, following the formulation in Section 3.1. For each input (q,ℳret)(q,\mathcal{M}{\text{ret}}), the policy samples GG candidate answers {yi}i=1G{y{i}}{i=1}^{G}. Their Exact Match scores against the ground‑truth answer ygty{\text{gt}} are normalized into group‑relative advantages. GRPO reuses the same importance ratio definition as PPO, applied per candidate answer. By comparing candidates within each group, GRPO provides stable gradient signals without a learned value function, improving sample efficiency and robustness during RL fine-tuning.

We adopt a simple yet effective reward function for the Answer Agent, using the Exact Match (EM) score between the generated answer and the ground-truth answer as the primary reward signal:

This design directly ties the reward to the correctness of the final answer, encouraging the agent to select and reason over memories in a way that yields accurate outputs rather than optimizing for intermediate steps.

We evaluate our approach on the LOCOMO benchmark maharana2024evaluating. LOCOMO consists of two components: (i) multi-turn dialogues (about 600 turns per dialogue, averaging 26,000 tokens) and (ii) question–answer (QA) pairs grounded in those dialogues. Each dialogue has roughly 200 questions spanning single‑hop, multi‑hop, open‑domain, and temporal reasoning. We exclude the adversarial subset because it lacks ground‑truth answers. LOCOMO is widely used in memory‑augmented LLM research chhikara2025mem0; xu2025amemagenticmemoryllm, as many questions require drawing information from distant or fragmented dialogue turns—making it an ideal testbed for evaluating memory management and utilization. The dataset comprises 10 multi-turn dialogues. We adopt a 1:1:8 train/validation/test split: the first dialogue with 152 questions for training, the second dialogue with 81 questions for validation, and eight dialogues with 1,307 questions for testing, providing broad evaluation coverage while keeping training supervision minimal. A detailed description of the data construction process is provided in Appendix B. All experiments are conducted with LLaMA-3.1‑8B‑Instruct and Qwen2.5‑7B‑Instruct, ensuring consistency and comparability across evaluations.

We assess performance using three complementary metrics: Token‑level F1 Score (F1), measuring overlap between predicted and ground‑truth answers; BLEU‑1 (B1), capturing unigram‑level lexical similarity; and LLM‑as‑a‑Judge (J), which employs a separate LLM to evaluate factual accuracy, relevance, completeness, and contextual appropriateness. While F1 and B1 provide straightforward string‑based scores, J captures deeper semantic correctness and offers a more human‑aligned assessment. The implementation details and script for LLM‑as‑a‑Judge are provided in Appendix C.

To evaluate the effectiveness of Memory‑R1, we compare it against several established baselines for multi‑session dialogue reasoning: (1) LOCOMO maharana2024evaluating, the benchmark framework designed to assess LLMs’ ability to retrieve and reason over information from long‑range, multi‑session conversations; (3) Zep rasmussen2025zep, a retrieval‑based agent that introduces structured memory access strategies for complex, temporally extended queries; (3) A‑Mem xu2025amemagenticmemoryllm, a dynamic agentic memory system that creates, links, and updates structured memories to enhance reasoning across sessions; (4) LangMem langmem2024, an open‑source memory framework that links memory chains across sessions to support long‑context reasoning; (5) Mem0 chhikara2025mem0, a modular memory system with explicit in context memory operations designed for scalable deployment.

For a fair comparison, we re‑implemented all baselines using both the LLaMA‑3.1‑8B‑Instruct and Qwen‑2.5‑7B‑Instruct models as backbones, with temperature set to 0 and a maximum token limit of 2048. This consistent setup ensures reproducibility and allows us to assess how each method performs across different model architectures.

We fine-tune Memory-R1 using both the LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct models to evaluate the framework’s robustness across different architectures. The prompts for memory operation and memory augmented answer generation are adapted from chhikara2025mem0, with details provided in Appendix C. For PPO experiments, we train both actor and critic networks, while for GRPO runs, only the actor is trained since GRPO uses grouped return normalization in place of a critic. Both PPO and GRPO training are implemented using the VERL framework sheng2025hybridflow. Experiments are performed on 4 NVIDIA H100 GPUs (80GB each) with a total batch size of 128 and a micro-batch size of 2 per GPU. The maximum prompt and response lengths are set to 4096 and 2048 tokens, respectively. The actor and critic (for PPO) are optimized with learning rates of 1×10−61\times 10^{-6} and 1×10−51\times 10^{-5}, using a constant warmup schedule. During RL training, we use a decoding temperature of τ=1.0\tau=1.0 to encourage exploration and capture a wider range of reward signals, which helps stabilize policy learning. For validation and testing, we switch to greedy decoding (τ=0\tau=0) to eliminate randomness and ensure consistent metric tracking. Each configuration and model variant is evaluated three times, and we report the mean score to reduce variance and provide a more reliable estimate of performance.

Table 1 reports the performance of Memory-R1 across LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct models on the LOCOMO benchmark, covering diverse question types including Single Hop, Multi-Hop, Open Domain, and Temporal reasoning. We evaluate two variants of Memory-R1: one fine-tuned with PPO and another with GRPO, and benchmark them against leading memory-augmented baselines, including LOCOMO, LangMem, A-Mem, and Mem0.

Across both model families, Memory-R1 consistently sets a new state of the art. On LLaMA-3.1-8B, Memory-R1-GRPO delivers the strongest overall performance, improving F1 by 68.9%, B1 by 48.3%, and J by 37.1% over the strongest baseline Mem0. Similarly, Memory-R1-PPO also yields substantial improvements, raising overall F1, B1, and J scores by 47.9%, 35.3%, and 26.5%, respectively.

Notably, the benefits of Memory-R1 transfer robustly to a different backbone. When applied to Qwen-2.5-7B-Instruct, Memory-R1-GRPO again emerges as the top performer, surpassing Mem0 by margins of 57.3% (F1), 41.5% (B1), and 33.8% (J). PPO remains competitive, delivering strong gains over all non-RL baselines.

These consistent improvements across two distinct LLM backbones demonstrate that reinforcement learning(RL) is highly effective for teaching LLMs to manage and leverage long-term memory, independent of model architecture. Moreover, the gains are broad-based—spanning single-hop, multi-hop, open-domain, and temporal questions—highlighting Memory-R1 as a generalizable framework for building adaptive, memory-augmented LLMs capable of long-horizon reasoning.

We conduct ablation studies to examine the contribution of each component in Memory‑R1, isolating the effects of the Memory Manager, the Answer Agent, and the Memory Distillation mechanism. We also compare the training dynamics of PPO and GRPO.

To assess the benefit of our RL‑fine‑tuned Memory Manager, we compare it against the in‑context memory manager, with both variants built on the LLaMA‑3.1‑8B‑Instruct base model. As shown in Table 2, RL training delivers consistent gains: PPO improves overall F1 to 24.60, B1 to 32.55, and J to 59.37; GRPO further improves F1 to 24.91, B1 to 33.05, and J to 59.91. These results confirm that outcome‑based RL enables the Memory Manager to execute more accurate and effective operations—reinforcing our central claim that memory control should be learned rather than scripted.

As shown in Table 3, RL significantly improves the performance of the vanilla LLM when used as the answer agent (baseline: F1 20.54, B1 26.73, J 47.82). PPO fine‑tuning raises scores to F1 32.91, B1 41.05, and J 57.54, while GRPO delivers even greater gains, reaching F1 37.51, B1 45.02, and J 62.74. These results demonstrate that reward‑driven fine‑tuning substantially elevates answer quality beyond what static retrieval can achieve. A detailed case study comparing the RL‑fine‑tuned Answer Agent to the vanilla LLM is provided in Appendix A.2.

We evaluate the impact of the proposed memory distillation mechanism by comparing the GRPO‑fine-tuned Answer Agent with and without distillation (Table 4). Without distillation, the agent consumes all retrieved memories; with distillation, it filters for the most relevant ones before answering. Memory distillation consistently improves performance: F1 rises from 34.37 to 37.51, BLEU‑1 from 40.95 to 45.02, and LLM‑as‑a‑Judge from 60.14 to 62.74. These gains show that filtering out irrelevant entries reduces noise and enables the agent to reason more effectively over high‑quality, distilled context.

We examine whether the Answer Agent’s improvements from RL fine‑tuning and Memory Distillation depend on the quality of the Memory Manager. Figure 3 compares base, PPO‑fine-tuned, and GRPO‑fine-tuned Answer Agents paired with either a LLaMA 3.1–8B‑Instruct Memory Manager (Figure 3a) or a stronger GPT‑4o‑mini Memory Manager (Figure 3b). Our RL fine‑tuned Answer Agent yields larger gains with the stronger manager: F1 improves by 10.10 points with the LLaMA memory manager vs. 19.72 with GPT‑4o‑mini memory manager; BLEU‑1 by 10.81 vs. 18.19; and LLM‑as‑a‑Judge by 5.05 vs. 15.76. This demonstrates that Memory‑R1’s benefits compound with stronger upstream memory manager, showing the Answer Agent scales effectively with memory quality.

We compare PPO and GRPO for training the Answer Agent, using exact match against ground‑truth answers as the reward signal. As shown in Figure 4, GRPO exhibits faster initial convergence, likely due to its grouped return normalization providing stronger early guidance. However, as training progresses, both methods steadily improve and ultimately reach comparable final reward levels.

We introduced Memory‑R1, a reinforcement learning (RL) framework that enables large language model agents to manage and utilize external memory effectively. By training two specialized agents, one for memory operations and another for selective reasoning over retrieved memories, our approach overcomes the limitations of static, heuristic‑driven memory pipelines. On the LOCOMO benchmark, Memory‑R1 outperforms strong baselines across F1, BLEU, and LLM‑as‑a‑Judge scores, achieving these gains with as few as 152 training examples. Ablation studies further confirm that RL enhances every component of the system. Memory‑R1 opens new directions for research on compositional memory architectures, long‑term knowledge retention, and more agentic LLM behavior, underscoring RL as a path to adaptive memory capabilities.

To demonstrate how RL fine-tuning improves memory operations, we present two real representative example. In the first case, the user initially mentions adopting a dog named Buddy, and later states that they have adopted another dog named Scout.

Dialogue Excerpts:

• Audrey: That’s great news! What’s the pup’s name? • Andrew: It took us a while to decide, but we ended up going with ’Scout’ for our pup - it seemed perfect for their adventurous spirit.

Vanilla Memory Manager Actions (failed):

Operations Issued: • DELETE: "Andrew adopted a new dog from a shelter and named him Buddy because he is his buddy." • DELETE: "Andrew enjoys taking walks with his newly adopted dog, Buddy, as a way to spend quality time together and get fresh air." • ADD: "Andrew adopted a new dog and named the pup ’Scout’ because the name seemed perfect for their adventurous spirit."

Operations Issued: • UPDATE: "Andrew adopted a new dog from a shelter and named him Buddy because he is his buddy." -> "Andrew adopted a dog from a shelter and named him Buddy because he is his buddy, and later adopted another dog named Scout."

Discussion: The vanilla LLM-based memory manager misinterpreted the second adoption as a contradiction because it saw two separate statements: “adopted a dog named Buddy” and “adopted another dog named Scout”, and assumed that the differing dog names implied the new statement replaced the old one. As a result, it deleted the first dog (“Buddy”) and added the second (“Scout”), leaving the memory fragmented and inaccurate. In contrast, the RL fine-tuned Memory Manager recognized the events as complementary and consolidated them with a single UPDATE, maintaining a coherent memory bank and enabling the system to answer correctly. This case illustrates how reinforcement learning enhances the Agent to understand the nuance between ADD, UPDATE, and DELETE, allowing the model to handle overlapping or related information more intelligently.

To further demonstrate how RL fine-tuning improves memory operations, we present another real-world example from a user-agent dialogue involving Joanna’s feelings about pets and her allergies, particularly toward turtles.

• Joanna: I wish I wasn’t allergic! I would get two turtles today if I could! I found out recently I’m allergic to cockroaches as well, so who knows if I’ll ever get a pet. • Nate: Sorry! Maybe there are other animals you could consider! In the meantime though, I’ll be sure to send you pics of my turtles so you can still watch them grow without getting too close.

Operations Issued: • DELETE: "Joanna is allergic to most reptiles and animals with fur, which causes her face to get puffy and itchy. Despite this, she finds other ways to be happy." • DELETE: "Joanna finds turtles peaceful and believes they bring calm and joy." • DELETE: "Joanna thinks having pets must be a wonderful experience." • ADD: "Joanna is allergic to turtles and cockroaches, which affects her ability to have pets."

Discussion: The vanilla Memory Manager misinterpreted the new allergy information in two aspects. First, it viewed the difference between “allergic to most reptiles and animals with fur” and “allergic to turtles and cockroaches” as a contradiction in allergy type—failing to recognize the latter as a more specific elaboration rather than a conflicting statement. Second, it interpreted Joanna’s expressed fondness for turtles as incompatible with her allergy to them, incorrectly assuming that emotional attachment and physical limitations cannot coexist. As a result, it issued a series of DELETE operations followed by a single ADD, discarding valuable emotional context such as Joanna’s admiration for turtles and her general enthusiasm toward pets. In contrast, the RL fine-tuned Memory Manager recognized that these pieces of information were complementary: Joanna likes turtles but cannot keep them due to her allergies. It updated the relevant memories accordingly using targeted UPDATE operations, preserving both factual accuracy and emotional nuance. This case demonstrates how reinforcement learning equips the model to reason about overlapping and evolving information more intelligently, favoring memory consolidation over fragmentation.

To illustrate how RL fine-tuned Answer Agent with Memory Distillation improves answer accuracy, we compare the original model’s output with the RL fine-tuned model on a representative example from LOCOMO. The prompt provided to the model is shown in Figure 7.

Question: Does John live close to a beach or the mountains?

Original Model Output (Before Fine-Tuning):

Memories selected as relevant: • 8:30 pm on 1 January, 2023: John has a nostalgic memory of having a film camera as a kid and taking many pictures at the beach. • 1:24 pm on 25 May, 2023: John shared a picture of his family at the beach, expressing a sense of making a difference and commitment to continue their efforts. Answer: beach

Discussion: The original model consumed all retrieved memories indiscriminately and defaulted to “mountains,” likely influenced by irrelevant mentions of mountaineering. In contrast, Memory-R1 filtered out distractors, surfaced only beach-related memories, and generated the correct answer. This case highlights how Memory Distillation helps the model discard noise, focus on true signals, and improve factual accuracy.

To train Memory-R1, we construct separate training data for the Memory Manager and the Answer Agent from the LOCOMO multi-turn dialogues.

Memory Manager Training Data. For every dialogue turn tt, GPT‑4o‑mini builds a temporal memory bank from the preceding 50 turns. The current turn tt is then fused with this snapshot, and we annotate the correct memory operation (ADD, UPDATE, DELETE, or NOOP). Each annotated tuple (turn,temporal memory bank,QA)(\text{turn},\text{temporal memory bank},\text{QA}) serves as a supervised signal for the Memory Manager to learn how to incrementally update the memory state. The details can be found in Algorithm 2.

Answer Agent Training Data. For each question qq in LOCOMO, we retrieve 60 candidate memories using retrieval‑augmented search (RAG) over the temporal memory bank. The retrieved set, paired with the question and its gold answer, becomes the training input for the Answer Agent, which learns to distill the relevant entries and generate a concise, correct response.

In developing our Memory Manager Prompt, answer generation agent prompt, and LLM-as-a-Judge prompt, we adapt elements from the prompt released by prior work packer2023memgpt; chhikara2025mem0

For training the Memory Manager, we use a detailed prompt that instructs the model how to perform four memory operations: ADD, UPDATE, DELETE, and NOOP. The full prompt spans multiple figures for readability.

We provide the full prompt used to instruct the Answer Agent in our case study. This prompt defines the reasoning process, memory selection criteria, and formatting requirements for the model’s responses. Figure 7 shows the complete instructions, context, and representative retrieved memories.

The overall Memory-R1 pipeline contains two complementary procedures, outlined in Algorithm 4 and Algorithm 5. Algorithm 4 (Memory Bank Construction) governs how the system incrementally builds and refines the external memory bank as new dialogue turns arrive. For each dialogue input, an LLM extracts key information, retrieves semantically related entries from the memory bank via retrieval-augmented generation (RAG), and invokes the RL fine-tuned Memory Manager to classify the update action as one of {ADD, UPDATE, DELETE, NOOP}. Depending on the chosen action, the memory store is updated accordingly—either inserting a new entry, merging information into an existing one, pruning contradictory content, or leaving the memory unchanged.

Algorithm 5 (Memory-augmented Answer Generation) describes how the system leverages the constructed memory bank to generate answers. Given an incoming question, the model retrieves the top‑k relevant memory candidates, concatenates them with the question to form a memory-augmented prompt, and applies the Answer Agent’s Memory Distillation policy to filter for the most relevant facts. The distilled memory context, along with the query, is then passed to the Answer Agent to produce the final response, which is added to the answer set. Together, these algorithms enable Memory-R1 to jointly manage memory and generate memory augmented answers.

Table: S4.T1: Evaluation results of Memory-R1 and baselines across LLaMA‑3.1‑8B‑Instruct and Qwen‑2.5‑7B‑Instruct on the LOCOMO benchmark dataset. Models are evaluated on F1, BLEU‑1 (B1), and LLM‑as‑a‑Judge (J) across Single Hop, Multi‑Hop, Open Domain, and Temporal questions. Higher is better. The best results are marked in Bold.

F1↑\uparrowB1↑\uparrowJ↑\uparrowF1↑\uparrowB1↑\uparrowJ↑\uparrowF1↑\uparrowB1↑\uparrowJ↑\uparrowF1↑\uparrowB1↑\uparrowJ↑\uparrowF1↑\uparrowB1↑\uparrowJ↑\uparrow
LLaMA-3.1-8B InstructLOCOMO12.259.7713.8113.6910.9620.4811.598.3015.969.388.154.6511.418.7113.62
Zep30.1517.1552.3815.0411.5633.3326.6718.4445.363.492.6827.5822.6015.0542.80
A-Mem21.6216.9344.7613.8211.4534.9334.6729.1349.3825.7722.1436.4329.2024.4044.76
LangMem22.4015.2147.2618.6516.0339.8131.6223.8548.3827.7521.5330.9428.3421.3144.18
Mem027.2918.6343.9318.5913.8637.3534.0324.7752.2726.9021.0631.4030.4122.2245.68
Memory-R1-PPO32.5224.4753.5626.8623.4742.1745.3039.1864.1041.5726.1147.6741.0532.9157.54
Memory-R1-GRPO35.7327.7059.8335.6530.7753.0147.4241.2468.7849.8638.2751.5545.0237.5162.74
Qwen-2.5-7B InstructLOCOMO9.577.0015.0611.8410.0219.288.676.5212.798.358.745.438.977.2712.17
Zep31.0221.3942.8520.4215.7623.8125.2521.3442.268.948.4229.3123.2218.7838.99
A-Mem18.9612.8640.7814.7312.6631.3230.5826.1446.9023.6720.6728.6826.0821.7840.78
LangMem22.8416.9843.6418.9816.8944.3832.4725.9850.4526.6220.9323.0828.6922.7643.42
Mem024.9618.0561.9220.3115.8248.1932.7425.2765.2033.1626.2838.7630.6123.5553.30
Memory-R1-PPO34.2223.6157.7432.8729.4853.0144.7838.7266.9942.8830.3042.2541.7233.7059.53
Memory-R1-GRPO33.6426.0662.3423.5520.7140.9646.8640.9267.8147.7538.4949.6143.1436.4461.51

Table: S4.T2: Performance of the LLaMA‑3.1‑8B‑Instruct model and its PPO- and GRPO-fine‑tuned variants as the Memory Manager, with the answer agent fixed to LLaMA‑3.1‑8B‑Instruct.

MethodF1↑\uparrowB1↑\uparrowJ↑\uparrow
LLaMA3.1-8B26.7320.5447.82
LLaMA3.1-8B + PPO32.5524.6059.37
LLaMA3.1-8B + GRPO33.0524.9159.91

Table: S4.T4: Performance comparison of GRPO fine-tuned Answer Agent with and without Memory Distillation policy using the LLaMA‑3.1‑8B‑Instruct model.

MethodF1↑\uparrowB1↑\uparrowJ↑\uparrow
GRPO w/o Memory Distillation40.9534.3760.14
GRPO w. Memory Distillation45.0237.5162.74

Refer to caption Comparison of Memory‑R1 and a vanilla LLM memory system. (Left) In a multi‑session dialogue, the user mentions the adoption of two dogs in separate sessions. (Middle) The vanilla Memory Manager misinterprets the second adoption as a contradiction and issues DELETE+ADD, fragmenting memory. (Right) The RL‑trained Memory Manager issues a single UPDATE to consolidate the memory, and the Answer Agent applies Memory Distillation: from 60 memories retrieved via RAG (e.g., <Memory 1> “Andrew adopted 2 dogs named Buddy and Scout”; <Memory 2> “Andrew feels a bit jealous of Audrey’s dogs”; etc.), answer agent first filters the memories that are truly useful to answer the question, which is <Memory 1>, then reasons over the selected entry to produce the correct answer (“2 dogs”).

Refer to caption Overview of the Memory-R1 framework. Stage 1 (blue) constructs and updates the memory bank via the RL‑fine‑tuned Memory Manager, which chooses operations {ADD, UPDATE, DELETE, NOOP} for each new dialogue turn. Stage 2 (green) answers user questions via the Answer Agent, which applies a Memory Distillation policy to reason over retrieved memories.

Refer to caption Performance gains of Answer Agent variants (Base/PPO fine-tuned/GRPO fine-tuned) when paired with different Memory Managers: (a) LLaMA 3.1-8B‑Instruct and (b) the stronger GPT‑4o‑mini.

Refer to caption Training reward curves for PPO and GRPO on the Answer Agent using exact match as the reward. GRPO converges faster initially, and both reach similar final rewards.

$$ (o,m^{\prime})\sim\pi_{\theta}(\cdot\mid x,\mathcal{M}_{\text{old}}) $$ \tag{S3.E1}

$$ \mathcal{J}(\theta)=\mathbb{E}\left[\min\left(\rho_{\theta}A,;\text{clip}(\rho_{\theta},1-\epsilon,1+\epsilon)A\right)\right], $$ \tag{S3.E2}

$$ A_{i}=\frac{r_{i}-\text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})},\qquad\mathbf{r}={r_{1},\ldots,r_{G}}. $$ \tag{S3.E4}

$$ R_{answer}=\mathrm{EM}(y_{\text{pred}},y_{\text{gold}}) $$ \tag{S3.E5}

Extended Results and Type-Level Analysis

Tables 3 and 4 provide detailed type-level evaluation on the LoCoMo and LongMemEval benchmarks. On LoCoMo (Table 3), Memory-R1 achieves consistent improvements across all reasoning types, with the largest gains on multi-hop and temporal questions, confirming its ability to maintain and integrate long-range information.

On LongMemEval (Table 4), improvements are most pronounced in multi-session scenarios where continuity across temporally distant interactions is critical. Memory-R1 shows substantial gains on tasks requiring factual recall (SSU) and temporal reasoning (TR), while also yielding steady improvements in knowledge update (KU) and open-domain QA. Across reasoning types, GRPO generally outperforms PPO, particularly in scenarios involving reasoning over multiple or noisy memory entries.

In addition to type-level analysis, Table 5 reports overall performance on the LongMemEval benchmark, including all baseline methods as well as Memory-R1 variants. Importantly, Memory-R1 is fine-tuned only on the LoCoMo dataset and evaluated on LongMemEval without any additional training. Despite this zero-shot transfer setting, Memory-R1-GRPO outperforms all baseline systems across both LLaMA-3.1-8B and Qwen-2.5-7B backbones. Together, these results complement the main findings in Section 4, further reinforcing that Memory-R1 generalizes robustly across reasoning types, model families, and benchmark tasks.

Latency Analysis

We provide a detailed latency analysis to better understand the efficiency characteristics of MemoryR1 and its individual components. All latency results are reported using median (p50) and tail

Table 3: Extended evaluation of Memory-R1 with Qwen-2.5 model family as backbones on the LoCoMo benchmark. Results are reported across question types (Single-Hop, Multi-Hop, Open-Domain, Temporal) and overall performance. Best scores are highlighted in bold.

Table 4: Extended evaluation of Memory-R1 on the LongMemEval benchmark using LLaMA and Qwen backbones. Each cell shows F1/B1/J for a given model-method combination, reported with one decimal precision. Task types are abbreviated as: SSU = Single-Session-User, SSP = Single-Session-Preference, OD = Open Domain, MS = Multi-Session, KU = Knowledge Update, TR = Temporal Reasoning, and O = Overall. The best value for each metric (F1, B1, J) within a task row is highlighted in bold.

Table 5: Overall results on the LongMemEval benchmark. We report the mean scores across all six evaluation dimensions. The best results are marked in bold .

(p95) inference time, measured across three components of the pipeline: the Memory Manager, Memory Search, and the Answer Agent. We compare the base model, PPO-trained variants, and GRPOtrained variants on both LLaMA-3.1-8B and Qwen2.5-7B backbones.

Overall Trends Across both model families, Memory-R1 does not introduce prohibitive latency overhead despite incorporating explicit memory management and reasoning components. In many cases, GRPO-trained variants achieve lower tail latency than both the base model and PPO variants, indicating that reinforcement learning can improve not only accuracy but also inference efficiency.

Memory Manager Latency For the Memory Manager component, latency remains relatively stable across Base, PPO, and GRPO variants. On LLaMA-3.1-8B, median latency ranges narrowly between 1.98 s and 2.17 s, with p95 latency around 3.4-3.6 s. Similar behavior is observed on Qwen2.5-7B, where p50 latency stays below 1.4 s across all variants. These results suggest that RL finetuning does not materially increase the computational cost of memory operation selection.

Memory Search Latency Memory Search exhibits consistently low latency across all settings. On both backbones, median latency remains below 0.35 s, and p95 latency remains under 0.65 s. Differences between Base, PPO, and GRPO variants are minimal, indicating that improvements in downstream accuracy are not driven by more expensive retrieval operations.

Memory Manager Latency

For training the Memory Manager, we use a detailed prompt that instructs the model how to perform four memory operations: ADD , UPDATE , DELETE , and NOOP . The full prompt spans multiple figures for readability.

Memory Search Latency

For training the Memory Manager, we use a detailed prompt that instructs the model how to perform four memory operations: ADD , UPDATE , DELETE , and NOOP . The full prompt spans multiple figures for readability.

Answer Agent Latency

We provide the full prompt used to instruct the Answer Agent in our case study. This prompt defines the reasoning process, memory selection criteria, and formatting requirements for the model's responses. Figure 11 shows the complete instructions, context, and representative retrieved memories.

Accuracy-Latency Relationship

$$ (o, m') \sim \pi\theta(\cdot \mid x, \mathcal{M}_{\text{old}}), $$

$$ \label{eq:ppo_equation} \mathcal{J}(\theta) = \mathbb{E} \left[ \min\big( \rho_\theta A,; \text{clip}(\rho_\theta, 1-\epsilon, 1+\epsilon) A \big) \right], $$ \tag{eq:ppo_equation}

$$ R_{answer} = \mathrm{EM}(y_{\text{pred}}, y_{\text{gold}}) $$

$$ \rho_{\theta}(q, \mathcal{M}{\text{ret}}) = \frac{\pi{\theta}(y \mid q, \mathcal{M}{\text{ret}})} {\pi{\text{old}}(y \mid q, \mathcal{M}_{\text{ret}})} , $$

Algorithm: algorithm
[t]
\caption{Data Construction for Memory-R1 Training}
\label{alg:data_construction_algo1}
\begin{algorithmic}[1]
\State \textbf{Input:} LoCoMo multi-turn dialogues $\mathcal{D}$
\State \textbf{Output:} Training tuples for the Memory Manager
$(\text{dialogue turn}, \text{temporal memory bank}, \text{QA})$

\For{each dialogue $d \in \mathcal{D}$}
\For{each turn $t$ in $d$}
\State Build a \textbf{temporal memory bank} using the previous 50 turns with GPT-4o-mini
\State Combine (i) the temporal memory bank, (ii) the current turn $t$, and (iii) any QA pairs linked to $t$
\State Store the combined package as a single training tuple
\EndFor
\EndFor
\end{algorithmic}
Algorithm: algorithm
[t]
\caption{Data Construction for Answer Agent Training}
\label{alg:answer_agent_data_construction}
\begin{algorithmic}[1]
\State \textbf{Input:} LoCoMo multi-turn dialogues $\mathcal{D}$, trained Memory Manager
\State \textbf{Output:} Training tuples for the Answer Agent
$(\text{question}, \text{retrieved memories}, \text{gold answer})$

\For{each dialogue $d \in \mathcal{D}$}
\State Use the Memory Manager to maintain an up-to-date memory bank across turns
\EndFor
\For{each question $q$ in $d$}
\State Use the question $q$ as a query to retrieve the top 30 most relevant candidate memories for each participant from the memory bank
\State Pair (i) the question $q$, (ii) the 60 retrieved memories, and (iii) the gold answer $a_{\text{gold}}$
\State Store the triplet as a single training tuple for Answer Agent fine-tuning
\EndFor
\end{algorithmic}
Algorithm: algorithm
[t]
\small
\caption{Memory Bank Construction via Memory Manager}
\label{alg:memory_construction}
\begin{algorithmic}[1]

\State \textbf{Input:} Multi-turn dialogue $D = \{ t_1, t_2, \dots, t_n \}$; Initial empty memory bank $M$
\State \textbf{Output:} Updated memory bank $M$

\Procedure{ConstructMemoryBank}{$D, M$}

\For{each dialogue turn $t_i \in D$}
\State Extract key info: $f_i \gets \text{LLMExtract}(t_i)$
\State Retrieve memories:$M_{old} \gets \text{TopK}(f_i, M)$
\State Determine operation:
\State $o_i \gets \text{MemoryManager}(f_i, M_{old})$ where $o_i \in \{\texttt{ADD}, \texttt{UPDATE}, \texttt{DELETE}, \texttt{NOOP}\}$
\If{$o_i = \texttt{ADD}$}
\State $M \gets M \cup \{ f_i \}$
\ElsIf{$o_i = \texttt{UPDATE}$}
\State $M_{tmp} \gets \text{Merge}(M_{old}, f_i)$
\State $M \gets M \setminus M_{old} \cup M_{tmp}$
\ElsIf{$o_i = \texttt{DELETE}$}
\State $M \gets M \setminus M_{old}$
\ElsIf{$o_i = \texttt{NOOP}$}
\State $M \gets M$
\EndIf

\EndFor

\State \textbf{return} $M$

\EndProcedure

\end{algorithmic}
Algorithm: algorithm
[t]
\small
\caption{Memory-augmented Generation via Answer Agent}
\label{alg:answer_generation}
\begin{algorithmic}[1]

\State \textbf{Input:} Question set $Q = \{q_1, q_2, \ldots, q_m\}$; Memory bank $M$; Generation instruction text $t$
\State \textbf{Output:} Answer set $\hat{A}$

\Procedure{GenerateAnswers}{$Q, M, t$}
\State $\hat{A} \gets \{\}$
\For{each question $q_i \in Q$}
\State $ M_{ret} \gets \text{TopK}(q_i, M)$
\State $p_i \gets \text{Concat}(t, q_i,M_{ret})$ \Comment{$p_i$ is the memory augmented prompt}
\State $ M_{distill}, \hat{a}_i \gets \text{AnswerAgent}(p_i)$
\State $\hat{A} \gets \hat{A} \cup \{\hat{a}_i\}$
\EndFor
\State \textbf{return} $\hat{A}$
\EndProcedure

\end{algorithmic}
Algorithm: algorithm
[t]
\small
\caption{Memory-R1 Pipeline for Memory Manager}
\label{alg:memory_manager_rl_pipeline}
\begin{algorithmic}[1]

\State \textbf{Input:} Dataset $D$ of tuples: dialogue turns $ds$, question-answer pairs $(q_i, a_i)$; Temp memory bank $M$;
Memory Manager LLM $\mathcal{L}_{m}$; Answer LLM $\mathcal{L}_{a}$; Reward Function $\mathcal{F}$; Generation instruction text $t$

\State \textbf{Output:} Fine-tuned Memory Manager LLM $\mathcal{L}_{m}$
\Procedure{TrainMemoryManager}{$D, \mathcal{L}_{m}, \mathcal{L}_{a}, \mathcal{F}$}
\For{each tuple $(ds, q_i, a_i) \in D$}
\State $ M \gets \{\} $
\For{$d_i \in ds$}
\State Facts Extraction: $f_i \gets \text{LLMExtract}(d_i)$
\State Memory Retrieval: $M_{ret} \gets \text{TopK}(f_i, M)$
\State Determine operation: $o_i \sim \mathcal{L}_{m}(f_i, M_{ret})$
\If{$o_i = \texttt{ADD}$}
\State $M \gets M \cup \{ f_i \}$
\ElsIf{$o_i = \texttt{UPDATE}$}
\State $M_{tmp} \gets \text{Merge}(M_{ret}, f_i)$
\State $M \gets M \cup M_{tmp}$
\ElsIf{$o_i = \texttt{DELETE}$}
\State $M \gets M \setminus M_{ret}$
\ElsIf{$o_i = \texttt{NOOP}$}
\State $M \gets M$
\EndIf
\EndFor
\State Get Context: $C_{ret} \gets \text{TopK}(q_i, M)$
\State Update Prompt: $p_i \gets \text{Concat}(t, q_i,C_{ret})$
\State Get Response: $r_i \sim \mathcal{L}_{a}(p_i)$
\State Policy Update: $\mathcal{L}_{m} \gets \text{RL}_{step}(\mathcal{L}_{m}, \mathcal{F}, a_i, r_i)$,
\State where RL $ \in \{PPO, GRPO\}$

\EndFor
\State \textbf{return} $\mathcal{L}_{m}$
\EndProcedure

\end{algorithmic}

We provide a detailed latency analysis to better understand the efficiency characteristics of MemoryR1 and its individual components. All latency results are reported using median (p50) and tail

Table 3: Extended evaluation of Memory-R1 with Qwen-2.5 model family as backbones on the LoCoMo benchmark. Results are reported across question types (Single-Hop, Multi-Hop, Open-Domain, Temporal) and overall performance. Best scores are highlighted in bold.

Table 4: Extended evaluation of Memory-R1 on the LongMemEval benchmark using LLaMA and Qwen backbones. Each cell shows F1/B1/J for a given model-method combination, reported with one decimal precision. Task types are abbreviated as: SSU = Single-Session-User, SSP = Single-Session-Preference, OD = Open Domain, MS = Multi-Session, KU = Knowledge Update, TR = Temporal Reasoning, and O = Overall. The best value for each metric (F1, B1, J) within a task row is highlighted in bold.

Table 5: Overall results on the LongMemEval benchmark. We report the mean scores across all six evaluation dimensions. The best results are marked in bold .

(p95) inference time, measured across three components of the pipeline: the Memory Manager, Memory Search, and the Answer Agent. We compare the base model, PPO-trained variants, and GRPOtrained variants on both LLaMA-3.1-8B and Qwen2.5-7B backbones.

Overall Trends Across both model families, Memory-R1 does not introduce prohibitive latency overhead despite incorporating explicit memory management and reasoning components. In many cases, GRPO-trained variants achieve lower tail latency than both the base model and PPO variants, indicating that reinforcement learning can improve not only accuracy but also inference efficiency.

Memory Manager Latency For the Memory Manager component, latency remains relatively stable across Base, PPO, and GRPO variants. On LLaMA-3.1-8B, median latency ranges narrowly between 1.98 s and 2.17 s, with p95 latency around 3.4-3.6 s. Similar behavior is observed on Qwen2.5-7B, where p50 latency stays below 1.4 s across all variants. These results suggest that RL finetuning does not materially increase the computational cost of memory operation selection.

Memory Search Latency Memory Search exhibits consistently low latency across all settings. On both backbones, median latency remains below 0.35 s, and p95 latency remains under 0.65 s. Differences between Base, PPO, and GRPO variants are minimal, indicating that improvements in downstream accuracy are not driven by more expensive retrieval operations.

ModelMethodSingle HopSingle HopSingle HopMulti-HopMulti-HopMulti-HopOpen DomainOpen DomainOpen DomainTemporalTemporalTemporalOverallOverallOverall
ModelMethodF1 ↑B1 ↑J ↑F1 ↑B1 ↑J ↑F1 ↑B1 ↑J ↑F1 ↑B1 ↑J ↑F1 ↑B1 ↑J ↑
LoCoMo (RAG)12.259.7713.8113.6910.9620.4811.598.3015.969.388.154.6511.418.7113.62
A-Mem21.6216.9344.7613.8211.4534.9334.6729.1349.3825.7722.1436.4329.2024.4044.76
Mem027.2918.6343.9318.5913.8637.3534.0324.7752.2726.9021.0631.4030.4122.2245.68
MemoryOS31.8923.0552.7213.8012.7831.3340.7433.6757.3628.7421.4423.6435.0427.9948.20
Memory-SFT34.6423.7356.9020.8016.2637.3546.4737.3563.2747.1834.5854.6542.8132.9858.76
Memory-R1-PPO32.5224.4753.5626.8623.4742.1745.3039.1864.1041.5726.1147.6741.0532.9157.54
Memory-R1-GRPO35.7327.7059.8335.6530.7753.0147.4241.2468.7849.8638.2751.5545.0237.5162.74
LoCoMo (RAG)9.577.0015.0611.8410.0219.288.676.5212.798.358.745.438.977.2712.17
A-Mem18.9612.8640.7814.7312.6631.3230.5826.1446.9023.6720.6728.6826.0821.7840.78
Mem024.9618.0561.9220.3115.8248.1932.7425.2765.2033.1626.2838.7630.6123.5553.30
MemoryOS29.5522.5948.1221.0318.4138.5540.8536.2663.1426.2619.7024.8134.6429.3651.26
Memory-SFT27.8120.2557.7424.6222.2846.9943.3334.0666.8544.4134.3252.7139.5130.8461.13
Memory-R1-PPO34.2223.6157.7432.8729.4853.0144.7838.7266.9942.8830.3042.2541.7233.7059.53
Memory-R1-GRPO33.6426.0662.3423.5520.7140.9646.8640.9267.8147.7538.4949.6143.1436.4461.51
MethodF1 ↑B1 ↑J ↑
PPO (J-based reward model)33.6923.3663.58
PPO (EM-based reward model)41.0532.9157.54
ModelMethodSingle HopSingle HopSingle HopMulti-HopMulti-HopMulti-HopOpen DomainOpen DomainOpen DomainTemporalTemporalTemporalOverallOverallOverall
ModelMethodF1 ↑B1 ↑J ↑F1 ↑B1 ↑J ↑F1 ↑B1 ↑J ↑F1 ↑B1 ↑J ↑F1 ↑B1 ↑J ↑
Qwen 2.5-3BBASE PPO GRPO19.82 15.78 28.60 19.0246.44 50.6349.3711.57 26.5710.22 22.7224.10 40.9625.37 41.0620.04 35.6045.67 58.7328.94 43.9224.19 29.7329.46 50.0024.18 38.4219.46 30.5941.24
BASE PPO25.6959.0025.3022.6643.5138.0065.3442.5230.1041.8640.5923.14 33.2154.40
Qwen30.1020.0025.2922.6944.5843.0137.3965.0647.6933.3650.0040.4532.4857.92
2.5-7B23.6117.7860.6717.8614.4043.3731.3924.3465.7532.6627.5140.3129.3658.38
34.9242.1758.07
GRPO33.9825.5458.1625.5021.6346.9944.7248.9964.6543.5435.5239.9241.3134.7457.46
Qwen- 2.5-14BBASE34.6026.8255.2324.2421.4538.5539.7934.5356.9534.9829.3933.3336.9131.2850.80
PPO37.5931.9263.1828.2124.5750.6048.4642.4472.7643.7834.7350.7844.2637.8665.26
GRPO38.3230.6463.1822.7120.4042.1746.7041.7067.1350.5036.6060.4744.4037.3263.50
TaskLLaMA-3.1-8BLLaMA-3.1-8BLLaMA-3.1-8BQwen-2.5-7BQwen-2.5-7BQwen-2.5-7B
BASEPPOGRPOBASEPPOGRPO
SSU (F1/B1/J)61.9/53.2/80.078.9 /75.6/ 87.176.0/70.3/ 87.164.4/54.6/90.070.8/65.5/80.080.9 / 76.3 / 91.4
SSP (F1/B1/J)7.4/0.1/46.79.6/1.6/50.011.5 / 3.8 / 63.313.9/ 2.5 /53.314.9 /2.2/ 66.712.6/2.0/ 66.7
OD (F1/B1/J)17.9/16.6/19.630.6/24.6/33.931.2 /25.3/ 33.914.1/14.4/16.123.5/21.1/19.626.8 / 23.0 /26.8
MS (F1/B1/J)20.8/19.6/33.143.1/43.6/54.150.0 / 48.1 / 57.930.2/26.9/54.932.4/35.1/36.151.7 / 48.5 / 63.2
KU (F1/B1/J)36.0/27.9/51.346.4 /43.1/55.138.5/35.5/52.640.5/33.5/59.052.3/48.2/65.454.4 / 51.3 / 65.4
TR (F1/B1/J)34.0/23.1/42.137.0/29.2/ 49.641.5 / 30.3 /45.136.5/24.5/ 44.438.1 / 26.3 /38.435.1/25.8/41.4
O (F1/B1/J)31.3/25.0/44.243.6/ 39.5 /55.245.2 /39.3/ 55.435.5/28.3/53.240.3/35.5/47.446.7 / 41.1 / 57.8
Base ModelMethodOverall F1 ↑Overall B1 ↑Overall J ↑
LLaMA-3.1-8BLoCoMo (RAG)20.5515.1721
LLaMA-3.1-8BA-Mem38.3633.354.2
LLaMA-3.1-8BMem031.4121.6941.2
LLaMA-3.1-8BMemory-SFT43.8936.7254.8
LLaMA-3.1-8BMemory-R1-PPO43.639.555.2
LLaMA-3.1-8BMemory-R1-GRPO45.239.355.4
Qwen-2.5-7BLoCoMo (RAG)18.2714.5722.2
Qwen-2.5-7BA-Mem41.5536.5854.8
Qwen-2.5-7BMem038.4434.5346.8
Memory-SFT43.1635.0454.8
Memory-R1-PPO40.335.547.4
Memory-R1-GRPO46.741.157.8

Figure

Figure 13: Latency-accuracy comparison across pipeline components on LLaMA-3.1-8B-Instruct. Points show median (p50) and tail (p95) latency versus accuracy (F1, BLEU-1, and LLM-as-a-Judge) for the base model and RL-trained variants (PPO, GRPO).

References

[xu2021beyond] Xu, Jing, Szlam, Arthur, Weston, Jason. (2021). Beyond goldfish memory: Long-term open-domain conversation. arXiv preprint arXiv:2107.07567.

[kang2025memory] Kang, Jiazheng, Ji, Mingming, Zhao, Zhe, Bai, Ting. (2025). Memory OS of AI Agent. arXiv preprint arXiv:2506.06326.

[wu2024longmemeval] Wu, Di, Wang, Hongwei, Yu, Wenhao, Zhang, Yuwei, Chang, Kai-Wei, Yu, Dong. (2024). Longmemeval: Benchmarking chat assistants on long-term interactive memory. arXiv preprint arXiv:2410.10813.

[tresp2023tensor] Tresp, Volker, Sharifzadeh, Sahand, Li, Hang, Konopatzki, Dario, Ma, Yunpu. (2023). The tensor brain: A unified theory of perception, memory, and semantic decoding. Neural Computation.

[ma2018holistic] Ma, Yunpu, Hildebrandt, Marcel, Tresp, Volker, Baier, Stephan. (2018). Holistic Representations for Memorization and Inference.. UAI.

[tresp2017tensor] Tresp, Volker, Ma, Yunpu. (2017). The tensor memory hypothesis. arXiv preprint arXiv:1708.02918.

[tresp2015learning] Tresp, Volker, Esteban, Crist{'o. (2015). Learning with memory embeddings. arXiv preprint arXiv:1511.07972.

[ma2019embedding] Ma, Yunpu, Tresp, Volker, Daxberger, Erik A. (2019). Embedding models for episodic knowledge graphs. Journal of Web Semantics.

[tresp2017embedding] Tresp, Volker, Ma, Yunpu, Baier, Stephan, Yang, Yinchong. (2017). Embedding learning for declarative memories. European Semantic Web Conference.

[aios2024sdk] {AIOS Foundation. (2024). AIOS Agent SDK: Memory API.

[martin1983database] Martin, James. (1983). Managing the Data-base Environment.

[modarressi2024memllm] Modarressi, Ali, K{. (2024). Memllm: Finetuning llms to use an explicit read-write memory. arXiv preprint arXiv:2404.11672.

[sheng2025hybridflow] Sheng, Guangming, Zhang, Chi, Ye, Zilingfeng, Wu, Xibin, Zhang, Wang, Zhang, Ru, Peng, Yanghua, Lin, Haibin, Wu, Chuan. (2025). Hybridflow: A flexible and efficient rlhf framework. Proceedings of the Twentieth European Conference on Computer Systems.

[shao2024deepseekmath] Shao, Zhihong, Wang, Peiyi, Zhu, Qihao, Xu, Runxin, Song, Junxiao, Bi, Xiao, Zhang, Haowei, Zhang, Mingchuan, Li, YK, others. (2024). Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.

[schulman2017proximal] Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, Klimov, Oleg. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

[xu2025amemagenticmemoryllm] Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, Yongfeng Zhang. (2025). A-MEM: Agentic Memory for LLM Agents.

[shi2023largelanguagemodelseasily] Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, Denny Zhou. (2023). Large Language Models Can Be Easily Distracted by Irrelevant Context.

[liu2023lostmiddlelanguagemodels] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang. (2023). Lost in the Middle: How Language Models Use Long Contexts.

[achiam2023gpt] Achiam, Josh, Adler, Steven, Agarwal, Sandhini, Ahmad, Lama, Akkaya, Ilge, Aleman, Florencia Leoni, Almeida, Diogo, Altenschmidt, Janko, Altman, Sam, Anadkat, Shyamal, others. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

[qwen2025qwen25technicalreport] Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu. (2025). Qwen2.5 Technical Report.

[ouyang2022traininglanguagemodelsfollow] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe. (2022). Training language models to follow instructions with human feedback.

[yi2024surveyrecentadvancesllmbased] Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, Ying Shen. (2024). A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems.

[wang2025actingreasoningmoreteaching] Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, Heng Ji. (2025). Acting Less is Reasoning More! Teaching Model to Act Efficiently.

[wei2025webagentr1trainingwebagents] Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, Lihong Li. (2025). WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning.

[qian2025toolrlrewardtoollearning] Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji. (2025). ToolRL: Reward is All Tool Learning Needs.

[xiong2025memory] Xiong, Zidi, Lin, Yuping, Xie, Wenya, He, Pengfei, Tang, Jiliang, Lakkaraju, Himabindu, Xiang, Zhen. (2025). How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior. arXiv preprint arXiv:2505.16067.

[packer2023memgpt] Packer, Charles, Fang, Vivian, Patil, Shishir_G, Lin, Kevin, Wooders, Sarah, Gonzalez, Joseph_E. (2023). MemGPT: Towards LLMs as Operating Systems.. ArXiv.

[salama2025meminsightautonomousmemoryaugmentation] Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, Yassine Benajiba. (2025). MemInsight: Autonomous Memory Augmentation for LLM Agents.

[pan2025memory] Pan, Zhuoshi, Wu, Qianhui, Jiang, Huiqiang, Luo, Xufang, Cheng, Hao, Li, Dongsheng, Yang, Yuqing, Lin, Chin-Yew, Zhao, H Vicky, Qiu, Lili, others. (2025). On memory construction and retrieval for personalized conversational agents. arXiv preprint arXiv:2502.05589.

[zhang2024survey] Zhang, Zeyu, Dai, Quanyu, Bo, Xiaohe, Ma, Chen, Li, Rui, Chen, Xu, Zhu, Jieming, Dong, Zhenhua, Wen, Ji-Rong. (2024). A survey on the memory mechanism of large language model based agents. ACM Transactions on Information Systems.

[fei2023extendingcontextwindowlarge] Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, Wei Han. (2023). Extending Context Window of Large Language Models via Semantic Compression.

[wang2024adaptingllmsefficientcontext] Cangqing Wang, Yutian Yang, Ruisi Li, Dan Sun, Ruicong Cai, Yuzhu Zhang, Chengqian Fu, Lillian Floyd. (2024). Adapting LLMs for Efficient Context Processing through Soft Prompt Compression.

[fan2025if] Fan, Siqi, Huang, Xiusheng, Yao, Yiqun, Fang, Xuezhi, Liu, Kang, Han, Peng, Shang, Shuo, Sun, Aixin, Wang, Yequan. (2025). If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs. arXiv preprint arXiv:2503.23514.

[goodyear2025effect] Goodyear, Lyle, Guo, Rachel, Johari, Ramesh. (2025). The Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games. arXiv preprint arXiv:2506.15624.

[yu2025stateful] Yu, Lingfan, Lin, Jinkun, Li, Jinyang. (2025). Stateful large language model serving with pensieve. Proceedings of the Twentieth European Conference on Computer Systems.

[song2024trialerrorexplorationbasedtrajectory] Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, Bill Yuchen Lin. (2024). Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents.

[du2025rethinkingmemoryaitaxonomy] Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, Jeff Z. Pan. (2025). Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions.

[brown2020languagemodelsfewshotlearners] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. (2020). Language Models are Few-Shot Learners.

[chowdhery2022palmscalinglanguagemodeling] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Noah Fiedel. (2022). PaLM: Scaling Language Modeling with Pathways.

[openai2024gpt4technicalreport] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O'Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, Barret Zoph. (2024). GPT-4 Technical Report.

[thoppilan2022lamdalanguagemodelsdialog] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, Quoc Le. (2022). LaMDA: Language Models for Dialog Applications.

[ouyang2022training] Ouyang, Long, Wu, Jeffrey, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll, Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex, others. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems.

[schick2023toolformerlanguagemodelsteach] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools.

[yao2023reactsynergizingreasoningacting] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.

[park2023generativeagentsinteractivesimulacra] Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein. (2023). Generative Agents: Interactive Simulacra of Human Behavior.

[wang2023voyageropenendedembodiedagent] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models.

[shinn2023reflexion] Shinn, Noah, Cassano, Federico, Gopinath, Ashwin, Narasimhan, Karthik, Yao, Shunyu. (2023). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems.

[song2025r1] Song, Huatong, Jiang, Jinhao, Min, Yingqian, Chen, Jie, Chen, Zhipeng, Zhao, Wayne Xin, Fang, Lei, Wen, Ji-Rong. (2025). R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592.

[guo2025deepseek] Guo, Daya, Yang, Dejian, Zhang, Haowei, Song, Junxiao, Zhang, Ruoyu, Xu, Runxin, Zhu, Qihao, Ma, Shirong, Wang, Peiyi, Bi, Xiao, others. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.

[jin2025search] Jin, Bowen, Zeng, Hansi, Yue, Zhenrui, Yoon, Jinsung, Arik, Sercan, Wang, Dong, Zamani, Hamed, Han, Jiawei. (2025). Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516.

[maharana2024evaluating] Maharana, Adyasha, Lee, Dong-Ho, Tulyakov, Sergey, Bansal, Mohit, Barbieri, Francesco, Fang, Yuwei. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

[chhikara2025mem0] Chhikara, Prateek, Khant, Dev, Aryan, Saket, Singh, Taranjeet, Yadav, Deshraj. (2025). Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413.

[rasmussen2025zep] Rasmussen, Preston, Paliychuk, Pavlo, Beauvais, Travis, Ryan, Jack, Chalef, Daniel. (2025). Zep: a temporal knowledge graph architecture for agent memory. arXiv preprint arXiv:2501.13956.

[lee2024human] Lee, Kuang-Huei, Chen, Xinyun, Furuta, Hiroki, Canny, John, Fischer, Ian. (2024). A human-inspired reading agent with gist memory of very long contexts. arXiv preprint arXiv:2402.09727.

[zhong2024memorybank] Zhong, Wanjun, Guo, Lianghong, Gao, Qiqi, Ye, He, Wang, Yanlin. (2024). Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence.

[langmem2024] LangChain. (2024). LangMem: Modular memory for agentic systems.

[em:86] . Blackboard Systems. ().

[c:83] Clancey, William J.. {Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence {(IJCAI-83).

[c:84] Clancey, William J.. {Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence.

[r:80] Robinson, Arthur L.. (1980). New Ways to Make Microcircuits Smaller. Science. doi:10.1126/science.208.4447.1019.

[r:80x] Robinson, Arthur L.. {New Ways to Make Microcircuits Smaller---Duplicate Entry. Science.

[hcr:83] Diane Warner Hasling, William J. Clancey, Glenn Rennels. (1984). Strategic explanations for a diagnostic consultation system. International Journal of Man-Machine Studies. doi:https://doi.org/10.1016/S0020-7373(84)80003-6.

[hcrt:83] Hasling, Diane Warner, Clancey, William J., Rennels, Glenn R., Test, Thomas. {Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies.

[r:86] Rice, James. {Poligon: A System for Parallel Problem Solving.

[c:79] Clancey, William J.. {Transfer of Rule-Based Expertise through a Tutorial Dialogue.

[c:21] Clancey, William J.. {The Engineering of Qualitative Models.

[c:22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. (2017). Attention Is All You Need.

[c:23] {NASA. Pluto: The 'Other' Red Planet.