MemR3: Memory Retrieval via Reflective Reasoning for LLM Agents

Xingbo Du, Loka Li, Duzhen Zhang, Le Song

Abstract

Memory systems have been designed to leverage past experiences in Large Language Model (LLM) agents. However, many deployed memory systems primarily optimize compression and storage, with comparatively less emphasis on explicit, closed-loop control of memory retrieval. From this observation, we build memory retrieval as an autonomous, accurate, and compatible agent system, named {\proj}, which has two core mechanisms: 1) a router that selects among retrieve, reflect, and answer actions to optimize answer quality; 2) a global evidence-gap tracker that explicitly renders the answering process transparent and tracks the evidence collection process. This design departs from the standard retrieve-then-answer pipeline by introducing a closed-loop control mechanism that enables autonomous decision-making. Empirical results on the LoCoMo benchmark demonstrate that {\proj} surpasses strong baselines on LLM-as-a-Judge score, and particularly, it improves existing retrievers across four categories with an overall improvement on RAG (+7.29%) and Zep (+1.94%) using GPT-4.1-mini backend, offering a plug-and-play controller for existing memory stores.

Introduction

Emails: { Xingbo.Du, Longkang.Li, Duzhen.Zhang, Le.Song } @mbzuai.ac.ae

1 Mohamed bin Zayed University of Artificial Intelligence. Correspondence to: Le Song < Le.Song@mbzuai.ac.ae > .

ting and requires expensive fine-tuning. 2) Non-parametric methods (Xu et al., 2025; LangChain Team, 2025; Chhikara et al., 2025; Rasmussen et al., 2025), in contrast, store explicit external information, enabling flexible retrieval and continual augmentation without altering model parameters. However, they typically rely on heuristic retrieval strategies, which can lead to noisy recall, heavy retrieval, and increasing latency as the memory store grows.

Orthogonal to these works, this paper constructs an agentic memory system, MemR 3 , i.e., Memory Retrieval system with Reflective Reasoning, to improve retrieval quality and efficiency. Specifically, this system is constructed using LangGraph (Inc., 2025), with a router node selecting three optional nodes: 1) the retrieve node, which is based on existing memory systems, can retrieve multiple times with updated retrieval queries. 2) the reflect node, iteratively reasoning based on the current acquired evidence and the gaps between questions and evidence. 3) the answer node that produces the final response using the acquired information. Within all nodes, the system maintains a global evidencegap tracker to update the acquired ( evidence ) and missing ( gap ) information.

The system has three core advantages : 1) Accuracy and efficiency . By tracking the evidence and gap, and dynamically routing between retrieval and reflection, MemR 3 minimizes unnecessary lookups and reduces noise, resulting in faster, more accurate answers. 2) Plug-and-play usage . As a controller independent of existing retriever or memory storage, MemR 3 can be easily integrated into memory systems, improving retrieval quality without architectural changes. 3) Transparency and explainability. Since MemR 3 maintains an explicit evidence-gap state over the course of an interaction, it can expose which memories support a given answer and which pieces of information were still missing at each step, providing a human-readable trace of the agent's decision process. We compare MemR 3 , the Full-Context setting (which uses all available memories), and the commonly adopted retrieve-then-answer paradigm from a high-level perspective in Fig. 1. The contributions of this work are threefold in the following:

(1) Aspecialized closed-loop retrieval controller for longterm conversational memory. We propose MemR 3 , an autonomous controller that wraps existing memory stores and

Figure 1. Illustration of three memory-usage paradigms. Full-Context overloads the LLM with all memories and answers incorrectly; Retrieve-then-Answer retrieves relevant snippets but still miscalculates. In contrast, MemR 3 iteratively retrieves and reflects using an evidence-gap tracker (Acts 0-3), refines the query about Buddy's adoption date, and produces the correct answer (3 months).

turns standard retrieve-then-answer pipelines into a closedloop process with explicit actions ( retrieve / reflect / answer ) and simple early-stopping rules. This instantiates the general LLM-as-controller idea specifically for non-parametric, long-horizon conversational memory.

(2) Evidence-gap state abstraction for explainable retrieval. MemR 3 maintains a global evidence-gap state ( E , G ) that summarizes what has been reliably established in memory and what information remains missing. This state drives query refinement and stopping, and can be surfaced as a human-readable trace of the agent's progress. We further formalize this abstraction via an abstract requirement space and prove basic monotonicity and completeness properties, which we later use to interpret empirical behaviors. (3) Empirical study across memory systems. We integrate MemR 3 with both chunk-based RAG and a graph-based backend (Zep) on the LoCoMo benchmark and compare it with recent memory systems and agentic retrievers. Across backends and question types, MemR 3 consistently improves LLM-as-a-Judge scores over its underlying retrievers.

Memory for LLM Agents

Prior work on non-parametric agent memory systems spans a wide range of fields, including management and utilization (Du et al., 2025), by storing structured (Rasmussen et al., 2025) or unstructured (Zhong et al., 2024) external knowledge. Specifically, production-oriented agents such as MemGPT (Packer et al., 2023) introduce an OSstyle hierarchical memory system that allows the model to page information between context and external storage, and SCM (Wang et al., 2023) provides a controllerbased memory stream that retrieves and summarizes past information only when necessary. Additionally, Zep (Rasmussen et al., 2025) builds a temporal knowledge graph that unifies and retrieves evolving conversational and business data. A-Mem (Xu et al., 2025) creates self-organizing, Zettelkasten-style memory that links and evolves over time. Mem0 (Chhikara et al., 2025) extracts and manages persistent conversational facts with optional graph-structured memory. MIRIX (Wang & Chen, 2025) offers a multimodal, multi-agent memory system with six specialized memory types. LightMem (Fang et al., 2025a) proposes a lightweight and efficient memory system inspired by the Atkinson-Shiffrin model. Another related approach, Reflexion (Shinn et al., 2023), improves language agents by providing verbal reinforcement across episodes by storing natural-language reflections to guide future trials.

In this paper, we explicitly limit our scope to long-term conversational memory. Existing parametric approaches (Wang et al., 2024; Fang et al., 2025b), KV-cache-based mechanisms (Zhong et al., 2024; Eyuboglu et al., 2025), and streaming multi-task memory benchmarks (Wei et al., 2025) are out of scope for this work. Orthogonal to existing storage, MemR 3 is an autonomous retrieval controller that uses a global evidence-gap tracker to route different actions, enabling closed-loop retrieval.

Agentic Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) established the modern retrieve-then-answer paradigm; sub-

sequent work explored stronger retrievers (Karpukhin et al., 2020; Izacard & Grave, 2021). Beyond the RAG, recent work, such as Self-RAG (Asai et al., 2024), Reflexion (Shinn et al., 2023), ReAct (Yao et al., 2022), and FAIR-RAG (Asl et al., 2025), has shown that letting a language model (LM) decide when to retrieve, when to reflect, and when to answer can substantially improve multi-step reasoning and factuality in tool-augmented settings. MemR 3 follows this general 'LLM-as-controller' paradigm but applies it specifically to long-term conversational memory over non-parametric stores. Concretely, we adopt the idea of multi-step retrieval and self-reflection from these frameworks, but i) move the controller outside the base LM as a LangGraph program, ii) maintain an explicit evidence-gap state that separates verified memories from remaining uncertainties, and iii) interface this state with different memory backends (e.g., RAG and Zep (Rasmussen et al., 2025)) commonly used in longhorizon dialogue agents. Our goal is not to replace these frameworks, but to provide a specialized retrieval controller that can be plugged into existing memory systems.

proj

Problem Formulation and Preliminaries

We consider a long-horizon LLM agent that interacts with a user, forming a memory store M = { m i } N i =1 , where each memory item m i may correspond to a dialogue utterance, personal fact, structured record, or event, often accompanied by metadata such as timestamps or speakers. Given a user query q , a retriever is applied to retrieve a set of memory snippets S that are useful for generating the final answer. Then, given designed prompt template p , the goal is to produce an answer w :

which is accurate (consistent with all relevant memories in M ), efficient (requiring minimal retrieval cycles and low latency), and robust (stable under noisy, redundant, or incomplete memory stores) as much as possible.

Existing memory systems have done great work on the memory storage M , but typically follow an open-loop pipeline: 1) apply a single retrieval pass; 2) feed the selected memories S into a generator to produce A . This approach lacks adaptivity: retrieval does not incorporate intermediate reasoning, and the system never represents which information remains missing. This leads to both under-retrieval (insufficient evidence) and over-retrieval (long, noisy contexts).

MemR 3 addresses these limitations by treating retrieval as an autonomous sequential decision process with explicit modeling of both acquired evidence and remaining gaps.

System Overview

MemR 3 is implemented as a directed agent graph comprising three operational nodes (Retrieve, Reflect, Answer) and one control node (Router) using LangGraph (Inc., 2025) (an open-source framework for building stateful, multi-agent workflows as graphs of interacting nodes). The agent maintains a mutable internal state

where q and S are the aforementioned original user query and retrieved snippets, respectively. E is the accumulated evidence relevant to q and G is the remaining missing information (the 'gap') between q and E . Moreover, we maintain the iteration index k to control early stopping.

At each iteration k , the router chooses an action in { retrieve , reflect , answer } , which determines the next node in the computation graph. The pipeline is shown in Fig. 2. This transforms the classical retrievethen-answer pipeline into a closed-loop controller that can repeatedly refine retrieval queries, integrate new evidence, and stop early once the information gap is resolved.

Global Evidence-Gap Tracker

A core design principle of MemR 3 is to explicitly maintain and update two state variables: the evidence E and the gap G . These variables summarize what the agent currently knows and what it still needs to know to answer the question.

At iteration k , the evidence E k and gaps G k are updated according to the retrieved snippets S k -1 (from the retrieve node) or reflective reasoning F k -1 (from the reflect node), together with last evidence E k -1 and gaps G k -1 at k -1 iteration:

where p k is the prompt template at k iteration. Additionally, a k is the action at k iteration, which will be introduced in Sec. 3.4. Note that we explicitly clarify in p k that E k does not contain any information in G k , making evidence and gaps decoupled. An example is shown in Fig. 3 to illustrate the evidence-gap tracker.

Through the evidence-gap tracker, MemR 3 maintains a structured and transparent internal state that continuously refines the agent's understanding of both i) what has already been established as relevant evidence, and ii) what missing information still prevents a complete and faithful answer. This

Figure 2. Pipeline of MemR 3 . MemR 3 transforms retrieval into a closed-loop process: a router dynamically switches between Retrieve, Reflect, and Answer nodes while a global evidence-gap tracker maintains what is known and what is still missing. This enables iterative query refinement, targeted retrieval, and early stopping, making MemR 3 an autonomous, backend-agnostic retrieval controller.

Figure 3. Example of the evidence-gap tracker for a specific query. At each step, the agent maintains an explicit summary of the evidence established and the information still missing. This state can be presented directly to users as a human-readable explanation of the agent's progress in answering the query.

explicit decoupling enables MemR 3 to reason under partial observability: as long as G k = ∅ , the agent recognizes that its current knowledge is insufficient and can proactively issue a refined retrieval query to close the remaining gap. Conversely, when G k becomes empty, the router detects that the agent has accumulated adequate evidence and can safely transition to the answer node.

Beyond guiding retrieval, the evidence-gap representation also makes the agent's behavior more transparent. At any iteration k , the pair ( E k , G k ) can be surfaced as a structured explanation of i) which memories the agent currently treats as relevant evidence and ii) which unresolved questions or missing details are preventing a confident answer. This trace provides users and developers with a faithful view of how the agent arrived at its final answer and why additional retrieval steps were taken (or not). In the following, we display an informal theorem that indicates the properties of the idealized evidence-gap tracker.

Theorem 3.1 ( [Informal] Monotonicity, soundness, and completeness of the idealized evidence-gap tracker) . Under an idealized requirement space R ( q ) for a specific query q , the evidence-gap tracker in MemR 3 is monotone (evidence never decreases and gaps never increase), sound (every supported requirement eventually enters the evidence set), and complete (if every requirement r ∈ R ( q ) is supported by some memory, the ideal gap eventually becomes empty).

Formally, in Appendix B we define the abstract requirement space R ( q ) and characterize the tracker as a set-valued update on R ( q ) , proving fundamental soundness, monotonicity, and completeness properties (Theorem B.4), which we later use in Sec. 4.3 to interpret empirical phenomena such as why some questions cannot be fully resolved even after exhausting the iteration budget.

LangGraph Nodes

We explicitly define several nodes in the LangGraph framework, including start , end , generate , router , retrieve , reflect , answer . Specifically, start is always followed by retrieve , and end is reached after answer . generate is a LLM generation node, which is already introduced in Eq. 3. In the following, we further introduce the router node and three action nodes.

Router . At each iteration, the router, an autonomous sequential controller, uses the current state and selects an action from { retrieve , reflect , answer } . Each action a k is accompanied by a textual generation:

where ∆ q k is a refinement query, f k is a reasoning content, and w k is a draft answer, which are utilized in the downstream action nodes. To ensure stability, router applies three deterministic constraints: 1) a maximum iteration budget n max that forces an answer action once the budget is exhausted, 2) a reflect-streak capacity n cap that forces

Retrieve.

Reflect.

Answer.

Discussion

In this work, we introduce MemR 3 , an autonomous memoryretrieval controller that transforms standard retrievethen-answer pipelines into a closed-loop process via a LangGraph-based sequential decision-making framework. By explicitly maintaining what is known and what remains unknown using an evidence-gap tracker, MemR 3 can iteratively refine queries, balance retrieval and reflection, and terminate early once sufficient evidence has been gathered. Our experiments on the LoCoMo benchmark show that MemR 3 consistently improves LLM-as-a-Judge scores over strong memory baselines, while incurring only modest token and latency overhead and remaining compatible with heterogeneous backends. Beyond these concrete gains, MemR 3 offers an explainable abstraction for reasoning under partial observability in long-horizon agent settings.

However, we acknowledge some limitations for future work: 1) MemR 3 requires an existing retriever or memory structure, and particularly, the performance greatly depends on the retriever or memory structure. 2) The routing structure could lead to token waste for answering simple questions. 3) MemR 3 is currently not designed for multi-modal memories like images or audio.

Discussion on Efficiency

Although MemR 3 introduces extra routing steps, it maintains low overhead via 1) Compact evidence and gap summaries : only short summaries are repeatedly fed into the router. 2) Masked retrieval : each retrieval call yields genuinely new information. 3) Small iteration budgets : typically, most questions can be answered using only a single iteration. Those complicated questions that require multiple iterations are constrained with a small maximum iteration budget. These design choices ensure that MemR 3 improves retrieval quality without large increases in retrieved tokens.

Advantages.

Experiments

The experiments are conducted on a machine with an AMD EPYC 7713P 64-core processor, an A100-SXM4-80GB GPU, and 512GB of RAM. Each experiment of MemR 3 is repeated three times to report the average scores. Code available: https://github . com/Leagein/memr3 .

Experimental Protocols

Datasets. In line with baselines (Xu et al., 2025; Chhikara et al., 2025), we employ LoCoMo (Maharana et al., 2024) dataset as a fundamental benchmark. LoCoMo has a total of 10 conversations across four categories: 1) multi-hop, 2) temporal, 3) open-domain, 4) single-hop, and 5) adversarial. We exclude the last 'adversarial' category, following existing work (Chhikara et al., 2025; Wang & Chen, 2025), since it is used to test whether unanswerable questions can be identified. Each conversation has approximately 600 dialogues with 26k tokens and 200 questions on average.

Metrics. We adopt the LLM-as-a-Judge (J) score to evaluate answer quality following Chhikara et al. (2025); Wang & Chen (2025). Compared with surface-level measures such as F1 or BLEU-1 (Xu et al., 2025; Soni et al., 2024), this metric better avoids relying on simple lexical overlap and instead captures semantic alignment. Specifically, GPT-4.1 (OpenAI, 2025) is employed to judge whether the answer is correct according to the original question and the generated answer, following the prompt by Chhikara et al. (2025).

Baselines. We select four groups of advanced methods as baselines: 1) memory systems, including A-mem (Xu et al., 2025), LangMem (LangChain Team, 2025), and Mem0(Chhikara et al., 2025); 2) agentic retrievers, like SelfRAG (Asai et al., 2024). We also design a RAG-CoT-RAG (RCR) pipeline beyond ReAct (Yao et al., 2022) as a strong agentic retriever baseline combining both RAG (Lewis et al., 2020) and Chain-of-Thoughts (CoT) (Wei et al., 2022); 3) backend baselines, including chunk-based (RAG (Lewis et al., 2020)) and graph-based (Zep (Rasmussen et al., 2025)) memory storage, demonstrating the plug-in capability of MemR 3 across different retriever backends; 4) Moreover,

Table 1. LLM-as-a-Judge scores (%, higher is better) for each question category in the LoCoMo (Maharana et al., 2024) dataset. The best results using each LLM backend, except Full-Context, are in bold .

'Full-Context' is widely used as a strong baseline and, when the entire conversation fits within the model window, serves as an empirical upper bound on J score (Chhikara et al., 2025; Wang & Chen, 2025). More detailed introduction of these baselines is shown in Appendix C.1.

Other Settings. Other experimental settings and protocols are shown in Appendix C.2.

LLM Backend. We reviewed recent work and found that it most frequently used GPT-4o-mini (OpenAI, 2024b), as it is inexpensive and performs well. While some work (Wang & Chen, 2025) also includes GPT-4.1-mini (OpenAI, 2025), we set both of them as our LLM backends. In our main results, MemR 3 is performed at temperature 0.

Datasets.

Memory systems have been designed to leverage past experiences in Large Language Model (LLM) agents. However, many deployed memory systems primarily optimize compression and storage, with comparatively less emphasis on explicit, closed-loop control of memory retrieval. From this observation, we build memory retrieval as an autonomous, accurate, and compatible agent system, named MemR 3 , which has two core mechanisms: 1) a router that selects among retrieve , reflect , and answer actions to optimize answer quality; 2) a global evidence-gap tracker that explicitly renders the answering process transparent and tracks the evidence collection process. This design departs from the standard retrievethen-answer pipeline by introducing a closedloop control mechanism that enables autonomous decision-making. Empirical results on the LoCoMo benchmark demonstrate that MemR 3 surpasses strong baselines on LLM-as-a-Judge score, and particularly, it improves existing retrievers across four categories with an overall improvement on RAG (+7.29%) and Zep (+1.94%) using GPT-4.1-mini backend, offering a plug-andplay controller for existing memory stores.

Main Results

Overall. Table 1 reports LLM-as-a-Judge (J) scores across four LoCoMo categories. Across both LLM backends and memory backbones, MemR 3 consistently outperforms its underlying retrievers (RAG and Zep) and achieves strong overall J scores. Under GPT-4o-mini , MemR 3 lifts the overall score of Zep from 74.62% to 76.26%, and RAG from 75.54% to 81.55%, with the latter even outperforming the Full-Context baseline (76.32%). With GPT-4.1-mini , we see the same pattern: MemR 3 improves Zep from 78.94% to 80.88% and RAG from 79.46% to 86.75%, mak- ing the RAG-backed variant the strongest retrieval-based system and narrowing the gap to Full-Context (89.00%). As expected, methods instantiated with GPT-4.1-mini are consistently stronger than their GPT-4o-mini counterparts. Full-Context also benefits substantially from the stronger LLM, but under GPT-4o-mini it lags behind the best retrieval-based systems, especially on temporal and open-domain questions. Overall, these results indicate that closed-loop retrieval with an explicit evidence-gap state yields gains primarily orthogonal to the choice of LLM or memory backend, and that MemR 3 particularly benefits from backends that expose relatively raw snippets (RAG) rather than heavily compressed structures (Zep).

Multi-hop. Multi-hop questions require chaining multiple pieces of evidence and, therefore, directly test our reflective controller. Under GPT-4o-mini , MemR 3 improves both backbones on this category: the multi-hop J score rises from 68.79% to 71.39% on RAG and from 67.38% to 69.39% on Zep, bringing both close to the Full-Context score (72.34%). With GPT-4.1-mini , the gains are more pronounced: MemR 3 boosts RAG from 73.05% to 81.20% and Zep from 72.34% to 77.78%, outperforming all other baselines and approaching the Full-Context upper bound (86.43%). These consistent gains suggest that explicitly tracking evidence and gaps helps the agent coordinate multiple distant memories via iterative retrieval, rather than relying on a single heuristic pass.

Temporal. Temporal questions stress the model's ability to reason about ordering and dating of events over long horizons, where both under- and over-retrieval can be harmful. Here, MemR 3 delivers some of its most considerable relative improvements. For GPT-4o-mini , the temporal J score of RAG jumps from 65.11% to 76.22%, outperforming both the original RAG and the Zep baseline (73.83%), while MemR 3 with a Zep backbone preserves Zep's strong temporal accuracy (73.83%). Full-Context performs notably worse in this regime (58.88%), indicating that simply supplying all dialogue turns can hinder temporal reasoning under a weaker backbone. With GPT-4.1-mini , MemR 3 again significantly strengthens temporal reasoning: RAG improves from 73.52% to 82.14%, and Zep from 77.26% to 77.78%, making the RAG-backed MemR 3 the best retrievalbased system and closing much of the remaining gap to Full-Context (86.82%). These findings support our design goal that explicitly modeling 'what is already known' versus 'what is still missing' helps the agent align and compare temporal relations more robustly.

Open-Domain. Open-domain questions are less tied to the user's personal timeline and often require retrieving diverse background knowledge, which makes retrieval harder to trigger and steer. Despite this, MemR 3 consistently improves over its backbones. Under GPT-4o-mini , MemR 3 increases the open-domain J score of RAG from 58.33% to 61.11% and that of Zep from 63.54% to 67.01%, with the Zep-backed variant achieving the best performance among all methods in this block, surpassing Full-Context (59.38%). With GPT-4.1-mini , the gains become even larger: MemR 3 lifts RAG from 62.50% to 71.53% and Zep from 64.58% to 69.79%, nearly matching the Full-Context baseline (71.88%) and again outperforming all other baselines. We attribute these improvements to the router's ability to interleave retrieval with reflection: when initial evidence is noisy or off-topic, MemR 3 uses the gap representation to reformulate queries and pull in more targeted external knowledge rather than committing to an early, brittle answer.

Single-hop. Single-hop questions can often be answered from a single relevant memory snippet, so the potential headroom is smaller, but MemR 3 still yields consistent gains. With GPT-4o-mini , MemR 3 raises the single-hop J score from 78.67% to 80.60% on Zep and from 83.86% to 89.44% on RAG, with the latter surpassing the Full-Context baseline (86.39%). Under GPT-4.1-mini , MemR 3 improves Zep from 83.49% to 84.42% and RAG from 85.90% to 92.17%, making the RAG-backed variant the strongest method overall aside from Full-Context (93.73%). Together with the iteration-count analysis in Sec. 4.3, these results suggest that the router often learns to terminate early on straightforward single-hop queries, gaining accuracy primarily through better evidence selection rather than additional reasoning depth, and thus adding little overhead in tokens or latency.

Table 2. Ablation studies. Best results are in bold .

MH=Multi-hop; OD = Open-domain; SH = Single-hop.

Other Experiments

We ablate various hyperparameters and modules to evaluate their impact in MemR 3 with the RAG retriever. During these experiments, we utilize GPT-4o-mini as a consistent LLM backend.

Ablation Studies. Wefirst examine the contribution of the main design choices in MemR 3 by progressively removing them while keeping the RAG retriever and all hyperparameters fixed. As shown in Table 2, disabling masking for previously retrieved snippets (w/o mask) results in the largest degradation, reducing the overall J score from 81.55% to 68.54% and harming every category. This confirms that repeatedly surfacing the same memories wastes budget and fails to effectively close the remaining gaps. Removing the refinement query ∆ q k (w/o ∆ q k ) has a milder effect: temporal and open-domain performance changed a little, but multi-hop and single-hop scores decline significantly, indicating that tailoring retrieval queries from the current evidence-gap state is particularly beneficial for simpler questions. Disabling the reflect node (w/o reflect) similarly reduces performance (from 81.55% to 76.65%), with notable drops on multi-hop and single-hop questions, highlighting the value of interleaving reasoning-only steps with retrieval. Note that in Table 2, the raw retrieved snippets are only visible to the vanilla RAG.

Effect of n chk and n max. We first choose a nominal configuration for MemR 3 (with a RAG retriever) by arbitrarily setting the number of chunks per iteration n chk = 3 and the max iteration budget n max = 5 . In Fig. 4a, we fix n max = 5 and perform ablations over n chk ∈ { 1 , 3 , 5 , 7 , 9 } . In Fig. 4b, we fix n chk = 3 and perform ablations over n max ∈ { 1 , 2 , 3 , 4 , 5 } . Considering both of the LLM-as-aJudge score and token consumption, we eventually choose n chk = 5 and n max = 5 in all main experiments.

Iteration count. We further inspect how often MemR 3 actually uses multiple retrieve/reflect/answer iterations when n chk = 5 and n max = 5 (Fig. 5). Overall, most questions are answered after a single iteration, and this effect is particularly strong for Single-hop questions. An exception is open-domain questions, for which 58 of 96 require continuous retrieval or reflection until the maximum number of

Figure 4. LLM-as-a-Judge score (%) with different a) number of chunks per iteration and b) max iterations.

iterations is reached, highlighting the inherent challenges and uncertainty in these questions. Additionally, only a small fraction of questions terminate at intermediate depths (2-4 iterations), suggesting that MemR 3 either becomes confident early or uses the whole iteration budget when the gap remains non-empty.

We observe that this distribution arises from two regimes. On the one hand, straightforward questions require only a single piece of evidence and can be resolved in a single iteration, consistent with intuition. From the perspective of the idealized tracker in Appendix B, these are precisely the queries for which every requirement r ∈ R ( q ) is supported by some retrieved memory item m ∈ ⋃ j ≤ k S j with m | = r , so the completeness condition in Theorem B.4 is satisfied and the ideal gap G ⋆ k becomes empty.

On the other hand, some challenging questions are inherently underspecified given the stored memories, so the gap cannot be fully closed even if the agent continues to refine its query. For example, for the question ' When did Melanie paint a sunrise? ', the correct answer in our setup is simply ' 2022 ' (the year). MemR 3 quickly finds this year at the first iteration based on evidence ' Melanie painted the lake sunrise image last year (2022). '. However, under the idealized abstraction, the requirement set R ( q ) implicitly includes an exact date predicate (year-month-day), and no memory item m ∈ ⋃ j ≤ K S j satisfies m | = r for that finer-grained requirement. Thus, the precondition of Theorem B.4(3) is violated, and G ⋆ k never becomes empty; the practical tracker mirrors this by continuing to search for the missing specificity until it hits the maximum iteration budget. In such cases, the additional token consumption is primarily due to a mismatch between the question's granularity and the available memory, rather than a failure of the agent.

Ablation Studies.

Effect of $n_{ text{chk

Number of chunks.

Max iterations.

Iteration count.

Emails: { Xingbo.Du, Longkang.Li, Duzhen.Zhang, Le.Song } @mbzuai.ac.ae

1 Mohamed bin Zayed University of Artificial Intelligence. Correspondence to: Le Song < Le.Song@mbzuai.ac.ae > .

(1) Aspecialized closed-loop retrieval controller for longterm conversational memory. We propose MemR 3 , an autonomous controller that wraps existing memory stores and

Revisiting the Evaluation Protocols of LoCoMo

During our reproduction of the baselines, we identified a latent ambiguity in the LoCoMo dataset's category indexing. Specifically, the mapping between numerical IDs and

Figure 5. Number of questions requiring different numbers of iterations before final answers, across four categories.

semantic categories (e.g., Multi-hop vs. Single-hop) implies a non-trivial alignment challenge. We observed that this ambiguity has led to category misalignment in several recent studies (Chhikara et al., 2025; Wang & Chen, 2025), potentially skewing the granular analysis of agent capabilities.

To ensure a rigorous and fair comparison, we recalibrate the evaluation protocols for all baselines. In Table 1, we report the performance based on the corrected alignment, where the alignment can be induced by the number of questions in each category. We believe this clarification contributes to a more accurate understanding of the current SOTA landscape. Details of the dataset realignment are illustrated in Appendix C.3.

Conclusion

Accessibility

Software and Data

Acknowledgements

Impact Statement

Prompts

System prompt of the texttt{generate

The system prompt is defined as follows, where the 'decision directive' instructs the maximum iteration budges, reflectstreak capacity, and retrieval opportunity check, introduced in Sec. 3.4. Generally, 'decision directive' is a textual instruction: 'reflect' if you need to think about the evidence and gaps; choose 'answer' ONLY when evidence is solid and no gaps are noted; choose 'retrieve' otherwise. However, when the maximum iterations budget is reached, 'decision directive' is set as 'answer' to stop early. When the reflection reaches the maximum capacity, 'decision directive' is set as 'retrieve' to avoid repeated ineffective reflection. When there is no useful retrieval remains, 'decision directive' is set as 'reflect' to avoid repeated ineffective retrieval. Through these constraints, the agent can avoid infinite ineffective actions to maintain stability.

User prompt of the texttt{generate

Apart from the system, the user prompt is responsible to feed additional information to the LLM. Specifically, at the k iteration, 'question' is the original question q . 'evidence block' and 'gap block' are evidence E k and gaps G k introduced in Sec. 3.3. 'raw block' is the retrieved raw snippets S k in Eq.5. 'reasoning block' is the reasoning content F k in Sec. 3.4. 'last query' is the refined query ∆ q k introduced in Sec. 3.4 that enables the new query to be different from the prior one. Note that these fields can be left empty if the corresponding information is not present.

{ reasoning block }

Prior Query { last query }

Formalizing the Evidence-Gap Tracker

A central component of MemR 3 is the evidence-gap tracker introduced in Sec. 3.3, which maintains an evolving summary of i) what information has been reliably established from memory and ii) what information is still missing to answer the query. While the practical implementation of this tracker is based on LLM-generated summaries, we introduce an idealized formal abstraction that clarifies its intended behavior, enables principled analysis, and provides a foundation for studying correctness and robustness. This abstraction does not assume perfect extraction; rather, the LLM acts as a stochastic approximator to the idealized tracker.

Definition B.1 (Idealized Requirement Space) . For a user query q , we define a finite set of atomic information requirements , which specify the minimal facts needed to fully answer the query:

For example, for the question 'How many months passed between events A and B ?', the requirement set can be

Each requirement r ∈ R ( q ) is associated with a symbolic predicate (e.g., a timestamp, entity attribute, or event relation), and R ( q ) provides the semantic target against which retrieved memories are judged.

Definition B.2 (Memory-Support Relation) . Let M be the memory store and S k ⊆ M denote the snippets retrieved at iteration k . We define a relation m | = r to indicate that memory item m ∈ M contains sufficient information to support requirement r ∈ R ( q ) . Formally, m | = r holds if the textual content of m contains a minimal witness (e.g., a timestamp, entity mention, or explicit assertion) matching the predicate corresponding to r . The matching criterion may be implemented via deterministic pattern rules or LLM-based semantic matching; our analysis is agnostic to this choice.

Definition B.3 (Idealized Evidence-Gap Update Rule) . At iteration k , the idealized tracker maintains two sets: i) the evidence E k ⊆ R ( q ) and ii) the gaps G k = R ( q ) \ E k . Given newly retrieved snippets S k , the ideal updates are

In this abstraction, the tracker monotonically accumulates verified requirements and removes corresponding gaps, providing a clean characterization of the desired system behavior independent of noise.

Practical Instantiation via LLM Summaries

In MemR 3 , the tracker is instantiated through LLM-generated summaries:

where the prompt explicitly instructs the model to: (i) extract concise factual bullets relevant to q , (ii) enumerate missing information blocking a complete answer, and (iii) avoid hallucinations or speculative inference. Thus, ( E k , G k ) serves as a stochastic approximation to the idealized ( E ⋆ k , G ⋆ k ) :

with deviations arising from LLM extraction noise. This perspective reconciles the formal update rule with the prompt-driven practical implementation.

Correctness Properties under Idealized Extraction

Although the practical instantiation lacks deterministic guarantees, the idealized tracker in Definition B.3 satisfies several intuitive properties essential for closed-loop retrieval.

Theorem B.4 (Properties of the Idealized Tracker) . Assume that for all k and all r ∈ R ( q ) , we have r ∈ E ⋆ k if and only if there exists some m ∈ ⋃ j ≤ k S j such that m | = r . Then the following hold:

Proof. (1) By Definition B.3,

These properties characterize the target behavior that the LLM-based tracker implementation aims to approximate.

Robustness Considerations

Since real LLMs introduce extraction noise, the practical tracker may deviate from the idealized ( E ⋆ k , G ⋆ k ) , for example, through false negatives (missing evidence), false positives (hallucinated evidence), or unstable gap estimates. In the main text (Sec. 3.3 and Sec. 4.3), we study these effects empirically by injecting noisy or contradictory memories and measuring their impact on routing decisions and final answer quality. The formal abstraction above serves as the reference model against which these robustness behaviors are interpreted.

Approximation Bias of the LLM Tracker

The abstraction in this section assumes access to an ideal tracker that updates ( E k , G k ) exactly according to the requirement-support relation m | = r . In practice, MemR 3 uses an LLM-generated tracker ( E k , G k ), which only approximates this ideal update. This introduces several forms of approximation bias: i) Coverage bias (false negatives): supported requirements r ∈ R ( q ) that are omitted from E k ; ii) Hallucination bias (false positives): requirements r that appear in E k even though no retrieved memory item supports them; iii) Granularity bias : cases where the tracker records a coarser fact (e.g., a year) but the requirement space R ( q ) contains a finer predicate (e.g., an exact date), so the ideal requirement is never fully satisfied.

Toy example of the granularity bias

The ' Melanie painted a sunrise ' case in Sec. 4.3 provides a concrete illustration of granularity bias. The question asks ' When did Melanie paint a sunrise? ', and in our setup the correct answer is the year 2022. Under the ideal abstraction, however, the requirement space R ( q ) implicitly contains a fine-grained predicate r date corresponding to the full year-month-day of the painting event. The memory store only contains a coarse statement such as ' Melanie painted the lake sunrise image last year (2022). '

In the ideal tracker, no memory item m satisfies m | = r date, so the precondition of Theorem B.4's completeness clause is violated and the ideal gap G k never becomes empty. The practical LLM tracker mirrors this behavior: it quickly recovers the year 2022 as evidence, but continues to treat the exact date as a remaining gap, eventually hitting the iteration budget without fully closing Gk. This example shows that some apparent 'failures' of the approximate tracker are in fact structural: they arise from a mismatch between the granularity of R ( q ) and the information actually present in the memory store.

Experimental Settings

Baselines

We select four groups of advanced methods as baselines: 1) memory systems, including A-mem (Xu et al., 2025), LangMem (LangChain Team, 2025), and Mem0 (Chhikara et al., 2025); 2) agentic retrievers, like Self-RAG (Asai et al., 2024). We also design a RAG-CoT-RAG (RCR) pipeline as a strong agentic retriever baseline combining both RAG (Lewis et al., 2020) and Chain-of-Thoughts (CoT) (Wei et al., 2022); 3) backend baselines, including chunk-based (RAG (Lewis et al., 2020)) and graph-based (Zep (Rasmussen et al., 2025)) memory storage, demonstrating the plug-in capability of MemR 3 across different retriever backends. Moreover, 'Full-Context' is widely used as a strong baseline and, when the entire conversation fits within the model window, serves as an empirical upper bound on J score (Chhikara et al., 2025; Wang & Chen, 2025). More detailed introduction of these baselines is shown in Appendix C.1.

We divide our groups into four groups: memory systems, agentic retrievers, backend baselines, and full-context.

Memory systems

In this group, we consider recent advanced memory systems, including A-mem (Xu et al., 2025), LangMem (LangChain Team, 2025), and Mem0 (Chhikara et al., 2025), to demonstrate the comprehensively strong capability of MemR 3 from a memory control perspective.

A-mem (Xu et al., 2025) 1 . A-Mem is an agent memory module that turns interactions into atomic notes and links them into a Zettelkasten-style graph using embeddings plus LLM-based linking.

LangMem (LangChain Team, 2025). LangMem is LangChain's persistent memory layer that extracts key facts from dialogues and stores them in a vector store (e.g., FAISS/Chroma) for later retrieval.

Mem0 (Chhikara et al., 2025) 2 . Mem0 is an open-source memory system that enables an LLM to incrementally summarize, deduplicate, and store factual snippets, with an optional graph-based memory extension.

Agentic Retrievers

In this group, we examine the agentic structures underlying memory retrieval to show the advanced performance of MemR 3 on memory retrieval, and particularly, showing the advantage of the agentic structure of MemR 3 . To validate this, we include Self-RAG (Asai et al., 2024) and design a strong heuristic baseline, RAG-CoT-RAG (RCR), which combines RAG and CoT (Wei et al., 2022).

Self-RAG (Asai et al., 2024). A model-driven retrieval controller where the LLM decides, at each step, whether to answer or issue a refined retrieval query. Unlike MemR 3 , retrieval decisions in Self-RAG are implicit in the model's chain-of-thought, without explicit state tracking. We reproduce their original code and prompt to suit our task.

RAG-CoT-RAG (RCR) . We design a strong heuristic baseline that extends beyond ReAct (Yao et al., 2022) by performing one initial retrieval (Lewis et al., 2020), a CoT (Wei et al., 2022) step to identify missing information, and a second retrieval using a refined query. It provides multi-step retrieval but lacks an explicit evidence-gap state or a general controller.

Backend Baselines

In this group, we incorporate vanilla RAG (Lewis et al., 2020) and Zep (Rasmussen et al., 2025) as retriever backends for MemR 3 to demonstrate the advantages of MemR 3 's plug-in design. The former is a chunk-based method while the latter is a graph-based one, which cover most types of existing memory systems.

Vanilla RAG (Lewis et al., 2020). The vanilla RAG retrieves the topk relevant snippets from the query once and provides a direct answer, without iterative retrieval or reasoning-based refinement. The other retrieval setting ( n chk, chunk size, etc.) is the same as that in MemR 3 .

Zep (Rasmussen et al., 2025). Zep is a hosted memory service that builds a time-aware knowledge graph over conversations and metadata to support fast semantic and temporal queries. We implement their original code.

1 https://github . com/WujiangXu/A-mem

Table 3. The alignment of the orders and categories in LoCoMo dataset.

Full-Context

Lastly, we include Full-Context as a strong baseline, which provides the model with the entire conversation or memory buffer without retrieval, serving as an upper-bound reference that is unconstrained by retrieval errors or missing information.

Other protocols.

For all chunk-based methods like RAG (Lewis et al., 2020), Self-RAG (Asai et al., 2024), RAG-CoT-RAG, and MemR 3 (RAG retriever), we set the embedding model as text-embedding-large-3 (OpenAI, 2024a) and use a re-ranking strategy (Reimers & Gurevych, 2019) ( ms-marco-MiniLM-L-12-v2 ) to search relevant memories rather than just similar ones. The chunk size is selected from { 128, 256, 512, 1024 } using the GPT-4o-mini backend when n max = 1 and n chk = 1 , and we ultimately choose 256. This chunk size is also in line with Mem0 (Chhikara et al., 2025).

Re-alignment of LoCoMo dataset

Misalignment in existing works. Although the correct order of the different categories is not explicitly reported in LoCoMo (Maharana et al., 2024), we can infer it from the number of questions in each category. The correct alignment is shown in Table 3. We believe this clarification could benefit the LLM memory community.

Repeated questions in LoCoMo dataset. Note that the number of single-hop and adversarial questions is 841 and 446 in the original LoCoMo, while the number is 830 and 445 based on our count, due to 12 repeated questions. In the following, the first question is repeated in both the single-hop and adversarial categories in the 2rd conversation (we remove the one in the adversarial category), while the remaining 11 questions are repeated in the single-hop category in the 8th conversation.

Table 4. Repeated experiments of MemR 3 in the main results in Table 1.

Figure 6. Average token consumption of the retrieved snippets (left y-axis) and LLM-as-a-Judge (J) Score (right y-axis) of RAG, MemR 3 , and Full-Context across four categories.

Misalignment in existing works.

Repeated questions in LoCoMo dataset.

Table 4. Repeated experiments of MemR 3 in the main results in Table 1.

Figure 6. Average token consumption of the retrieved snippets (left y-axis) and LLM-as-a-Judge (J) Score (right y-axis) of RAG, MemR 3 , and Full-Context across four categories.

Experimental Results

Repeated Experiments.

For the LoCoMo dataset, we show the repeated experiments of MemR 3 in Table 4.

Token Consumption

In Table 6, we compare the average token consumption of the retrieved snippets and J score of RAG, MemR 3 , and FullContext methods across four categories. The chunk size of RAG and MemR 3 are both set as n chk = 5 , while n max = 2 for MemR 3 . We observe that MemR 3 outperforms RAG across all four categories with only a few additional tokens. While Full-Context consumes significantly more tokens than MemR 3 , it surpasses MemR 3 only on multi-hop questions.

1:	Input: query q , previous snippets S k - 1 , iteration k , budgets n max ,n cap , current reflect-streak length n streak .
2: 3: 4:	Output: action a k . if k ≥ n max then a k = answer ▷ Max iteration budget. else if S k - 1 = ∅ then
5: 6:	a k = reflect ▷
7: 8:	No retrieved snippets. else if n streak ≥ n cap then
9:	a k = retrieve ▷ Max reflect streak. else
10: 11:	pass ▷ Keep the generated action. end if

LLM	Method	1. Multi-Hop	2. Temporal	3. Open-Domain	4. Single-Hop	Overall
GPT-4o-mini	A-Mem (Xu et al., 2025) LangMem (LangChain Team, 2025) Mem0 (Chhikara et al., 2025)	61.70 62.23 67.13	64.49 23.43 55.51	40.62 47.92 51.15	76.63 71.12 72.93	69.06 58.10 66.88
	Self-RAG (Asai et al., 2024) RAG-CoT-RAG	69.15 71.28	64.80 71.03	34.38 42.71	88.31 86.99	76.46 77.96
	Zep (Rasmussen et al., 2025) MemR 3 (ours, Zep backbone)	67.38 69.39 (+2.01)	73.83 73.83(+0.00)	63.54 67.01 (+3.47)	78.67 80.60 (+1.93)	74.62 76.26 (+1.64)
	RAG (Lewis et al., 2020) MemR 3 (ours, RAG backbone)	68.79 71.39 (+2.60)	65.11 76.22 (+11.11)	58.33 61.11 (+2.78)	83.86 89.44 (+5.58)	75.54 81.55 (+6.01)
	Full-Context	72.34	58.88	59.38	86.39	76.32
GPT-4.1-mini	A-Mem (Xu et al., 2025) LangMem (LangChain Team, 2025) Mem0 (Chhikara et al., 2025)	71.99 74.47 62.41	74.77 61.06 57.32	58.33 67.71 44.79	79.88 86.92 66.47	76.00 78.05 62.47
	Self-RAG (Asai et al., 2024) RAG-CoT-RAG	75.89 80.85	75.08 81.62	54.17 62.50	90.12 90.12	82.08 84.89
	Zep (Rasmussen et al., 2025) MemR 3 (ours, Zep backbone) RAG (Lewis et al., 2020)	72.34 77.78 (+5.44)	77.26 77.78 (+0.52)	64.58 69.79 (+5.21) 62.50	83.49 84.42 (+0.93) 85.90	78.94 80.88 (+1.94)
	MemR 3 (ours, RAG backbone)	73.05 81.20 (+8.15)	73.52 82.14 (+8.62)	71.53 (+9.03)	92.17 (+6.27)	79.46 86.75 (+7.29)
	Full-Context	86.43	86.82	71.88	93.73	89.00

Method	MH *	Temporal	OD *	SH *	Overall
RAG	68.79	65.11	58.33	83.86	75.54
MemR 3	71.39	76.22	61.11	89.44	81.55
w/o mask	62.41	68.54	55.21	72.17	68.54
w/o ∆ q k	66.67	75.08	60.42	83.37	77.11
w/o reflect	65.25	73.83	61.46	83.37	76.65

Category	Multi-Hop	Temporal	Open-Domain	Single-Hop	Adversarial
Order	Category 1	Category 2	Category 3	Category 4	Category 5
# Questions	282	321	96	830	445

LLM	Retriever	Run Order	1. Multi-Hop	2. Temporal	3. Open-Domain	4. Single-Hop	Overall
	Zep 1 2 3	68.09 69.86 70.21	73.52 72.59 75.39	68.75 67.71 64.58	80.72 80.36 80.72	76.13 76.00 76.65
GPT-4o-mini	RAG 2 3	1 71.63 70.21 72.34		77.26 76.01 75.39	61.46 59.38 62.50	89.28 89.40 89.64	81.75 81.16 81.75
	mean ± std Zep 1	71.39 78.72 2 75.89 3 78.72	± 1.08	76.22 ± 0.95 61.11 78.50 77.26 77.57	± 1.59 72.92 68.75 67.71	89.44 ± 0.18 84.34 84.58 84.34	81.56 ± 0.34 81.36 80.44 80.84
GPT-4.1-mini	RAG 1 2	81.56 82.62 79.43		83.18 80.69 82.55	69.79 75.00 69.79	91.93 92.65	86.79 87.18 86.27
	3					91.93
	mean	± std 81.20	± 1.62	82.14 ± 1.29	71.53 ± 3.01	92.17 ± 0.42	86.75 ± 0.46

Memory systems have been designed to leverage past experiences in Large Language Model (LLM) agents. However, many deployed memory systems primarily optimize compression and storage, with comparatively less emphasis on explicit, closed-loop control of memory retrieval. From this observation, we build memory retrieval as an autonomous, accurate, and compatible agent system, named MemR3, which has two core mechanisms: 1) a router that selects among retrieve, reflect, and answer actions to optimize answer quality; 2) a global evidence-gap tracker that explicitly renders the answering process transparent and tracks the evidence collection process. This design departs from the standard retrieve-then-answer pipeline by introducing a closed-loop control mechanism that enables autonomous decision-making. Empirical results on the LoCoMo benchmark demonstrate that MemR3 surpasses strong baselines on LLM-as-a-Judge score, and particularly, it improves existing retrievers across four categories with an overall improvement on RAG (+7.29%) and Zep (+1.94%) using GPT-4.1-mini backend, offering a plug-and-play controller for existing memory stores.

With recent advances in large language model (LLM) agents, memory systems have become the focus of storing and retrieving long-term, personalized memories. They can typically be categorized into two groups: 1) Parametric methods (wang2024wise; fang2025alphaedit) that encode memories implicitly into model parameters, which can handle specific knowledge better but struggle in scalability and continual updates, as modifying parameters to incorporate new memories often risks catastrophic forgetting and requires expensive fine-tuning. 2) Non-parametric methods (xu2025amem; langmem_blog2025; chhikara2025mem0; rasmussen2025zep), in contrast, store explicit external information, enabling flexible retrieval and continual augmentation without altering model parameters. However, they typically rely on heuristic retrieval strategies, which can lead to noisy recall, heavy retrieval, and increasing latency as the memory store grows.

Orthogonal to these works, this paper constructs an agentic memory system, MemR3, i.e., Memory Retrieval system with Reflective Reasoning, to improve retrieval quality and efficiency. Specifically, this system is constructed using LangGraph (langchain2025langgraph), with a router node selecting three optional nodes: 1) the retrieve node, which is based on existing memory systems, can retrieve multiple times with updated retrieval queries. 2) the reflect node, iteratively reasoning based on the current acquired evidence and the gaps between questions and evidence. 3) the answer node that produces the final response using the acquired information. Within all nodes, the system maintains a global evidence-gap tracker to update the acquired (evidence) and missing (gap) information.

The system has three core advantages: 1) Accuracy and efficiency. By tracking the evidence and gap, and dynamically routing between retrieval and reflection, MemR3 minimizes unnecessary lookups and reduces noise, resulting in faster, more accurate answers. 2) Plug-and-play usage. As a controller independent of existing retriever or memory storage, MemR3 can be easily integrated into memory systems, improving retrieval quality without architectural changes. 3) Transparency and explainability. Since MemR3 maintains an explicit evidence-gap state over the course of an interaction, it can expose which memories support a given answer and which pieces of information were still missing at each step, providing a human-readable trace of the agent’s decision process. We compare MemR3, the Full-Context setting (which uses all available memories), and the commonly adopted retrieve-then-answer paradigm from a high-level perspective in Fig. 1. The contributions of this work are threefold in the following:

(1) A specialized closed-loop retrieval controller for long-term conversational memory. We propose MemR3, an autonomous controller that wraps existing memory stores and turns standard retrieve-then-answer pipelines into a closed-loop process with explicit actions (retrieve / reflect / answer) and simple early-stopping rules. This instantiates the general LLM-as-controller idea specifically for non-parametric, long-horizon conversational memory.

(2) Evidence–gap state abstraction for explainable retrieval. MemR3 maintains a global evidence–gap state (ℰ,𝒢)(\mathcal{E},\mathcal{G}) that summarizes what has been reliably established in memory and what information remains missing. This state drives query refinement and stopping, and can be surfaced as a human-readable trace of the agent’s progress. We further formalize this abstraction via an abstract requirement space and prove basic monotonicity and completeness properties, which we later use to interpret empirical behaviors.

(3) Empirical study across memory systems. We integrate MemR3 with both chunk-based RAG and a graph-based backend (Zep) on the LoCoMo benchmark and compare it with recent memory systems and agentic retrievers. Across backends and question types, MemR3 consistently improves LLM-as-a-Judge scores over its underlying retrievers.

Prior work on non-parametric agent memory systems spans a wide range of fields, including management and utilization (du2025rethinking), by storing structured (rasmussen2025zep) or unstructured (zhong2024memorybank) external knowledge. Specifically, production-oriented agents such as MemGPT (packer2023memgpt) introduce an OS-style hierarchical memory system that allows the model to page information between context and external storage, and SCM (wang2023enhancing) provides a controller-based memory stream that retrieves and summarizes past information only when necessary. Additionally, Zep (rasmussen2025zep) builds a temporal knowledge graph that unifies and retrieves evolving conversational and business data. A-Mem (xu2025amem) creates self-organizing, Zettelkasten-style memory that links and evolves over time. Mem0 (chhikara2025mem0) extracts and manages persistent conversational facts with optional graph-structured memory. MIRIX (wang2025mirix) offers a multimodal, multi-agent memory system with six specialized memory types. LightMem (fang2025lightmem) proposes a lightweight and efficient memory system inspired by the Atkinson–Shiffrin model. Another related approach, Reflexion (shinn2023reflexion), improves language agents by providing verbal reinforcement across episodes by storing natural-language reflections to guide future trials.

In this paper, we explicitly limit our scope to long-term conversational memory. Existing parametric approaches (wang2024wise; fang2025alphaedit), KV-cache–based mechanisms (zhong2024memorybank; eyuboglu2025cartridges), and streaming multi-task memory benchmarks (wei2025evo) are out of scope for this work. Orthogonal to existing storage, MemR3 is an autonomous retrieval controller that uses a global evidence–gap tracker to route different actions, enabling closed-loop retrieval.

Retrieval-Augmented Generation (RAG) (lewis2020retrieval) established the modern retrieve-then-answer paradigm; subsequent work explored stronger retrievers (karpukhin2020dense; izacard2021leveraging). Beyond the RAG, recent work, such as Self-RAG (asai2024self), Reflexion (shinn2023reflexion), ReAct (yao2022react), and FAIR-RAG (asl2025fair), has shown that letting a language model (LM) decide when to retrieve, when to reflect, and when to answer can substantially improve multi-step reasoning and factuality in tool-augmented settings. MemR3 follows this general “LLM-as-controller” paradigm but applies it specifically to long-term conversational memory over non-parametric stores. Concretely, we adopt the idea of multi-step retrieval and self-reflection from these frameworks, but i) move the controller outside the base LM as a LangGraph program, ii) maintain an explicit evidence–gap state that separates verified memories from remaining uncertainties, and iii) interface this state with different memory backends (e.g., RAG and Zep (rasmussen2025zep)) commonly used in long-horizon dialogue agents. Our goal is not to replace these frameworks, but to provide a specialized retrieval controller that can be plugged into existing memory systems.

In this section, we first formulate the problem and provide preliminaries in Sec. 3.1, and then give a system overview of MemR3 in Sec. 3.2. Additionally, we describe the two core components that enable accurate and efficient retrieval: the router and the global evidence-gap tracker in Sec. 3.4 and Sec. 3.3, respectively.

We consider a long-horizon LLM agent that interacts with a user, forming a memory store ℳ={mi}i=1N\mathcal{M}={m_{i}}{i=1}^{N}, where each memory item mim{i} may correspond to a dialogue utterance, personal fact, structured record, or event, often accompanied by metadata such as timestamps or speakers. Given a user query qq, a retriever is applied to retrieve a set of memory snippets 𝒮\mathcal{S} that are useful for generating the final answer. Then, given designed prompt template pp, the goal is to produce an answer ww:

which is accurate (consistent with all relevant memories in ℳ\mathcal{M}), efficient (requiring minimal retrieval cycles and low latency), and robust (stable under noisy, redundant, or incomplete memory stores) as much as possible.

Existing memory systems have done great work on the memory storage ℳ\mathcal{M}, but typically follow an open-loop pipeline: 1) apply a single retrieval pass; 2) feed the selected memories 𝒮\mathcal{S} into a generator to produce 𝒜\mathcal{A}. This approach lacks adaptivity: retrieval does not incorporate intermediate reasoning, and the system never represents which information remains missing. This leads to both under-retrieval (insufficient evidence) and over-retrieval (long, noisy contexts).

MemR3 addresses these limitations by treating retrieval as an autonomous sequential decision process with explicit modeling of both acquired evidence and remaining gaps.

MemR3 is implemented as a directed agent graph comprising three operational nodes (Retrieve, Reflect, Answer) and one control node (Router) using LangGraph (langchain2025langgraph) (an open-source framework for building stateful, multi-agent workflows as graphs of interacting nodes). The agent maintains a mutable internal state

where qq and 𝒮\mathcal{S} are the aforementioned original user query and retrieved snippets, respectively. ℰ\mathcal{E} is the accumulated evidence relevant to qq and 𝒢\mathcal{G} is the remaining missing information (the “gap”) between qq and ℰ\mathcal{E}. Moreover, we maintain the iteration index kk to control early stopping.

At each iteration kk, the router chooses an action in {retrieve,reflect,answer}{\texttt{retrieve},\ \texttt{reflect},\ \texttt{answer}}, which determines the next node in the computation graph. The pipeline is shown in Fig. 2. This transforms the classical retrieve-then-answer pipeline into a closed-loop controller that can repeatedly refine retrieval queries, integrate new evidence, and stop early once the information gap is resolved.

A core design principle of MemR3 is to explicitly maintain and update two state variables: the evidence ℰ\mathcal{E} and the gap 𝒢\mathcal{G}. These variables summarize what the agent currently knows and what it still needs to know to answer the question.

At iteration kk, the evidence ℰk\mathcal{E}{k} and gaps 𝒢k\mathcal{G}{k} are updated according to the retrieved snippets 𝒮k−1\mathcal{S}{k-1} (from the retrieve node) or reflective reasoning ℱk−1\mathcal{F}{k-1} (from the reflect node), together with last evidence ℰk−1\mathcal{E}{k-1} and gaps 𝒢k−1\mathcal{G}{k-1} at k−1k-1 iteration:

where pkp_{k} is the prompt template at kk iteration. Additionally, aka_{k} is the action at kk iteration, which will be introduced in Sec. 3.4. Note that we explicitly clarify in pkp_{k} that ℰk\mathcal{E}{k} does not contain any information in 𝒢k\mathcal{G}{k}, making evidence and gaps decoupled. An example is shown in Fig. 3 to illustrate the evidence-gap tracker.

Through the evidence-gap tracker, MemR3 maintains a structured and transparent internal state that continuously refines the agent’s understanding of both i) what has already been established as relevant evidence, and ii) what missing information still prevents a complete and faithful answer. This explicit decoupling enables MemR3 to reason under partial observability: as long as 𝒢k≠∅\mathcal{G}{k}\neq\varnothing, the agent recognizes that its current knowledge is insufficient and can proactively issue a refined retrieval query to close the remaining gap. Conversely, when 𝒢k\mathcal{G}{k} becomes empty, the router detects that the agent has accumulated adequate evidence and can safely transition to the answer node.

Beyond guiding retrieval, the evidence-gap representation also makes the agent’s behavior more transparent. At any iteration kk, the pair (ℰk,𝒢k)(\mathcal{E}{k},\mathcal{G}{k}) can be surfaced as a structured explanation of i) which memories the agent currently treats as relevant evidence and ii) which unresolved questions or missing details are preventing a confident answer. This trace provides users and developers with a faithful view of how the agent arrived at its final answer and why additional retrieval steps were taken (or not). In the following, we display an informal theorem that indicates the properties of the idealized evidence-gap tracker.

Under an idealized requirement space R(q)R(q) for a specific query qq, the evidence-gap tracker in MemR3 is monotone (evidence never decreases and gaps never increase), sound (every supported requirement eventually enters the evidence set), and complete (if every requirement r∈R(q)r\in R(q) is supported by some memory, the ideal gap eventually becomes empty).

Formally, in Appendix B we define the abstract requirement space R(q)R(q) and characterize the tracker as a set-valued update on R(q)R(q), proving fundamental soundness, monotonicity, and completeness properties (Theorem B.4), which we later use in Sec. 4.3 to interpret empirical phenomena such as why some questions cannot be fully resolved even after exhausting the iteration budget.

We explicitly define several nodes in the LangGraph framework, including start, end, generate, router, retrieve, reflect, answer. Specifically, start is always followed by retrieve, and end is reached after answer. generate is a LLM generation node, which is already introduced in Eq. 3. In the following, we further introduce the router node and three action nodes.

Router. At each iteration, the router, an autonomous sequential controller, uses the current state and selects an action from {retrieve,reflect,answer}{\texttt{retrieve},\texttt{reflect},\texttt{answer}}. Each action aka_{k} is accompanied by a textual generation:

where Δqk\Delta q_{k} is a refinement query, fkf_{k} is a reasoning content, and wkw_{k} is a draft answer, which are utilized in the downstream action nodes. To ensure stability, router applies three deterministic constraints: 1) a maximum iteration budget nmaxn_{\text{max}} that forces an answer action once the budget is exhausted, 2) a reflect-streak capacity ncapn_{\text{cap}} that forces a retrieve action when too many reflections have occurred consecutively, and 3) a retrieval-opportunity check that switches the action to reflect whenever the retrieval stage returns no snippets. The router’s algorithm is shown in Alg. 1.

These lightweight rules stabilize the decision process while preserving flexibility. We further introduce the detailed implementation of these constraints when introducing the system prompt in Appendix A.1.

Given a generated refinement Δqk\Delta q_{k}, the retrieve node constructs qkret=q⊕Δqkq_{k}^{\mathrm{ret}}=q\oplus\Delta q_{k}, where ⊕\oplus means textual combination and qq is the original query, and then, fetches new memory snippets:

Snippets 𝒮k\mathcal{S}_{k} are independently used for the next generation without history accumulation. Moreover, retrieved snippets are masked to prevent re-selection.

A major benefit of MemR3 is that it treats all concrete retrievers as plug-in modules. Any retriever, e.g., vector search, graph memory, hybrid stores, or future systems, can be integrated into MemR3 as long as they return textual snippets, optionally with stable identifiers that can be masked once used. This abstraction ensures MemR3 remains lightweight, portable, and compatible.

The reflect node incorporates the reasoning process ℱk−1\mathcal{F}{k-1}, and invokes the router to update (ℰk,𝒢k,ak)(\mathcal{E}{k},\mathcal{G}{k},a{k}) in Eq. 3, where evidence and gaps can be re-summarized.

Once the router selects answer, the final answer is generated from the original query qq, the draft answer wkw_{k}, evidence ℰk\mathcal{E}{k} using prompt pwp{w} from rasmussen2025zep:

The answer LLM is instructed to avoid hallucinations and remain faithful to evidence.

Although MemR3 introduces extra routing steps, it maintains low overhead via 1) Compact evidence and gap summaries: only short summaries are repeatedly fed into the router. 2) Masked retrieval: each retrieval call yields genuinely new information. 3) Small iteration budgets: typically, most questions can be answered using only a single iteration. Those complicated questions that require multiple iterations are constrained with a small maximum iteration budget. These design choices ensure that MemR3 improves retrieval quality without large increases in retrieved tokens.

The experiments are conducted on a machine with an AMD EPYC 7713P 64-core processor, an A100-SXM4-80GB GPU, and 512GB of RAM. Each experiment of MemR3 is repeated three times to report the average scores. Code available: https://github.com/Leagein/memr3.

In line with baselines (xu2025amem; chhikara2025mem0), we employ LoCoMo (maharana2024evaluating) dataset as a fundamental benchmark. LoCoMo has a total of 10 conversations across four categories: 1) multi-hop, 2) temporal, 3) open-domain, 4) single-hop, and 5) adversarial. We exclude the last ‘adversarial’ category, following existing work (chhikara2025mem0; wang2025mirix), since it is used to test whether unanswerable questions can be identified. Each conversation has approximately 600 dialogues with 26k tokens and 200 questions on average.

Metrics. We adopt the LLM-as-a-Judge (J) score to evaluate answer quality following chhikara2025mem0; wang2025mirix. Compared with surface-level measures such as F1 or BLEU-1 (xu2025amem; 10738994), this metric better avoids relying on simple lexical overlap and instead captures semantic alignment. Specifically, GPT-4.1 (openai2025gpt41) is employed to judge whether the answer is correct according to the original question and the generated answer, following the prompt by chhikara2025mem0.

Baselines. We select four groups of advanced methods as baselines: 1) memory systems, including A-mem (xu2025amem), LangMem (langmem_blog2025), and Mem0 (chhikara2025mem0); 2) agentic retrievers, like Self-RAG (asai2024self). We also design a RAG-CoT-RAG (RCR) pipeline beyond ReAct (yao2022react) as a strong agentic retriever baseline combining both RAG (lewis2020retrieval) and Chain-of-Thoughts (CoT) (wei2022chain); 3) backend baselines, including chunk-based (RAG (lewis2020retrieval)) and graph-based (Zep (rasmussen2025zep)) memory storage, demonstrating the plug-in capability of MemR3 across different retriever backends; 4) Moreover, ‘Full-Context’ is widely used as a strong baseline and, when the entire conversation fits within the model window, serves as an empirical upper bound on J score (chhikara2025mem0; wang2025mirix). More detailed introduction of these baselines is shown in Appendix C.1.

Other Settings. Other experimental settings and protocols are shown in Appendix C.2.

LLM Backend. We reviewed recent work and found that it most frequently used GPT-4o-mini (openai2024gpt4omini), as it is inexpensive and performs well. While some work (wang2025mirix) also includes GPT-4.1-mini (openai2025gpt41), we set both of them as our LLM backends. In our main results, MemR3 is performed at temperature 0.

Overall. Table 1 reports LLM-as-a-Judge (J) scores across four LoCoMo categories. Across both LLM backends and memory backbones, MemR3 consistently outperforms its underlying retrievers (RAG and Zep) and achieves strong overall J scores. Under GPT-4o-mini, MemR3 lifts the overall score of Zep from 74.62% to 76.26%, and RAG from 75.54% to 81.55%, with the latter even outperforming the Full-Context baseline (76.32%). With GPT-4.1-mini, we see the same pattern: MemR3 improves Zep from 78.94% to 80.88% and RAG from 79.46% to 86.75%, making the RAG-backed variant the strongest retrieval-based system and narrowing the gap to Full-Context (89.00%). As expected, methods instantiated with GPT-4.1-mini are consistently stronger than their GPT-4o-mini counterparts. Full-Context also benefits substantially from the stronger LLM, but under GPT-4o-mini it lags behind the best retrieval-based systems, especially on temporal and open-domain questions. Overall, these results indicate that closed-loop retrieval with an explicit evidence–gap state yields gains primarily orthogonal to the choice of LLM or memory backend, and that MemR3 particularly benefits from backends that expose relatively raw snippets (RAG) rather than heavily compressed structures (Zep).

Multi-hop. Multi-hop questions require chaining multiple pieces of evidence and, therefore, directly test our reflective controller. Under GPT-4o-mini, MemR3 improves both backbones on this category: the multi-hop J score rises from 68.79% to 71.39% on RAG and from 67.38% to 69.39% on Zep, bringing both close to the Full-Context score (72.34%). With GPT-4.1-mini, the gains are more pronounced: MemR3 boosts RAG from 73.05% to 81.20% and Zep from 72.34% to 77.78%, outperforming all other baselines and approaching the Full-Context upper bound (86.43%). These consistent gains suggest that explicitly tracking evidence and gaps helps the agent coordinate multiple distant memories via iterative retrieval, rather than relying on a single heuristic pass.

Temporal. Temporal questions stress the model’s ability to reason about ordering and dating of events over long horizons, where both under- and over-retrieval can be harmful. Here, MemR3 delivers some of its most considerable relative improvements. For GPT-4o-mini, the temporal J score of RAG jumps from 65.11% to 76.22%, outperforming both the original RAG and the Zep baseline (73.83%), while MemR3 with a Zep backbone preserves Zep’s strong temporal accuracy (73.83%). Full-Context performs notably worse in this regime (58.88%), indicating that simply supplying all dialogue turns can hinder temporal reasoning under a weaker backbone. With GPT-4.1-mini, MemR3 again significantly strengthens temporal reasoning: RAG improves from 73.52% to 82.14%, and Zep from 77.26% to 77.78%, making the RAG-backed MemR3 the best retrieval-based system and closing much of the remaining gap to Full-Context (86.82%). These findings support our design goal that explicitly modeling “what is already known” versus “what is still missing” helps the agent align and compare temporal relations more robustly.

Open-Domain. Open-domain questions are less tied to the user’s personal timeline and often require retrieving diverse background knowledge, which makes retrieval harder to trigger and steer. Despite this, MemR3 consistently improves over its backbones. Under GPT-4o-mini, MemR3 increases the open-domain J score of RAG from 58.33% to 61.11% and that of Zep from 63.54% to 67.01%, with the Zep-backed variant achieving the best performance among all methods in this block, surpassing Full-Context (59.38%). With GPT-4.1-mini, the gains become even larger: MemR3 lifts RAG from 62.50% to 71.53% and Zep from 64.58% to 69.79%, nearly matching the Full-Context baseline (71.88%) and again outperforming all other baselines. We attribute these improvements to the router’s ability to interleave retrieval with reflection: when initial evidence is noisy or off-topic, MemR3 uses the gap representation to reformulate queries and pull in more targeted external knowledge rather than committing to an early, brittle answer.

Single-hop. Single-hop questions can often be answered from a single relevant memory snippet, so the potential headroom is smaller, but MemR3 still yields consistent gains. With GPT-4o-mini, MemR3 raises the single-hop J score from 78.67% to 80.60% on Zep and from 83.86% to 89.44% on RAG, with the latter surpassing the Full-Context baseline (86.39%). Under GPT-4.1-mini, MemR3 improves Zep from 83.49% to 84.42% and RAG from 85.90% to 92.17%, making the RAG-backed variant the strongest method overall aside from Full-Context (93.73%). Together with the iteration-count analysis in Sec. 4.3, these results suggest that the router often learns to terminate early on straightforward single-hop queries, gaining accuracy primarily through better evidence selection rather than additional reasoning depth, and thus adding little overhead in tokens or latency.

We ablate various hyperparameters and modules to evaluate their impact in MemR3 with the RAG retriever. During these experiments, we utilize GPT-4o-mini as a consistent LLM backend.

MH = Multi-hop; OD = Open-domain; SH = Single-hop.

We first examine the contribution of the main design choices in MemR3 by progressively removing them while keeping the RAG retriever and all hyperparameters fixed. As shown in Table 2, disabling masking for previously retrieved snippets (w/o mask) results in the largest degradation, reducing the overall J score from 81.55% to 68.54% and harming every category. This confirms that repeatedly surfacing the same memories wastes budget and fails to effectively close the remaining gaps. Removing the refinement query Δqk\Delta q_{k} (w/o Δqk\Delta q_{k}) has a milder effect: temporal and open-domain performance changed a little, but multi-hop and single-hop scores decline significantly, indicating that tailoring retrieval queries from the current evidence-gap state is particularly beneficial for simpler questions. Disabling the reflect node (w/o reflect) similarly reduces performance (from 81.55% to 76.65%), with notable drops on multi-hop and single-hop questions, highlighting the value of interleaving reasoning-only steps with retrieval. Note that in Table 2, the raw retrieved snippets are only visible to the vanilla RAG.

We first choose a nominal configuration for MemR3 (with a RAG retriever) by arbitrarily setting the number of chunks per iteration nchk=3n_{\text{chk}}=3 and the max iteration budget nmax=5n_{\text{max}}=5. In Fig. 4(a), we fix nmax=5n_{\text{max}}=5 and perform ablations over nchk∈{1,3,5,7,9}n_{\text{chk}}\in{1,3,5,7,9}. In Fig. 4(b), we fix nchk=3n_{\text{chk}}=3 and perform ablations over nmax∈{1,2,3,4,5}n_{\text{max}}\in{1,2,3,4,5}. Considering both of the LLM-as-a-Judge score and token consumption, we eventually choose nchk=5n_{\text{chk}}=5 and nmax=5n_{\text{max}}=5 in all main experiments.

We further inspect how often MemR3 actually uses multiple retrieve/reflect/answer iterations when nchk=5n_{\text{chk}}=5 and nmax=5n_{\text{max}}=5 (Fig. 5). Overall, most questions are answered after a single iteration, and this effect is particularly strong for Single-hop questions. An exception is open-domain questions, for which 58 of 96 require continuous retrieval or reflection until the maximum number of iterations is reached, highlighting the inherent challenges and uncertainty in these questions. Additionally, only a small fraction of questions terminate at intermediate depths (2–4 iterations), suggesting that MemR3 either becomes confident early or uses the whole iteration budget when the gap remains non-empty.

We observe that this distribution arises from two regimes. On the one hand, straightforward questions require only a single piece of evidence and can be resolved in a single iteration, consistent with intuition. From the perspective of the idealized tracker in Appendix B, these are precisely the queries for which every requirement r∈R(q)r\in R(q) is supported by some retrieved memory item m∈⋃j≤kSjm\in\bigcup_{j\leq k}S_{j} with m⊧rm\models r, so the completeness condition in Theorem B.4 is satisfied and the ideal gap Gk⋆G_{k}^{\star} becomes empty.

On the other hand, some challenging questions are inherently underspecified given the stored memories, so the gap cannot be fully closed even if the agent continues to refine its query. For example, for the question “When did Melanie paint a sunrise?”, the correct answer in our setup is simply “2022” (the year). MemR3 quickly finds this year at the first iteration based on evidence “Melanie painted the lake sunrise image last year (2022).”. However, under the idealized abstraction, the requirement set R(q)R(q) implicitly includes an exact date predicate (year–month–day), and no memory item m∈⋃j≤KSjm\in\bigcup_{j\leq K}S_{j} satisfies m⊧rm\models r for that finer-grained requirement. Thus, the precondition of Theorem B.4(3) is violated, and Gk⋆G_{k}^{\star} never becomes empty; the practical tracker mirrors this by continuing to search for the missing specificity until it hits the maximum iteration budget. In such cases, the additional token consumption is primarily due to a mismatch between the question’s granularity and the available memory, rather than a failure of the agent.

During our reproduction of the baselines, we identified a latent ambiguity in the LoCoMo dataset’s category indexing. Specifically, the mapping between numerical IDs and semantic categories (e.g., Multi-hop vs. Single-hop) implies a non-trivial alignment challenge. We observed that this ambiguity has led to category misalignment in several recent studies (chhikara2025mem0; wang2025mirix), potentially skewing the granular analysis of agent capabilities.

In this work, we introduce MemR3, an autonomous memory-retrieval controller that transforms standard retrieve-then-answer pipelines into a closed-loop process via a LangGraph-based sequential decision-making framework. By explicitly maintaining what is known and what remains unknown using an evidence-gap tracker, MemR3 can iteratively refine queries, balance retrieval and reflection, and terminate early once sufficient evidence has been gathered. Our experiments on the LoCoMo benchmark show that MemR3 consistently improves LLM-as-a-Judge scores over strong memory baselines, while incurring only modest token and latency overhead and remaining compatible with heterogeneous backends. Beyond these concrete gains, MemR3 offers an explainable abstraction for reasoning under partial observability in long-horizon agent settings.

However, we acknowledge some limitations for future work: 1) MemR3 requires an existing retriever or memory structure, and particularly, the performance greatly depends on the retriever or memory structure. 2) The routing structure could lead to token waste for answering simple questions. 3) MemR3 is currently not designed for multi-modal memories like images or audio.

The system prompt is defined as follows, where the “decision_directive” instructs the maximum iteration budges, reflect-streak capacity, and retrieval opportunity check, introduced in Sec. 3.4. Generally, “decision_directive” is a textual instruction: “reflect” if you need to think about the evidence and gaps; choose “answer” ONLY when evidence is solid and no gaps are noted; choose “retrieve” otherwise. However, when the maximum iterations budget is reached, “decision_directive” is set as “answer” to stop early. When the reflection reaches the maximum capacity, “decision_directive” is set as “retrieve” to avoid repeated ineffective reflection. When there is no useful retrieval remains, “decision_directive” is set as “reflect” to avoid repeated ineffective retrieval. Through these constraints, the agent can avoid infinite ineffective actions to maintain stability.

Apart from the system, the user prompt is responsible to feed additional information to the LLM. Specifically, at the kk iteration, “question” is the original question qq. “evidence_block” and “gap_block” are evidence ℰk\mathcal{E}{k} and gaps 𝒢k\mathcal{G}{k} introduced in Sec. 3.3. “raw_block” is the retrieved raw snippets 𝒮k\mathcal{S}{k} in Eq.5. “reasoning_block” is the reasoning content ℱk\mathcal{F}{k} in Sec. 3.4. “last_query” is the refined query Δqk\Delta q_{k} introduced in Sec. 3.4 that enables the new query to be different from the prior one. Note that these fields can be left empty if the corresponding information is not present.

A central component of MemR3 is the evidence-gap tracker introduced in Sec. 3.3, which maintains an evolving summary of i) what information has been reliably established from memory and ii) what information is still missing to answer the query. While the practical implementation of this tracker is based on LLM-generated summaries, we introduce an idealized formal abstraction that clarifies its intended behavior, enables principled analysis, and provides a foundation for studying correctness and robustness. This abstraction does not assume perfect extraction; rather, the LLM acts as a stochastic approximator to the idealized tracker.

For a user query qq, we define a finite set of atomic information requirements, which specify the minimal facts needed to fully answer the query:

For example, for the question “How many months passed between events AA and BB?”, the requirement set can be

Each requirement r∈R(q)r\in R(q) is associated with a symbolic predicate (e.g., a timestamp, entity attribute, or event relation), and R(q)R(q) provides the semantic target against which retrieved memories are judged.

Let ℳ\mathcal{M} be the memory store and Sk⊆ℳS_{k}\subseteq\mathcal{M} denote the snippets retrieved at iteration kk. We define a relation m⊧rm\models r to indicate that memory item m∈ℳm\in\mathcal{M} contains sufficient information to support requirement r∈R(q)r\in R(q). Formally, m⊧rm\models r holds if the textual content of mm contains a minimal witness (e.g., a timestamp, entity mention, or explicit assertion) matching the predicate corresponding to rr. The matching criterion may be implemented via deterministic pattern rules or LLM-based semantic matching; our analysis is agnostic to this choice.

At iteration kk, the idealized tracker maintains two sets: i) the evidence Ek⊆R(q)E_{k}\subseteq R(q) and ii) the gaps Gk=R(q)∖EkG_{k}=R(q)\setminus E_{k}. Given newly retrieved snippets SkS_{k}, the ideal updates are

In MemR3, the tracker is instantiated through LLM-generated summaries:

where the prompt explicitly instructs the model to: (i) extract concise factual bullets relevant to qq, (ii) enumerate missing information blocking a complete answer, and (iii) avoid hallucinations or speculative inference. Thus, (Ek,Gk)(E_{k},G_{k}) serves as a stochastic approximation to the idealized (Ek⋆,Gk⋆)(E_{k}^{\star},G_{k}^{\star}):

with deviations arising from LLM extraction noise. This perspective reconciles the formal update rule with the prompt-driven practical implementation.

Although the practical instantiation lacks deterministic guarantees, the idealized tracker in Definition B.3 satisfies several intuitive properties essential for closed-loop retrieval.

Assume that for all kk and all r∈R(q)r\in R(q), we have r∈Ek⋆r\in E_{k}^{\star} if and only if there exists some m∈⋃j≤kSjm\in\bigcup_{j\leq k}S_{j} such that m⊧rm\models r. Then the following hold:

Monotonicity: Ek−1⋆⊆Ek⋆E_{k-1}^{\star}\subseteq E_{k}^{\star} and Gk⋆⊆Gk−1⋆G_{k}^{\star}\subseteq G_{k-1}^{\star} for all k≥1k\geq 1.

Soundness: If m⊧rm\models r for some retrieved memory m∈Skm\in S_{k}, then r∈Ek⋆r\in E_{k}^{\star}.

(1) By Definition B.3,

(3) If every r∈R(q)r\in R(q) is supported by some m∈⋃j≤KSjm\in\bigcup_{j\leq K}S_{j} with m⊧rm\models r, then repeated application of the update rule ensures that each such rr is eventually added to EK⋆E_{K}^{\star}. Hence EK⋆=R(q)E_{K}^{\star}=R(q) and therefore GK⋆=R(q)∖EK⋆=∅G_{K}^{\star}=R(q)\setminus E_{K}^{\star}=\varnothing. ∎

These properties characterize the target behavior that the LLM-based tracker implementation aims to approximate.

Since real LLMs introduce extraction noise, the practical tracker may deviate from the idealized (Ek⋆,Gk⋆)(E_{k}^{\star},G_{k}^{\star}), for example, through false negatives (missing evidence), false positives (hallucinated evidence), or unstable gap estimates. In the main text (Sec. 3.3 and Sec. 4.3), we study these effects empirically by injecting noisy or contradictory memories and measuring their impact on routing decisions and final answer quality. The formal abstraction above serves as the reference model against which these robustness behaviors are interpreted.

The abstraction in this section assumes access to an ideal tracker that updates (ℰk\mathcal{E}{k}, 𝒢k\mathcal{G}{k}) exactly according to the requirement–support relation m⊧rm\models r. In practice, MemR3 uses an LLM-generated tracker (ℰk\mathcal{E}{k}, 𝒢k\mathcal{G}{k}), which only approximates this ideal update. This introduces several forms of approximation bias: i) Coverage bias (false negatives): supported requirements r∈R(q)r\in R(q) that are omitted from ℰk\mathcal{E}{k}; ii) Hallucination bias (false positives): requirements rr that appear in ℰk\mathcal{E}{k} even though no retrieved memory item supports them; iii) Granularity bias: cases where the tracker records a coarser fact (e.g., a year) but the requirement space R(q)R(q) contains a finer predicate (e.g., an exact date), so the ideal requirement is never fully satisfied.

The “Melanie painted a sunrise” case in Sec. 4.3 provides a concrete illustration of granularity bias. The question asks “When did Melanie paint a sunrise?”, and in our setup the correct answer is the year 2022. Under the ideal abstraction, however, the requirement space R(q)R(q) implicitly contains a fine-grained predicate rdater_{\text{date}} corresponding to the full year–month–day of the painting event. The memory store only contains a coarse statement such as “Melanie painted the lake sunrise image last year (2022).”

In the ideal tracker, no memory item mm satisfies m⊧rdatem\models r_{\text{date}}, so the precondition of Theorem B.4’s completeness clause is violated and the ideal gap 𝒢k\mathcal{G}_{k} never becomes empty. The practical LLM tracker mirrors this behavior: it quickly recovers the year 2022 as evidence, but continues to treat the exact date as a remaining gap, eventually hitting the iteration budget without fully closing Gk. This example shows that some apparent “failures” of the approximate tracker are in fact structural: they arise from a mismatch between the granularity of R(q)R(q) and the information actually present in the memory store.

We select four groups of advanced methods as baselines: 1) memory systems, including A-mem (xu2025amem), LangMem (langmem_blog2025), and Mem0 (chhikara2025mem0); 2) agentic retrievers, like Self-RAG (asai2024self). We also design a RAG-CoT-RAG (RCR) pipeline as a strong agentic retriever baseline combining both RAG (lewis2020retrieval) and Chain-of-Thoughts (CoT) (wei2022chain); 3) backend baselines, including chunk-based (RAG (lewis2020retrieval)) and graph-based (Zep (rasmussen2025zep)) memory storage, demonstrating the plug-in capability of MemR3 across different retriever backends. Moreover, ‘Full-Context’ is widely used as a strong baseline and, when the entire conversation fits within the model window, serves as an empirical upper bound on J score (chhikara2025mem0; wang2025mirix). More detailed introduction of these baselines is shown in Appendix C.1.

We divide our groups into four groups: memory systems, agentic retrievers, backend baselines, and full-context.

In this group, we consider recent advanced memory systems, including A-mem (xu2025amem), LangMem (langmem_blog2025), and Mem0 (chhikara2025mem0), to demonstrate the comprehensively strong capability of MemR3 from a memory control perspective.

A-mem (xu2025amem)111https://github.com/WujiangXu/A-mem. A-Mem is an agent memory module that turns interactions into atomic notes and links them into a Zettelkasten-style graph using embeddings plus LLM-based linking.

LangMem (langmem_blog2025). LangMem is LangChain’s persistent memory layer that extracts key facts from dialogues and stores them in a vector store (e.g., FAISS/Chroma) for later retrieval.

Mem0 (chhikara2025mem0)222https://github.com/mem0ai/mem0. Mem0 is an open-source memory system that enables an LLM to incrementally summarize, deduplicate, and store factual snippets, with an optional graph-based memory extension.

In this group, we examine the agentic structures underlying memory retrieval to show the advanced performance of MemR3 on memory retrieval, and particularly, showing the advantage of the agentic structure of MemR3. To validate this, we include Self-RAG (asai2024self) and design a strong heuristic baseline, RAG-CoT-RAG (RCR), which combines RAG and CoT (wei2022chain).

Self-RAG (asai2024self). A model-driven retrieval controller where the LLM decides, at each step, whether to answer or issue a refined retrieval query. Unlike MemR3, retrieval decisions in Self-RAG are implicit in the model’s chain-of-thought, without explicit state tracking. We reproduce their original code and prompt to suit our task.

RAG-CoT-RAG (RCR). We design a strong heuristic baseline that extends beyond ReAct (yao2022react) by performing one initial retrieval (lewis2020retrieval), a CoT (wei2022chain) step to identify missing information, and a second retrieval using a refined query. It provides multi-step retrieval but lacks an explicit evidence-gap state or a general controller.

In this group, we incorporate vanilla RAG (lewis2020retrieval) and Zep (rasmussen2025zep) as retriever backends for MemR3 to demonstrate the advantages of MemR3’s plug-in design. The former is a chunk-based method while the latter is a graph-based one, which cover most types of existing memory systems.

Vanilla RAG (lewis2020retrieval). The vanilla RAG retrieves the top-kk relevant snippets from the query once and provides a direct answer, without iterative retrieval or reasoning-based refinement. The other retrieval setting (nchkn_{\text{chk}}, chunk size, etc.) is the same as that in MemR3.

Zep (rasmussen2025zep). Zep is a hosted memory service that builds a time-aware knowledge graph over conversations and metadata to support fast semantic and temporal queries. We implement their original code.

For all chunk-based methods like RAG (lewis2020retrieval), Self-RAG (asai2024self), RAG-CoT-RAG, and MemR3 (RAG retriever), we set the embedding model as text-embedding-large-3 (openai2024embeddinglarge3) and use a re-ranking strategy (reimers2019sentence) (ms-marco-MiniLM-L-12-v2) to search relevant memories rather than just similar ones. The chunk size is selected from {128, 256, 512, 1024} using the GPT-4o-mini backend when nmax=1n_{\text{max}}=1 and nchk=1n_{\text{chk}}=1, and we ultimately choose 256. This chunk size is also in line with Mem0 (chhikara2025mem0).

Although the correct order of the different categories is not explicitly reported in LoCoMo (maharana2024evaluating), we can infer it from the number of questions in each category. The correct alignment is shown in Table 3. We believe this clarification could benefit the LLM memory community.

Note that the number of single-hop and adversarial questions is 841 and 446 in the original LoCoMo, while the number is 830 and 445 based on our count, due to 12 repeated questions. In the following, the first question is repeated in both the single-hop and adversarial categories in the 2rd conversation (we remove the one in the adversarial category), while the remaining 11 questions are repeated in the single-hop category in the 8th conversation.

What did Gina receive from a dance contest? (conversation 2, question 62), (conversation 2, question 96)

Table: S4.T1: LLM-as-a-Judge scores (%, higher is better) for each question category in the LoCoMo (maharana2024evaluating) dataset. The best results using each LLM backend, except Full-Context, are in bold.


LLM	Method	1. Multi-Hop	2. Temporal	3. Open-Domain	4. Single-Hop	Overall
GPT-4o-mini	A-Mem (xu2025amem)	61.70	64.49	40.62	76.63	69.06
LangMem (langmem_blog2025)	62.23	23.43	47.92	71.12	58.10
Mem0 (chhikara2025mem0)	67.13	55.51	51.15	72.93	66.88
Self-RAG (asai2024self)	69.15	64.80	34.38	88.31	76.46
RAG-CoT-RAG	71.28	71.03	42.71	86.99	77.96
Zep (rasmussen2025zep)	67.38	73.83	63.54	78.67	74.62
MemR3 (ours, Zep backbone)	69.39 (+2.01)	73.83(+0.00)	67.01 (+3.47)	80.60 (+1.93)	76.26 (+1.64)
RAG (lewis2020retrieval)	68.79	65.11	58.33	83.86	75.54
MemR3 (ours, RAG backbone)	71.39 (+2.60)	76.22 (+11.11)	61.11 (+2.78)	89.44 (+5.58)	81.55 (+6.01)
Full-Context	72.34	58.88	59.38	86.39	76.32
GPT-4.1-mini	A-Mem (xu2025amem)	71.99	74.77	58.33	79.88	76.00
LangMem (langmem_blog2025)	74.47	61.06	67.71	86.92	78.05
Mem0 (chhikara2025mem0)	62.41	57.32	44.79	66.47	62.47
Self-RAG (asai2024self)	75.89	75.08	54.17	90.12	82.08
RAG-CoT-RAG	80.85	81.62	62.50	90.12	84.89
Zep (rasmussen2025zep)	72.34	77.26	64.58	83.49	78.94
MemR3 (ours, Zep backbone)	77.78 (+5.44)	77.78 (+0.52)	69.79 (+5.21)	84.42 (+0.93)	80.88 (+1.94)
RAG (lewis2020retrieval)	73.05	73.52	62.50	85.90	79.46
MemR3 (ours, RAG backbone)	81.20 (+8.15)	82.14 (+8.62)	71.53 (+9.03)	92.17 (+6.27)	86.75 (+7.29)
	Full-Context	86.43	86.82	71.88	93.73	89.00

Table: S4.T2: Ablation studies. Best results are in bold.

Method	MH*	Temporal	OD*	SH*	Overall
RAG	68.79	65.11	58.33	83.86	75.54
MemR3	71.39	76.22	61.11	89.44	81.55
w/o mask	62.41	68.54	55.21	72.17	68.54
w/o Δqk\Delta q_{k}	66.67	75.08	60.42	83.37	77.11
w/o reflect	65.25	73.83	61.46	83.37	76.65

Table: A3.T3: The alignment of the orders and categories in LoCoMo dataset.


Category	Multi-Hop	Temporal	Open-Domain	Single-Hop	Adversarial
Order	Category 1	Category 2	Category 3	Category 4	Category 5
# Questions	282	321	96	830	445

Refer to caption Illustration of three memory-usage paradigms. Full-Context overloads the LLM with all memories and answers incorrectly; Retrieve-then-Answer retrieves relevant snippets but still miscalculates. In contrast, MemR3 iteratively retrieves and reflects using an evidence–gap tracker (Acts 0–3), refines the query about Buddy’s adoption date, and produces the correct answer (3 months).

Refer to caption Pipeline of MemR3. MemR3 transforms retrieval into a closed-loop process: a router dynamically switches between Retrieve, Reflect, and Answer nodes while a global evidence–gap tracker maintains what is known and what is still missing. This enables iterative query refinement, targeted retrieval, and early stopping, making MemR3 an autonomous, backend-agnostic retrieval controller.

Refer to caption Example of the evidence-gap tracker for a specific query. At each step, the agent maintains an explicit summary of the evidence established and the information still missing. This state can be presented directly to users as a human-readable explanation of the agent’s progress in answering the query.

Refer to caption (a)

Refer to caption Number of questions requiring different numbers of iterations before final answers, across four categories.

Refer to caption Average token consumption of the retrieved snippets (left y-axis) and LLM-as-a-Judge (J) Score (right y-axis) of RAG, MemR3, and Full-Context across four categories.

$$ \begin{split}\mathcal{S}&\leftarrow\texttt{Retrieve}(q,\mathcal{M}).\ w&\leftarrow\texttt{LLM}(q,\mathcal{S},p),\end{split} $$ \tag{S3.E1}

$$ s=(q,\mathcal{S},\mathcal{E},\mathcal{G},k), $$ \tag{S3.E2}

$$ \mathcal{E}{k},\mathcal{G}{k},a_{k}=\texttt{LLM}(q,\mathcal{S}{k-1},\mathcal{F}{k-1},\mathcal{E}{k-1},\mathcal{G}{k-1},p_{k}), $$ \tag{S3.E3}

$$ {\small a_{k}\in{(\texttt{retrieve},\Delta q_{k}),(\texttt{reflect},f_{k}),(\texttt{answer},w_{k})},} $$ \tag{S3.E4}

$$ \begin{split}\mathcal{S}{k}=\texttt{Retrieve}(q{k}^{\mathrm{ret}},\mathcal{M}\backslash\mathcal{M}^{\text{ret}}{k-1}),~\mathcal{M}^{\text{ret}}{k}=\mathcal{M}^{\text{ret}}{k-1}\cup\mathcal{S}{k}.\end{split} $$ \tag{S3.E5}

$$ R(q)={r_{1},r_{2},\dots,r_{m}}. $$ \tag{A2.E7}

$$ R(q)={\text{date}(A),\text{date}(B)}. $$ \tag{A2.E8}

$$ E_{k}^{\star}=E_{k-1}\cup\big{r\in R(q),\big|,\exists m\in S_{k},;m\models r\big},\qquad G_{k}^{\star}=R(q)\setminus E_{k}^{\star}. $$ \tag{A2.E9}

$$ (E_{k},G_{k})\approx(E_{k}^{\star},G_{k}^{\star}), $$ \tag{A2.E11}

Theorem. Theorem 3.1 ([Informal] Monotonicity, soundness, and completeness of the idealized evidence-gap tracker). Under an idealized requirement space R(q)R(q) for a specific query qq, the evidence-gap tracker in MemR3 is monotone (evidence never decreases and gaps never increase), sound (every supported requirement eventually enters the evidence set), and complete (if every requirement r∈R(q)r\in R(q) is supported by some memory, the ideal gap eventually becomes empty).

Definition. Definition B.1 (Idealized Requirement Space). For a user query qq, we define a finite set of atomic information requirements, which specify the minimal facts needed to fully answer the query: R(q)={r1,r2,…,rm}.R(q)={r_{1},r_{2},\dots,r_{m}}. (7)

Definition. Definition B.2 (Memory-Support Relation). Let ℳ\mathcal{M} be the memory store and Sk⊆ℳS_{k}\subseteq\mathcal{M} denote the snippets retrieved at iteration kk. We define a relation m⊧rm\models r to indicate that memory item m∈ℳm\in\mathcal{M} contains sufficient information to support requirement r∈R(q)r\in R(q). Formally, m⊧rm\models r holds if the textual content of mm contains a minimal witness (e.g., a timestamp, entity mention, or explicit assertion) matching the predicate corresponding to rr. The matching criterion may be implemented via deterministic pattern rules or LLM-based semantic matching; our analysis is agnostic to this choice.

Definition. Definition B.3 (Idealized Evidence-Gap Update Rule). At iteration kk, the idealized tracker maintains two sets: i) the evidence Ek⊆R(q)E_{k}\subseteq R(q) and ii) the gaps Gk=R(q)∖EkG_{k}=R(q)\setminus E_{k}. Given newly retrieved snippets SkS_{k}, the ideal updates are Ek⋆=Ek−1∪{r∈R(q)|∃m∈Sk,m⊧r},Gk⋆=R(q)∖Ek⋆.E_{k}^{\star}=E_{k-1}\cup\big{r\in R(q),\big|,\exists m\in S_{k},;m\models r\big},\qquad G_{k}^{\star}=R(q)\setminus E_{k}^{\star}. (9)

Theorem. Theorem B.4 (Properties of the Idealized Tracker). Assume that for all kk and all r∈R(q)r\in R(q), we have r∈Ek⋆r\in E_{k}^{\star} if and only if there exists some m∈⋃j≤kSjm\in\bigcup_{j\leq k}S_{j} such that m⊧rm\models r. Then the following hold: 1. Monotonicity: Ek−1⋆⊆Ek⋆E_{k-1}^{\star}\subseteq E_{k}^{\star} and Gk⋆⊆Gk−1⋆G_{k}^{\star}\subseteq G_{k-1}^{\star} for all k≥1k\geq 1. 2. Soundness: If m⊧rm\models r for some retrieved memory m∈Skm\in S_{k}, then r∈Ek⋆r\in E_{k}^{\star}. 3. Completeness at convergence: If every requirement r∈R(q)r\in R(q) is supported by some m∈⋃j≤KSjm\in\bigcup_{j\leq K}S_{j} with m⊧rm\models r, then EK⋆=R(q)E_{K}^{\star}=R(q) and hence GK⋆=∅G_{K}^{\star}=\varnothing.

Ablation on S

$$ \begin{split} \mathcal{S} &\leftarrow \texttt{Retrieve}(q, \mathcal{M}). \ w &\leftarrow \texttt{LLM}(q, \mathcal{S}, p), \end{split} $$

$$ s = (q,\mathcal{S}, \mathcal{E}, \mathcal{G}, k), $$

$$ \label{eq:generate} \mathcal{E}k, \mathcal{G}k, a_k = \texttt{LLM}(q, \mathcal{S}{k-1}, \mathcal{F}{k-1}, \mathcal{E}{k-1}, \mathcal{G}{k-1}, p_k), $$ \tag{eq:generate}

$$ { \small a_k \in {(\texttt{retrieve}, \Delta q_k), (\texttt{reflect}, f_k), (\texttt{answer}, w_k)}, } $$

$$ \label{eq:retrieve} \begin{split} \mathcal{S}k = \texttt{Retrieve}(q_k^{\mathrm{ret}}, \mathcal{M}\backslash \mathcal{M}^{\text{ret}}{k-1}), ~ \mathcal{M}^{\text{ret}}k = \mathcal{M}^{\text{ret}}{k-1} \cup \mathcal{S}_k. \end{split} $$ \tag{eq:retrieve}

$$ R(q) = { r_1, r_2, \dots, r_m }. $$

$$ E_k^\star = E_{k-1} \cup \big{ r \in R(q) ,\big|, \exists m \in S_k,; m \models r \big}, \qquad G_k^\star = R(q) \setminus E_k^\star. $$

$$ (E_k, G_k) \approx (E_k^\star, G_k^\star), $$

Theorem. [[Informal] Monotonicity, soundness, and completeness of the idealized evidence-gap tracker] Under an idealized requirement space $R(q)$ for a specific query $q$, the evidence-gap tracker in {\proj} is monotone (evidence never decreases and gaps never increase), sound (every supported requirement eventually enters the evidence set), and complete (if every requirement $r \in R(q)$ is supported by some memory, the ideal gap eventually becomes empty).

Theorem. [Properties of the Idealized Tracker] Assume that for all $k$ and all $r \in R(q)$, we have $r \in E_k^\star$ if and only if there exists some $m \in \bigcup_{j \le k} S_j$ such that $m \models r$. Then the following hold: enumerate \item Monotonicity: $E_{k-1}^\star \subseteq E_k^\star$ and $G_k^\star \subseteq G_{k-1}^\star$ for all $k \ge 1$. \item Soundness: If $m \models r$ for some retrieved memory $m \in S_k$, then $r \in E_k^\star$. \item Completeness at convergence: If every requirement $r \in R(q)$ is supported by some $m \in \bigcup_{j \le K} S_j$ with $m \models r$, then $E_K^\star = R(q)$ and hence $G_K^\star = \varnothing$. enumerate

Definition. [Idealized Requirement Space] For a user query $q$, we define a finite set of atomic information requirements, which specify the minimal facts needed to fully answer the query: equation R(q) = { r_1, r_2, \dots, r_m }. equation

Definition. [Memory-Support Relation] Let $M$ be the memory store and $S_k \subseteq M$ denote the snippets retrieved at iteration $k$. We define a relation $m \models r$ to indicate that memory item $m \in M$ contains sufficient information to support requirement $r \in R(q)$. Formally, $m \models r$ holds if the textual content of $m$ contains a minimal witness (e.g., a timestamp, entity mention, or explicit assertion) matching the predicate corresponding to $r$. The matching criterion may be implemented via deterministic pattern rules or LLM-based semantic matching; our analysis is agnostic to this choice.

Definition. [Idealized Evidence-Gap Update Rule] At iteration $k$, the idealized tracker maintains two sets: i) the evidence $E_k \subseteq R(q)$ and ii) the gaps $G_k = R(q) \setminus E_k$. Given newly retrieved snippets $S_k$, the ideal updates are equation E_k^\star = E_{k-1} \cup \big{ r \in R(q) ,\big|, \exists m \in S_k,; m \models r \big}, \qquad G_k^\star = R(q) \setminus E_k^\star. equation

Proof. (1) By Definitiondef:eg_update, equation E_k^\star = E_{k-1}^\star \cup \big{ r \in R(q) ,\big|, \exists m \in S_k,; m \models r \big}, equation so $E_{k-1}^\star \subseteq E_k^\star$. Since $G_k^\star = R(q) \setminus E_k^\star$ and $E_{k-1}^\star \subseteq E_k^\star$, we obtain $G_k^\star \subseteq G_{k-1}^\star$. (2) If $m \models r$ for some $m \in S_k$, then by Definitiondef:eg_update we have $r \in { r' \in R(q) \mid \exists m' \in S_k,; m' \models r' } \subseteq E_k^\star$. (3) If every $r \in R(q)$ is supported by some $m \in \bigcup_{j \le K} S_j$ with $m \models r$, then repeated application of the update rule ensures that each such $r$ is eventually added to $E_K^\star$. Hence $E_K^\star = R(q)$ and therefore $G_K^\star = R(q) \setminus E_K^\star = \varnothing$.

Algorithm: algorithm
[t]
\caption{Router policy in {\proj}}
\label{alg:router}
\begin{algorithmic}[1]
\STATE \textbf{Input:} query $q$, previous snippets $\mathcal{S}_{k-1}$, iteration $k$, budgets $n_{\text{max}}, n_{\text{cap}}$, current reflect-streak length $n_{\text{streak}}$.
\STATE \textbf{Output:} action $a_k$.
\IF{$k \ge n_{\text{max}}$}
  \STATE $a_k = \texttt{answer}$ \hfill $\triangleright$ Max iteration budget.
\ELSIF{$\mathcal{S}_{k-1} = \emptyset$}
  \STATE $a_k = \texttt{reflect}$ \hfill $\triangleright$ No retrieved snippets.
\ELSIF{$n_{\text{streak}} \ge n_{\text{cap}}$}
  \STATE $a_k = \texttt{retrieve}$ \hfill $\triangleright$ Max reflect streak.
\ELSE
  \STATE \textbf{pass} \hfill $\triangleright$ Keep the generated action.
\ENDIF
\end{algorithmic}

Algorithm: algorithm
[t]
% \caption{MemR3: Reflective Memory Retrieval Controller}
% \label{alg:memr3}
% \begin{algorithmic}[1]

% \STATE \textbf{Input:} question $q$; memory store $M$; retriever $R$;
% router LLM $L_{\text{router}}$; answer LLM $L_{\text{ans}}$;
% budgets $n_{\max}, n_{\text{chk}}, r_{\max}$

% \STATE \textbf{Initialize:}
% \STATE $k \gets 0$
% \STATE $S \gets \emptyset$
% \STATE $E \gets \emptyset$
% \STATE $G \gets$ ``the full information need of $q$ is unknown''
% \STATE $\mathcal{I}_{\text{seen}} \gets \emptyset$
% \STATE $\text{streak}_{\text{ref}} \gets 0$
% \STATE $\text{last\_query} \gets$ empty string
% \STATE $\text{reasoning} \gets$ empty string

% \STATE $\Delta q \gets$ empty string
% \STATE $(S,\mathcal{I}_{\text{seen}}) \gets \texttt{Retrieve}(q,\Delta q, M, R, \mathcal{I}_{\text{seen}}, n_{\text{chk}})$

% \WHILE{true}

%     \IF{$k \ge n_{\max}$}
%         \STATE $d \gets$ ``force\_answer''
%     \ELSIF{$\text{streak}_{\text{ref}} \ge r_{\max}$}
%         \STATE $d \gets$ ``force\_retrieve''
%     \ELSE
%         \STATE $d \gets$ ``default''
%     \ENDIF

%     \STATE $(E,G,a,\text{payload}) \gets 
%     \texttt{Router}(L_{\text{router}}, q, S, E, G, \text{reasoning}, \text{last\_query}, d)$

%     \IF{$a = \text{RETRIEVE}$}
%         \STATE $\text{streak}_{\text{ref}} \gets 0$
%         \STATE $\Delta q \gets \text{payload.retrieval\_query}$
%         \STATE $\text{last\_query} \gets \Delta q$
%         \STATE $(S,\mathcal{I}_{\text{seen}}) \gets
%         \texttt{Retrieve}(q,\Delta q, M, R, \mathcal{I}_{\text{seen}}, n_{\text{chk}})$
%         \STATE $k \gets k + 1$

%     \ELSIF{$a = \text{REFLECT}$}
%         \STATE $\text{streak}_{\text{ref}} \gets \text{streak}_{\text{ref}} + 1$
%         \STATE $\text{reasoning} \gets \text{payload.reasoning}$
%         \STATE $S \gets \emptyset$
%         \STATE $k \gets k + 1$

%     \ELSIF{$a = \text{ANSWER}$}
%         \STATE $w_{\text{draft}} \gets \text{payload.detailed\_answer}$
%         \STATE $w \gets \texttt{Answer}(L_{\text{ans}}, q, w_{\text{draft}}, E)$
%         \STATE \textbf{return} $w$
%     \ENDIF

% \ENDWHILE

% \end{algorithmic}
%

1:	Input: query q , previous snippets S k - 1 , iteration k , budgets n max ,n cap , current reflect-streak length n streak .
2: 3: 4:	Output: action a k . if k ≥ n max then a k = answer ▷ Max iteration budget. else if S k - 1 = ∅ then
5: 6:	a k = reflect ▷
7: 8:	No retrieved snippets. else if n streak ≥ n cap then
9:	a k = retrieve ▷ Max reflect streak. else
10: 11:	pass ▷ Keep the generated action. end if

LLM	Method	1. Multi-Hop	2. Temporal	3. Open-Domain	4. Single-Hop	Overall
GPT-4o-mini	A-Mem (Xu et al., 2025) LangMem (LangChain Team, 2025) Mem0 (Chhikara et al., 2025)	61.70 62.23 67.13	64.49 23.43 55.51	40.62 47.92 51.15	76.63 71.12 72.93	69.06 58.10 66.88
	Self-RAG (Asai et al., 2024) RAG-CoT-RAG	69.15 71.28	64.80 71.03	34.38 42.71	88.31 86.99	76.46 77.96
	Zep (Rasmussen et al., 2025) MemR 3 (ours, Zep backbone)	67.38 69.39 (+2.01)	73.83 73.83(+0.00)	63.54 67.01 (+3.47)	78.67 80.60 (+1.93)	74.62 76.26 (+1.64)
	RAG (Lewis et al., 2020) MemR 3 (ours, RAG backbone)	68.79 71.39 (+2.60)	65.11 76.22 (+11.11)	58.33 61.11 (+2.78)	83.86 89.44 (+5.58)	75.54 81.55 (+6.01)
	Full-Context	72.34	58.88	59.38	86.39	76.32
GPT-4.1-mini	A-Mem (Xu et al., 2025) LangMem (LangChain Team, 2025) Mem0 (Chhikara et al., 2025)	71.99 74.47 62.41	74.77 61.06 57.32	58.33 67.71 44.79	79.88 86.92 66.47	76.00 78.05 62.47
	Self-RAG (Asai et al., 2024) RAG-CoT-RAG	75.89 80.85	75.08 81.62	54.17 62.50	90.12 90.12	82.08 84.89
	Zep (Rasmussen et al., 2025) MemR 3 (ours, Zep backbone) RAG (Lewis et al., 2020)	72.34 77.78 (+5.44)	77.26 77.78 (+0.52)	64.58 69.79 (+5.21) 62.50	83.49 84.42 (+0.93) 85.90	78.94 80.88 (+1.94)
	MemR 3 (ours, RAG backbone)	73.05 81.20 (+8.15)	73.52 82.14 (+8.62)	71.53 (+9.03)	92.17 (+6.27)	79.46 86.75 (+7.29)
	Full-Context	86.43	86.82	71.88	93.73	89.00

Method	MH *	Temporal	OD *	SH *	Overall
RAG	68.79	65.11	58.33	83.86	75.54
MemR 3	71.39	76.22	61.11	89.44	81.55
w/o mask	62.41	68.54	55.21	72.17	68.54
w/o ∆ q k	66.67	75.08	60.42	83.37	77.11
w/o reflect	65.25	73.83	61.46	83.37	76.65

Category	Multi-Hop	Temporal	Open-Domain	Single-Hop	Adversarial
Order	Category 1	Category 2	Category 3	Category 4	Category 5
# Questions	282	321	96	830	445

LLM	Retriever	Run Order	1. Multi-Hop	2. Temporal	3. Open-Domain	4. Single-Hop	Overall
	Zep 1 2 3	68.09 69.86 70.21	73.52 72.59 75.39	68.75 67.71 64.58	80.72 80.36 80.72	76.13 76.00 76.65
GPT-4o-mini	RAG 2 3	1 71.63 70.21 72.34		77.26 76.01 75.39	61.46 59.38 62.50	89.28 89.40 89.64	81.75 81.16 81.75
	mean ± std Zep 1	71.39 78.72 2 75.89 3 78.72	± 1.08	76.22 ± 0.95 61.11 78.50 77.26 77.57	± 1.59 72.92 68.75 67.71	89.44 ± 0.18 84.34 84.58 84.34	81.56 ± 0.34 81.36 80.44 80.84
GPT-4.1-mini	RAG 1 2	81.56 82.62 79.43		83.18 80.69 82.55	69.79 75.00 69.79	91.93 92.65	86.79 87.18 86.27
	3					91.93
	mean	± std 81.20	± 1.62	82.14 ± 1.29	71.53 ± 3.01	92.17 ± 0.42	86.75 ± 0.46

References

[maharana2024evaluating] Maharana, Adyasha, Lee, Dong-Ho, Tulyakov, Sergey, Bansal, Mohit, Barbieri, Francesco, Fang, Yuwei. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

[xu2025amem] Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang. (2025). A-Mem: Agentic Memory for {LLM. The Thirty-ninth Annual Conference on Neural Information Processing Systems.

[chhikara2025mem0] Chhikara, Prateek, Khant, Dev, Aryan, Saket, Singh, Taranjeet, Yadav, Deshraj. (2025). Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413.

[rasmussen2025zep] Rasmussen, Preston, Paliychuk, Pavlo, Beauvais, Travis, Ryan, Jack, Chalef, Daniel. (2025). Zep: a temporal knowledge graph architecture for agent memory. arXiv preprint arXiv:2501.13956.

[lewis2020retrieval] Lewis, Patrick, Perez, Ethan, Piktus, Aleksandra, Petroni, Fabio, Karpukhin, Vladimir, Goyal, Naman, K{. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems.

[langmem_blog2025] {LangChain Team. (2025). LangMem SDK for Long-term Agent Memory. LangChain Blog.

[hurst2024gpt] Hurst, Aaron, Lerer, Adam, Goucher, Adam P, Perelman, Adam, Ramesh, Aditya, Clark, Aidan, Ostrow, AJ, Welihinda, Akila, Hayes, Alan, Radford, Alec, others. (2024). Gpt-4o system card. arXiv preprint arXiv:2410.21276.

[wang2024wise] Wang, Peng, Li, Zexi, Zhang, Ningyu, Xu, Ziwen, Yao, Yunzhi, Jiang, Yong, Xie, Pengjun, Huang, Fei, Chen, Huajun. (2024). Wise: Rethinking the knowledge memory for lifelong model editing of large language models. Advances in Neural Information Processing Systems.

[fang2025alphaedit] Fang, Junfeng, Jiang, Houcheng, Wang, Kun, Ma, Yunshan, Shi, Jie, Wang, Xiang, He, Xiangnan, Chua, Tat-Seng. (2025). Alphaedit: Null-space constrained model editing for language models. The Thirteenth International Conference on Learning Representations.

[langchain2025langgraph] LangChain Inc.. (2025). Langgraph: Build Resilient Language Agents as Graphs.

[du2025rethinking] Du, Yiming, Huang, Wenyu, Zheng, Danna, Wang, Zhaowei, Montella, Sebastien, Lapata, Mirella, Wong, Kam-Fai, Pan, Jeff Z. (2025). Rethinking memory in ai: Taxonomy, operations, topics, and future directions. arXiv preprint arXiv:2505.00675.

[zhong2024memorybank] Zhong, Wanjun, Guo, Lianghong, Gao, Qiqi, Ye, He, Wang, Yanlin. (2024). Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence.

[packer2023memgpt] Packer, Charles, Wooders, Sarah, Lin, Kevin, Fang, Vivian, Patil, Shishir G, Stoica, Ion, Gonzalez, Joseph E. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv preprint arXiv:2310.08560.

[wang2023enhancing] Wang, Bing, Liang, Xinnian, Yang, Jian, Huang, Hui, Wu, Shuangzhi, Wu, Peihao, Lu, Lu, Ma, Zejun, Li, Zhoujun. (2023). Enhancing large language model with self-controlled memory framework. arXiv preprint arXiv:2304.13343.

[wang2025mirix] Wang, Yu, Chen, Xi. (2025). Mirix: Multi-agent memory system for llm-based agents. arXiv preprint arXiv:2507.07957.

[eyuboglu2025cartridges] Eyuboglu, Sabri, Ehrlich, Ryan, Arora, Simran, Guha, Neel, Zinsley, Dylan, Liu, Emily, Tennien, Will, Rudra, Atri, Zou, James, Mirhoseini, Azalia, others. (2025). Cartridges: Lightweight and general-purpose long context representations via self-study. arXiv preprint arXiv:2506.06266.

[karpukhin2020dense] Karpukhin, Vladimir, Oguz, Barlas, Min, Sewon, Lewis, Patrick SH, Wu, Ledell, Edunov, Sergey, Chen, Danqi, Yih, Wen-tau. (2020). Dense Passage Retrieval for Open-Domain Question Answering.. EMNLP (1).

[izacard2021leveraging] Izacard, Gautier, Grave, Edouard. (2021). Leveraging passage retrieval with generative models for open domain question answering. Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume.

[asai2024self] Asai, Akari, Wu, Zeqiu, Wang, Yizhong, Sil, Avirup, Hajishirzi, Hannaneh. (2024). Self-rag: Learning to retrieve, generate, and critique through self-reflection.

[openai2024gpt4omini] OpenAI. (2024). GPT-4o mini: advancing cost-efficient intelligence.

[openai2025gpt41] OpenAI. (2025). Introducing GPT-4.1 in the API.

[fang2025lightmem] Fang, Jizhan, Deng, Xinle, Xu, Haoming, Jiang, Ziyan, Tang, Yuqi, Xu, Ziwen, Deng, Shumin, Yao, Yunzhi, Wang, Mengru, Qiao, Shuofei, others. (2025). Lightmem: Lightweight and efficient memory-augmented generation. arXiv preprint arXiv:2510.18866.

[wei2022chain] Wei, Jason, Wang, Xuezhi, Schuurmans, Dale, Bosma, Maarten, Xia, Fei, Chi, Ed, Le, Quoc V, Zhou, Denny, others. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems.

[10738994] Shinn, Noah, Cassano, Federico, Gopinath, Ashwin, Narasimhan, Karthik, Yao, Shunyu. (2023). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems. doi:10.1109/ICEECT61758.2024.10738994.

[reimers2019sentence] Reimers, Nils, Gurevych, Iryna. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.

[openai2024embeddinglarge3] OpenAI. (2024). text-embedding-large-3.

[yao2022react] Yao, Shunyu, Zhao, Jeffrey, Yu, Dian, Du, Nan, Shafran, Izhak, Narasimhan, Karthik R, Cao, Yuan. (2022). React: Synergizing reasoning and acting in language models. The eleventh international conference on learning representations.

[wei2025evo] Wei, Tianxin, Sachdeva, Noveen, Coleman, Benjamin, He, Zhankui, Bei, Yuanchen, Ning, Xuying, Ai, Mengting, Li, Yunzhe, He, Jingrui, Chi, Ed H, others. (2025). Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory. arXiv preprint arXiv:2511.20857.

[asl2025fair] Asl, Mohammad Aghajani, Asgari-Bidhendi, Majid, Minaei-Bidgoli, Behrooz. (2025). FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation. arXiv preprint arXiv:2510.22344.

Introduction​

Related Work​

Memory for LLM Agents​

Agentic Retrieval-Augmented Generation​

proj​

Problem Formulation and Preliminaries​

System Overview​

Global Evidence-Gap Tracker​

LangGraph Nodes​

Retrieve.​

Reflect.​

Answer.​

Discussion​

Discussion on Efficiency​

Advantages.​

Advantages.​

Experiments​

Experimental Protocols​

Datasets.​

Main Results​

Other Experiments​

Ablation Studies.​

Effect of $n_{ text{chk​

Number of chunks.​

Max iterations.​

Iteration count.​

Revisiting the Evaluation Protocols of LoCoMo​

Conclusion​

Accessibility​

Software and Data​

Acknowledgements​

Impact Statement​

Prompts​

System prompt of the texttt{generate​

User prompt of the texttt{generate​

Prior Query { last query }

Formalizing the Evidence-Gap Tracker​

Practical Instantiation via LLM Summaries​

Correctness Properties under Idealized Extraction​

Robustness Considerations​

Approximation Bias of the LLM Tracker​

Toy example of the granularity bias​

Experimental Settings​

Baselines​

Memory systems​

Agentic Retrievers​

Backend Baselines​

Full-Context​

Other protocols.​

Re-alignment of LoCoMo dataset​

Misalignment in existing works.​

Repeated questions in LoCoMo dataset.​

Experimental Results​

Repeated Experiments.​

Token Consumption​

Ablation on S​

References​

Introduction

Related Work

Memory for LLM Agents

Agentic Retrieval-Augmented Generation

proj

Problem Formulation and Preliminaries

System Overview

Global Evidence-Gap Tracker

LangGraph Nodes

Retrieve.

Reflect.

Answer.

Discussion

Discussion on Efficiency

Advantages.

Advantages.

Experiments

Experimental Protocols

Datasets.

Main Results

Other Experiments

Ablation Studies.

Effect of $n_{ text{chk

Number of chunks.

Max iterations.

Iteration count.

Revisiting the Evaluation Protocols of LoCoMo

Conclusion

Accessibility

Software and Data

Acknowledgements

Impact Statement

Prompts

System prompt of the texttt{generate

User prompt of the texttt{generate

Formalizing the Evidence-Gap Tracker

Practical Instantiation via LLM Summaries

Correctness Properties under Idealized Extraction

Robustness Considerations

Approximation Bias of the LLM Tracker

Toy example of the granularity bias

Experimental Settings

Baselines

Memory systems

Agentic Retrievers

Backend Baselines

Full-Context

Other protocols.

Re-alignment of LoCoMo dataset

Misalignment in existing works.

Repeated questions in LoCoMo dataset.

Experimental Results

Repeated Experiments.

Token Consumption

Ablation on S

References