In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long T. Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, Tomas Pfister

Abstract

Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities—utterances, turns, and sessions—into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs' cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10$%$ accuracy improvement over the baseline without memory management on the LongMemEval dataset.

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

Zhen Tan 1 * , Jun Yan 2 , I-Hung Hsu 2 , Rujun Han 2 , Zifeng Wang 2 , Long T. Le 2 , Yiwen Song 2 , Yanfei Chen 2 , Hamid Palangi 2 , George Lee 2 , Anand Iyer 3 , Tianlong Chen 4 , Huan Liu 1 , Chen-Yu Lee 2 and Tomas Pfister 2 1 Arizona State University, 2 Google Cloud AI Research, 3 Google Cloud AI, 4 UNC Chapel Hill

Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities-utterances, turns, and sessions-into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs' cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10 % accuracy improvement over the baseline without memory management on the LongMemEval dataset.

Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities in engaging in open-ended dialogue (Lee et al., 2023; Mendonça et al., 2024), yet their inherent statelessness poses a significant challenge for maintaining coherent, personalized conversations over time (Chen et al., 2024; Li et al., 2024b; Tseng et al., 2024), which are crucial across various real-world applications ( e.g. , customer service (Kolasani, 2023), virtual assistants (Guan et al., 2024), and education platforms (Wen et al., 2024; Zhang et al., 2024d)). As illustrated in Figure 1, effective personalization requires not only understanding the immediate context but also recalling relevant information from the user's previous interactions (Dong et al., 2024; Whittaker et al., 2002; Williams and Hollan, 1981). The limitations with current LLMs to naturally retain and recall information from past

Figure 1 | An illustration of a personalized healthcare agent. Key information about a user's allergy and previous symptoms mentioned in the past sessions is needed to provide a more informed response in the current session.

Corresponding author(s): ztan36@asu.edu, {junyann, chenyulee}@google.com

This work was done while Zhen Tan was a Research Intern at Google Cloud AI Research.

interactions beyond their context windows sparked the development of external memory mechanisms for LLMs (Li et al., 2025; Ong et al., 2025; Zhang et al., 2024c). These memory systems serve as crucial components in personalized dialogue agents, enabling them to maintain consistent personality traits, remember user preferences, and build upon previous interactions.

While external memory mechanisms represent a significant step towards enabling persistent dialogue, current approaches suffer from two critical limitations. Firstly , existing systems digest information at a pre-defined granularity, such as turn, session, or time interval boundaries, which may not align with the inherent semantic structure of the conversation (e.g., topic shifts). This rigid approach can lead to fragmented or incomplete memory representations, hindering the LLM's ability to retrieve, utilize, and update relevant information effectively (Pan et al., 2025; Wu et al., 2025). Secondly, these systems rely on fixed retrievers (Li et al., 2025; Zhong et al., 2024), which struggle to adapt to the diverse retrieval demands of varying dialogue domains and individual user interaction patterns. Moreover, the expense associated with collecting labeled data for training personalized retrievers presents a substantial barrier to widespread adoption and scalability.

To address these limitations, we propose a novel Reflective Memory Management ( RMM ) mechanism to provide a more adaptable and granular approach to long-term dialogue memory. Our framework incorporates two key innovations. Prospective Reflection tackles the issue of fixed granularity by summarizing dialogue histories into decomposed topics, effectively integrating fragmented conversational segments into cohesive memory structures. This approach optimizes memory organization for future retrieval, allowing the LLM to access relevant information more effectively regardless of the original turn or session boundaries. Complementing this, Retrospective Reflection addresses the challenge of fixed retrievers by leveraging unsupervised attribution signals generated during the LLM's response generation to reflect on past retrieval. This allows for online refinement of the retriever as the conversation progresses, enabling the system to adapt to diverse dialogue domains and individual user interaction patterns without the need for costly labeled data.

By integrating these two reflective mechanisms, our approach enables LLMs to maintain a more nuanced and adaptable memory, leading to more coherent, personalized, and engaging dialogues. Experiments on MSC and LongMemEval benchmarks show that RMM achieves more than 5% improvement over the strongest baseline across memory retrieval and response generation metrics.

Our contributions are as follows: (1) We propose RMM as a novel memory management mechanism that employs topic-based memory management optimized for future retrieval and leverages attribution signal to reflect on past retrieval for unsupervised online retrieval refinement. (2) We conduct extensive experiments on two long-term personalized dialogue benchmarks to demonstrate the effectiveness of RMM over strong baselines. (3) We perform detailed analysis on the impact of various design choices to pinpoint the limitations of existing memory management mechanisms with fixed granularity and retrievers, shedding light on the room for future improvement.

proprietary or API-based LLMs. (2) Summarization-based methods, which condense long contexts into structured events or topics for direct conditioning or retrieval (Jiang et al., 2025; Li et al., 2024a; Lu et al., 2023; Wang et al., 2023). RMM falls into this category but explicitly addresses the issue of fragmented topics arising from fixed granularity and incorporates retrospective reflection to refine the retrieval process, encouraging more coherent and contextual responses.

Memory-based Personalized Dialogue Agents. The development of memory-based personalized dialogue agents has further enhanced long-term interactions by enabling systems to retain and utilize information from past conversations (Bae et al., 2022). Traditional methods Walker et al. (1997); Weizenbaum (1966) laid the groundwork for understanding how systems can model user preferences, intentions, and behaviors across sessions, often using handcrafted rules, heuristics, or symbolic representations. Early approaches, such as CoMemNN (Pei et al., 2021), introduce mechanisms to incrementally enrich user profiles during dialogues. However, collecting substantial annotations for training a personalized system for long-term use is hard (Tseng et al., 2024). Recent advancements focus on integrating LLMs with memory modules (Chhikara et al., 2025; Packer et al., 2024; Rasmussen et al., 2025; Wang et al., 2025; Xu et al., 2025). For instance, the LD-Agent framework (Li et al., 2025) employs long-, short-term memory banks to manage conversational history for retrieval. MemoryBank (Zhong et al., 2024) incorporates a memory updating mechanism inspired by the Ebbinghaus Forgetting Curve, enabling models to retrieve relevant memories considering recency. Theanine (Ong et al., 2025) introduces timeline-based retrieval and utilizes an additional LLM for refinement. These methods typically deploy fixed retrievers with a pre-defined granularity. In contrast, the proposed RMM approach facilitates adaptive retrieval with a revised retrieval granularity.

Problem Formulation

We consider the task of building a personalized dialogue agent in a multi-session conversational setting. In this setting, an agent interacts with a user across multiple distinct sessions. A session represents a distinct interaction period, often delimited by user inactivity, explicit user confirmation of conversation completion, or the initiation of a new dialogue thread. Within each session, the conversation unfolds as a sequence of turns, where a turn consists of a user query and the agent's corresponding response. The agent is equipped with an external memory, serving as the sole repository for information gathered from previous sessions. The agent's objective is to generate contextually relevant and personalized responses to user queries, leveraging both the immediate conversational context within the current session and the relevant information retrieved from the memory.

This task presents two key challenges: first, the agent must proactively identify and store salient information from each session, anticipating future retrieval needs. Second, the agent must accurately retrieve relevant past information from the memory, as incorporating irrelevant context can distract the LLM and degrade response quality (Liu et al., 2024b; Shi et al., 2023). Effectively managing this balance between comprehensive storage and precise retrieval is critical for achieving personalized and coherent multi-session dialogues.

Framework Overview

To tackle the challenges, we introduce Reflective Memory Management (RMM), a novel framework that integrates two mechanisms. Prospective Reflection proactively decomposes dialogue history into topic-based memory representations, optimizing for future retrieval, while Retrospective Reflection dynamically refines the retrieval mechanism through online feedback signals generated during response generation. They together improve the quality of the retrieved memories, contributing to effective personalization.

Our framework comprises four key components. The memory bank stores dialogue history as a collection of memory entries, each represented as a pair (topic summary, raw dialogue), where the 'topic summary' serves as the search key for retrieving the conversational segment. The retriever identifies relevant memories based on the current user query. To enable lightweight adaptation of the retrieval process, we incorporate a reranker , which refines the retriever's initial output by prioritizing the most pertinent memories. Finally, an LLM synthesizes the relevant memories with the current context to produce a personalized response. Crucially, the LLM also provides feedback signals based on its utilization of retrieved memories, which are used to refine the reranker through Retrospective Reflection. Our complete workflow is detailed in Algorithm 1.

redacted \correspondingauthorztan36@asu.edu, {junyann, chenyulee}@google.com 11affiliationtext: Arizona State University 22affiliationtext: Google Cloud AI Research 33affiliationtext: Google Cloud AI 44affiliationtext: UNC Chapel Hill

Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities—utterances, turns, and sessions—into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs’ cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10%% accuracy improvement over the baseline without memory management on the LongMemEval dataset.

Large Language Models (LLMs) have demonstrated remarkable capabilities in engaging in open-ended dialogue (lee2023prompted; mendoncca2024benchmarking), yet their inherent statelessness poses a significant challenge for maintaining coherent, personalized conversations over time (chen2024large; li2024personal; tseng2024two), which are crucial across various real-world applications (e.g., customer service (kolasani2023optimizing), virtual assistants (guan2023intelligent), and education platforms (zhang2024simulating; wen2024ai)). As illustrated in Figure 1, effective personalization requires not only understanding the immediate context but also recalling relevant information from the user’s previous interactions (williams1981process; whittaker2002managing; dong2024can). The limitations with current LLMs to naturally retain and recall information from past interactions beyond their context windows sparked the development of external memory mechanisms for LLMs (zhang2024personalization; li2024hello; kim2024theanine). These memory systems serve as crucial components in personalized dialogue agents, enabling them to maintain consistent personality traits, remember user preferences, and build upon previous interactions.

While external memory mechanisms represent a significant step towards enabling persistent dialogue, current approaches suffer from two critical limitations. Firstly, existing systems digest information at a pre-defined granularity, such as turn, session, or time interval boundaries, which may not align with the inherent semantic structure of the conversation (e.g., topic shifts). This rigid approach can lead to fragmented or incomplete memory representations, hindering the LLM’s ability to retrieve, utilize, and update relevant information effectively (wu2024longmemeval; pansecom). Secondly, these systems rely on fixed retrievers (zhong2024memorybank; li2024hello), which struggle to adapt to the diverse retrieval demands of varying dialogue domains and individual user interaction patterns. Moreover, the expense associated with collecting labeled data for training personalized retrievers presents a substantial barrier to widespread adoption and scalability.

To address these limitations, we propose a novel Reflective Memory Management (RMM) mechanism to provide a more adaptable and granular approach to long-term dialogue memory. Our framework incorporates two key innovations. Prospective Reflection tackles the issue of fixed granularity by summarizing dialogue histories into decomposed topics, effectively integrating fragmented conversational segments into cohesive memory structures. This approach optimizes memory organization for future retrieval, allowing the LLM to access relevant information more effectively regardless of the original turn or session boundaries. Complementing this, Retrospective Reflection addresses the challenge of fixed retrievers by leveraging unsupervised attribution signals generated during the LLM’s response generation to reflect on past retrieval. This allows for online refinement of the retriever as the conversation progresses, enabling the system to adapt to diverse dialogue domains and individual user interaction patterns without the need for costly labeled data.

Our contributions are as follows: (1) We propose RMM as a novel memory management mechanism that employs topic-based memory management optimized for future retrieval and leverages attribution signal to reflect on past retrieval for unsupervised online retrieval refinement. (2) We conduct extensive experiment on two long-term personalized dialogue benchmarks to demonstrate the effectiveness of RMM over strong baselines. (3) We perform detailed analysis on the impacts of various design choices to pinpoint the limitations of existing memory management mechanisms with fixed granularity and retrievers, shedding light on the room for future improvement.

Long-term Conversations for LLMs. LLMs have demonstrated the ability to engage in extended, coherent dialogues, yet maintaining context and consistency over long-term interactions remains a challenge. maharana2024evaluating introduced the LoCoMo dataset to assess LLMs’ performance in sustained dialogues, showing their struggles with long-range temporal and causal understanding. Existing solutions can be broadly categorized into two approaches: (1) Architectural modifications, such as enhancing attention mechanisms (liu2023ring; zhang2024spar), optimizing KV caches (li2024scbench; liu2025chunkkv), and refining position embeddings (zhao2023length; zheng2025dape). These methods require white-box access to model internals, making them infeasible for proprietary or API-based LLMs. (2) Summarization-based methods, which condense long contexts into structured events or topics for direct conditioning or retrieval (lu2023memochat; wang2023recursively; jiang2024retrieve; li2024alr). RMM falls into this category but explicitly addresses the issue of fragmented topics arising from fixed granularity and incorporates retrospective reflection to refine the retrieval process, encouraging more coherent and contextual responses.

Memory-based Personalized Dialogue Agents. The development of memory-based personalized dialogue agents has further enhanced long-term interactions by enabling systems to retain and utilize information from past conversations (bae2022keep). Early approaches, such as CoMemNN (pei2021cooperative), introduce mechanisms to incrementally enrich user profiles during dialogues. However, collecting substantial annotations for training a personalized system for long-term use is hard (tseng2024two). Recent advancements focus on integrating LLMs with memory modules. For instance, the LD-Agent framework (li2024hello) employs long-, short-term memory banks to manage conversational history for retrieval. MemoryBank (zhong2024memorybank) incorporates a memory updating mechanism inspired by the Ebbinghaus Forgetting Curve, enabling models to retrieve relevant memories considering recency. Theanine (kim2024theanine) introduces timeline-based retrieval and utilizes an additional LLM for refinement. These methods typically deploy fixed retrievers with a pre-defined granularity. In contrast, the proposed RMM approach facilitates adaptive retrieval with a revised retrieval granularity.

We consider the task of building a personalized dialogue agent in a multi-session conversational setting. In this setting, an agent interacts with a user across multiple distinct sessions. A session represents a distinct interaction period, often delimited by user inactivity, explicit user confirmation of conversation completion, or the initiation of a new dialogue thread. Within each session, the conversation unfolds as a sequence of turns, where a turn consists of a user query and the agent’s corresponding response. The agent is equipped with an external memory, serving as the sole repository for information gathered from previous sessions. The agent’s objective is to generate contextually relevant and personalized responses to user queries, leveraging both the immediate conversational context within the current session and the relevant information retrieved from the memory.

This task presents two key challenges: first, the agent must proactively identify and store salient information from each session, anticipating future retrieval needs. Second, the agent must accurately retrieve relevant past information from the memory, as incorporating irrelevant context can distract the LLM and degrade response quality (shi2023large; liu2024lost). Effectively managing this balance between comprehensive storage and precise retrieval is critical for achieving personalized and coherent multi-session dialogues.

Input: query qq, past messages in current session SS, memory bank BB, retriever fθf_{\theta}, reranker gϕg_{\phi}, LLM Output: response aa, updated SS, gϕg_{\phi}, BB

Refer to caption An illustration of a personalized healthcare agent. Key information about a user’s allergy and previous symptoms mentioned in the past sessions is needed to provide a more informed response in the current session.

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

1: Retrieve: M 𝐾 ← 𝑓 𝜃 ( 𝑞, 𝐵 )

Input: query 𝑞 , past messages in current session 𝑆 , memory bank 𝐵 , retriever 𝑓 𝜃 , reranker 𝑔 𝜙 , LLM Output: response 𝑎 , updated 𝑆 , 𝑔 𝜙 , 𝐵

2: Rerank: M 𝑀 ← 𝑔 𝜙 ( 𝑞, M 𝐾 ) , where M 𝑀 = { 𝑚𝑖 } 𝑀 𝑖 = 1
4: Generate: 𝑎, 𝑅 𝑀 ← LLM ( 𝑞, 𝑆, M 𝑀 ) where 𝑅𝑀 = { 𝑟 𝑖 } 𝑀 𝑖 = 1
3: // Retrospective Reflection
5: 𝑔 𝜙 ← RL_Update ( 𝑔 𝜙 , 𝑅 𝑀 )
7: // Prospective Reflection
6: 𝑆.𝑎𝑝𝑝𝑒𝑛𝑑 (( 𝑞, 𝑎 ))
8: if session 𝑆 ends then
10: for 𝑚 ∈ M do
9: M← ExtractMemory ( 𝑆 )
11: 𝐵 ← UpdateMemory ( 𝐵, 𝑚 )
13: 𝑆 ← []
12: end for
14: end if

Prospective Reflection: Topic-Based Memory Organization

Traditional memory management systems often rely on fixed boundaries, such as session or turn delimiters, to structure dialogue history. However, these pre-defined boundaries may not align with the underlying semantic units of conversation. As a result, critical information may be fragmented across multiple memory entries, hindering effective retrieval. To address this, we introduce Prospective Reflection, a mechanism for organizing memory based on coherent topics, enabling more granular and semantically relevant future retrieval. Here 'topic' refers to a semantically coherent unit of discussion that may span across one or multiple turns in a session. Each topic is associated with the raw dialogue segment(s) in which it was discussed. These topics can range from fine-grained user intents (e.g., asking about vegan recipes) to broader themes (e.g., travel planning). As illustrated in Figure 2, this process occurs at the conclusion of each session and consists of two key steps: memory extraction and memory update.

Figure 2 | Illustration of Prospective Reflection . After each session, the agent decomposes and summarizes the session into specific topics. These newly generated memories are compared with existing memories in the memory bank. Relevant memories are merged , while others are directly added . Prospective reflection ensures efficient organization of personal knowledge for future retrieval.

First, memory extraction is achieved by using an LLM (prompt in Appendix D.1.1) to extract dialogue snippets from the session with their corresponding summaries based on the distinct mentioned topics. Second, memory update involves integrating the extracted topic-based memories into the memory bank. Specifically, for each extracted memory, we retrieve the Top𝐾 most semantically similar

memories already present in the memory bank. Subsequently, an LLM (prompt in Appendix D.1.2) determines whether the extracted memory should be directly added into the memory bank (e.g., when the extracted memory discusses a new topic) or merged with an existing memory into an updated one (e.g., when the extracted memory provides updated information to a previously discussed topic).

Through Prospective Reflection, the memory bank maintains a coherent and consolidated representation of the evolving dialogue history, organized around meaningful topic structures.

Reranker Design

While an off-the-shelf retriever can identify semantically-relevant memories, its performance can degrade across diverse dialogue domains and user interaction patterns. Instead of resorting to computationally expensive fine-tuning of the retriever, which requires extensive labeled data, we introduce a lightweight reranker to refine the retrieved memory list. This reranker allows for efficient adaptation to the nuances of specific dialogue domains and user preferences, enabling the system to dynamically adjust its retrieval strategy.

To be specific, the reranker processes the Top𝐾 memory embeddings retrieved by the retriever, refining their relevance with respect to the user query and selecting the Top𝑀 candidates. The whole process includes the following steps.

Embedding Adaptation. Let q represent the embedding of the query and m 𝑖 represent the embedding of the 𝑖 -th memory entry retrieved by retriever. The embeddings are fed into the reranker to be refined via a linear layer with residual connections:

where W 𝑞 and W 𝑚 are linear transformation matrices for the query and memory, respectively.

Stochastic Sampling with Gumbel Trick. The adapted query embedding q ′ and memory embeddings m ′ 𝑖 are adopted to compute relevance scores via dot product: 𝑠 𝑖 = q ′⊤ m ′ 𝑖 . To select memory entries based on relevance scores, we employ the Gumbel Trick (Gumbel, 1954), which enables stochastic sampling from a discrete probability distribution while preserving gradients, making it particularly useful in reinforcement learning and differentiable ranking tasks (Jang et al., 2017). We add Gumbel noise 𝑔 𝑖 (Maddison et al., 2014) to the relevance scores 𝑠 𝑖 for each memory entry:

where 𝑢𝑖 ∼ Uniform ( 0 , 1 ) . The perturbed scores ˜ 𝑠 𝑖 are then normalized using the softmax function to compute sampling probabilities: 𝑝𝑖 = exp ( ˜ 𝑠 𝑖 / 𝜏 ) ˝ 𝐾 𝑗 = 1 exp ( ˜ 𝑠 𝑗 / 𝜏 ) , where 𝜏 > 0 is the temperature parameter controlling the sharpness of the distribution. Lower 𝜏 results in more deterministic sampling (approaching the maximum of 𝑠 𝑖 ), while higher 𝜏 increases stochasticity, encouraging exploration.

By introducing a reranker, RMM ensures efficient retrieval refinement without modifying the retriever itself, making it adaptable to any pre-trained retrieval model while allowing task-specific optimizations through Reinforcement Learning (RL).

LLM Attribution as Rewards

Obtaining high-quality user-specific labeled data for refining the retrieval process is prohibitively expensive. To overcome this challenge, we propose leveraging the inherent capabilities of the LLM

generator itself to provide automated feedback on the quality of retrieved memories. Given the user query with context in the current session, and the retrieved memories, we prompt the LLM (prompt in Appendix D.2) to generate both the response and the associated citations to each individual memory in the context (Kenthapadi et al., 2024). This design uses a single LLM call for generating response and LLM attribution, reducing computational overhead. Moreover, the citations are generated conditioned on the response, which has been shown to be more effective compared to prior or post-hoc citations (Buchmann et al., 2024).

Rewards. As shown in Figure 3, each retrieved memory entry receives either a positive or negative reward based on its citation in the generated response. Specifically, we assign a reward of + 1 (Useful) if the generator cites the memory in the final response, and -1 (Not Useful) otherwise. This reward assignment reflects the utility of each memory entry and allows the reranker to learn better retrieval strategies over time, aligning future selections with the generator's actual usage of retrieved evidence. We validate its effectiveness in Section 8.3.

Reranker Update

The reranker is fine-tuned using the REINFORCE algorithm (Williams, 1992) to optimize its relevance predictions based on these binary rewards with the following formulation:

where 𝑅 is the reward ( + 1 or -1), 𝑏 is a baseline value set as a hyperparameter, and 𝜙 denotes the weights of the reranker.

Experimental Setup

Implementation Details

In our experiments, we use Gemini-1.5-Flash as the generator and evaluate Gemini-1.5-Pro in Section 8.4. We equip RMM with the following dense retrievers with strong semantic representation capabilities and widespread adoption in personalized dialogue systems (Wu et al., 2025).

· Contriever ( facebook/contriever ) (Izacard et al., 2022): A dense retriever optimized for semantic search leveraging contrastive learning. · GTE ( Alibaba-NLP/gte-Qwen2-7B-instruct ) (Li et al., 2023): A retriever designed for instructionfollowing queries, which is trained across a vast, multilingual text corpus spanning diverse domains. · Stella ( dunzhang/stella_en_1.5B_v5 ) (Zhang et al., 2024b): A large embedding-based retriever, which is developed based on language models.

Contriever is used as the default retriever. Following Wu et al. (2025), for experiments without a reranker, the Top𝐾 is 5. Otherwise, the default Top𝐾 is 20 and Top𝑀 is 5. We explore the

Figure 3 | Illustration of Retrospective Reflection . The Retriever fetches Top𝐾 memory entries from the memory bank, which are refined by the learnable Reranker to select the Top𝑀 most relevant entries. These entries are passed to the LLM along with the query to generate the final response. The LLM assigns binary citation scores ( + 1 for useful and -1 for not useful) to the retrieved memory entries based on their utility in the response. These scores are used as reward signals to update the reranker via an RL update , adapting the selection of relevant memory over time.

impact of retrieval parameters in Appendix 8.7. For LongMemEval, we also consider the results of using an 'Oracle' retriver which retrieves the ground-truth turns annotated in the dataset with the necessary personal knowledge to respond to a question. More implementation and training details are elaborated in Appendix A.

Datasets and Evaluation Metrics

Weexperiment on two publicly available benchmark datasets commonly used for personalized dialogue evaluation: MSC (Xu et al., 2022) and LongMemEval (Wu et al., 2025). Additional details about datasets can be found in Appendix B.

For MSC , the evaluation measures if the generated response matches the human-provided ground truth. We follow Li et al. (2025) to use METEOR (Banerjee and Lavie, 2005) for measuring lexical similarity and BERTScore (Zhang et al., 2020) for measuring semantic similarity. We also provide LLM judge results in Appendix 8.8.

For LongMemEval , we follow the original paper to use Recall@K to evaluate the model's ability to retrieve relevant information for the query from conversation histories and use an LLM judge to measure the Accuracy of the generated answer by comparing it to the human-provided ground truth using Gemini-1.5-Pro. The prompt is presented in Appendix D.3.

Compared Methods

To benchmark the performance of RMM, we compare it against the following baselines which represent different strategies for managing and retrieving long-term conversational memories, allowing for a comprehensive comparison with RMM.

· No History : No history session is used. · RAG : These models retrieve relevant turns or sessions for a given user query, concatenate them with the query, and feed the resulting input to the LLM for response generation. We use turns as the default granularity for better performance. · Long Context : This method directly incorporate as much conversation history as possible into the context window. Older turns are truncated. · Personalized Dialogue Agents : We consider two agent systems: (1) MemoryBank (Zhong et al., 2024) treats conversation history as a fixed database and modulates retrieval using heuristics based on the forgetting curve. (2) LD-Agent (Li et al., 2025) employs fixed conversation databases with additional retrieval modulation using strategies such as keywords matching.

Experimental Results

Main Results

We present the main results shown in Table 1 and analyze each method's performance as follow.

History matters: Without any history, the LLM performs poorly, achieving a METEOR score of 5.2% on MSC and 0.0% accuracy on LongMemEval , showing the necessity of historical context.

Long context is not enough: Long-context models struggle due to fixed context windows and the inclusion of noisy context. On MSC , scores remain low ( e.g. , METEOR below 20%, BERT score under 40%), and on LongMemEval , accuracy is lower than 58%. This limitation highlights their inability to retain and utilize long-term knowledge.

Table 1 | Performance comparison of RMM with baseline methods on the MSC and LongMemEval datasets. Metrics include METEOR and BERT Scores for MSC , and Recall@5 and Accuracy (Acc.) scores for LongMemEval . RMM demonstrates superior performance across all metrics, highlighting its effectiveness in retrieval relevance and personalized response generation. No oracle retrieval is available for the MSC dataset. MemoryBank and LD-Agent utilize their specific methods for retrieval. Scores are averaged over 3 runs and are reported in percentage (%).

RAG Models: RAG models outperform Long-Context LLMs by only incorporating relevant histories. With strong retrievers like GTE, RAG achieves 27.5% METEOR and 52.1% BERT Scores on MSC and 62.4% recall and 63.6% accuracy on LongMemEval . We also observe that the performance is retriever-dependent, where stronger retrievers boost the performance.

Personalized Dialogue Agents: MemoryBank and LD-Agent show more moderate improvements over Long-Context LLMs. For instance, LD-Agent achieves 25.4% METEOR and 51.5% BERT score on MSC , but these models fall short of RAG and RMM. Their reliance on heuristic-based retrieval potentially limits adaptability to complex tasks.

Proposed RMM Framework: RMM consistently achieves the best results across datasets and metrics. With GTE, RMM achieves 33.4% METEOR and 57.1% BERT on MSC , and 69.8% recall and 70.4% accuracy on LongMemEval . Even with weaker retrievers like Contriever, RMM maintains competitive performance, demonstrating robustness. The improvements stem from RMM's ability to integrate dynamic memory management with adaptive retrieval optimization enables it to retrieve and utilize relevant knowledge effectively, outperforming all baselines.

To further assess the impact of memory integration, we calculate the proportion of test examples where memory improves response quality. On MSC , memory improves on 86% of responses, as the dataset frequently requires recalling prior discussion topics. On LongMemEval , where questions are deliberately designed to test historical recall, memory contributes to quality improvements in 100% of cases. These results show the necessity of memory mechanisms in maintaining long-term coherence. We provide case studies of the memory usage in Appendix C.

Ablation Study

We conduct ablation study to evaluate the contributions of key components in the RMM framework. we present the results in Table 2 and list our observations as below.

( i ) Adding Prospective Reflection boosts performance by organizing the memory into structured topics, which reduces redundancy and improves relevance. ( ii ) Retrospective Reflection alone without a reranker misaligns retrieved content, leading to suboptimal results. Directly updating the retriever using RL rewards requires extensive amounts of training data for effective full fine-tuning, which is often difficult to obtain in real-world scenarios. Without sufficient data, it can lead to issues like catastrophic forgetting (McCloskey and Cohen, 1989). ( iii ) The addition of the reranker alongside RR significantly enhances alignment, achieving 27.5% METEOR and 58.8% Recall@5, demonstrating its effectiveness in refining re-

Table 2 | Ablation study on the datasets. Variants evaluate the impact of key components in RMM: Prospective Reflection (PR), Retrospective Reflection (RR), and the reranker. RR (W/O reranker) means the retriever is fine-tuned instead. Scores are obtained with Contriever and Gemini-1.5-Flash and in percentage (%).

trieval quality. ( iv ) Finally, the complete RMM framework, which integrates Prospective Reflection, Retrospective Reflection, and the reranker, achieves the best results across all metrics, with a METEOR score of 30.8% on MSC and 60.4% Recall@5 on LongMemEval . This confirms that RMM enables more accurate and efficient future retrieval.

Validation of Citation Scores

Our framework leverages LLM-generated citations to determine reward scores, guiding the retrieval refinement process. To assess the validity of the citation scores, we conduct evaluation on the LongMemEval dataset, using the Gemini-1.5-Pro model as the judge. The experiment tasks the LLM with determining whether cited memories were useful for response generation. The results, presented in Table 3, demonstrate high precision, recall, and F1, confirming the effectiveness of citation-based scoring in our framework.

Effect of Different LLMs

To examine the effect of different LLMs as generators, we evaluate both Gemini-1.5-Flash and Gemini1.5-Pro in Long-Context LLMs and RMM. As shown in Table 4, for Long-Context models, Gemini1.5-Pro achieves slightly better performance than Gemini-1.5-Flash across all metrics, suggesting that a stronger model improves response quality when relying solely on extended context windows. However, for RMM, Gemini-1.5-Flash outperforms Gemini-1.5-Pro, achieving higher METEOR and BERT scores on MSC and better accuracy on LongMemEval . Similar observations are reported by Wu

Table 3 | Evaluation of citation-based scoring in RR for useful memory identification on LongMemEval (results in %).

Table 4 | Effect of different LLMs on MSC and LongMemEval . Results (in %) compare LongContext LLMs and RMM using the Contriever retriever with Gemini models as generators.

Figure 4 | Granularity analysis on randomly sampled 100 instances from LongMemEval with the GTE retriever and Gemini-1.5-Flash generator. 'Turn' and 'Session' indicate retrieval at a fixed granularity. 'Mix' represents retrieving from a pool combining both turns and sessions. 'PR' refers to the granularity resulting from the proposed Prospective Reflection, while 'Best' corresponds to selecting the optimal granularity (either turn or session) for each instance.

Figure 5 | Impact of offline pretraining on retriever performance for LongMemEval dataset with the same 100 random samples as Figure 4. Results without offline pretraining are shown in blue, while results with offline pretraining are shown in orange. Offline pretraining improves recall and accuracy across all settings.

et al. (2025), where GPT-4o-mini performs better than GPT-4o in personal knowledge QA. This trend can be attributed to stronger LLMs, such as Gemini-1.5-Pro, being more likely to abstain from answering queries involving personal information, possibly due to stronger alignment tuning aimed at enhancing privacy protection.

Effect of Different Granularities

We conduct experiments to show the advantage of the flexible granularity resulting from the proposed Prospective Reflection (PR) over pre-defined fixed granularities as baselines. Results in Figure 4 show that fixed granularities, such as 'turn' and 'session', achieve moderate performance, with session-level retrieval outperforming turn-level due to richer contexts. The 'mixed' granularity underperforms, likely due to increased noise from a larger search space. The best configuration, which selects the optimal granularity per instance, achieves the highest scores, demonstrating the importance of adaptive memory organization. In contrast, PR improves performance by integrating fragmented conversational segments into cohesive memory structure, exhibiting an approaching performance with the best oracle granularity.

Offline Supervised Training

We further investigate the applicability of RMM in scenarios where a handful of labelled retrieval data is available, allowing for offline supervised pretraining (based on the off-the-shelf retriever) before online refinement. Figure 5 illustrates the impact of offline pretraining on retriever performance on the LongMemEval dataset. We randomly select 100 samples as test data with the rest as training and validation sets and apply vanilla supervised contrastive learning for the GTE retriever (Li et al., 2023). As the results show, across all settings, RMM consistently benefits from offline pretraining (orange bars) by outperforming retrievers without pretraining (blue bars). These results demonstrate that offline pretraining can enhance the retriever's ability to identify relevant information, providing a robust foundation for subsequent fine-tuning via RL.

The Impact of Top- textit{K

The results in Table 5 evaluate the impact of the number of retrieved memories (Top𝐾 ) and the number of reranked memories used for response generation (Top𝑀 ) in the RMM framework. Specifically, we analyze the performance on LongMemEval using Recall@5 (Top𝐾 = 20, Top𝑀 = 5), Recall@10 (Top𝐾 = 50, Top𝑀 = 10), and their corresponding QA accuracy scores.

The results demonstrate two key findings. First, increasing the number of memories ( M ) from 5 to 10 consistently improves both retrieval and accuracy metrics across all retrievers. For example, with the GTE retriever, Recall improves from 69.8% to 74.4%, and Accuracy increases from 70.4% to 73.8%. Second, the performance gain is most significant for stronger retrievers like GTE and Stella, highlighting the importance of retrieval quality. RMM with GTE achieves the best results of 70.4% Accuracy with Top𝐾 = 20, Top𝑀 = 5 and 73.8% Accuracy with Top𝐾 = 50, Top𝑀 = 10.

These observations emphasize that careful selection of Top𝐾 and Top𝑀 values can enhance both retrieval relevance and downstream QA performance. The combination of effective retrieval and reranking ensures that RMM efficiently leverages the most relevant information for long-term dialogue tasks.

LLM-as-a-Judge

Table 5 | Impact of Top𝐾 (retrieved memories) and Top𝑀 (reranked memories) on LongMemEval performance. Results include Recall@5 (Top𝐾 = 20, Top𝑀 =5) and Recall@10 (Top𝐾 =50, Top𝑀 = 10), and corresponding Accuracy across different retrievers. Results show that increasing the number of retrieved and reranked memories improves retrieval and QA performance on the LongMemEval dataset.

Table 6 | Results for MSC with LLM-as-a-judge. RMM shows consistent advantages with more finegrained evaluation.

Table 7 | Cohen's Kappa Agreement Between Annotators and LLM

For fair comparison, we follow prior work (Wu et al., 2025; Xu et al., 2022) and use METEOR and BERTScore. Here we include additional results using LLM-as-a-judge. Following Wu et al. (2025), we use Gemini-1.5-Pro to decide whether the generated answer matches the ground-truth as a binary annotation. The prompt we used in given in Appendix D.3. LLM-as-a-judge results also show the effectiveness of the proposed RMM.

Furthermore, we conducted a human evaluation on 100 randomly sampled instances from the MSC dataset. For each instance, annotators were shown the user query, the ground-truth response, and the model-generated response-mirroring the setup used for the LLM-based evaluations. We adopted the same instruction used in our LLM judge prompt, and asked human annotators to answer Yes/No to evaluate each instance. The annotation was independently performed by two NLP researchers. Both are PhD students working in NLP and not on the author list. We report inter-annotator agreement and agreement between human judgments and the LLM judge in Table 7.

$$ \mathbf{q}^{\prime} = \mathbf{q} + \mathbf{W}_q \mathbf{q}, \quad \mathbf{m}_i^{\prime} = \mathbf{m}_i + \mathbf{W}_m \mathbf{m}_i, $$

$$ \Delta \phi = \eta \cdot (R - b) \cdot \nabla_\phi \log P(\mathcal{M}_M | q, \mathcal{M}_K; \phi), $$

Algorithm: algorithm
[H]
\small
\captionsetup{font=small}
\caption{Reflective Memory Management (RMM) for Dialogue Agents}
\label{alg:rmm}
\textbf{Input:} query $q$, past messages in current session $S$, memory bank $B$, retriever $f_\theta$, reranker $g_\phi$, $\textsc{LLM}$\\
\textbf{Output:} response $a$, updated $S$, $g_\phi$, $B$
\begin{algorithmic}[1]
\State \textbf{Retrieve:} $\mathcal{M}_K \gets f_\theta(q,B)$
\State \textbf{Rerank:} $\mathcal{M}_M \gets g_\phi(q, \mathcal{M}_K)$, where $\mathcal{M}_M=\{m_i\}_{i=1}^{M}$
\State \texttt{\textsc{// Retrospective Reflection}}
\State \textbf{Generate:} $a, R_M \gets \textsc{LLM}(q, S, \mathcal{M}_M)$ where $R_M=\{r_i\}_{i=1}^M$
\State $g_\phi \gets \textbf{RL\_Update}(g_\phi, R_M)$
\State $S.append((q, a))$
\State \texttt{\textsc{// Prospective Reflection}}
\If{session $S$ ends}
    \State $\mathcal{M} \gets \textbf{ExtractMemory}(S)$
    \For{$m \in \mathcal{M}$}
        \State $B \gets \textbf{UpdateMemory}(B, m)$
    \EndFor
    \State $S \gets []$
\EndIf
\end{algorithmic}

Conclusion

We present RMM, a framework that integrates Prospective Reflection for structured, topic-based memory organization and Retrospective Reflection for dynamic memory reranking via reinforcement learning. Experimental results on benchmark datasets demonstrate that RMM outperforms stateof-the-art baselines in retrieval relevance and response quality for personalized dialogue tasks. By identifying limitations in existing memory management approaches-particularly those relying on fixed granularity and static retrievers, we highlight key challenges and avenues for future research in long-term dialogue memory modeling.

Limitations

While the proposed RMM framework demonstrates significant improvements in retrieval relevance and response quality, it is not without limitations. First, RMM relies on reinforcement learning for memory reranking, which can be computationally expensive, especially for large-scale datasets or real-time applications. Second, the current framework primarily focuses on textual data, limiting its applicability to multi-modal dialogue systems that incorporate images, audio, or video. Additionally, the memory updating mechanism may require further optimization to handle dynamically evolving long-term user interactions efficiently.

For future work, we plan to address these limitations by exploring more efficient reinforcement learning techniques and lightweight memory reranking strategies. We also aim to extend RMM to multimodal dialogue systems to accommodate diverse user interactions. Furthermore, we will investigate privacy-preserving techniques to ensure safe deployment of RMM in real-world personalized dialogue applications where sensitive user data is involved.

Ethical Statement

This work focuses on developing a framework for long-term personalized dialogue systems to improve user experiences. However, we acknowledge the potential ethical implications of handling personal data in such systems. The RMM framework relies on historical conversations, which may contain sensitive or private information. To mitigate privacy risks, we recommend adopting robust encryption and privacy-preserving methods, such as differential privacy or federated learning, during data collection and model training.

Additionally, we emphasize the importance of transparent data usage policies and obtaining user consent when deploying personalized dialogue systems. Efforts should also be made to minimize biases in memory retrieval and response generation to ensure fairness and inclusivity across diverse user groups. Future work will continue to prioritize ethical considerations to promote the responsible development and deployment of personalized dialogue technologies.

Dependencies

Implementation and Training Details

Parameter Setup

We use the following hyper-parameters for all experiments:

Dependencies

Our implementation relies on the following tools and libraries:

Hardware and Reproducibility

All experiments are conducted on a server with the following hardware configuration:

Details for MemoryBank and LD-Agent Baselines

We integrate MemoryBank and LD-Agent as baselines, with key features implemented using the LongMemEval codebase 1 . We use Contriever as the default retriever. Particularly , they differ in the way for structuring and accessing stored information.

MemoryBank (Zhong et al., 2024) retrieves historical context by maintaining a structured memory where both conversational summaries and round-level utterances are stored as keyvalue pairs. The retrieval process involves directly matching user queries to the most relevant stored information, ensuring efficient context retrieval for response generation.

LD-Agent (Li et al., 2025), on the other hand, enhances retrieval by incorporating keyphrase-based queries. In addition to storing factual and summarized information, its retrieval is based on queries with key phrases

Figure 6 | Convergence of usefulness scores (ratio of useful memories cited) over RL training steps. The score improves as the reranker is updated based on Retrospective Reflection, indicating enhanced alignment between retrieved memory and generated responses.

extracted from past interactions. This enables the model to adapt more effectively to diverse query formulations, retrieving context that aligns with the underlying semantic meaning of the user input.

For both methods, retrieval operates in a non-hierarchical manner, meaning that all stored data is accessed through a uniform search mechanism without additional interaction-based refinement. The retrieved content is then used to provide historical grounding for response generation.

The Convergence of Citation Scores in RL

Figure 6 illustrates the convergence of citation scores (usefulness scores) during reinforcement learning. The x-axis represents the RL training steps, while the y-axis measures the ratio of useful memories cited by the LLM generator. Initially, the usefulness score starts at a low value around 0.2, reflecting the misalignment between retrieved memories and response generation. As training progresses, the score steadily increases, converging to approximately 0.4 by step 1000. This trend highlights the effectiveness of Retrospective Reflection in updating the reranker, allowing the retrieval process to better align with the generator's citation behavior. The gradual convergence indicates stable learning and suggests that RL fine-tuning improves retrieval quality without overfitting.

Dataset Description

Weconduct experiments on two publicly available datasets: MSC (Xu et al., 2022) and LongMemEval (Wu et al., 2025). MSC is a benchmark dataset for multi-session conversations, providing turn-level and session-level conversational data with annotations for relevance and response quality. On this dataset, following Li et al. (2025), we evaluate the ability of an LLM agent to produce human-like personalized responses. Each response can be grounded in historical context across multiple previous sessions. The focus is on accurately generating personalized responses by leveraging relevant user preferences and conversation patterns. We followed the methodology outlined by Li et al. (2025) to construct the data for our experiments. Specifically, we use the first 1000 sessions as chat history and the rest for

1 https://github.com/xiaowu0162/LongMemEval

evaluation.

LongMemEval is designed for long-term conversational evaluation. It includes extended histories across turn, session, and mixed granularities. For experiments in Section 8.6, we randomly sample 100 test instances and use the remaining data for training and validation. On this dataset, following Li et al. (2025), we evaluate the system's ability to answer human-designed questions about specific personal knowledge described in the historical sessions. For example, given a query like, 'What car did Mary buy last summer?', the system must retrieve and synthesize information scattered across multiple sessions. The task emphasizes accurately identifying and leveraging relevant details from long-term memory.

Case Studies

We present case studies to illustrate how RMM effectively integrates relevant memory fragments to enhance response quality. The following examples highlight scenarios where historical context is essential for maintaining coherence and accuracy in long-term dialogue.

Case 1: Revisiting Fitness Choices ( texttt{MSC

Tracking personal preferences and habits across multiple conversations is essential for maintaining coherent and personalized dialogue. In this case, the user initially considers purchasing a treadmill (Session A), later expresses a preference for using the gym treadmill due to weather constraints (Session B), and finally confirms their gym-going routine (Session C). An effective memory mechanism should correctly track this evolving decision and retrieve the most up-to-date preference.

Case 1: Revisiting Fitness Choices ( texttt{MSC

Session A, Turn 3:

Session B, Turn 2:

Session C, Turn 2:

Analysis: The user's decision about treadmill usage shifts across sessions. Initially, in Session A, they express interest in buying a treadmill. By Session B, they reconsider and decide that using the gym treadmill would be sufficient and confirm that they run on the treadmill at the gym. Without memory management, the model generates an outdated response, assuming the user is still undecided about purchasing a treadmill.

Case 2: Tracking Chronological Order of Events ( texttt{LongMemEval

In long-term interactions, correctly recalling the sequence of past events is essential for maintaining factual consistency. This case examines whether the model can track the order in which the user attended two different events.

Case 2: Tracking Chronological Order of Events ( texttt{LongMemEval

Session A, Turn 1:

Session B, Turn 3:

[Question] Which event did I attend first, the 'Effective Time Management' workshop or the 'Data Analysis using Python' webinar?

[Answer]

Analysis: The correct response requires linking the time reference ('two months ago') with the corresponding event. Without RMM, the model fails to retrieve this detail, resulting in an uncertain and incomplete answer. With RMM, the model correctly recalls the chronological order, demonstrating the advantage of structured memory retrieval in tracking event sequences.

Prompts

Prospective Reflection

Memory Extraction

Function: Memory extraction for SPEAKER_1

Task Description: Given a session of dialogue between SPEAKER_1 and SPEAKER_2, extract the personal summaries of SPEAKER_1, with references to the corresponding turn IDs. Ensure the output adheres to the following rules:

Example: INPUT:

· Turn 0:

· Turn 1:

· Turn 2:

· Turn 3:

· Turn 4:

· Turn 5:

OUTPUT:

{ "extracted_memories": [ { "summary": "SPEAKER_1 asked about a new gym in town and suggested older gyms or a treadmill as alternatives.", "reference": [0, 2] }, { "summary": "SPEAKER_1 usually lifts weights at the gym rather than using a treadmill.", "reference": [3] }, { "summary": "SPEAKER_1 has heard good things about the NordicTrack treadmill.", "reference": [3] }, { "summary": "SPEAKER_1 lives in New England and experiences foggy and rainy weather but enjoys the fall season.", "reference": [4] }, {

"summary": "SPEAKER_1 has lived overseas in the tropics before.", "reference": [5] } ] } Task: Follow the JSON format demonstrated in the example above and extract the personal summaries for SPEAKER_1 from the following dialogue session. Input: {} Output:

Memory Extraction

Task Description: Given a session of dialogue between SPEAKER_1 and SPEAKER_2, extract the personal summaries of SPEAKER_2, with references to the corresponding turn IDs. Ensure the output adheres to the following rules:

Example:

INPUT:

· Turn 3:

OUTPUT:

{ "extracted_memories": [ { "summary": "SPEAKER_2 is considering joining a local gym due to frequent rain affecting outdoor runs.", "reference": [0, 1] }, { "summary": "SPEAKER_2 is debating between buying a treadmill for home use or going to the gym for more workout variety.", "reference": [2] }, { "summary": "A new gym was recently built nearby SPEAKER_2 , replacing a previous one that was an hour away.", "reference": [3] }, { "summary": "SPEAKER_2 has access to a nice local running trail.", "reference": [4] }, { "summary": "SPEAKER_2 notices there is no local running club but is considering starting one.", "reference": [5] } ] }

Task: Follow the JSON format demonstrated in the example above and extract the personal summaries for SPEAKER_2 from the following dialogue session.

Input:

{}

Memory Update

Task Description: Given a list of history personal summaries for a specific user and a new and similar personal summary from the same user, update the personal history summaries following the instructions below:

Format: Add()

Format: Merge(index, merged_summary)

Note: index is the position of the relevant history summary in the list. merged_summary is the merged summary of the new summary and the relevant history summary. Two summaries are considered relevant if they discuss the same aspect of the user's personal information or experiences.

Example: INPUT:

Introduction

Merge(0, SPEAKER_1 exercises every Monday and Thursday, although he doesn't particularly enjoy it.)

Task: Follow the example format above to update the personal history for the given case. INPUT:

Introduction

Retrospective Reflection

Task Description: Given a user query and a list of memories consisting of personal summaries with their corresponding original turns, generate a natural and fluent response while adhering to the following guidelines:

Examples:

Case 1: Useful Memories Found INPUT:

Speaker 2: That's awesome! What songs do you usually play?

Output: You enjoy hiking, playing the guitar, and stargazing. [0, 1, 2]

Case 2: No Useful Memories INPUT:

Speaker 2: That sounds amazing! Do you go alone or with friends?

Output: I don't have enough information to answer that. [NO_CITE]

Introduction

Input:

Output:

LLM-as-a-Judge

You are an expert language model evaluator. I will provide you with a question, a ground-truth answer, and a model-generated response. Your task is to determine whether the response correctly answers the question by following these evaluation rules:

$$ \mathbf{q}^{\prime} = \mathbf{q} + \mathbf{W}_q \mathbf{q}, \quad \mathbf{m}_i^{\prime} = \mathbf{m}_i + \mathbf{W}_m \mathbf{m}_i, $$

$$ \Delta \phi = \eta \cdot (R - b) \cdot \nabla_\phi \log P(\mathcal{M}_M | q, \mathcal{M}_K; \phi), $$

Algorithm: algorithm
[H]
\small
\captionsetup{font=small}
\caption{Reflective Memory Management (RMM) for Dialogue Agents}
\label{alg:rmm}
\textbf{Input:} query $q$, past messages in current session $S$, memory bank $B$, retriever $f_\theta$, reranker $g_\phi$, $\textsc{LLM}$\\
\textbf{Output:} response $a$, updated $S$, $g_\phi$, $B$
\begin{algorithmic}[1]
\State \textbf{Retrieve:} $\mathcal{M}_K \gets f_\theta(q,B)$
\State \textbf{Rerank:} $\mathcal{M}_M \gets g_\phi(q, \mathcal{M}_K)$, where $\mathcal{M}_M=\{m_i\}_{i=1}^{M}$
\State \texttt{\textsc{// Retrospective Reflection}}
\State \textbf{Generate:} $a, R_M \gets \textsc{LLM}(q, S, \mathcal{M}_M)$ where $R_M=\{r_i\}_{i=1}^M$
\State $g_\phi \gets \textbf{RL\_Update}(g_\phi, R_M)$
\State $S.append((q, a))$
\State \texttt{\textsc{// Prospective Reflection}}
\If{session $S$ ends}
    \State $\mathcal{M} \gets \textbf{ExtractMemory}(S)$
    \For{$m \in \mathcal{M}$}
        \State $B \gets \textbf{UpdateMemory}(B, m)$
    \EndFor
    \State $S \gets []$
\EndIf
\end{algorithmic}

Examples:

Example

1: Correct

Response

Ablation Study

Example 2: Incorrect Response

Ablation Study

Introduction

Input:

Output:

Method	Retriever	MSC	MSC	LongMemEval	LongMemEval
Method	Retriever	METEOR ( % ) ↑	BERT ( % ) ↑	Recall@5 ( % )	Acc. ( % ) ↑
No History	-	5.2	10.6	-	0.0
Long Context	-	14.8	31.9	-	57.4
RAG	Contriever	24.8	50.8	54.3	58.8
RAG	Stella	26.2	51.6	59.2	61.4
RAG	GTE	27.5	52.1	62.4	63.6
MemoryBank	Specific1	20.1	40.3	58.6	59.6
LD-Agent	Specific2	25.4	51.5	56.8	59.2
RMM (Ours)	Contriever	30.8	55.4	60.4	61.2
	Stella	31.9	56.3	65.9	64.8
	GTE	33.4	57.1	69.8	70.4
RAG	Oracle	-	-	100.0	90.2

Variant	MSC	MSC	LongMemEval	LongMemEval
	METEOR	BERT	Recall@5	Acc.
RAG	24.8	50.8	54.3	58.8
+ PR	28.6	53.3	57.4	59.6
+ RR (W/O reranker)	20.3	31.8	34.2	31.0
+ RR	27.5	52.2	58.8	60.2
RMM	30.8	55.4	60.4	61.2

Metric	Precision	Recall	F1
Useful memory	89.4	91.1	90.2
Not useful memory	87.2	84.6	85.9
Overall	87.6	85.8	86.7

Method	LLM	MSC	MSC	LongMemEval
Method	LLM	METEOR	BERT	Acc.
Long Context	Gemini-1.5-Flash	14.8	31.9	57.4
Long Context	Gemini-1.5-Pro	17.4	36.1	56.6
RMM	Gemini-1.5-Flash	30.8	55.4	61.2
RMM	Gemini-1.5-Pro	24.6	50.6	58.6

Model	Retriever	LongMemEval	LongMemEval	LongMemEval	LongMemEval
		Recall@5	Acc.	Recall@10	Acc.
RMM	Contriever	60.4	61.2	67.2	66.8
	Stella	65.9	64.8	70.6	71.0
	GTE	69.8	70.4	74.4	73.8

Method	LLM	METEOR	BERT	LLM-as-a-Judge (Yes%)
Long Context	Gemini-1.5-Flash	14.8	31.9	25.4
Long Context	Gemini-1.5-Pro	17.4	36.1	22.8
RMM	Gemini-1.5-Flash Gemini-1.5-Pro	30.8	55.4	69.7
RMM		24.6	50.6	65.4

Pair	Cohen's Kappa ( 𝜅 )	Interpretation
Human A vs. Human B	0.82	Substantial agreement
LLM vs. Human A	0.71	Substantial agreement
LLM vs. Human B	0.69	Substantial agreement

Method	Retriever	MSC	MSC	LongMemEval	LongMemEval
Method	Retriever	METEOR ( % ) ↑	BERT ( % ) ↑	Recall@5 ( % )	Acc. ( % ) ↑
No History	-	5.2	10.6	-	0.0
Long Context	-	14.8	31.9	-	57.4
RAG	Contriever	24.8	50.8	54.3	58.8
RAG	Stella	26.2	51.6	59.2	61.4
RAG	GTE	27.5	52.1	62.4	63.6
MemoryBank	Specific1	20.1	40.3	58.6	59.6
LD-Agent	Specific2	25.4	51.5	56.8	59.2
RMM (Ours)	Contriever	30.8	55.4	60.4	61.2
	Stella	31.9	56.3	65.9	64.8
	GTE	33.4	57.1	69.8	70.4
RAG	Oracle	-	-	100.0	90.2

Variant	MSC	MSC	LongMemEval	LongMemEval
	METEOR	BERT	Recall@5	Acc.
RAG	24.8	50.8	54.3	58.8
+ PR	28.6	53.3	57.4	59.6
+ RR (W/O reranker)	20.3	31.8	34.2	31.0
+ RR	27.5	52.2	58.8	60.2
RMM	30.8	55.4	60.4	61.2

Metric	Precision	Recall	F1
Useful memory	89.4	91.1	90.2
Not useful memory	87.2	84.6	85.9
Overall	87.6	85.8	86.7

Method	LLM	MSC	MSC	LongMemEval
Method	LLM	METEOR	BERT	Acc.
Long Context	Gemini-1.5-Flash	14.8	31.9	57.4
Long Context	Gemini-1.5-Pro	17.4	36.1	56.6
RMM	Gemini-1.5-Flash	30.8	55.4	61.2
RMM	Gemini-1.5-Pro	24.6	50.6	58.6

Model	Retriever	LongMemEval	LongMemEval	LongMemEval	LongMemEval
		Recall@5	Acc.	Recall@10	Acc.
RMM	Contriever	60.4	61.2	67.2	66.8
	Stella	65.9	64.8	70.6	71.0
	GTE	69.8	70.4	74.4	73.8

Method	LLM	METEOR	BERT	LLM-as-a-Judge (Yes%)
Long Context	Gemini-1.5-Flash	14.8	31.9	25.4
Long Context	Gemini-1.5-Pro	17.4	36.1	22.8
RMM	Gemini-1.5-Flash Gemini-1.5-Pro	30.8	55.4	69.7
RMM		24.6	50.6	65.4

Pair	Cohen's Kappa ( 𝜅 )	Interpretation
Human A vs. Human B	0.82	Substantial agreement
LLM vs. Human A	0.71	Substantial agreement
LLM vs. Human B	0.69	Substantial agreement

References

[Aho:72] Alfred V. Aho, Jeffrey D. Ullman. (1972). The Theory of Parsing, Translation and Compiling.

[APA:83] {American Psychological Association. (1983). Publications Manual.

[Chandra:81] Ashok K. Chandra, Dexter C. Kozen, Larry J. Stockmeyer. (1981). Alternation. Journal of the Association for Computing Machinery. doi:10.1145/322234.322243.

[andrew2007scalable] Andrew, Galen, Gao, Jianfeng. (2007). Scalable training of {L1. Proceedings of the 24th International Conference on Machine Learning.

[Gusfield:97] Dan Gusfield. (1997). Algorithms on Strings, Trees and Sequences.

[rasooli-tetrault-2015] Mohammad Sadegh Rasooli, Joel R. Tetreault. (2015). Yara Parser: {A. ArXiv preprint.

[Ando2005] Ando, Rie Kubota, Zhang, Tong. (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. Journal of Machine Learning Research.

[chen2024large] Chen, Jin, Liu, Zheng, Huang, Xu, Wu, Chenwang, Liu, Qi, Jiang, Gangwei, Pu, Yuanhao, Lei, Yuxuan, Chen, Xiaolong, Wang, Xingmei, Zheng, Kai, Lian, Defu, Chen, Enhong. (2024). When large language models meet personalization: perspectives of challenges and opportunities. World Wide Web. doi:10.1007/s11280-024-01276-1.

[li2024personal] Li, Yuanchun, Wen, Hao, Wang, Weijun, Li, Xiangyu, Yuan, Yizhen, Liu, Guohong, Liu, Jiacheng, Xu, Wenxing, Wang, Xiang, Sun, Yi, others. (2024). Personal llm agents: Insights and survey about the capability, efficiency and security. ArXiv preprint.

[tseng2024two] Tseng, Yu-Min, Huang, Yu-Chao, Hsiao, Teng-Yun, Chen, Wei-Lin, Huang, Chao-Wei, Meng, Yu, Chen, Yun-Nung. (2024). Two Tales of Persona in {LLM. Findings of the Association for Computational Linguistics: EMNLP 2024. doi:10.18653/v1/2024.findings-emnlp.969.

[joko2024doing] Joko, Hideaki, Chatterjee, Shubham, Ramsay, Andrew, De Vries, Arjen P, Dalton, Jeff, Hasibi, Faegheh. (2024). Doing personal laps: Llm-augmented dialogue construction for personalized multi-session conversational search. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval.

[eapen2023personalization] Eapen, Joel, Adhithyan, VS. (2023). Personalization and customization of llm responses. International Journal of Research Publication and Reviews.

[kolasani2023optimizing] Saydulu Kolasani. (2023). Optimizing Natural Language Processing, Large Language Models (LLMs) for Efficient Customer Service, and hyper-personalization to enable sustainable growth and revenue. Transactions on Latest Trends in Artificial Intelligence.

[guan2023intelligent] Guan, Yanchu, Wang, Dong, Chu, Zhixuan, Wang, Shiyu, Ni, Feiyue, Song, Ruihua, Zhuang, Chenyi. (2024). Intelligent Agents with LLM-based Process Automation. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. doi:10.1145/3637528.3671646.

[zhang2024simulating] Zhang, Zheyuan, Zhang-Li, Daniel, Yu, Jifan, Gong, Linlu, Zhou, Jinchang, Liu, Zhiyuan, Hou, Lei, Li, Juanzi. (2024). Simulating classroom education with llm-empowered agents. ArXiv preprint.

[shareef2024retailgpt] Shareef, Farooq, Ajith, Rishi, Kaushal, Parth, Sengupta, Karthik. (2024). RetailGPT: A Fine-Tuned LLM Architecture for Customer Experience and Sales Optimization. 2024 2nd International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS).

[wu2024longmemeval] Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, Dong Yu. (2025). LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. The Thirteenth International Conference on Learning Representations.

[jin2024long] Jin, Bowen, Yoon, Jinsung, Han, Jiawei, Arik, Sercan O. (2024). Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG. ArXiv preprint.

[liu2024lost] Liu, Nelson F., Lin, Kevin, Hewitt, John, Paranjape, Ashwin, Bevilacqua, Michele, Petroni, Fabio, Liang, Percy. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. doi:10.1162/tacl_a_00638.

[gao2023retrieval] Gao, Yunfan, Xiong, Yun, Gao, Xinyu, Jia, Kangxiang, Pan, Jinliu, Bi, Yuxi, Dai, Yi, Sun, Jiawei, Wang, Haofen. (2023). Retrieval-augmented generation for large language models: A survey. ArXiv preprint.

[li2024hello] Li, Hao, Yang, Chenghao, Zhang, An, Deng, Yang, Wang, Xiang, Chua, Tat-Seng. (2025). Hello Again! {LLM. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). doi:10.18653/v1/2025.naacl-long.272.

[dong2024can] Dong, Yijiang River, Hu, Tiancheng, Collier, Nigel. (2024). Can {LLM. Findings of the Association for Computational Linguistics: EMNLP 2024. doi:10.18653/v1/2024.findings-emnlp.592.

[zhong2024memorybank] Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, Yanlin Wang. (2024). MemoryBank: Enhancing Large Language Models with Long-Term Memory. Thirty-Eighth {AAAI. doi:10.1609/AAAI.V38I17.29946.

[wen2024ai] Wen, Qingsong, Liang, Jing, Sierra, Carles, Luckin, Rose, Tong, Richard, Liu, Zitao, Cui, Peng, Tang, Jiliang. (2024). AI for Education (AI4EDU): Advancing Personalized Education with LLM and Adaptive Learning. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. doi:10.1145/3637528.3671498.

[williams1981process] Michael David Williams, James D. Hollan. (1981). The process of retrieval from very long-term memory. Cognitive Science.

[whittaker2002managing] Whittaker, S., Jones, Q., Terveen, L.. (2002). Managing long term communications: conversation and contact management. Proceedings of the 35th Annual Hawaii International Conference on System Sciences. doi:10.1109/HICSS.2002.994063.

[li2024personalization] Li, Wei, Zhang, Hao. (2024). Personalization in Large Language Models: Techniques and Applications. Journal of Artificial Intelligence Research.

[wang2023personalized] Wang, Li, Chen, Ming. (2023). Personalized Language Models through Meta-Learning. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.

[chen2024meta] Chen, Yu, Liu, Xiaoming. (2024). Meta-Learning for Few-Shot Personalization of Language Models. Proceedings of the 2024 International Conference on Learning Representations.

[ouyang2022training] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.

[gao2024learning] Gao, Ling, Sun, Wei. (2024). Learning Personalized Preferences for Language Generation. Proceedings of the 2024 Annual Meeting of the Association for Computational Linguistics.

[zhang2024personalization] Zhang, Zhehao, Rossi, Ryan A, Kveton, Branislav, Shao, Yijia, Yang, Diyi, Zamani, Hamed, Dernoncourt, Franck, Barrow, Joe, Yu, Tong, Kim, Sungchul, others. (2024). Personalization of large language models: A survey. ArXiv preprint.

[radford2019language] Radford, Alec, Wu, Jeffrey, Child, Rewon, Luan, David, Amodei, Dario, Sutskever, Ilya, others. (2019). Language models are unsupervised multitask learners. OpenAI blog.

[brown2020language] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert{-. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.

[raffel2020exploring] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res..

[lewis2020retrieval] Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K{. (2020). Retrieval-Augmented Generation for Knowledge-Intensive {NLP. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.

[xu-etal-2022-beyond] Xu, Jing, Szlam, Arthur, Weston, Jason. (2022). Beyond Goldfish Memory: Long-Term Open-Domain Conversation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2022.acl-long.356.

[agarwal2023graph] Agarwal, Vibhor, Young, Anthony P, Joglekar, Sagar, Sastry, Nishanth. (2023). A graph-based context-aware model to understand online conversations. ACM Transactions on the Web.

[murre2015replication] Murre, Jaap MJ, Dros, Joeri. (2015). Replication and analysis of Ebbinghaus’ forgetting curve. PloS one.

[gumbel1948statistical] Gumbel, E.J.. (1954). Statistical Theory of Extreme Values and Some Practical Applications: A Series of Lectures.

[maddison2014sampling] Maddison, Chris J, Tarlow, Daniel, Minka, Tom. (2014). A Sampling*. Advances in Neural Information Processing Systems.

[jang2016categorical] Eric Jang, Shixiang Gu, Ben Poole. (2017). Categorical Reparameterization with Gumbel-Softmax. 5th International Conference on Learning Representations, {ICLR.

[buchmann-etal-2024-attribute] Buchmann, Jan, Liu, Xiao, Gurevych, Iryna. (2024). Attribute or Abstain: Large Language Models as Long Document Assistants. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2024.emnlp-main.463.

[izacard2021unsupervised] Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, Edouard Grave. (2022). Unsupervised Dense Information Retrieval with Contrastive Learning. Transactions on Machine Learning Research.

[yang2024qwen2] Yang, An, Yang, Baosong, Hui, Binyuan, Zheng, Bo, Yu, Bowen, Zhou, Chang, Li, Chengpeng, Li, Chengyuan, Liu, Dayiheng, Huang, Fei, others. (2024). Qwen2 technical report. ArXiv preprint.

[douze2024faiss] Douze, Matthijs, Guzhva, Alexandr, Deng, Chengqi, Johnson, Jeff, Szilvasy, Gergely, Mazar{'e. (2024). The faiss library. ArXiv preprint.

[pansecom] Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Jianfeng Gao. (2025). SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents. The Thirteenth International Conference on Learning Representations.

[shi2023large] Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch{. (2023). Large Language Models Can Be Easily Distracted by Irrelevant Context. International Conference on Machine Learning, {ICML.

[salemi2024optimization] Salemi, Alireza, Kallumadi, Surya, Zamani, Hamed. (2024). Optimization methods for personalizing large language models through retrieval augmentation. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval.

[jiang2024longrag] Jiang, Ziyan, Ma, Xueguang, Chen, Wenhu. (2024). Longrag: Enhancing retrieval-augmented generation with long-context llms. ArXiv preprint.

[wang2024biorag] Wang, Chengrui, Long, Qingqing, Meng, Xiao, Cai, Xunxin, Wu, Chengjun, Meng, Zhen, Wang, Xuezhi, Zhou, Yuanchun. (2024). BioRAG: A RAG-LLM Framework for Biological Question Reasoning. ArXiv preprint.

[asaiself] Asai, Akari, Wu, Zeqiu, Wang, Yizhong, Sil, Avirup, Hajishirzi, Hannaneh. (2024). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. The Twelfth International Conference on Learning Representations.

[ampazis2024improving] Ampazis, Nicholas. (2024). Improving RAG Quality for Large Language Models with Topic-Enhanced Reranking. IFIP International Conference on Artificial Intelligence Applications and Innovations.

[li2024dalk] Li, Dawei, Yang, Shu, Tan, Zhen, Baik, Jae Young, Yun, Sukwon, Lee, Joseph, Chacko, Aaron, Hou, Bojian, Duong-Tran, Duy, Ding, Ying, others. (2024). DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer's Disease Questions with Scientific Literature. ArXiv preprint.

[li2024exploring] Li, Dawei, Tan, Zhen, Liu, Huan. (2024). Exploring large language models for feature selection: A data-centric perspective. ArXiv preprint.

[pipitone2024legalbench] Pipitone, Nicholas, Alami, Ghita Houir. (2024). LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain. ArXiv preprint.

[banerjee2005meteor] Banerjee, Satanjeev, Lavie, Alon. (2005). {METEOR. Proceedings of the {ACL.

[izacard2021towards] Izacard, Gautier, Caron, Mathilde, Hosseini, Lucas, Riedel, Sebastian, Bojanowski, Piotr, Joulin, Armand, Grave, Edouard. (2021). Towards unsupervised dense information retrieval with contrastive learning. ArXiv preprint.

[hsu2024calm] Hsu, I, Wang, Zifeng, Le, Long T, Miculicich, Lesly, Peng, Nanyun, Lee, Chen-Yu, Pfister, Tomas, others. (2024). CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation. ArXiv preprint.

[ye2024effective] Ye, Xi, Sun, Ruoxi, Arik, Sercan, Pfister, Tomas. (2024). Effective Large Language Model Adaptation for Improved Grounding and Citation Generation. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers).

[buchmann2024attribute] Buchmann, Jan, Liu, Xiao, Gurevych, Iryna. (2024). Attribute or Abstain: Large Language Models as Long Document Assistants. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

[kim2024theanine] Ong, Kai Tzu-iunn, Kim, Namyoung, Gwak, Minju, Chae, Hyungjoo, Kwon, Taeyoon, Jo, Yohan, Hwang, Seung-won, Lee, Dongha, Yeo, Jinyoung. (2025). Towards Lifelong Dialogue Agents via Timeline-based Memory Management. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). doi:10.18653/v1/2025.naacl-long.435.

[mendoncca2024benchmarking] Mendon{\c{c. (2024). On the Benchmarking of {LLM. Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024).

[lee2023prompted] Lee, Gibbeum, Hartmann, Volker, Park, Jongho, Papailiopoulos, Dimitris, Lee, Kangwook. (2023). Prompted {LLM. Findings of the Association for Computational Linguistics: ACL 2023. doi:10.18653/v1/2023.findings-acl.277.

[maharana2024evaluating] Maharana, Adyasha, Lee, Dong-Ho, Tulyakov, Sergey, Bansal, Mohit, Barbieri, Francesco, Fang, Yuwei. (2024). Evaluating Very Long-Term Conversational Memory of {LLM. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2024.acl-long.747.

[lu2023memochat] Lu, Junru, An, Siyu, Lin, Mingbao, Pergola, Gabriele, He, Yulan, Yin, Di, Sun, Xing, Wu, Yunsheng. (2023). Memochat: Tuning llms to use memos for consistent long-range open-domain conversation. ArXiv preprint.

[wang2023recursively] Wang, Qingyue, Ding, Liang, Cao, Yanan, Tian, Zhiliang, Wang, Shi, Tao, Dacheng, Guo, Li. (2023). Recursively summarizing enables long-term dialogue memory in large language models. ArXiv preprint.

[bae2022keep] Bae, Sanghwan, Kwak, Donghyun, Kang, Soyoung, Lee, Min Young, Kim, Sungdong, Jeong, Yuin, Kim, Hyeri, Lee, Sang-Woo, Park, Woomyoung, Sung, Nako. (2022). Keep Me Updated! Memory Management in Long-term Conversations. Findings of the Association for Computational Linguistics: EMNLP 2022. doi:10.18653/v1/2022.findings-emnlp.276.

[pei2021cooperative] Jiahuan Pei, Pengjie Ren, Maarten de Rijke. (2021). A Cooperative Memory Network for Personalized Task-oriented Dialogue Systems with Incomplete User Profiles. {WWW. doi:10.1145/3442381.3449843.

[zhang2024spar] Zhang, Chiyu, Sun, Yifei, Chen, Jun, Lei, Jie, Abdul-Mageed, Muhammad, Wang, Sinong, Jin, Rong, Park, Sem, Yao, Ning, Long, Bo. (2024). SPAR: Personalized Content-Based Recommendation via Long Engagement Attention. ArXiv preprint.

[li2024scbench] YUCHENG LI, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu. (2025). {SCB. The Thirteenth International Conference on Learning Representations.

[zheng2025dape] Zheng, Chuanyang, Gao, Yihang, Shi, Han, Huang, Minbin, Li, Jingyao, Xiong, Jing, Ren, Xiaozhe, Ng, Michael, Jiang, Xin, Li, Zhenguo, Li, Yu. (2024). DAPE: Data-Adaptive Positional Encoding for Length Extrapolation. Advances in Neural Information Processing Systems.

[jiang2024retrieve] Jiang, Zhouyu, Sun, Mengshu, Liang, Lei, Zhang, Zhiqiang. (2025). Retrieve, Summarize, Plan: Advancing Multi-hop Question Answering with an Iterative Approach. Companion Proceedings of the ACM on Web Conference 2025. doi:10.1145/3701716.3716889.

[li2024alr] Li, Huayang, Verga, Pat, Sen, Priyanka, Yang, Bowen, Viswanathan, Vijay, Lewis, Patrick, Watanabe, Taro, Su, Yixuan. (2024). ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering. ArXiv preprint.

[liu2025chunkkv] Liu, Xiang, Tang, Zhenheng, Dong, Peijie, Li, Zeyu, Li, Bo, Hu, Xuming, Chu, Xiaowen. (2025). ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference. ArXiv preprint.

[zhao2023length] Zhao, Liang, Feng, Xiachong, Feng, Xiaocheng, Zhong, Weihong, Xu, Dongliang, Yang, Qing, Liu, Hongtao, Qin, Bing, Liu, Ting. (2024). Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding. Findings of the Association for Computational Linguistics: EMNLP 2024. doi:10.18653/v1/2024.findings-emnlp.582.

[liu2023ring] Hao Liu, Matei Zaharia, Pieter Abbeel. (2024). RingAttention with Blockwise Transformers for Near-Infinite Context. The Twelfth International Conference on Learning Representations.

[Zhang2020BERTScore] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi. (2020). BERTScore: Evaluating Text Generation with {BERT. 8th International Conference on Learning Representations, {ICLR.

[williams1992simple] Williams, Ronald J.. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn.. doi:10.1007/BF00992696.

[zhang2024jasper] Zhang, Dun, Li, Jiacheng, Zeng, Ziyang, Wang, Fulong. (2024). Jasper and Stella: distillation of SOTA embedding models. ArXiv preprint.

[li2023towards] Li, Zehan, Zhang, Xin, Zhang, Yanzhao, Long, Dingkun, Xie, Pengjun, Zhang, Meishan. (2023). Towards general text embeddings with multi-stage contrastive learning. ArXiv preprint.

[mccloskey1989catastrophic] Michael McCloskey, Neal J. Cohen. (1989). Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. doi:https://doi.org/10.1016/S0079-7421(08)60536-8.

[kenthapadi2024grounding] Kenthapadi, Krishnaram, Sameki, Mehrnoosh, Taly, Ankur. (2024). Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey). Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. doi:10.1145/3637528.3671467.

[litman2002designing] Litman, Diane J, Pan, Shimei. (2002). Designing and evaluating an adaptive spoken dialogue system. User Modeling and User-Adapted Interaction.

[weizenbaum1966eliza] Weizenbaum, Joseph. (1966). ELIZA—a computer program for the study of natural language communication between man and machine. Commun. ACM. doi:10.1145/365153.365168.

[walker1997paradise] Walker, Marilyn A., Litman, Diane J., Kamm, Candace A., Abella, Alicia. (1997). {PARADISE. 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the {E. doi:10.3115/976909.979652.

[wang2024agentworkflowmemory] Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig. (2025). Agent Workflow Memory. Forty-second International Conference on Machine Learning.

[xu2025mem] Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, Yongfeng Zhang. (2025). A-MEM: Agentic Memory for LLM Agents.

[chhikara2025mem0buildingproductionreadyai] Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, Deshraj Yadav. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.

[rasmussen2025zep] Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory.

[packer2024memgptllmsoperatingsystems] Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph E. Gonzalez. (2024). MemGPT: Towards LLMs as Operating Systems.

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

Introduction​

Related Work​

Problem Formulation​

Framework Overview​

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

Prospective Reflection: Topic-Based Memory Organization​

Retrospective Reflection: Retrieval Refinement via LLM Attribution​

Reranker Design​

LLM Attribution as Rewards​

Reranker Update​

Experimental Setup​

Implementation Details​

Datasets and Evaluation Metrics​

Compared Methods​

Experimental Results​

Main Results​

Ablation Study​

Validation of Citation Scores​

Effect of Different LLMs​

Effect of Different Granularities​

Offline Supervised Training​

The Impact of Top- textit{K​

LLM-as-a-Judge​

Conclusion​

Limitations​

Ethical Statement​

Dependencies​

Implementation and Training Details​

Parameter Setup​

Dependencies​

Hardware and Reproducibility​

Details for MemoryBank and LD-Agent Baselines​

The Convergence of Citation Scores in RL​

Dataset Description​

Case Studies​

Case 1: Revisiting Fitness Choices ( texttt{MSC​

Case 1: Revisiting Fitness Choices ( texttt{MSC​

Session A, Turn 3:

Session B, Turn 2:

Session C, Turn 2:

Case 2: Tracking Chronological Order of Events ( texttt{LongMemEval​

Case 2: Tracking Chronological Order of Events ( texttt{LongMemEval​

Session A, Turn 1:

Session B, Turn 3:

[Answer]

Prompts​

Prospective Reflection​

Memory Extraction​

Example: INPUT:

· Turn 0:

· Turn 1:

· Turn 2:

· Turn 3:

· Turn 4:

· Turn 5:

OUTPUT:

Memory Extraction​

Example:

INPUT:

· Turn 3:

OUTPUT:

Memory Update​

Example: INPUT:

Introduction​

Introduction​

Retrospective Reflection​

Examples:

Case 1: Useful Memories Found INPUT:

Case 2: No Useful Memories INPUT:

Introduction​

Input:

Output:

LLM-as-a-Judge​

Examples:

1: Correct

Ablation Study​

Example 2: Incorrect Response

Ablation Study​

Introduction​

Introduction

Related Work

Problem Formulation

Framework Overview

Prospective Reflection: Topic-Based Memory Organization

Retrospective Reflection: Retrieval Refinement via LLM Attribution

Reranker Design

LLM Attribution as Rewards

Reranker Update

Experimental Setup

Implementation Details

Datasets and Evaluation Metrics

Compared Methods

Experimental Results

Main Results

Ablation Study

Validation of Citation Scores

Effect of Different LLMs

Effect of Different Granularities

Offline Supervised Training

The Impact of Top- textit{K

LLM-as-a-Judge

Conclusion

Limitations

Ethical Statement

Dependencies

Implementation and Training Details

Parameter Setup

Dependencies

Hardware and Reproducibility

Details for MemoryBank and LD-Agent Baselines

The Convergence of Citation Scores in RL

Dataset Description

Case Studies

Case 1: Revisiting Fitness Choices ( texttt{MSC

Case 1: Revisiting Fitness Choices ( texttt{MSC

Case 2: Tracking Chronological Order of Events ( texttt{LongMemEval

Case 2: Tracking Chronological Order of Events ( texttt{LongMemEval

Prompts

Prospective Reflection

Memory Extraction

Memory Extraction

Memory Update

Introduction

Introduction

Retrospective Reflection

Introduction

LLM-as-a-Judge

Ablation Study

Ablation Study

Introduction

References