Skip to main content

Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents

Mathis Pink, Qinyuan Wu, Vy Ai Vo, Javier Turek, Jianing Mu, Alexander Huth, Mariya Toneva

Abstract

As Large Language Models (LLMs) evolve from text-completion tools into fully fledged agents operating in dynamic environments, they must address the challenge of continuous learning and long-term knowledge retention. Many biological systems solve these challenges with episodic memory, which supports single-shot learning of instance-specific contexts. Inspired by this, we present a framework for LLM agents, centered around five key properties of episodic memory that underlie adaptive and context-sensitive behavior. With various research efforts already covering portions of these properties, this position paper argues that now is the right time for an explicit, integrated focus on episodic memory to catalyze the development of long-term agents. To this end, we outline a roadmap that unites several research directions under the goal to support all five properties of episodic memory for more efficient long-term LLM agents.

Episodic Memory as a Unifying Framework

Mathis Pink 1 Qinyuan Wu 1 Vy Ai Vo 2 Javier Turek 3 Jianing Mu 4 Alexander Huth 4 Mariya Toneva 1

As Large Language Models (LLMs) evolve from text-completion tools into fully fledged agents operating in dynamic environments, they must address the challenge of continuous learning and long-term knowledge retention. Many biological systems solve these challenges with episodic memory, which supports single-shot learning of instance-specific contexts. Inspired by this, we present a framework for LLM agents, centered around five key properties of episodic memory that underlie adaptive and context-sensitive behavior. With various research efforts already covering portions of these properties, this position paper argues that now is the right time for an explicit, integrated focus on episodic memory to catalyze the development of long-term agents. To this end, we outline a roadmap that unites several research directions under the goal to support all five properties of episodic memory for more efficient long-term LLM agents.

Introduction

Large Language Models (LLMs) are rapidly expanding beyond their origins as text-completion engines. Instead, they are evolving into agentic systems capable of taking meaningful actions in complex environments (Xi et al., 2023). This transformation can enable a range of real-world applications, including autonomous research assistance (Schmidgall et al., 2025), aiding in literature reviews, data analysis, and hypothesis generation; personalized customer support (Li et al., 2024b), where they can recall prior interactions to provide consistent and tailored assistance; and interactive tutoring systems (Lin et al., 2023), which track learning progress, and revisit challenging concepts to ensure effective and personalized education. These diverse applications hint at a vast potential of LLMs to enable intelligent agents capable of meaningful and context-sensitive interaction.

  • Equal contribution 1 Max Planck Institute for Software Systems

2 3 EarthDynamics.ai 4 University of Texas at Austin. Correspondence to: Mathis Pink < mpink@mpi-sws.org > .

Operating and reasoning over extended timescales in dynamic interactive contexts demands that an agent not only recalls what happened, but also when, how, why, and involving whom. Such rich traces of past events, motivations, and outcomes form the basis of context-sensitive behavior-especially crucial in large-scale projects involving human stakeholders and multiple actors. For example, a future long-term LLM agent that is supposed to assist in the ongoing development of a massive software project such as Linux-which has spanned decades, encompasses over 40 million lines of code, and additionally involves countless past contributions, issues, comments, notes, and feature requests-would need to continuously integrate and reason about a vast, evolving historical context while adapting to new requirements. Core necessities for this kind of system are constant computational cost per new token and a stable or improving performance over time.

Ongoing research directions attack the problem of longterm retention and adaptation from different angles and have made impressive progress. However, we are still lacking approaches that maintain relevant contextualized information over long time frames at a constant cost without degrading performance-necessities for a widespread adoption of LLM agents in many long-term settings.

Meanwhile, many biological systems solve the demands for acting in a continually evolving environment with a dedicated memory system that allows for both fast and slow learning: episodic memory (McClelland et al., 1995; Schwartz & Evans, 2001; O'Reilly & Norman, 2002; O'Reilly et al., 2014; Kumaran et al., 2016; Liao & Losonczy, 2024). In this position paper, we argue that the growing demand for LLM agents to operate effectively over extended timescales, alongside ongoing advances in long-context models, external memory systems, and efficient fine-tuning methods, makes episodic memory a timely framework to unify efforts for enabling truly long-term LLM agents.

To lay out the argument for this position, we proceed as follows: In Section 2, we operationalize the concept of episodic memory for LLM agents by highlighting five key properties that distinguish it from other biological types of memory that are also desirable for LLM agents. We proceed to argue in Section 3 for episodic memory as a unifying goal

Figure 1. LLM-Agents with an Episodic Memory system. The LLM agent acts on and gets feedback from an environment. Feedback can come in the form of outputs from programs (E1), from other agents (E2), humans (E3), as well as external real-world data (E4). Actions can modify parts of the environment, and provide feedback for humans or other agents in the environment. Within the agent, an external memory system acts as a bridge between parametric and in-context memory while allowing for fast encoding of and retrieval into in-context memory (the LLM's context window). (a) Consolidation : Episodes in the external memory are consolidated into a model's broader parametric memory to avoid capacity limitations and allow for generalization to new semantic knowledge and procedural skills based on specific instances. (b) Encoding : Limited in-context memory can offload its content into external memory. (c) Retrieval : Stored episodes can later be retrieved and used to reinstate representations into in-context memory.

Figure 1. LLM-Agents with an Episodic Memory system. The LLM agent acts on and gets feedback from an environment. Feedback can come in the form of outputs from programs (E1), from other agents (E2), humans (E3), as well as external real-world data (E4). Actions can modify parts of the environment, and provide feedback for humans or other agents in the environment. Within the agent, an external memory system acts as a bridge between parametric and in-context memory while allowing for fast encoding of and retrieval into in-context memory (the LLM's context window). (a) Consolidation : Episodes in the external memory are consolidated into a model's broader parametric memory to avoid capacity limitations and allow for generalization to new semantic knowledge and procedural skills based on specific instances. (b) Encoding : Limited in-context memory can offload its content into external memory. (c) Retrieval : Stored episodes can later be retrieved and used to reinstate representations into in-context memory.

by showing how various existing approaches to improve LLM memory target different properties that are united in episodic memory. In Section 4, we highlight how unifying these threads under a common goal can spur more holistic progress and outline a roadmap toward implementing episodic memory. Lastly, in Section 5, we discuss alternative views under which episodic memory would not be necessary for long-term LLM agents.

Operationalizing EM for LLMs

To transfer the concept of episodic memory from cognitive science to the context of LLM agents, we highlight five properties of episodic memory that are useful for LLM agents, and that distinguish episodic memory from other memory types in animals and humans. These five properties naturally cluster into two categories: properties that concern the way that the system operates with the memorynamely, long-term storage, explicit reasoning, and singleshot learning, and properties that concern the content of the stored memory-namely, instance-specific and contextualized memories. We first discuss how the combination of these five properties distinguishes episodic memory from other types of memories in animals and humans, and then detail each property and its utility for LLM agents.

Unique Combination of EM Properties

Episodic memory is one of multiple memory systems that exist in animals and humans, distinguished by its unique combination of properties (Table 1). Other biological memory systems that share some, but not all, properties of episodic memory are 1) procedural memory (Milner, 1962; Cohen & Squire, 1980), which allows for long-term storage of memories for implicit operations or task behaviors, such as producing a sequence of a particular type, rather than reasoning about the sequence; 2) semantic memory (Collins & Quillian, 1969; Tulving, 1972), which allows for longterm storage of factual knowledge and explicit reasoning with these stored memories, but lacks specificity to single instances of acquired information and its context; and 3) working memory (Baddeley, 1986; Baddeley & Hitch, 1974), which can share many of the highlighted properties of episodic memory except for the important fact that it does not allow for long-term storage. The unique combination of important properties in episodic memory makes it a promising candidate for translation to AI systems.

Importance of EM Properties for LLM Agents

Episodic Memory Operations

Long-term storage. In humans and other animals, episodic memory functions as a form of long-term memory, capable

of storing knowledge throughout an individual's lifetime (Conway, 2001; Mayes & Roberts, 2001; Squire & Zola, 1996; Hampton & Schwartz, 2004). This distinguishes it from working memory, which is transient. For LLM agents, an effective episodic memory system must similarly support memory retrieval across any number of tokens. This requires mechanisms for long-term memory that maintain an agent's performance throughout a continual interaction with an environment. An adaptive long-term agent should not only prevent a degradation in performance over timeit should also be able to improve by learning new general knowledge and skills.

Explicit reasoning. In classical theories of human memory, episodic memory is described as a subset of declarative or explicit memory (Squire & Zola, 1996; Hampton & Schwartz, 2004). A defining feature of explicit memory is the ability to reflect and reason about the memory content. In the context of LLM agents, the explicitness of memory is necessary as agents need to be able to answer direct queries about stored information or use this information in explicit internal reasoning processes.

Single-shot learning. Akey characteristic of episodic memory, as emphasized in complementary learning systems theory, is its ability to be acquired based on a single exposure (Liao & Losonczy, 2024; Schwartz & Evans, 2001; O'Reilly &Norman, 2002; O'Reilly et al., 2014; McClelland et al., 1995; Kumaran et al., 2016; Das et al., 2024b). This fast learning enables the rapid encoding of unique experiences or events. For LLM agents, this capability is particularly crucial in environments where continual deployment may not provide multiple variations or repetitions of specific events. Certain occurrences in an environment may happen only once, necessitating an episodic memory system that is capable of effectively capturing and utilizing information from single exposures.

Episodic Memory Content

Instance-specific memories. Episodic memory stores information specific to an individual sequence of events along with their distinct temporal contexts (Sugar & Moser, 2019; Colgin et al., 2008). This specificity allows episodic memory to capture details unique to a particular occurrence, enabling its application in agentic environments where reasoning about specific past actions and their consequences matters. This can include past lines of reasoning that were associated with a decision to be made by an LLM agent.

Contextual memories. Episodic memory binds context to its memory content, such as when, where, and why an event was encountered (Eichenbaum & Cohen, 2014; O'Keefe & Nadel, 1978; Eichenbaum, 2015). The ability to store many contextual relations associated with a specific event enables retrieval based on contextual cues as well as explicit recall

Table 2. Methods for in-context, external, and parametric memory do not cover all features of episodic memory. ∼ is used for cases where it is unclear whether an aspect of episodic memory is properly satisfied by a method.

of context. For LLM agents, this property is important to not only remember that a specific event happened in the past, but also when, why, and in which broader context it happened.

Current Approaches

While many methods currently exist to modify and augment LLM memory, we argue that they fall short of the memory properties that would enable effective, long-term LLM agents. We group existing methods that seek to improve the memory of LLMs into three categories that are relevant to episodic memory:

  1. In-Context Memory methods extend the effective context length by optimizing computational efficiency and length generalization;
  2. External Memory methods augment a model's incontext memory capacity with a separate module, often with reduced GPU memory requirements and/or computational cost;
  3. Parametric Memory methods modify the LLM parameters that encode memories (primarily learned from the language modeling training data).

In this section, we discuss examples in each category that capture different properties of episodic memory (Table 2). More importantly, we highlight their shortcomings in supporting episodic memory for LLM agents in isolation.

In-Context Memory

In-context memory (ICM) allows LLMs to perform singleshot, instance-specific, and contextualized learning by enabling them to directly attend to representations of encountered sequences (Table 2). ICM capacity is either tightly

limited or extensible but expensive to scale, often requiring sequence parallelization. Recent works seek to extend it by increasing the context window, but models struggle with length generalization beyond training exposures. We review existing methods and their limitations in addressing these challenges.

One active research direction focuses on extending the incontext window to handle significantly longer sequences, enabling LLMs to perform reasoning over extended contexts. This advancement brings LLMs closer to mimicking episodic memory, as it allows models to retain and utilize information across longer contexts. However, transformerbased LLMs face significant challenges, including the high computational cost of processing long sequences and limitations in length generalization. Recent research has sought to address these challenges by reducing memory usage, optimizing inference time, and improving long-sequence generation. Despite these advancements, current methods have yet to achieve robust, persistent memory capabilities necessary for long-term, open-ended, and context-aware reasoning. Below, we briefly review existing methods and their limitations.

Memory reduction. For transformer-based LLMs, several methods aim to reduce memory and computation costs.

Sparsification and compression methods selectively retain relevant information to optimize memory usage. Sparsification strategies optimize memory by restricting attention computations to the most relevant parts of the sequence (Lou et al., 2024), reducing both storage and computational overhead. Similarly, forgetting mechanisms remove less useful tokens to maintain efficiency (Anonymous, 2024). Other compression-based approaches dynamically reduce KV cache size by storing only the most important tokens and key-value pairs (Liu et al., 2023b; Ge et al., 2024; Tang et al., 2024). Adaptive strategies further refine compression across layers (Yang et al., 2024a; Nawrot et al., 2024) or merge similar states to minimize redundancy (Liu et al., 2024a).

Quantization methods reduce memory footprints by lowering precision or selectively storing information. Quantization techniques store key-value pairs at reduced precision (Liu et al., 2024c; Hooper et al., 2024; Yue et al., 2024; Duanmu et al., 2024), allowing for larger context windows with a relatively minimal performance degradation.

Inference time reduction. Efficiency improvements during inference focus on optimizing KV cache management and parallelization. Techniques such as paged caching (Kwon et al., 2023; Lee et al., 2024; Zheng et al., 2024) dynamically allocate memory to accommodate longer sequences without excessive overhead. Other methods leverage GPU memory pooling and adaptive chunking (Lin et al., 2024; Agrawal et al., 2024) to process extended contexts efficiently while maintaining fast retrieval and computation speeds. Other strategies improve efficiency by reusing KV tensors across layers (Ye et al., 2024a; Brandon et al., 2024)

Recent work has aimed to introduce episodic memory in LLMs by structuring token sequences into retrievable events (Fountas et al., 2024), enhancing long-context reasoning and outperforming retrieval-based models. However, a fundamental challenge remain the increasing memory and retrieval costs: maintaining the full KV-Cache for an entire interaction history can quickly become impractical, especially in large-scale, long-duration, and multimodal applications. This limitation is inherent to KV-Cache management systems, which must retain the entire cache, leading to significant storage and computational overhead.

Transformer alternatives to reduce both memory and inference time. In addition to optimizing KV cache storage and management, alternative architectures have been proposed to address the limitations of standard transformers in both memory and inference time.

Linear attention (Li et al., 2020; Katharopoulos et al., 2020) approximates full self-attention using kernel-based or lowrank transformations, significantly reducing computational complexity and improving efficiency for long-sequence processing. State-space models (SSMs) (Peng et al., 2023; Gu & Dao, 2023) further achieve linear scaling for sequence handling by maintaining a fixed-size representation, making them inherently memory-efficient. Hybrid architectures (Goldstein et al., 2024) combine these techniques with transformers to compress KV-cache sizes while preserving strong performance. Other alternatives restructure the transformer architecture itself to enhance efficiency. Some models (Sun et al., 2024a) modify the decoder structure to reduce memory usage and latency, while others (Pang et al., 2024) compress sequence information into compact representations to improve inference speed and scalability.

These methods enhance ICM efficiency, but their reliance on compression, approximation, and selective retention comes with limited support for long-term reasoning with episodic memory. This limitation highlights the need for an external memory structure that retains past information. Methods of KV-cache optimization can also discard older context, leading to irreversible information loss and different model behavior (Kirsten et al., 2024). Generally, methods with a constant cost, like SSMs, struggle to handle a continually expanding interaction history in dynamic environments, while methods with an increasing state representation increase in both inference time and GPU memory requirements.

Length Generalization. Length generalization refers to a model's ability to maintain understanding over long sequences, preventing degradation of performance such as

forgetting or losing context midway through processing (Liu et al., 2024b). In essence, humans avoid SSMs' trade-offs by storing compressed representations and retrieving knowledge adaptively, allowing us to manage expanding information effortlessly.

To address this, lightweight solutions (Yen et al., 2024; Xiao et al., 2024) create adapters to process and retrieve long inputs before passing the content to the LLM. Other approaches (Han et al., 2024) refine attention patterns and positional encodings to enhance long-context comprehension. Alternative architectures (Ye et al., 2024b; Dai et al., 2019) improve long-context learning through mechanisms like differential attention and segment-level recurrence. Another promising approach embeds test-time information into the model's parameters, creating a form of long-term memory (Sun et al., 2024b; Behrouz et al., 2024), combining attention with neural memory modules, enabling adaptability for long contexts but at the cost of increased inference overhead. These approaches have limited capacity and still face eventual forgetting over very long sequences.

External Memory

Many methods propose a separate memory module that stores information when it exceeds the effective operating span of the model. These augmented memory models are usually evaluated on tasks which require using that stored information. As such, these methods typically have longterm and explicit memory (Table 2). However, they often lack information that relates the stored memories to one another-especially contextual details on how the model acquired the memory, or details to help differentiate specific instances. They are typically not evaluated for single-shot learning, especially for specific instances. And finally, there is a lack of proposals to generalize information from these instances and update parametric memory (Figure 1a). Below we review some relevant external memory methods and elaborate on key examples to illustrate these shortcomings.

Slot-based memory with recurrent controllers. A key advance in memory augmentation in the pre-transformer era was the formulation of learnable memory modules external to the main neural network (Bordes et al., 2015; Graves et al., 2014; Sukhbaatar et al., 2015). External memories were stored in individual slots and updated via a recurrent memory controller. These models were shown to retain longer-term information than vanilla long-short term memory (LSTM) networks. Similar memory augmentation methods have been adapted for transformers (Wu et al., 2022a). However, these methods lack a way to store contextual details that LLM agents would need in an episodic memory, as they strongly depend on the details available in the input data. One exception devised a method to record temporal relationships between memories (Graves et al., 2016), but this has yet to be seen in augmented LLMs.

Distributed vs. slot memory. An issue with slot-based memory modules is that they are capacity-limited, both by the number of slots and the dimensionality of each slot representation. While these models adopt forgetting mechanisms to mitigate this, the capacity limit still affects how long memories can be stored. Another approach addresses this downside by storing external memories in a sparse, distributed fashion (Wu et al., 2018) instead of in slots. Recent work (Das et al., 2024a) integrated distributed memory in an LLM, and showed that the model can recall a greater number of facts over longer contexts, compared to baseline LLMs. While they demonstrate how they can perform oneshot memory updates (fact-editing), they do not evaluate single-shot learning of novel facts.

RAG and GraphRAG methods. Retrieval Augmented Generation (RAG) methods maintain an external database of information that is added to the input data to augment LLM generation. Naive RAG implementations encode chunks of text using embedding models (Gao et al., 2023), typically without much metadata or contextual detail about the original text. (One exception is work that preserves the order of retrieved text from the database (Yu et al., 2024).) And while text embedding models can capture some similarity relationships between embeddings, they do not encompass the rich set of relationships that LLM agents will likely need for most applications. GraphRAG models replace the vector embedding database with a structured graph that explicitly encodes relationships as connections between nodes (Peng et al., 2024). Still, these graphs encode a limited number of relationship types, even when researchers branch out beyond pre-existing datasets and learn to build the graphs directly from the input text (Li et al., 2024a; Edge et al., 2024; Guti´ errez et al., 2025). As such, they also lack rich contextual detail.

External storage of past LLM inputs and outputs. Another type of approach maintains a database of pasts LLM inputs to avoid recomputing predictions to similar future inputs (Wu et al., 2022b; Khandelwal et al., 2020; Yogatama et al., 2021). Here, contextual information (e.g. details that differentiate specific instances) will only be stored when explicitly given in the LLM input text. That is, the memory is much more dependent on input data, limiting test-time generalization. One proposal to mitigate this formulates a long-term memory module for context that is updated with LLM activations based on the current inputs (Behrouz et al., 2024). Other approaches additionally store LLM outputs, such as generated text (Cheng et al., 2023), summarizations (Wang et al., 2024a; Lee et al., 2023), chain-of-thought steps (Liu et al., 2023a; Lu et al., 2023), and extracted relation triples (Modarressi et al., 2025). One approach specialized for chat interactions stores timestamps and user personality

profiles as context (Zhong et al., 2024). These modifications enable storage of contextual details useful for LLM agents. However, specifying the type of contextual detail is restrictive, so it is preferable to combine this with a more learnable and flexible mechanism for storing context.

Learning to interact with external memory. The approaches described above may fine-tune or instruct the LLM to interact with and update external memory. That is, the LLM learns the functions of a memory controller. For example, several RAG approaches fine-tune the LLM to make better use of the retrieved content (Gao et al., 2023). Other approaches define how LLMs should interact with memory, requiring them to learn specific API calls (Modarressi et al., 2025) or memory hierarchies (Packer et al., 2024). These provide possible mechanisms to add information to external memory, such as contextual details and specific instances. However, most current work does not consider how to modify the LLM to generalize across specific instances to store new knowledge in LLM parameters (Figure 1a). Behrouz et al. (2024) propose one way to generalize across instances by adding a data-independent memory system (a.k.a. meta-memory, persistent memory) in addition to a more data-dependent memory module. However, the data-independent memory is considered to be closer to task memory than knowledge distillation, and the meta-memory parameters are kept separate from the LLM itself.

Parametric Memory

This type of memory allows LLMs to process the information in the input to obtain well-suited output. Parametric memory values are initially learned through backpropagation with a pretraining dataset. During this process, the parametric memory tends to capture general knowledge and rules ranging from syntax to common sense and factual knowledge. Due to the sheer size of the parametric memory, the amount of data needed for pre-training is usually very large, following power laws (Kaplan et al., 2020). Generally, parametric memory is fixed after training, i.e., does not change with the input at inference time.

A relevant research direction in parametric memory focuses on adapting LLM parameters to specific domains, tasks, or applications when given limited resources. Efficient fine-tuning methods have been developed in recent years to tackle the runtime and memory consumption of this process. Alternatively, distillation techniques have been proposed to update knowledge and propagate it through a model. A key challenge is the need for updating specific factual knowledge without interfering with other knowledge. Some facts may change over time, requiring surgical precision to update the parameters of a model. The line of work that proposes these updates is known as knowledge editing.

Efficient Fine-tuning. Various works have been proposed to reduce the computational needs (hardware memory) of updating a model to a specific domain. Among these, LowRank Adaptation (LoRA) (Hu et al., 2022) applies additive low-rank approximation updates to shift the model parameters. Several methods proposed other ways to further improve efficiency by reducing and localizing updates (Wang et al., 2024b; Valipour et al., 2022; Xu et al., 2021; Yin et al., 2024). Other work learned modifications on representations instead of parameters (Wu et al., 2024; Yin et al., 2024). In all cases, fine-tuning methods require a dataset to adapt a model for a specific task or domain, i.e. they are not capable of single-shot learning or capturing instance-specific and contextually rich information. On the other hand, additional fine-tuned adapter parameters are often frozen after the fine-tuning process, supporting the long term storage of information. Moreover, these methods tend to preserve the reasoning capabilities while updating the model with newly captured information (Wu et al., 2024).

Knowledge Editing. As the environment evolves over time, some factual knowledge becomes outdated (e.g., the president of a country may change after the elections). Knowledge editing methods aim to make modifications to the factual knowledge in parametric memory with targeted updates while avoiding interference with other facts. In ROME (Meng et al., 2023a) and MEM-IT (Meng et al., 2023b), the first step is to find relevant parameters (in MLPs) that influence the specific fact through causal interventions and then update the related parameters with low-rank model edits. An alternative research direction proposes to train a hypernetwork (Cao et al., 2021; Tan et al., 2024) that predicts the amount of change for each parameter given the knowledge to be edited. A different method, SERAC, stores the set of edits in an external memory, combined with a scope detector and a counter-factual model to decide when and how to apply the edits(Mitchell et al., 2022). All knowledge editing methods work on facts which are inherently context-free, making it impossible to contextualize the edited knowledge in the history of the agent-environment interaction. However, they mimic the episodic memory traits of learning from a single instance, while enabling long-term retention.

The problem of knowledge editing has been extended to a continual learning setting, where edits are required sequentially over time to correct a model. This leads to the sequential editing problem: hyper-network prediction quality decreases because they fail to reflect the updated model, and low-rank parameter updates interfere with one another causing catastrophic forgetting (Gupta et al., 2024). MELO (Yu et al., 2023) adapts dynamic LoRA to this problem and introduces a vector database to search the selections of the blocks to be dynamically activated within the LoRA matrices for each layer. WISE (Wang et al., 2024c) adds duplicates of the MLP's output parameters for some layers in the network, and updates them with each new edit set. A

routing mechanism decides whether to use the original layer or the updated one. It further uses sharding and merging (Yadav et al., 2023) to distribute the edits into random subspaces to improve generalization and parameter utilization.

While continual learning-based knowledge editing allows models to integrate updates over time, it has fundamental limitations. Edited knowledge often lacks generalization, struggling with inferring new relationships or reasoning over multiple steps (Berglund et al., 2023; Yang et al., 2024b). This highlights a key challenge-knowledge editing methods can introduce updates but do not always ensure deeper understanding or adaptability.

Context Distillation. The idea behind these techniques is to transfer in-context learned information, abilities, and task-understanding by distilling them into model parameters. Snell et al. (2022) proposed to use distillation when the teacher and the student are in the same model, but less in-context information is given to the student. This would enable the student to learn skills and express knowledge that would otherwise depend on including information and instances in costly and limited in-context memory. Further, Padmanabhan et al. (2023) proposes to exploit context distillation to inject and propagate knowledge through a model. The original model is provided with new definitions and continuations. The distillation process updates a copy of the model with only the generated continuation, conditioning the updated model to the new entities implicitly. This helps to propagate the information into the parameters (i.e., consolidating it) and thus improving inference with such entities.

Episodic Memory as a Unifying Framework

Although current work has advanced context-sensitive LLMs that are capable of handling longer sequences, it does not yet deliver efficient learning that could support long-term LLM agents. Existing methods-which extend in-context (working) memory, integrate external memory, or update parametric memory-only address subsets of episodic memory's five essential properties, as discussed in Section 3. These approaches remain fragmented, impeding the immediate assimilation of new experiences and gradual improvement over time.

We propose that enabling episodic memory offers a unifying perspective that will combine and extend existing methods to advance the capabilities of LLM agents. By incorporating long in-context memory, external memory, and mechanisms for updating parametric memory, agents can more seamlessly adapt to new information, consolidate it, and prevent escalating costs or performance degradation during extended interactions with an environment. This view is based on Complementary Learning Systems Theory (O'Reilly et al.,

2014; Kumaran et al., 2016; Arani et al., 2022), in which episodic memory is part of a fast-learning system that stores information from individual instances. Over time, that information is consolidated into a slow-learning system that stores more stable, durable knowledge.

In Figure 1, we present a general architecture and framework that combines these elements under the overarching goal of enabling all five key features of episodic memory for LLM agents as detailed in Section 2. As a roadmap to enable episodic memory in LLM agents, we specifically call for four main research directions (encoding, retrieval, consolidation, and benchmarks), and formulate six research questions under these areas below.

Encoding

RQ1: How to store information from in-context memory in a long-term external memory store?

An external memory store is essential for retaining experience in a structured way that preserves the context of individual instances (Fig.1, arrow (b)). A straightforward approach is to store text chunks or embeddings in a non-parametric RAG-like database, potentially augmented with metadata for context (Mombaerts et al., 2024). More structured representations, such as GraphRAG, could also facilitate contextsensitive retrieval. However, capacity constraints on these types of databases may make it necessary to rely on more compressed parametric representations.

RQ2: How to segment continuous input into discrete episodes, and when to store them in an external memory?

A major design question is when and how to segment a continuous stream of agent experience into episodes to be encoded into an external memory. LLMs have already been shown to be capable of segmenting text into meaningful events, in a way that is similar to humans (Michelmann et al., 2023), and recent approaches show that further bundling related segments based on model surprise can improve longterm modeling (Behrouz et al., 2024; Fountas et al., 2024).

Leveraging long-context advances can further improve encoding by providing a space in which new episodes can be equipped with a rich contextualization. Large hidden states or extended attention windows help capture high-fidelity contextual information, which can then be encoded into an external memory in a compressed format for future retrieval.

Retrieval

RQ3: Given an external memory, how to select relevant past episodes for retrieval and reinstatement into in-context memory for the purpose of explicit reasoning?

To employ past experiences in current tasks, an agent must retrieve relevant episodes at the right time and reintegrate

them into its in-context memory with an adequate mechanism (Fig.1, arrow (c)). Common strategies include prepending retrieved text tokens to the input sequence (as in RAG), manipulating representational states within the transformer (e.g., memory tokens (Bulatov et al., 2022)), or adapting internal representations (Wu et al., 2024).

RQ4: How can retrieval mechanisms in long-context LLMs improve and accelerate the optimization of external memory retrieval and reinstatement?

Long-context advances can be leveraged to inform when and what to retrieve at sequence lengths that are still feasible. Future research could explore tight integration of external memory with the model's forward pass (Berges et al., 2024) and adopt cross-architecture distillation (Wang et al., 2025) to accelerate the development of external memory structures that retain many of the desirable properties of in-context memory while reducing the resource cost.

Consolidation

RQ5: How to periodically consolidate external memory contents into the LLM's base parameters without forgetting previous knowledge?

Eventually, merging external memory contents into the model's parameters (Fig.1, arrow (a)) promises to allow new generalized knowledge to be used without explicit retrieval. This process both prevents external memory overflow and supports continuous adaptation of the agent's semantic and procedural backbone to the environment. Relevant techniques include context distillation, parametric knowledge editing, and localized fine-tuning methods that capture newly encountered information without catastrophic interference with other knowledge. Open questions remain about how to decide when to consolidate and how to compress many episodic instances into more abstract parametric knowledge while also retaining previous knowledge and skills.

Benchmarks

RQ6: What new types of benchmarks are needed to assess episodic memory in LLM agents?

Finally, evaluating episodic memory effectiveness requires new tasks and metrics. Studies should test the recall of contextualized events after long delays, assessing how well agents remember when, where, and how events occurred. An example of such a study is the testing of instance-specific temporal order memory proposed by Pink et al. (2024). Beyond controlled probes, benchmarks must incorporate real-world complexities: agents should demonstrate an improving task performance that is linked to encoding, retrieval, and consolidation of past experiences over extended timescales.

Alternative Views

While we argue that an explicit episodic memory framework is necessary for effective long-term and context-sensitive behavior, there are alternative perspectives suggesting that current or emerging methods might suffice in the future without the need for the concept of episodic memory to provide guidance.

Scaling in-context memory will be sufficient. One view suggests that advances in long-context methods-such as improved transformers, state-space models, or other architectures with extended context windows-will enable practically unlimited access to past information. Proponents claim that better positional encodings, modified attention mechanisms, and other in-context memory extensions will cover most relevant applications for LLM-based agents.

Contextualized external memory will be sufficient. A second view holds that external memory structures-such as knowledge graphs or retrieval-augmented generation (RAG) systems-could eliminate the need for an episodic memory framework. By contextualizing data chunks and storing them in structured graphs, these systems aim to incorporate past context into current tasks effectively.

'Infinite' in-context memory remains a speculative prospect. Extending limited context windows to include all information needed by an agent requires foreknowledge of the maximum timespan of relevant information. For very long timespans, this will either incur prohibitive computational costs or require compression methods that may lose key details. Only relying on external memory will still incur high storage costs, and require forgetting mechanisms. An episodic memory framework addresses these constraints by periodically consolidating information into high-capacity parametric memory (Figure 1, arrow (a)). This has the added benefit of enabling LLM agents to slowly improve over time, as they continue to learn from the past before they forget it.

Conclusion

This position paper argues that to fully realize efficient long-term LLM agents, we must endow LLM agents with episodic memory. We operationalize episodic memory-a term borrowed from cognitive science-for LLMs by highlighting five key characteristics that distinguish episodic memory from other types of memory in biological systems, and argue for why each property is also important for LLM agents. We position the call for episodic memory in LLM agents in the current literature and discuss how episodic memory can serve as a unifying goal for existing research directions. Lastly, we provide a roadmap of research questions towards implementing episodic memory in LLMs. By describing the potential of this research direction, we aim to spark a community-wide shift in how we conceive and

engineer long-term memory in the move towards agentic AI-one that more deeply integrates lessons from cognitive science and brings together existing approaches in ML under a unifying goal with strong promise.

Current Approaches

Guti´ errez, B. J., Shu, Y ., Gu, Y ., Yasunaga, M., and Su, Y . Hipporag: Neurobiologically inspired long-term memory for large language models, 2025. URL https: //arxiv.org/abs/2405.14831 .

Kumaran, D., Hassabis, D., and McClelland, J. L. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in Cognitive Sciences , 20(7):512-534, July 2016. ISSN 13646613. doi: 10.1016/j.tics.2016.05.004. URL http:// dx.doi.org/10.1016/j.tics.2016.05.004 .

Memory TypeLong- termExplicitSingle- shotInstance- specificContextual relations
Episodic!!!!!
Procedural!××××
Semantic!!×××
Working×!!!!
Memory ApproachMemory ApproachLong-termExplicitSingle-shotInst.-specificContextual rel.
KV-Compression State-space-model× ×! !! !! !! !
RAG GraphRAG! !! !∼ ∼∼ ∼× ∼
Efficient Fine-tuning Knowledge Editing Context Distillation! ! !! ∼ !× × ×× ∼ !× × ×

Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents

Mathis Pink 1 Qinyuan Wu 1 Vy Ai Vo 2 Javier Turek 3 Jianing Mu 4 Alexander Huth 4 Mariya Toneva 1

As Large Language Models (LLMs) evolve from text-completion tools into fully fledged agents operating in dynamic environments, they must address the challenge of continuous learning and long-term knowledge retention. Many biological systems solve these challenges with episodic memory, which supports single-shot learning of instance-specific contexts. Inspired by this, we present a framework for LLM agents, centered around five key properties of episodic memory that underlie adaptive and context-sensitive behavior. With various research efforts already covering portions of these properties, this position paper argues that now is the right time for an explicit, integrated focus on episodic memory to catalyze the development of long-term agents. To this end, we outline a roadmap that unites several research directions under the goal to support all five properties of episodic memory for more efficient long-term LLM agents.

Large Language Models (LLMs) are rapidly expanding beyond their origins as text-completion engines. Instead, they are evolving into agentic systems capable of taking meaningful actions in complex environments (Xi et al., 2023). This transformation can enable a range of real-world applications, including autonomous research assistance (Schmidgall et al., 2025), aiding in literature reviews, data analysis, and hypothesis generation; personalized customer support (Li et al., 2024b), where they can recall prior interactions to provide consistent and tailored assistance; and interactive tutoring systems (Lin et al., 2023), which track learning progress, and revisit challenging concepts to ensure effective and personalized education. These diverse applications hint at a vast potential of LLMs to enable intelligent agents capable of meaningful and context-sensitive interaction.

Operating and reasoning over extended timescales in dynamic interactive contexts demands that an agent not only recalls what happened, but also when, how, why, and involving whom. Such rich traces of past events, motivations, and outcomes form the basis of context-sensitive behavior—especially crucial in large-scale projects involving human stakeholders and multiple actors. For example, a future long-term LLM agent that is supposed to assist in the ongoing development of a massive software project such as Linux—which has spanned decades, encompasses over 40 million lines of code, and additionally involves countless past contributions, issues, comments, notes, and feature requests—would need to continuously integrate and reason about a vast, evolving historical context while adapting to new requirements. Core necessities for this kind of system are constant computational cost per new token and a stable or improving performance over time.

Ongoing research directions attack the problem of long-term retention and adaptation from different angles and have made impressive progress. However, we are still lacking approaches that maintain relevant contextualized information over long time frames at a constant cost without degrading performance—necessities for a widespread adoption of LLM agents in many long-term settings.

Meanwhile, many biological systems solve the demands for acting in a continually evolving environment with a dedicated memory system that allows for both fast and slow learning: episodic memory (McClelland et al., 1995; Schwartz & Evans, 2001; O’Reilly & Norman, 2002; O’Reilly et al., 2014; Kumaran et al., 2016; Liao & Losonczy, 2024). In this position paper, we argue that the growing demand for LLM agents to operate effectively over extended timescales, alongside ongoing advances in long-context models, external memory systems, and efficient fine-tuning methods, makes episodic memory a timely framework to unify efforts for enabling truly long-term LLM agents.

To lay out the argument for this position, we proceed as follows: In Section 2, we operationalize the concept of episodic memory for LLM agents by highlighting five key properties that distinguish it from other biological types of memory that are also desirable for LLM agents. We proceed to argue in Section 3 for episodic memory as a unifying goal by showing how various existing approaches to improve LLM memory target different properties that are united in episodic memory. In Section 4, we highlight how unifying these threads under a common goal can spur more holistic progress and outline a roadmap toward implementing episodic memory. Lastly, in Section 5, we discuss alternative views under which episodic memory would not be necessary for long-term LLM agents.

To transfer the concept of episodic memory from cognitive science to the context of LLM agents, we highlight five properties of episodic memory that are useful for LLM agents, and that distinguish episodic memory from other memory types in animals and humans. These five properties naturally cluster into two categories: properties that concern the way that the system operates with the memory—namely, long-term storage, explicit reasoning, and single-shot learning, and properties that concern the content of the stored memory—namely, instance-specific and contextualized memories. We first discuss how the combination of these five properties distinguishes episodic memory from other types of memories in animals and humans, and then detail each property and its utility for LLM agents.

Episodic memory is one of multiple memory systems that exist in animals and humans, distinguished by its unique combination of properties (Table 1). Other biological memory systems that share some, but not all, properties of episodic memory are 1) procedural memory (Milner, 1962; Cohen & Squire, 1980), which allows for long-term storage of memories for implicit operations or task behaviors, such as producing a sequence of a particular type, rather than reasoning about the sequence; 2) semantic memory (Collins & Quillian, 1969; Tulving, 1972), which allows for long-term storage of factual knowledge and explicit reasoning with these stored memories, but lacks specificity to single instances of acquired information and its context; and 3) working memory (Baddeley, 1986; Baddeley & Hitch, 1974), which can share many of the highlighted properties of episodic memory except for the important fact that it does not allow for long-term storage. The unique combination of important properties in episodic memory makes it a promising candidate for translation to AI systems.

Long-term storage. In humans and other animals, episodic memory functions as a form of long-term memory, capable of storing knowledge throughout an individual’s lifetime (Conway, 2001; Mayes & Roberts, 2001; Squire & Zola, 1996; Hampton & Schwartz, 2004). This distinguishes it from working memory, which is transient. For LLM agents, an effective episodic memory system must similarly support memory retrieval across any number of tokens. This requires mechanisms for long-term memory that maintain an agent’s performance throughout a continual interaction with an environment. An adaptive long-term agent should not only prevent a degradation in performance over time-it should also be able to improve by learning new general knowledge and skills.

Explicit reasoning. In classical theories of human memory, episodic memory is described as a subset of declarative or explicit memory (Squire & Zola, 1996; Hampton & Schwartz, 2004). A defining feature of explicit memory is the ability to reflect and reason about the memory content. In the context of LLM agents, the explicitness of memory is necessary as agents need to be able to answer direct queries about stored information or use this information in explicit internal reasoning processes.

Single-shot learning. A key characteristic of episodic memory, as emphasized in complementary learning systems theory, is its ability to be acquired based on a single exposure (Liao & Losonczy, 2024; Schwartz & Evans, 2001; O’Reilly & Norman, 2002; O’Reilly et al., 2014; McClelland et al., 1995; Kumaran et al., 2016; Das et al., 2024b). This fast learning enables the rapid encoding of unique experiences or events. For LLM agents, this capability is particularly crucial in environments where continual deployment may not provide multiple variations or repetitions of specific events. Certain occurrences in an environment may happen only once, necessitating an episodic memory system that is capable of effectively capturing and utilizing information from single exposures.

Instance-specific memories. Episodic memory stores information specific to an individual sequence of events along with their distinct temporal contexts (Sugar & Moser, 2019; Colgin et al., 2008). This specificity allows episodic memory to capture details unique to a particular occurrence, enabling its application in agentic environments where reasoning about specific past actions and their consequences matters. This can include past lines of reasoning that were associated with a decision to be made by an LLM agent.

Contextual memories. Episodic memory binds context to its memory content, such as when, where, and why an event was encountered (Eichenbaum & Cohen, 2014; O’Keefe & Nadel, 1978; Eichenbaum, 2015). The ability to store many contextual relations associated with a specific event enables retrieval based on contextual cues as well as explicit recall of context. For LLM agents, this property is important to not only remember that a specific event happened in the past, but also when, why, and in which broader context it happened.

While many methods currently exist to modify and augment LLM memory, we argue that they fall short of the memory properties that would enable effective, long-term LLM agents. We group existing methods that seek to improve the memory of LLMs into three categories that are relevant to episodic memory:

In-Context Memory methods extend the effective context length by optimizing computational efficiency and length generalization;

External Memory methods augment a model’s in-context memory capacity with a separate module, often with reduced GPU memory requirements and/or computational cost;

Parametric Memory methods modify the LLM parameters that encode memories (primarily learned from the language modeling training data).

In this section, we discuss examples in each category that capture different properties of episodic memory (Table 2). More importantly, we highlight their shortcomings in supporting episodic memory for LLM agents in isolation.

Single-shot

Inst.-specific

Contextual rel.

In-context memory (ICM) allows LLMs to perform single-shot, instance-specific, and contextualized learning by enabling them to directly attend to representations of encountered sequences (Table 2). ICM capacity is either tightly limited or extensible but expensive to scale, often requiring sequence parallelization. Recent works seek to extend it by increasing the context window, but models struggle with length generalization beyond training exposures. We review existing methods and their limitations in addressing these challenges.

One active research direction focuses on extending the in-context window to handle significantly longer sequences, enabling LLMs to perform reasoning over extended contexts. This advancement brings LLMs closer to mimicking episodic memory, as it allows models to retain and utilize information across longer contexts. However, transformer-based LLMs face significant challenges, including the high computational cost of processing long sequences and limitations in length generalization. Recent research has sought to address these challenges by reducing memory usage, optimizing inference time, and improving long-sequence generation. Despite these advancements, current methods have yet to achieve robust, persistent memory capabilities necessary for long-term, open-ended, and context-aware reasoning. Below, we briefly review existing methods and their limitations.

Memory reduction. For transformer-based LLMs, several methods aim to reduce memory and computation costs.

Sparsification and compression methods selectively retain relevant information to optimize memory usage. Sparsification strategies optimize memory by restricting attention computations to the most relevant parts of the sequence (Lou et al., 2024), reducing both storage and computational overhead. Similarly, forgetting mechanisms remove less useful tokens to maintain efficiency (Anonymous, 2024). Other compression-based approaches dynamically reduce KV cache size by storing only the most important tokens and key-value pairs (Liu et al., 2023b; Ge et al., 2024; Tang et al., 2024). Adaptive strategies further refine compression across layers (Yang et al., 2024a; Nawrot et al., 2024) or merge similar states to minimize redundancy (Liu et al., 2024a).

Quantization methods reduce memory footprints by lowering precision or selectively storing information. Quantization techniques store key-value pairs at reduced precision (Liu et al., 2024c; Hooper et al., 2024; Yue et al., 2024; Duanmu et al., 2024), allowing for larger context windows with a relatively minimal performance degradation.

Inference time reduction. Efficiency improvements during inference focus on optimizing KV cache management and parallelization. Techniques such as paged caching (Kwon et al., 2023; Lee et al., 2024; Zheng et al., 2024) dynamically allocate memory to accommodate longer sequences without excessive overhead. Other methods leverage GPU memory pooling and adaptive chunking (Lin et al., 2024; Agrawal et al., 2024) to process extended contexts efficiently while maintaining fast retrieval and computation speeds. Other strategies improve efficiency by reusing KV tensors across layers (Ye et al., 2024a; Brandon et al., 2024)

Recent work has aimed to introduce episodic memory in LLMs by structuring token sequences into retrievable events (Fountas et al., 2024), enhancing long-context reasoning and outperforming retrieval-based models. However, a fundamental challenge remain the increasing memory and retrieval costs: maintaining the full KV-Cache for an entire interaction history can quickly become impractical, especially in large-scale, long-duration, and multimodal applications. This limitation is inherent to KV-Cache management systems, which must retain the entire cache, leading to significant storage and computational overhead.

Transformer alternatives to reduce both memory and inference time. In addition to optimizing KV cache storage and management, alternative architectures have been proposed to address the limitations of standard transformers in both memory and inference time.

Linear attention (Li et al., 2020; Katharopoulos et al., 2020) approximates full self-attention using kernel-based or low-rank transformations, significantly reducing computational complexity and improving efficiency for long-sequence processing. State-space models (SSMs) (Peng et al., 2023; Gu & Dao, 2023) further achieve linear scaling for sequence handling by maintaining a fixed-size representation, making them inherently memory-efficient. Hybrid architectures (Goldstein et al., 2024) combine these techniques with transformers to compress KV-cache sizes while preserving strong performance. Other alternatives restructure the transformer architecture itself to enhance efficiency. Some models (Sun et al., 2024a) modify the decoder structure to reduce memory usage and latency, while others (Pang et al., 2024) compress sequence information into compact representations to improve inference speed and scalability.

These methods enhance ICM efficiency, but their reliance on compression, approximation, and selective retention comes with limited support for long-term reasoning with episodic memory. This limitation highlights the need for an external memory structure that retains past information. Methods of KV-cache optimization can also discard older context, leading to irreversible information loss and different model behavior (Kirsten et al., 2024). Generally, methods with a constant cost, like SSMs, struggle to handle a continually expanding interaction history in dynamic environments, while methods with an increasing state representation increase in both inference time and GPU memory requirements.

Length Generalization. Length generalization refers to a model’s ability to maintain understanding over long sequences, preventing degradation of performance such as forgetting or losing context midway through processing (Liu et al., 2024b). In essence, humans avoid SSMs’ trade-offs by storing compressed representations and retrieving knowledge adaptively, allowing us to manage expanding information effortlessly.

To address this, lightweight solutions (Yen et al., 2024; Xiao et al., 2024) create adapters to process and retrieve long inputs before passing the content to the LLM. Other approaches (Han et al., 2024) refine attention patterns and positional encodings to enhance long-context comprehension. Alternative architectures (Ye et al., 2024b; Dai et al., 2019) improve long-context learning through mechanisms like differential attention and segment-level recurrence. Another promising approach embeds test-time information into the model’s parameters, creating a form of long-term memory (Sun et al., 2024b; Behrouz et al., 2024), combining attention with neural memory modules, enabling adaptability for long contexts but at the cost of increased inference overhead. These approaches have limited capacity and still face eventual forgetting over very long sequences.

Many methods propose a separate memory module that stores information when it exceeds the effective operating span of the model. These augmented memory models are usually evaluated on tasks which require using that stored information. As such, these methods typically have long-term and explicit memory (Table 2). However, they often lack information that relates the stored memories to one another—especially contextual details on how the model acquired the memory, or details to help differentiate specific instances. They are typically not evaluated for single-shot learning, especially for specific instances. And finally, there is a lack of proposals to generalize information from these instances and update parametric memory (Figure 1a). Below we review some relevant external memory methods and elaborate on key examples to illustrate these shortcomings.

Slot-based memory with recurrent controllers. A key advance in memory augmentation in the pre-transformer era was the formulation of learnable memory modules external to the main neural network (Bordes et al., 2015; Graves et al., 2014; Sukhbaatar et al., 2015). External memories were stored in individual slots and updated via a recurrent memory controller. These models were shown to retain longer-term information than vanilla long-short term memory (LSTM) networks. Similar memory augmentation methods have been adapted for transformers (Wu et al., 2022a). However, these methods lack a way to store contextual details that LLM agents would need in an episodic memory, as they strongly depend on the details available in the input data. One exception devised a method to record temporal relationships between memories (Graves et al., 2016), but this has yet to be seen in augmented LLMs.

Distributed vs. slot memory. An issue with slot-based memory modules is that they are capacity-limited, both by the number of slots and the dimensionality of each slot representation. While these models adopt forgetting mechanisms to mitigate this, the capacity limit still affects how long memories can be stored. Another approach addresses this downside by storing external memories in a sparse, distributed fashion (Wu et al., 2018) instead of in slots. Recent work (Das et al., 2024a) integrated distributed memory in an LLM, and showed that the model can recall a greater number of facts over longer contexts, compared to baseline LLMs. While they demonstrate how they can perform one-shot memory updates (fact-editing), they do not evaluate single-shot learning of novel facts.

RAG and GraphRAG methods. Retrieval Augmented Generation (RAG) methods maintain an external database of information that is added to the input data to augment LLM generation. Naive RAG implementations encode chunks of text using embedding models (Gao et al., 2023), typically without much metadata or contextual detail about the original text. (One exception is work that preserves the order of retrieved text from the database (Yu et al., 2024).) And while text embedding models can capture some similarity relationships between embeddings, they do not encompass the rich set of relationships that LLM agents will likely need for most applications. GraphRAG models replace the vector embedding database with a structured graph that explicitly encodes relationships as connections between nodes (Peng et al., 2024). Still, these graphs encode a limited number of relationship types, even when researchers branch out beyond pre-existing datasets and learn to build the graphs directly from the input text (Li et al., 2024a; Edge et al., 2024; Gutiérrez et al., 2025). As such, they also lack rich contextual detail.

External storage of past LLM inputs and outputs. Another type of approach maintains a database of pasts LLM inputs to avoid recomputing predictions to similar future inputs (Wu et al., 2022b; Khandelwal et al., 2020; Yogatama et al., 2021). Here, contextual information (e.g. details that differentiate specific instances) will only be stored when explicitly given in the LLM input text. That is, the memory is much more dependent on input data, limiting test-time generalization. One proposal to mitigate this formulates a long-term memory module for context that is updated with LLM activations based on the current inputs (Behrouz et al., 2024). Other approaches additionally store LLM outputs, such as generated text (Cheng et al., 2023), summarizations (Wang et al., 2024a; Lee et al., 2023), chain-of-thought steps (Liu et al., 2023a; Lu et al., 2023), and extracted relation triples (Modarressi et al., 2025). One approach specialized for chat interactions stores timestamps and user personality profiles as context (Zhong et al., 2024). These modifications enable storage of contextual details useful for LLM agents. However, specifying the type of contextual detail is restrictive, so it is preferable to combine this with a more learnable and flexible mechanism for storing context.

Learning to interact with external memory. The approaches described above may fine-tune or instruct the LLM to interact with and update external memory. That is, the LLM learns the functions of a memory controller. For example, several RAG approaches fine-tune the LLM to make better use of the retrieved content (Gao et al., 2023). Other approaches define how LLMs should interact with memory, requiring them to learn specific API calls (Modarressi et al., 2025) or memory hierarchies (Packer et al., 2024). These provide possible mechanisms to add information to external memory, such as contextual details and specific instances. However, most current work does not consider how to modify the LLM to generalize across specific instances to store new knowledge in LLM parameters (Figure 1a). Behrouz et al. (2024) propose one way to generalize across instances by adding a data-independent memory system (a.k.a. meta-memory, persistent memory) in addition to a more data-dependent memory module. However, the data-independent memory is considered to be closer to task memory than knowledge distillation, and the meta-memory parameters are kept separate from the LLM itself.

This type of memory allows LLMs to process the information in the input to obtain well-suited output. Parametric memory values are initially learned through back-propagation with a pretraining dataset. During this process, the parametric memory tends to capture general knowledge and rules ranging from syntax to common sense and factual knowledge. Due to the sheer size of the parametric memory, the amount of data needed for pre-training is usually very large, following power laws Kaplan et al. (2020). Generally, parametric memory is fixed after training, i.e., does not change with the input at inference time.

A relevant research direction in parametric memory focuses on adapting LLM parameters to specific domains, tasks, or applications when given limited resources. Efficient fine-tuning methods have been developed in recent years to tackle the runtime and memory consumption of this process. Alternatively, distillation techniques have been proposed to update knowledge and propagate it through a model. A key challenge is the need for updating specific factual knowledge without interfering with other knowledge. Some facts may change over time, requiring surgical precision to update the parameters of a model. The line of work that proposes these updates is known as knowledge editing.

Efficient Fine-tuning. Various works have been proposed to reduce the computational needs (hardware memory) of updating a model to a specific domain. Among these, Low-Rank Adaptation (LoRA) Hu et al. (2022) applies additive low-rank approximation updates to shift the model parameters. Several methods proposed other ways to further improve efficiency by reducing and localizing updates Wang et al. (2024b); Valipour et al. (2022); Xu et al. (2021); Yin et al. (2024). Other work learned modifications on representations instead of parameters Wu et al. (2024); Yin et al. (2024). In all cases, fine-tuning methods require a dataset to adapt a model for a specific task or domain, i.e. they are not capable of single-shot learning or capturing instance-specific and contextually rich information. On the other hand, additional fine-tuned adapter parameters are often frozen after the fine-tuning process, supporting the long term storage of information. Moreover, these methods tend to preserve the reasoning capabilities while updating the model with newly captured information Wu et al. (2024).

Knowledge Editing. As the environment evolves over time, some factual knowledge becomes outdated (e.g., the president of a country may change after the elections). Knowledge editing methods aim to make modifications to the factual knowledge in parametric memory with targeted updates while avoiding interference with other facts. In ROME (Meng et al., 2023a) and MEM-IT (Meng et al., 2023b), the first step is to find relevant parameters (in MLPs) that influence the specific fact through causal interventions and then update the related parameters with low-rank model edits. An alternative research direction proposes to train a hyper-network (Cao et al., 2021; Tan et al., 2024) that predicts the amount of change for each parameter given the knowledge to be edited. A different method, SERAC, stores the set of edits in an external memory, combined with a scope detector and a counter-factual model to decide when and how to apply the edits(Mitchell et al., 2022). All knowledge editing methods work on facts which are inherently context-free, making it impossible to contextualize the edited knowledge in the history of the agent-environment interaction. However, they mimic the episodic memory traits of learning from a single instance, while enabling long-term retention.

The problem of knowledge editing has been extended to a continual learning setting, where edits are required sequentially over time to correct a model. This leads to the sequential editing problem: hyper-network prediction quality decreases because they fail to reflect the updated model, and low-rank parameter updates interfere with one another causing catastrophic forgetting Gupta et al. (2024). MELO (Yu et al., 2023) adapts dynamic LoRA to this problem and introduces a vector database to search the selections of the blocks to be dynamically activated within the LoRA matrices for each layer. WISE (Wang et al., 2024c) adds duplicates of the MLP’s output parameters for some layers in the network, and updates them with each new edit set. A routing mechanism decides whether to use the original layer or the updated one. It further uses sharding and merging (Yadav et al., 2023) to distribute the edits into random subspaces to improve generalization and parameter utilization.

While continual learning-based knowledge editing allows models to integrate updates over time, it has fundamental limitations. Edited knowledge often lacks generalization, struggling with inferring new relationships or reasoning over multiple steps (Berglund et al., 2023; Yang et al., 2024b). This highlights a key challenge—knowledge editing methods can introduce updates but do not always ensure deeper understanding or adaptability.

Context Distillation. The idea behind these techniques is to transfer in-context learned information, abilities, and task-understanding by distilling them into model parameters. Snell et al. (2022) proposed to use distillation when the teacher and the student are in the same model, but less in-context information is given to the student. This would enable the student to learn skills and express knowledge that would otherwise depend on including information and instances in costly and limited in-context memory. Further, Padmanabhan et al. (2023) proposes to exploit context distillation to inject and propagate knowledge through a model. The original model is provided with new definitions and continuations. The distillation process updates a copy of the model with only the generated continuation, conditioning the updated model to the new entities implicitly. This helps to propagate the information into the parameters (i.e., consolidating it) and thus improving inference with such entities.

Although current work has advanced context-sensitive LLMs that are capable of handling longer sequences, it does not yet deliver efficient learning that could support long-term LLM agents. Existing methods—which extend in-context (working) memory, integrate external memory, or update parametric memory—only address subsets of episodic memory’s five essential properties, as discussed in Section 3. These approaches remain fragmented, impeding the immediate assimilation of new experiences and gradual improvement over time.

We propose that enabling episodic memory offers a unifying perspective that will combine and extend existing methods to advance the capabilities of LLM agents. By incorporating long in-context memory, external memory, and mechanisms for updating parametric memory, agents can more seamlessly adapt to new information, consolidate it, and prevent escalating costs or performance degradation during extended interactions with an environment. This view is based on Complementary Learning Systems Theory (O’Reilly et al., 2014; Kumaran et al., 2016; Arani et al., 2022), in which episodic memory is part of a fast-learning system that stores information from individual instances. Over time, that information is consolidated into a slow-learning system that stores more stable, durable knowledge.

In Figure 1, we present a general architecture and framework that combines these elements under the overarching goal of enabling all five key features of episodic memory for LLM agents as detailed in Section 2. As a roadmap to enable episodic memory in LLM agents, we specifically call for four main research directions (encoding, retrieval, consolidation, and benchmarks), and formulate six research questions under these areas below.

RQ1: How to store information from in-context memory in a long-term external memory store?

An external memory store is essential for retaining experience in a structured way that preserves the context of individual instances (Fig.1, arrow (b)). A straightforward approach is to store text chunks or embeddings in a non-parametric RAG-like database, potentially augmented with metadata for context Mombaerts et al. (2024). More structured representations, such as GraphRAG, could also facilitate context-sensitive retrieval. However, capacity constraints on these types of databases may make it necessary to rely on more compressed parametric representations.

RQ2: How to segment continuous input into discrete episodes, and when to store them in an external memory?

A major design question is when and how to segment a continuous stream of agent experience into episodes to be encoded into an external memory. LLMs have already been shown to be capable of segmenting text into meaningful events, in a way that is similar to humans Michelmann et al. (2023), and recent approaches show that further bundling related segments based on model surprise can improve long-term modeling Behrouz et al. (2024); Fountas et al. (2024).

Leveraging long-context advances can further improve encoding by providing a space in which new episodes can be equipped with a rich contextualization. Large hidden states or extended attention windows help capture high-fidelity contextual information, which can then be encoded into an external memory in a compressed format for future retrieval.

RQ3: Given an external memory, how to select relevant past episodes for retrieval and reinstatement into in-context memory for the purpose of explicit reasoning?

To employ past experiences in current tasks, an agent must retrieve relevant episodes at the right time and reintegrate them into its in-context memory with an adequate mechanism (Fig.1, arrow (c)). Common strategies include prepending retrieved text tokens to the input sequence (as in RAG), manipulating representational states within the transformer (e.g., memory tokens (Bulatov et al., 2022)), or adapting internal representations (Wu et al., 2024).

Long-context advances can be leveraged to inform when and what to retrieve at sequence lengths that are still feasible. Future research could explore tight integration of external memory with the model’s forward pass (Berges et al., 2024) and adopt cross-architecture distillation (Wang et al., 2025) to accelerate the development of external memory structures that retain many of the desirable properties of in-context memory while reducing the resource cost.

RQ5: How to periodically consolidate external memory contents into the LLM’s base parameters without forgetting previous knowledge?

Eventually, merging external memory contents into the model’s parameters (Fig.1, arrow (a)) promises to allow new generalized knowledge to be used without explicit retrieval. This process both prevents external memory overflow and supports continuous adaptation of the agent’s semantic and procedural backbone to the environment. Relevant techniques include context distillation, parametric knowledge editing, and localized fine-tuning methods that capture newly encountered information without catastrophic interference with other knowledge. Open questions remain about how to decide when to consolidate and how to compress many episodic instances into more abstract parametric knowledge while also retaining previous knowledge and skills.

RQ6: What new types of benchmarks are needed to assess episodic memory in LLM agents?

Finally, evaluating episodic memory effectiveness requires new tasks and metrics. Studies should test the recall of contextualized events after long delays, assessing how well agents remember when, where, and how events occurred. An example of such a study is the testing of instance-specific temporal order memory proposed by Pink et al. (2024). Beyond controlled probes, benchmarks must incorporate real-world complexities: agents should demonstrate an improving task performance that is linked to encoding, retrieval, and consolidation of past experiences over extended timescales.

While we argue that an explicit episodic memory framework is necessary for effective long-term and context-sensitive behavior, there are alternative perspectives suggesting that current or emerging methods might suffice in the future without the need for the concept of episodic memory to provide guidance.

Scaling in-context memory will be sufficient. One view suggests that advances in long-context methods—such as improved transformers, state-space models, or other architectures with extended context windows—will enable practically unlimited access to past information. Proponents claim that better positional encodings, modified attention mechanisms, and other in-context memory extensions will cover most relevant applications for LLM-based agents.

Contextualized external memory will be sufficient. A second view holds that external memory structures—such as knowledge graphs or retrieval-augmented generation (RAG) systems—could eliminate the need for an episodic memory framework. By contextualizing data chunks and storing them in structured graphs, these systems aim to incorporate past context into current tasks effectively.

“Infinite” in-context memory remains a speculative prospect. Extending limited context windows to include all information needed by an agent requires foreknowledge of the maximum timespan of relevant information. For very long timespans, this will either incur prohibitive computational costs or require compression methods that may lose key details. Only relying on external memory will still incur high storage costs, and require forgetting mechanisms. An episodic memory framework addresses these constraints by periodically consolidating information into high-capacity parametric memory (Figure 1, arrow (a)). This has the added benefit of enabling LLM agents to slowly improve over time, as they continue to learn from the past before they forget it.

This position paper argues that to fully realize efficient long-term LLM agents, we must endow LLM agents with episodic memory. We operationalize episodic memory—a term borrowed from cognitive science—for LLMs by highlighting five key characteristics that distinguish episodic memory from other types of memory in biological systems, and argue for why each property is also important for LLM agents. We position the call for episodic memory in LLM agents in the current literature and discuss how episodic memory can serve as a unifying goal for existing research directions. Lastly, we provide a roadmap of research questions towards implementing episodic memory in LLMs. By describing the potential of this research direction, we aim to spark a community-wide shift in how we conceive and engineer long-term memory in the move towards agentic AI—one that more deeply integrates lessons from cognitive science and brings together existing approaches in ML under a unifying goal with strong promise.

Table: S2.T1: Properties of episodic memory in comparison to other relevant forms of memory in animals and humans.

Memory TypeLong-termExplicitSingle-shotInstance-specificContextual relations
Episodic
Procedural×\times×\times×\times×\times
Semantic×\times×\times×\times
Working×\times

Table: S3.T2: Methods for in-context, external, and parametric memory do not cover all features of episodic memory. ∼\sim is used for cases where it is unclear whether an aspect of episodic memory is properly satisfied by a method.

Memory ApproachLong-termExplicitSingle-shotInst.-specificContextual rel.
In-ContextKV-Compression×\times
State-space-model×\times
ExternalRAG∼\sim∼\sim×\times
GraphRAG∼\sim∼\sim∼\sim
ParametricEfficient Fine-tuning×\times×\times×\times
Knowledge Editing∼\sim×\times∼\sim×\times
Context Distillation×\times×\times

Refer to caption LLM-Agents with an Episodic Memory system. The LLM agent acts on and gets feedback from an environment. Feedback can come in the form of outputs from programs (E1), from other agents (E2), humans (E3), as well as external real-world data (E4). Actions can modify parts of the environment, and provide feedback for humans or other agents in the environment. Within the agent, an external memory system acts as a bridge between parametric and in-context memory while allowing for fast encoding of and retrieval into in-context memory (the LLM’s context window). (a) Consolidation: Episodes in the external memory are consolidated into a model’s broader parametric memory to avoid capacity limitations and allow for generalization to new semantic knowledge and procedural skills based on specific instances. (b) Encoding: Limited in-context memory can offload its content into external memory. (c) Retrieval: Stored episodes can later be retrieved and used to reinstate representations into in-context memory.

Memory TypeLong- termExplicitSingle- shotInstance- specificContextual relations
Episodic!!!!!
Procedural!××××
Semantic!!×××
Working×!!!!
Memory ApproachMemory ApproachLong-termExplicitSingle-shotInst.-specificContextual rel.
KV-Compression State-space-model× ×! !! !! !! !
RAG GraphRAG! !! !∼ ∼∼ ∼× ∼
Efficient Fine-tuning Knowledge Editing Context Distillation! ! !! ∼ !× × ×× ∼ !× × ×

References

[vaswani2017attention] Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems.

[gu2024mambalineartimesequencemodeling] Albert Gu, Tri Dao. (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.

[yao2023reactsynergizingreasoningacting] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.

[Shinn2023] Shinn, Noah, Cassano, Federico, Gopinath, Ashwin, Narasimhan, Karthik, Yao, Shunyu. (2023). Reflexion: language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems.

[schmidgall2025agent] Schmidgall, Samuel, Su, Yusheng, Wang, Ze, Sun, Ximeng, Wu, Jialian, Yu, Xiaodong, Liu, Jiang, Liu, Zicheng, Barsoum, Emad. (2025). Agent Laboratory: Using LLM Agents as Research Assistants. arXiv preprint arXiv:2501.04227.

[lin2023artificial] Lin, Chien-Chang, Huang, Anna YQ, Lu, Owen HT. (2023). Artificial intelligence in intelligent tutoring systems toward sustainable education: a systematic review. Smart Learning Environments.

[Kim2023] Kim, Geunwoo, Baldi, Pierre, McAleer, Stephen. (2023). Language Models can Solve Computer Tasks. Advances in Neural Information Processing Systems.

[xi2023risepotentiallargelanguage] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, Tao Gui. (2023). The Rise and Potential of Large Language Model Based Agents: A Survey.

[wu2023autogenenablingnextgenllm] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, Chi Wang. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.

[liu2023agentbenchevaluatingllmsagents] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang. (2023). AgentBench: Evaluating LLMs as Agents.

[Lin23] Lin, Bill Yuchen, Fu, Yicheng, Yang, Karina, Brahman, Faeze, Huang, Shiyu, Bhagavatula, Chandra, Ammanabrolu, Prithviraj, Choi, Yejin, Ren, Xiang. (2023). SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks. Advances in Neural Information Processing Systems.

[zhang2024chain] Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, Sercan O Arik. (2024). Chain of Agents: Large Language Models Collaborating on Long-Context Tasks. The Thirty-eighth Annual Conference on Neural Information Processing Systems.

[wang2023interactivenaturallanguageprocessing] Zekun Wang, Ge Zhang, Kexin Yang, Ning Shi, Wangchunshu Zhou, Shaochun Hao, Guangzheng Xiong, Yizhi Li, Mong Yuan Sim, Xiuying Chen, Qingqing Zhu, Zhenzhu Yang, Adam Nik, Qi Liu, Chenghua Lin, Shi Wang, Ruibo Liu, Wenhu Chen, Ke Xu, Dayiheng Liu, Yike Guo, Jie Fu. (2023). Interactive Natural Language Processing.

[li2024personalllmagentsinsights] Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu. (2024). Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security.

[CohenSquire1980] Cohen, Neal J., Squire, Larry R.. (1980). Preserved learning and retention of pattern-analyzing skill in amnesia: Dissociation of “knowing how” and “knowing that”. Science.

[Milner1962] Milner, Brenda. (1962). Les troubles de la mémoire accompagnant des lésions hippocampiques bilatérales. Psychologie Médicale.

[BaddeleyHitch1974] Baddeley, Alan D., Hitch, Graham J.. (1974). Working Memory. The Psychology of Learning and Motivation: Advances in Research and Theory.

[Baddeley1986] Baddeley, Alan D.. (1986). Working Memory.

[Tulving1972a] Tulving, Endel. (1972). Episodic and Semantic Memory. Organization of Memory.

[Tulving1983] Tulving, Endel. (1983). Elements of Episodic Memory.

[Tulving1972] Tulving, Endel. (1972). Episodic and Semantic Memory. Organization of Memory.

[CollinsQuillian1969] Collins, Allan M., Quillian, M. Ross. (1969). Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behavior.

[Patterson2007] Patterson, Karalyn, Nestor, Peter J., Rogers, Timothy T.. (2007). Where do you know what you know? The representation of semantic knowledge in the human brain. Nature Reviews Neuroscience. doi:10.1038/nrn2277.

[Kumaran2016] Kumaran, Dharshan, Hassabis, Demis, McClelland, James L.. (2016). What Learning Systems do Intelligent Agents Need? Complementary Learning Systems Theory Updated. Trends in Cognitive Sciences. doi:10.1016/j.tics.2016.05.004.

[McClelland1995] McClelland, J., McNaughton, B., O'Reilly, R.. (1995). Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review.

[OReilly2002] O'Reilly, R., Norman, K.. (2002). Hippocampal and neocortical contributions to memory: Advances in the complementary learning systems framework. Trends in Cognitive Sciences.

[OReilly2011] O’Reilly, Randall C., Bhattacharyya, Rajan, Howard, Michael D., Ketz, Nicholas. (2011). Complementary Learning Systems. Cognitive Science. doi:10.1111/j.1551-6709.2011.01214.x.

[arani2022learningfastlearningslow] Elahe Arani, Fahad Sarfraz, Bahram Zonooz. (2022). Learning Fast, Learning Slow: A General Continual Learning Method based on Complementary Learning System.

[jian2024linkingincontextlearningtransformers] Li Ji-An, Corey Y. Zhou, Marcus K. Benna, Marcelo G. Mattar. (2024). Linking In-context Learning in Transformers to Human Episodic Memory.

[Whittington2023] Whittington, James C.R., Dorrell, William, Behrens, Timothy E.J., Ganguli, Surya, El-Gaby, Mohamady. (2023). On prefrontal working memory and hippocampal episodic memory: Unifying memories stored in weights and activity slots. bioRxiv. doi:10.1101/2023.11.05.565662.

[michelmann2023largelanguagemodelssegment] Sebastian Michelmann, Manoj Kumar, Kenneth A. Norman, Mariya Toneva. (2023). Large language models can segment narrative events similarly to humans.

[Sugar2019] Sugar, J., Moser, M.. (2019). Episodic memory: Neuronal codes for what, where, and when. Hippocampus.

[Colgin2008] Colgin, L., Moser, E., Moser, M.. (2008). Understanding memory through hippocampal remapping. Trends in Neurosciences.

[Liao2024] Liao, Z., Losonczy, A.. (2024). Learning, fast and slow: Single- and many-shot learning in the hippocampus. Annual Review of Neuroscience.

[Schwartz2001] Schwartz, B., Evans, S.. (2001). Episodic memory in primates. American Journal of Primatology: Official Journal of the American Society of Primatologists.

[OReilly2014] O'Reilly, R., Bhattacharyya, R., Howard, M., Ketz, N.. (2014). Complementary learning systems. Cognitive Science.

[Das2024] Das, P., Chaudhury, S., Nelson, E., others. (2024). Larimar: Large language models with episodic memory control. Proceedings of the 41st International Conference on Machine Learning (ICML).

[OKeefe1978] O’Keefe, J., Nadel, L.. (1978). The Hippocampus as a Cognitive Map.

[Eichenbaum2014] Eichenbaum, H., Cohen, N.. (2014). Can we reconcile the declarative memory and spatial navigation views on hippocampal function?. Neuron.

[Eichenbaum2015] Eichenbaum, H.. (2015). The hippocampus as a cognitive map . . . of social space. Neuron.

[Mayes2001] Mayes, A., Roberts, N.. (2001). Theories of episodic memory. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences.

[Conway2001] Conway, M.. (2001). Sensory–perceptual episodic memory and its context: Autobiographical memory. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences.

[Squire1996] Squire, L., Zola, S.. (1996). Structure and function of declarative and nondeclarative memory systems. PNAS.

[Hampton2004] Hampton, R., Schwartz, B.. (2004). Episodic memory in nonhumans: what, and where, is when?. Current Opinion in Neurobiology.

[yen2024longcontextlanguagemodelingparallel] Howard Yen, Tianyu Gao, Danqi Chen. (2024). Long-Context Language Modeling with Parallel Context Encoding.

[brandon2024reducingtransformerkeyvaluecache] William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly. (2024). Reducing Transformer Key-Value Cache Size with Cross-Layer Attention.

[sun2024cacheoncedecoderdecoderarchitectures] Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei. (2024). You Only Cache Once: Decoder-Decoder Architectures for Language Models.

[goldstein2024goldfinchhighperformancerwkvtransformer] Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, Eugene Cheah. (2024). GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression.

[Kwon2023] Kwon, Woosuk, Li, Zhuohan, Zhuang, Siyuan, Sheng, Ying, Zheng, Lianmin, Yu, Cody Hao, Gonzalez, Joseph, Zhang, Hao, Stoica, Ion. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles. doi:10.1145/3600006.3613165.

[ye2024chunkattentionefficientselfattentionprefixaware] Lu Ye, Ze Tao, Yong Huang, Yang Li. (2024). ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition.

[lin2024infinitellmefficientllmservice] Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin. (2024). Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache.

[lee2024infinigenefficientgenerativeinference] Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim. (2024). InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management.

[xiao2024infllmtrainingfreelongcontextextrapolation] Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Maosong Sun. (2024). InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory.

[agrawal2024mnemosyneparallelizationstrategiesefficiently] Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse. (2024). Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations.

[han-etal-2024-lm] Han, Chi, Wang, Qifan, Peng, Hao, Xiong, Wenhan, Chen, Yu, Ji, Heng, Wang, Sinong. (2024). {LM. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). doi:10.18653/v1/2024.naacl-long.222.

[tang2024razorattentionefficientkvcache] Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, Gongyi Wang. (2024). RazorAttention: Efficient KV Cache Compression Through Retrieval Heads.

[Liu2023scissorhands] Liu, Zichang, Desai, Aditya, Liao, Fangshuo, Wang, Weitao, Xie, Victor, Xu, Zhaozhuo, Kyrillidis, Anastasios, Shrivastava, Anshumali. (2023). Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time. Advances in Neural Information Processing Systems.

[yang2024pyramidinferpyramidkvcache] Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao. (2024). PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference.

[ge2024modeltellsdiscardadaptive] Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao. (2024). Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs.

[devoto2024simpleeffectivel2normbased] Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini. (2024). A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression.

[nawrot2024dynamicmemorycompressionretrofitting] Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti. (2024). Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference.

[liu2024minicachekvcachecompression] Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang. (2024). MiniCache: KV Cache Compression in Depth Dimension for Large Language Models.

[pang2024anchorbasedlargelanguagemodels] Jianhui Pang, Fanghua Ye, Derek Fai Wong, Xin He, Wanshun Chen, Longyue Wang. (2024). Anchor-based Large Language Models.

[hooper2024kvquant10millioncontext] Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami. (2024). KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.

[yue2024wkvquantquantizingweightkeyvalue] Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie. (2024). WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More.

[duanmu2024skvqslidingwindowkeyvalue] Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin. (2024). SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models.

[anonymous2024forgetting] Anonymous. (2024). Forgetting Transformer: Softmax Attention with a Forget Gate. Submitted to The Thirteenth International Conference on Learning Representations.

[ye2024differentialtransformer] Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei. (2024). Differential Transformer.

[lewis2021retrievalaugmentedgenerationknowledgeintensivenlp] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. (2021). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

[gao2023enablinglargelanguagemodels] Tianyu Gao, Howard Yen, Jiatong Yu, Danqi Chen. (2023). Enabling Large Language Models to Generate Text with Citations.

[edge2024localglobalgraphrag] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Jonathan Larson. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization.

[li2024graphreaderbuildinggraphbasedagent] Shilong Li, Yancheng He, Hangyu Guo, Xingyuan Bu, Ge Bai, Jie Liu, Jiaheng Liu, Xingwei Qu, Yangguang Li, Wanli Ouyang, Wenbo Su, Bo Zheng. (2024). GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models.

[gutiérrez2025hipporagneurobiologicallyinspiredlongterm] Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, Yu Su. (2025). HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models.

[peng2024graphretrievalaugmentedgenerationsurvey] Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, Siliang Tang. (2024). Graph Retrieval-Augmented Generation: A Survey.

[yu2024defenserageralongcontext] Tan Yu, Anbang Xu, Rama Akkiraju. (2024). In Defense of RAG in the Era of Long-Context Language Models.

[balaguer2024ragvsfinetuningpipelines] Angels Balaguer, Vinamra Benara, Renato Luiz de Freitas Cunha, Roberto de M. Estevão Filho, Todd Hendry, Daniel Holstein, Jennifer Marsman, Nick Mecklenburg, Sara Malvar, Leonardo O. Nunes, Rafael Padilha, Morris Sharp, Bruno Silva, Swati Sharma, Vijay Aski, Ranveer Chandra. (2024). RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture.

[wu2022memorizing] Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, Christian Szegedy. (2022). Memorizing Transformers. International Conference on Learning Representations.

[khandelwal2020generalizationknnlm] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis. (2020). Generalization through Memorization: Nearest Neighbor Language Models. International Conference on Learning Representations.

[yogatama2021semiparametriclm] Yogatama, Dani, de Masson d’Autume, Cyprien, Kong, Lingpeng. (2021). Adaptive Semiparametric Language Models. Transactions of the Association for Computational Linguistics. doi:10.1162/tacl_a_00371.

[cheng2023selfmemrag] Cheng, Xin, Luo, Di, Chen, Xiuying, Liu, Lemao, Zhao, Dongyan, Yan, Rui. (2023). Lift Yourself Up: Retrieval-augmented Text Generation with Self-Memory. Advances in Neural Information Processing Systems.

[wang2024selfcontrolledmemllm] Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, Zhoujun Li. (2024). Enhancing Large Language Model with Self-Controlled Memory Framework.

[modarressi2025memllmfinetuningllmsuse] Ali Modarressi, Abdullatif Köksal, Ayyoob Imani, Mohsen Fayyaz, Hinrich Schütze. (2025). MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory.

[packer2024memgptllmsoperatingsystems] Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph E. Gonzalez. (2024). MemGPT: Towards LLMs as Operating Systems.

[lee2023promptllmlongchatbotmem] Lei Liu, Xiaoyan Yang, Yue Shen, Binbin Hu, Zhiqiang Zhang, Jinjie Gu, Guannan Zhang. (2023). Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory. Findings of the Association for Computational Linguistics: ACL 2023. doi:10.18653/v1/2023.findings-acl.277.

[zhong2024memorybankenhancinglargelanguage] Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, Yanlin Wang. (2024). MemoryBank: Enhancing Large Language Models with Long-Term Memory. Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.1609/aaai.v38i17.29946.

[bordes2015largescalesimplequestionanswering] Antoine Bordes, Nicolas Usunier, Sumit Chopra, Jason Weston. (2015). Large-scale Simple Question Answering with Memory Networks.

[sukhbaatar2015endtoendmemorynetworks] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus. (2015). End-To-End Memory Networks. Advances in Neural Information Processing Systems.

[kumar2016dynamicmemorynet] Kumar, Ankit, Irsoy, Ozan, Ondruska, Peter, Iyyer, Mohit, Bradbury, James, Gulrajani, Ishaan, Zhong, Victor, Paulus, Romain, Socher, Richard. (2016). Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. Proceedings of The 33rd International Conference on Machine Learning.

[graves2014neuralturingmachines] Alex Graves, Greg Wayne, Ivo Danihelka. (2014). Neural Turing Machines.

[graves2016hybrid] Graves, Alex, Wayne, Greg, Reynolds, Malcolm, Harley, Tim, Danihelka, Ivo, Grabska-Barwi{'n. (2016). Hybrid computing using a neural network with dynamic external memory. Nature.

[wu2018kanerva] Yan Wu, Greg Wayne, Alex Graves, Timothy Lillicrap. (2018). The Kanerva Machine: A Generative Distributed Memory. International Conference on Learning Representations.

[das2024larimarlargelanguagemodels] Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aurélie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Jiří, Navrátil, Soham Dan, Pin-Yu Chen. (2024). Larimar: Large Language Models with Episodic Memory Control.

[wu2022memformer] Wu, Qingyang, Lan, Zhenzhong, Qian, Kun, Gu, Jing, Geramifard, Alborz, Yu, Zhou. (2022). Memformer: A Memory-Augmented Transformer for Sequence Modeling. Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022. doi:10.18653/v1/2022.findings-aacl.29.

[wang2023augmentwithresidualretriever] Wang, Weizhi, Dong, Li, Cheng, Hao, Liu, Xiaodong, Yan, Xifeng, Gao, Jianfeng, Wei, Furu. (2023). Augmenting Language Models with Long-Term Memory. Advances in Neural Information Processing Systems.

[geva2021transformerffkvmem] Geva, Mor, Schuster, Roei, Berant, Jonathan, Levy, Omer. (2021). Transformer Feed-Forward Layers Are Key-Value Memories. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2021.emnlp-main.446.

[berges2024memorylayersscale] Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, Wen-tau Yih, Luke Zettlemoyer, Gargi Ghosh. (2024). Memory Layers at Scale.

[lample2019largememlayers] Lample, Guillaume, Sablayrolles, Alexandre, Ranzato, Marc\textquotesingle Aurelio, Denoyer, Ludovic, Jegou, Herve. (2019). Large Memory Layers with Product Keys. Advances in Neural Information Processing Systems.

[sukhbaatar2019augmentingselfattentionpersistentmemory] Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, Armand Joulin. (2019). Augmenting Self-attention with Persistent Memory.

[bulatov2022recurrentmemorytransformer] Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev. (2022). Recurrent Memory Transformer.

[lu2023memochattuningllmsuse] Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yulan He, Di Yin, Xing Sun, Yunsheng Wu. (2023). MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-Domain Conversation.

[snell2022learningdistillingcontext] Charlie Snell, Dan Klein, Ruiqi Zhong. (2022). Learning by Distilling Context.

[distillingcontext-2] Padmanabhan, Shankar, Onoe, Yasumasa, Zhang, Michael, Durrett, Greg, Choi, Eunsol. (2023). Propagating Knowledge Updates to LMs Through Distillation. Advances in Neural Information Processing Systems.

[kang2024latentparaphrasingperturbationlayers] Minki Kang, Sung Ju Hwang, Gibbeum Lee, Jaewoong Cho. (2024). Latent Paraphrasing: Perturbation on Layers Improves Knowledge Injection in Language Models.

[pham2022revisitingselfdistillation] Minh Pham, Minsu Cho, Ameya Joshi, Chinmay Hegde. (2022). Revisiting Self-Distillation.

[wang2025mamballamadistillingaccelerating] Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao. (2025). The Mamba in the Llama: Distilling and Accelerating Hybrid Models.

[yang2024syntheticcontinuedpretraining] Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto. (2024). Synthetic continued pretraining.

[decao2021editingfactualknowledgelanguage] Nicola De Cao, Wilker Aziz, Ivan Titov. (2021). Editing Factual Knowledge in Language Models.

[zhang2024comprehensivestudyknowledgeediting] Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, Huajun Chen. (2024). A Comprehensive Study of Knowledge Editing for Large Language Models.

[mitchell2022memorybasedmodeleditingscale] Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D. Manning, Chelsea Finn. (2022). Memory-Based Model Editing at Scale.

[tan2024massiveeditinglargelanguage] Chenmien Tan, Ge Zhang, Jie Fu. (2024). Massive Editing for Large Language Models via Meta Learning.

[wang2024wiserethinkingknowledgememory] Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen. (2024). WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models.

[wang2024easyediteasytouseknowledgeediting] Peng Wang, Ningyu Zhang, Bozhong Tian, Zekun Xi, Yunzhi Yao, Ziwen Xu, Mengru Wang, Shengyu Mao, Xiaohan Wang, Siyuan Cheng, Kangwei Liu, Yuansheng Ni, Guozhou Zheng, Huajun Chen. (2024). EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models.

[meng2023locatingeditingfactualassociations] Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov. (2023). Locating and Editing Factual Associations in GPT.

[meng2023masseditingmemorytransformer] Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, David Bau. (2023). Mass-Editing Memory in a Transformer.

[gupta2024modeleditingscaleleads] Akshat Gupta, Anurag Rao, Gopala Anumanchipalli. (2024). Model Editing at Scale leads to Gradual and Catastrophic Forgetting.

[yadav2023tiesmerging] Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, Mohit Bansal. (2023). {TIES. Thirty-seventh Conference on Neural Information Processing Systems.

[wang2024deepeditknowledgeeditingdecoding] Yiwei Wang, Muhao Chen, Nanyun Peng, Kai-Wei Chang. (2024). DeepEdit: Knowledge Editing as Decoding with Constraints.

[yu2023meloenhancingmodelediting] Lang Yu, Qin Chen, Jie Zhou, Liang He. (2023). MELO: Enhancing Model Editing with Neuron-Indexed Dynamic LoRA.

[kaplan2020scalinglawsneurallanguage] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei. (2020). Scaling Laws for Neural Language Models.

[xu-etal-2021-raise] Xu, Runxin, Luo, Fuli, Zhang, Zhiyuan, Tan, Chuanqi, Chang, Baobao, Huang, Songfang, Huang, Fei. (2021). Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2021.emnlp-main.749.

[hu2022lora] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen. (2022). Lo{RA. International Conference on Learning Representations.

[va2022DyLoRA] Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, Ali Ghodsi. (2022). DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low Rank Adaptation.

[wang2024roselorarowcolumnwisesparse] Haoyu Wang, Tianci Liu, Ruirui Li, Monica Cheng, Tuo Zhao, Jing Gao. (2024). RoseLoRA: Row and Column-wise Sparse Low-rank Adaptation of Pre-trained Language Model for Knowledge Editing and Fine-tuning.

[yin2024lofitlocalizedfinetuningllm] Fangcong Yin, Xi Ye, Greg Durrett. (2024). LoFiT: Localized Fine-tuning on LLM Representations.

[zou2023representationengineeringtopdownapproach] Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks. (2023). Representation Engineering: A Top-Down Approach to AI Transparency.

[wu2024reftrepresentationfinetuninglanguage] Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts. (2024). ReFT: Representation Finetuning for Language Models.

[jovanovic2024incrementallearninglargelanguage] Mladjan Jovanovic, Peter Voss. (2024). Towards Incremental Learning in Large Language Models: A Critical Review.

[yang2024gateddeltanetworksimproving] Songlin Yang, Jan Kautz, Ali Hatamizadeh. (2024). Gated Delta Networks: Improving Mamba2 with Delta Rule.

[yu2024hoperobustparameterizationlongmemory] Annan Yu, Michael W. Mahoney, N. Benjamin Erichson. (2024). HOPE for a Robust Parameterization of Long-memory State Space Models.

[peng2023rwkvreinventingrnnstransformer] Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Bolun Wang, Johan S. Wind, Stanislaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Qinghua Zhou, Jian Zhu, Rui-Jie Zhu. (2023). RWKV: Reinventing RNNs for the Transformer Era.

[martin2018parallelizing] Eric Martin, Chris Cundy. (2018). Parallelizing Linear Recurrent Neural Nets Over Sequence Length. International Conference on Learning Representations.

[sun2024learninglearntesttime] Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin. (2024). Learning to (Learn at Test Time): RNNs with Expressive Hidden States.

[behrouz2024titanslearningmemorizetest] Ali Behrouz, Peilin Zhong, Vahab Mirrokni. (2024). Titans: Learning to Memorize at Test Time.

[blundell2016modelfreeepisodiccontrol] Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z Leibo, Jack Rae, Daan Wierstra, Demis Hassabis. (2016). Model-Free Episodic Control.

[pritzel2017neuralepisodiccontrol] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adrià Puigdomènech, Oriol Vinyals, Demis Hassabis, Daan Wierstra, Charles Blundell. (2017). Neural Episodic Control.

[li2023neuralepisodiccontrolstate] Zhuo Li, Derui Zhu, Yujing Hu, Xiaofei Xie, Lei Ma, Yan Zheng, Yan Song, Yingfeng Chen, Jianjun Zhao. (2023). Neural Episodic Control with State Abstraction.

[hu2021generalizableepisodicmemorydeep] Hao Hu, Jianing Ye, Guangxiang Zhu, Zhizhou Ren, Chongjie Zhang. (2021). Generalizable Episodic Memory for Deep Reinforcement Learning.

[CEMDQN-EM-RL] Srivastava, Satyam, Rathore, Heena, Tiwari, Kamlesh. (2023). CEMDQN: Cognitive-inspired Episodic Memory in Deep Q-networks. 2023 International Joint Conference on Neural Networks (IJCNN). doi:10.1109/IJCNN54540.2023.10192032.

[Gershman2017] Gershman, Samuel J., Daw, Nathaniel D.. (2017). Reinforcement Learning and Episodic Memory in Humans and Animals: An Integrative Framework. Annual Review of Psychology. doi:10.1146/annurev-psych-122414-033625.

[fortunato2020generalizationreinforcementlearnersworking] Meire Fortunato, Melissa Tan, Ryan Faulkner, Steven Hansen, Adrià Puigdomènech Badia, Gavin Buttimore, Charlie Deck, Joel Z Leibo, Charles Blundell. (2020). Generalization of Reinforcement Learners with Working and Episodic Memory.

[mombaerts2024metaknowledgeretrievalaugmented] Laurent Mombaerts, Terry Ding, Adi Banerjee, Florian Felice, Jonathan Taws, Tarik Borogovac. (2024). Meta Knowledge for Retrieval Augmented Large Language Models.

[pink2024assessingepisodicmemoryllms] Mathis Pink, Vy A. Vo, Qinyuan Wu, Jianing Mu, Javier S. Turek, Uri Hasson, Kenneth A. Norman, Sebastian Michelmann, Alexander Huth, Mariya Toneva. (2024). Assessing Episodic Memory in LLMs with Sequence Order Recall Tasks.

[moskvichev2023narrativexllargescaledatasetlongterm] Arseny Moskvichev, Ky-Vinh Mai. (2023). NarrativeXL: A Large-scale Dataset For Long-Term Memory Models.

[li2024needlebenchllmsretrievalreasoning] Mo Li, Songyang Zhang, Yunxin Liu, Kai Chen. (2024). NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?.

[fountas2024humanlikeepisodicmemoryinfinite] Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou-Ammar, Jun Wang. (2024). Human-like Episodic Memory for Infinite Context LLMs.

[hatalis2024improvingmemoryllmagents] Hatalis, Kostas, Christou, Despina, Myers, Joshua, Jones, Steven, Lambert, Keith, Amos-Binks, Adam, Dannenhauer, Zohreh, Dannenhauer, Dustin. (2024). Memory Matters: The Need to Improve Long-Term Memory in LLM-Agents. Proceedings of the AAAI Symposium Series. doi:10.1609/aaaiss.v2i1.27688.

[brandon2024reducing] Brandon, William, Mishra, Mayank, Nrusimha, Aniruddha, Panda, Rameswar, Kelly, Jonathan Ragan. (2024). Reducing Transformer Key-Value Cache Size with Cross-Layer Attention. arXiv preprint arXiv:2405.12981.

[liu2024kivi] Liu, Zirui, Yuan, Jiayi, Jin, Hongye, Zhong, Shaochen, Xu, Zhaozhuo, Braverman, Vladimir, Chen, Beidi, Hu, Xia. (2024). Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750.

[zheng2024sglang] Zheng, Lianmin, Yin, Liangsheng, Xie, Zhiqiang, Sun, Chuyue, Huang, Jeff, Yu, Cody Hao, Cao, Shiyi, Kozyrakis, Christos, Stoica, Ion, Gonzalez, Joseph E, others. (2024). Sglang: Efficient execution of structured language model programs. arXiv preprint arXiv:2312.07104.

[gu2023mamba] Gu, Albert, Dao, Tri. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.

[dai1901transformer] Dai, Zihang, Yang, Zhilin, Yang, Yiming, Carbonell, Jaime, Le, Quoc, Salakhutdinov, Ruslan. (2019). Transformer-{XL. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. doi:10.18653/v1/P19-1285.

[lou2024sparser] Lou, Chao, Jia, Zixia, Zheng, Zilong, Tu, Kewei. (2024). Sparser is faster and less is more: Efficient sparse attention for long-range transformers. arXiv preprint arXiv:2406.16747.

[li2020linear] Li, Rui, Su, Jianlin, Duan, Chenxi, Zheng, Shunyi. (2020). Linear attention mechanism: An efficient attention for semantic segmentation. arXiv preprint arXiv:2007.14902.

[katharopoulos2020transformers] Katharopoulos, Angelos, Vyas, Apoorv, Pappas, Nikolaos, Fleuret, Fran{\c{c. (2020). Transformers are rnns: Fast autoregressive transformers with linear attention. International conference on machine learning.

[liu2024lost] Liu, Nelson F, Lin, Kevin, Hewitt, John, Paranjape, Ashwin, Bevilacqua, Michele, Petroni, Fabio, Liang, Percy. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics.

[kirsten2024impact] Kirsten, Elisabeth, Habernal, Ivan, Nanda, Vedant, Zafar, Muhammad Bilal. (2024). The Impact of Inference Acceleration Strategies on Bias of LLMs. arXiv preprint arXiv:2410.22118.

[berglund2023reversal] Berglund, Lukas, Tong, Meg, Kaufmann, Max, Balesni, Mikita, Stickland, Asa Cooper, Korbak, Tomasz, Evans, Owain. (2023). The reversal curse: Llms trained on. arXiv preprint arXiv:2309.12288.

[yang2024large] Yang, Sohee, Gribovskaya, Elena, Kassner, Nora, Geva, Mor, Riedel, Sebastian. (2024). Do Large Language Models Latently Perform Multi-Hop Reasoning?. arXiv preprint arXiv:2402.16837.

[bib1] Agrawal et al. (2024) Agrawal, A., Chen, J., Íñigo Goiri, Ramjee, R., Zhang, C., Tumanov, A., and Choukse, E. Mnemosyne: Parallelization strategies for efficiently serving multi-million context length llm inference requests without approximations, 2024. URL https://arxiv.org/abs/2409.17264.

[bib2] Anonymous. Forgetting transformer: Softmax attention with a forget gate. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=q2Lnyegkr8. under review.

[bib3] Arani et al. (2022) Arani, E., Sarfraz, F., and Zonooz, B. Learning fast, learning slow: A general continual learning method based on complementary learning system, 2022. URL https://arxiv.org/abs/2201.12604.

[bib4] Baddeley, A. D. Working Memory. Clarendon Press, Oxford, UK, 1986.

[bib5] Baddeley, A. D. and Hitch, G. J. Working memory. In Bower, G. H. (ed.), The Psychology of Learning and Motivation: Advances in Research and Theory, volume 8, pp. 47–89. Academic Press, New York, 1974.

[bib6] Behrouz et al. (2024) Behrouz, A., Zhong, P., and Mirrokni, V. Titans: Learning to memorize at test time, 2024. URL https://arxiv.org/abs/2501.00663.

[bib7] Berges et al. (2024) Berges, V.-P., Oğuz, B., Haziza, D., tau Yih, W., Zettlemoyer, L., and Ghosh, G. Memory layers at scale, 2024. URL https://arxiv.org/abs/2412.09764.

[bib8] Berglund et al. (2023) Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A. C., Korbak, T., and Evans, O. The reversal curse: Llms trained on” a is b” fail to learn” b is a”. arXiv preprint arXiv:2309.12288, 2023.

[bib9] Bordes et al. (2015) Bordes, A., Usunier, N., Chopra, S., and Weston, J. Large-scale simple question answering with memory networks, 2015. URL https://arxiv.org/abs/1506.02075.

[bib10] Brandon et al. (2024) Brandon, W., Mishra, M., Nrusimha, A., Panda, R., and Kelly, J. R. Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024.

[bib11] Bulatov et al. (2022) Bulatov, A., Kuratov, Y., and Burtsev, M. S. Recurrent memory transformer, 2022. URL https://arxiv.org/abs/2207.06881.

[bib12] Cao et al. (2021) Cao, N. D., Aziz, W., and Titov, I. Editing factual knowledge in language models, 2021. URL https://arxiv.org/abs/2104.08164.

[bib13] Cheng et al. (2023) Cheng, X., Luo, D., Chen, X., Liu, L., Zhao, D., and Yan, R. Lift yourself up: Retrieval-augmented text generation with self-memory. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 43780–43799. Curran Associates, Inc., 2023.

[bib14] Cohen, N. J. and Squire, L. R. Preserved learning and retention of pattern-analyzing skill in amnesia: Dissociation of “knowing how” and “knowing that”. Science, 210(4466):207–210, 1980.

[bib15] Colgin et al. (2008) Colgin, L., Moser, E., and Moser, M. Understanding memory through hippocampal remapping. Trends in Neurosciences, 2008.

[bib16] Collins, A. M. and Quillian, M. R. Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behavior, 8(2):240–247, 1969.

[bib17] Conway, M. Sensory–perceptual episodic memory and its context: Autobiographical memory. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 2001.

[bib18] Dai et al. (2019) Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., and Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. In Korhonen, A., Traum, D., and Màrquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL https://aclanthology.org/P19-1285/.

[bib19] Das et al. (2024a) Das, P., Chaudhury, S., Nelson, E., Melnyk, I., Swaminathan, S., Dai, S., Lozano, A., Kollias, G., Chenthamarakshan, V., Jiří, Navrátil, Dan, S., and Chen, P.-Y. Larimar: Large language models with episodic memory control, 2024a. URL https://arxiv.org/abs/2403.11901.

[bib20] Das et al. (2024b) Das, P., Chaudhury, S., Nelson, E., et al. Larimar: Large language models with episodic memory control. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024b.

[bib21] Duanmu et al. (2024) Duanmu, H., Yuan, Z., Li, X., Duan, J., Zhang, X., and Lin, D. Skvq: Sliding-window key and value cache quantization for large language models, 2024. URL https://arxiv.org/abs/2405.06219.

[bib22] Edge et al. (2024) Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., and Larson, J. From local to global: A graph rag approach to query-focused summarization, 2024. URL https://arxiv.org/abs/2404.16130.

[bib23] Eichenbaum, H. The hippocampus as a cognitive map . . . of social space. Neuron, 2015.

[bib24] Eichenbaum, H. and Cohen, N. Can we reconcile the declarative memory and spatial navigation views on hippocampal function? Neuron, 2014.

[bib25] Fountas et al. (2024) Fountas, Z., Benfeghoul, M. A., Oomerjee, A., Christopoulou, F., Lampouras, G., Bou-Ammar, H., and Wang, J. Human-like episodic memory for infinite context llms, 2024. URL https://arxiv.org/abs/2407.09450.

[bib26] Gao et al. (2023) Gao, T., Yen, H., Yu, J., and Chen, D. Enabling large language models to generate text with citations, 2023. URL https://arxiv.org/abs/2305.14627.

[bib27] Ge et al. (2024) Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., and Gao, J. Model tells you what to discard: Adaptive kv cache compression for llms, 2024. URL https://arxiv.org/abs/2310.01801.

[bib28] Goldstein et al. (2024) Goldstein, D., Obeid, F., Alcaide, E., Song, G., and Cheah, E. Goldfinch: High performance rwkv/transformer hybrid with linear pre-fill and extreme kv-cache compression, 2024. URL https://arxiv.org/abs/2407.12077.

[bib29] Graves et al. (2014) Graves, A., Wayne, G., and Danihelka, I. Neural turing machines, 2014. URL https://arxiv.org/abs/1410.5401.

[bib30] Graves et al. (2016) Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., Colmenarejo, S. G., Grefenstette, E., Ramalho, T., Agapiou, J., et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.

[bib31] Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.

[bib32] Gupta et al. (2024) Gupta, A., Rao, A., and Anumanchipalli, G. Model editing at scale leads to gradual and catastrophic forgetting, 2024. URL https://arxiv.org/abs/2401.07453.

[bib33] Gutiérrez et al. (2025) Gutiérrez, B. J., Shu, Y., Gu, Y., Yasunaga, M., and Su, Y. Hipporag: Neurobiologically inspired long-term memory for large language models, 2025. URL https://arxiv.org/abs/2405.14831.

[bib34] Hampton, R. and Schwartz, B. Episodic memory in nonhumans: what, and where, is when? Current Opinion in Neurobiology, 2004.

[bib35] Han et al. (2024) Han, C., Wang, Q., Peng, H., Xiong, W., Chen, Y., Ji, H., and Wang, S. LM-infinite: Zero-shot extreme length generalization for large language models. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3991–4008, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.222. URL https://aclanthology.org/2024.naacl-long.222/.

[bib36] Hooper et al. (2024) Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. Kvquant: Towards 10 million context length llm inference with kv cache quantization, 2024. URL https://arxiv.org/abs/2401.18079.

[bib37] Hu et al. (2022) Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.

[bib38] Kaplan et al. (2020) Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.

[bib39] Katharopoulos et al. (2020) Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156–5165. PMLR, 2020.

[bib40] Khandelwal et al. (2020) Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis, M. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklBjCEKvH.

[bib41] Kirsten et al. (2024) Kirsten, E., Habernal, I., Nanda, V., and Zafar, M. B. The impact of inference acceleration strategies on bias of llms. arXiv preprint arXiv:2410.22118, 2024.

[bib42] Kumaran et al. (2016) Kumaran, D., Hassabis, D., and McClelland, J. L. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in Cognitive Sciences, 20(7):512–534, July 2016. ISSN 1364-6613. doi: 10.1016/j.tics.2016.05.004. URL http://dx.doi.org/10.1016/j.tics.2016.05.004.

[bib43] Kwon et al. (2023) Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, pp. 611–626. ACM, October 2023. doi: 10.1145/3600006.3613165. URL http://dx.doi.org/10.1145/3600006.3613165.

[bib44] Lee et al. (2023) Lee, G., Hartmann, V., Park, J., Papailiopoulos, D., and Lee, K. Prompted llms as chatbot modules for long open-domain conversation. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.findings-acl.277. URL http://dx.doi.org/10.18653/v1/2023.findings-acl.277.

[bib45] Lee et al. (2024) Lee, W., Lee, J., Seo, J., and Sim, J. Infinigen: Efficient generative inference of large language models with dynamic kv cache management, 2024. URL https://arxiv.org/abs/2406.19707.

[bib46] Li et al. (2020) Li, R., Su, J., Duan, C., and Zheng, S. Linear attention mechanism: An efficient attention for semantic segmentation. arXiv preprint arXiv:2007.14902, 2020.

[bib47] Li et al. (2024a) Li, S., He, Y., Guo, H., Bu, X., Bai, G., Liu, J., Liu, J., Qu, X., Li, Y., Ouyang, W., Su, W., and Zheng, B. Graphreader: Building graph-based agent to enhance long-context abilities of large language models, 2024a. URL https://arxiv.org/abs/2406.14550.

[bib48] Li et al. (2024b) Li, Y., Wen, H., Wang, W., Li, X., Yuan, Y., Liu, G., Liu, J., Xu, W., Wang, X., Sun, Y., Kong, R., Wang, Y., Geng, H., Luan, J., Jin, X., Ye, Z., Xiong, G., Zhang, F., Li, X., Xu, M., Li, Z., Li, P., Liu, Y., Zhang, Y.-Q., and Liu, Y. Personal llm agents: Insights and survey about the capability, efficiency and security, 2024b. URL https://arxiv.org/abs/2401.05459.

[bib49] Liao, Z. and Losonczy, A. Learning, fast and slow: Single- and many-shot learning in the hippocampus. Annual Review of Neuroscience, 2024.

[bib50] Lin et al. (2024) Lin, B., Zhang, C., Peng, T., Zhao, H., Xiao, W., Sun, M., Liu, A., Zhang, Z., Li, L., Qiu, X., Li, S., Ji, Z., Xie, T., Li, Y., and Lin, W. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache, 2024. URL https://arxiv.org/abs/2401.02669.

[bib51] Lin et al. (2023) Lin, C.-C., Huang, A. Y., and Lu, O. H. Artificial intelligence in intelligent tutoring systems toward sustainable education: a systematic review. Smart Learning Environments, 10(1):41, 2023.

[bib52] Liu et al. (2024a) Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., and Zhuang, B. Minicache: Kv cache compression in depth dimension for large language models, 2024a. URL https://arxiv.org/abs/2405.14366.

[bib53] Liu et al. (2023a) Liu, L., Yang, X., Shen, Y., Hu, B., Zhang, Z., Gu, J., and Zhang, G. Think-in-memory: Recalling and post-thinking enable llms with long-term memory, 2023a. URL https://arxiv.org/abs/2311.08719.

[bib54] Liu et al. (2024b) Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024b.

[bib55] Liu et al. (2023b) Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., and Shrivastava, A. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 52342–52364. Curran Associates, Inc., 2023b.

[bib56] Liu et al. (2024c) Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024c.

[bib57] Lou et al. (2024) Lou, C., Jia, Z., Zheng, Z., and Tu, K. Sparser is faster and less is more: Efficient sparse attention for long-range transformers. arXiv preprint arXiv:2406.16747, 2024.

[bib58] Lu et al. (2023) Lu, J., An, S., Lin, M., Pergola, G., He, Y., Yin, D., Sun, X., and Wu, Y. Memochat: Tuning llms to use memos for consistent long-range open-domain conversation, 2023. URL https://arxiv.org/abs/2308.08239.

[bib59] Mayes, A. and Roberts, N. Theories of episodic memory. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 2001.

[bib60] McClelland et al. (1995) McClelland, J., McNaughton, B., and O’Reilly, R. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 1995.

[bib61] Meng et al. (2023a) Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in gpt, 2023a. URL https://arxiv.org/abs/2202.05262.

[bib62] Meng et al. (2023b) Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y., and Bau, D. Mass-editing memory in a transformer, 2023b. URL https://arxiv.org/abs/2210.07229.

[bib63] Michelmann et al. (2023) Michelmann, S., Kumar, M., Norman, K. A., and Toneva, M. Large language models can segment narrative events similarly to humans, 2023. URL https://arxiv.org/abs/2301.10297.

[bib64] Milner, B. Les troubles de la mémoire accompagnant des lésions hippocampiques bilatérales. Psychologie Médicale, 51:39–52, 1962.

[bib65] Mitchell et al. (2022) Mitchell, E., Lin, C., Bosselut, A., Manning, C. D., and Finn, C. Memory-based model editing at scale, 2022. URL https://arxiv.org/abs/2206.06520.

[bib66] Modarressi et al. (2025) Modarressi, A., Köksal, A., Imani, A., Fayyaz, M., and Schütze, H. Memllm: Finetuning llms to use an explicit read-write memory, 2025. URL https://arxiv.org/abs/2404.11672.

[bib67] Mombaerts et al. (2024) Mombaerts, L., Ding, T., Banerjee, A., Felice, F., Taws, J., and Borogovac, T. Meta knowledge for retrieval augmented large language models, 2024. URL https://arxiv.org/abs/2408.09017.

[bib68] Nawrot et al. (2024) Nawrot, P., Łańcucki, A., Chochowski, M., Tarjan, D., and Ponti, E. M. Dynamic memory compression: Retrofitting llms for accelerated inference, 2024. URL https://arxiv.org/abs/2403.09636.

[bib69] O’Reilly, R. and Norman, K. Hippocampal and neocortical contributions to memory: Advances in the complementary learning systems framework. Trends in Cognitive Sciences, 2002.

[bib70] O’Reilly et al. (2014) O’Reilly, R., Bhattacharyya, R., Howard, M., and Ketz, N. Complementary learning systems. Cognitive Science, 2014.

[bib71] O’Keefe, J. and Nadel, L. The Hippocampus as a Cognitive Map. Oxford: Clarendon Press, 1978.

[bib72] Packer et al. (2024) Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., and Gonzalez, J. E. Memgpt: Towards llms as operating systems, 2024. URL https://arxiv.org/abs/2310.08560.

[bib73] Padmanabhan et al. (2023) Padmanabhan, S., Onoe, Y., Zhang, M., Durrett, G., and Choi, E. Propagating knowledge updates to lms through distillation. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 47124–47142. Curran Associates, Inc., 2023.

[bib74] Pang et al. (2024) Pang, J., Ye, F., Wong, D. F., He, X., Chen, W., and Wang, L. Anchor-based large language models, 2024. URL https://arxiv.org/abs/2402.07616.

[bib75] Peng et al. (2023) Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., GV, K. K., He, X., Hou, H., Lin, J., Kazienko, P., Kocon, J., Kong, J., Koptyra, B., Lau, H., Mantri, K. S. I., Mom, F., Saito, A., Song, G., Tang, X., Wang, B., Wind, J. S., Wozniak, S., Zhang, R., Zhang, Z., Zhao, Q., Zhou, P., Zhou, Q., Zhu, J., and Zhu, R.-J. Rwkv: Reinventing rnns for the transformer era, 2023. URL https://arxiv.org/abs/2305.13048.

[bib76] Peng et al. (2024) Peng, B., Zhu, Y., Liu, Y., Bo, X., Shi, H., Hong, C., Zhang, Y., and Tang, S. Graph retrieval-augmented generation: A survey, 2024. URL https://arxiv.org/abs/2408.08921.

[bib77] Pink et al. (2024) Pink, M., Vo, V. A., Wu, Q., Mu, J., Turek, J. S., Hasson, U., Norman, K. A., Michelmann, S., Huth, A., and Toneva, M. Assessing episodic memory in llms with sequence order recall tasks, 2024. URL https://arxiv.org/abs/2410.08133.

[bib78] Schmidgall et al. (2025) Schmidgall, S., Su, Y., Wang, Z., Sun, X., Wu, J., Yu, X., Liu, J., Liu, Z., and Barsoum, E. Agent laboratory: Using llm agents as research assistants. arXiv preprint arXiv:2501.04227, 2025.

[bib79] Schwartz, B. and Evans, S. Episodic memory in primates. American Journal of Primatology: Official Journal of the American Society of Primatologists, 2001.

[bib80] Snell et al. (2022) Snell, C., Klein, D., and Zhong, R. Learning by distilling context, 2022. URL https://arxiv.org/abs/2209.15189.

[bib81] Squire, L. and Zola, S. Structure and function of declarative and nondeclarative memory systems. PNAS, 1996.

[bib82] Sugar, J. and Moser, M. Episodic memory: Neuronal codes for what, where, and when. Hippocampus, 2019.

[bib83] Sukhbaatar et al. (2015) Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. End-to-end memory networks. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://arxiv.org/abs/1503.08895.

[bib84] Sun et al. (2024a) Sun, Y., Dong, L., Zhu, Y., Huang, S., Wang, W., Ma, S., Zhang, Q., Wang, J., and Wei, F. You only cache once: Decoder-decoder architectures for language models, 2024a. URL https://arxiv.org/abs/2405.05254.

[bib85] Sun et al. (2024b) Sun, Y., Li, X., Dalal, K., Xu, J., Vikram, A., Zhang, G., Dubois, Y., Chen, X., Wang, X., Koyejo, S., Hashimoto, T., and Guestrin, C. Learning to (learn at test time): Rnns with expressive hidden states, 2024b. URL https://arxiv.org/abs/2407.04620.

[bib86] Tan et al. (2024) Tan, C., Zhang, G., and Fu, J. Massive editing for large language models via meta learning, 2024. URL https://arxiv.org/abs/2311.04661.

[bib87] Tang et al. (2024) Tang, H., Lin, Y., Lin, J., Han, Q., Hong, S., Yao, Y., and Wang, G. Razorattention: Efficient kv cache compression through retrieval heads, 2024. URL https://arxiv.org/abs/2407.15891.

[bib88] Tulving, E. Episodic and semantic memory. In Tulving, E. and Donaldson, W. (eds.), Organization of Memory, pp. 381–403. Academic Press, New York, 1972.

[bib89] Valipour et al. (2022) Valipour, M., Rezagholizadeh, M., Kobyzev, I., and Ghodsi, A. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low rank adaptation, 2022.

[bib90] Wang et al. (2024a) Wang, B., Liang, X., Yang, J., Huang, H., Wu, S., Wu, P., Lu, L., Ma, Z., and Li, Z. Enhancing large language model with self-controlled memory framework, 2024a. URL https://arxiv.org/abs/2304.13343.

[bib91] Wang et al. (2024b) Wang, H., Liu, T., Li, R., Cheng, M., Zhao, T., and Gao, J. Roselora: Row and column-wise sparse low-rank adaptation of pre-trained language model for knowledge editing and fine-tuning, 2024b. URL https://arxiv.org/abs/2406.10777.

[bib92] Wang et al. (2025) Wang, J., Paliotta, D., May, A., Rush, A. M., and Dao, T. The mamba in the llama: Distilling and accelerating hybrid models, 2025. URL https://arxiv.org/abs/2408.15237.

[bib93] Wang et al. (2024c) Wang, P., Li, Z., Zhang, N., Xu, Z., Yao, Y., Jiang, Y., Xie, P., Huang, F., and Chen, H. Wise: Rethinking the knowledge memory for lifelong model editing of large language models, 2024c. URL https://arxiv.org/abs/2405.14768.

[bib94] Wu et al. (2022a) Wu, Q., Lan, Z., Qian, K., Gu, J., Geramifard, A., and Yu, Z. Memformer: A memory-augmented transformer for sequence modeling. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pp. 308–318. Association for Computational Linguistics, November 2022a. doi: 10.18653/v1/2022.findings-aacl.29. URL https://aclanthology.org/2022.findings-aacl.29/.

[bib95] Wu et al. (2018) Wu, Y., Wayne, G., Graves, A., and Lillicrap, T. The kanerva machine: A generative distributed memory. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S1HlA-ZAZ.

[bib96] Wu et al. (2022b) Wu, Y., Rabe, M. N., Hutchins, D., and Szegedy, C. Memorizing transformers. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=TrjbxzRcnf-.

[bib97] Wu et al. (2024) Wu, Z., Arora, A., Wang, Z., Geiger, A., Jurafsky, D., Manning, C. D., and Potts, C. Reft: Representation finetuning for language models, 2024. URL https://arxiv.org/abs/2404.03592.

[bib98] Xi et al. (2023) Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R., Fan, X., Wang, X., Xiong, L., Zhou, Y., Wang, W., Jiang, C., Zou, Y., Liu, X., Yin, Z., Dou, S., Weng, R., Cheng, W., Zhang, Q., Qin, W., Zheng, Y., Qiu, X., Huang, X., and Gui, T. The rise and potential of large language model based agents: A survey, 2023. URL https://arxiv.org/abs/2309.07864.

[bib99] Xiao et al. (2024) Xiao, C., Zhang, P., Han, X., Xiao, G., Lin, Y., Zhang, Z., Liu, Z., and Sun, M. Infllm: Training-free long-context extrapolation for llms with an efficient context memory, 2024. URL https://arxiv.org/abs/2402.04617.

[bib100] Xu et al. (2021) Xu, R., Luo, F., Zhang, Z., Tan, C., Chang, B., Huang, S., and Huang, F. Raise a child in large language model: Towards effective and generalizable fine-tuning. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9514–9528, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.749. URL https://aclanthology.org/2021.emnlp-main.749/.

[bib101] Yadav et al. (2023) Yadav, P., Tam, D., Choshen, L., Raffel, C., and Bansal, M. TIES-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=xtaX3WyCj1.

[bib102] Yang et al. (2024a) Yang, D., Han, X., Gao, Y., Hu, Y., Zhang, S., and Zhao, H. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference, 2024a. URL https://arxiv.org/abs/2405.12532.

[bib103] Yang et al. (2024b) Yang, S., Gribovskaya, E., Kassner, N., Geva, M., and Riedel, S. Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837, 2024b.

[bib104] Ye et al. (2024a) Ye, L., Tao, Z., Huang, Y., and Li, Y. Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition, 2024a. URL https://arxiv.org/abs/2402.15220.

[bib105] Ye et al. (2024b) Ye, T., Dong, L., Xia, Y., Sun, Y., Zhu, Y., Huang, G., and Wei, F. Differential transformer, 2024b. URL https://arxiv.org/abs/2410.05258.

[bib106] Yen et al. (2024) Yen, H., Gao, T., and Chen, D. Long-context language modeling with parallel context encoding, 2024. URL https://arxiv.org/abs/2402.16617.

[bib107] Yin et al. (2024) Yin, F., Ye, X., and Durrett, G. Lofit: Localized fine-tuning on llm representations, 2024. URL https://arxiv.org/abs/2406.01563.

[bib108] Yogatama et al. (2021) Yogatama, D., de Masson d’Autume, C., and Kong, L. Adaptive semiparametric language models. Transactions of the Association for Computational Linguistics, 9:362–373, 04 2021. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00371. URL https://doi.org/10.1162/tacl_a_00371.

[bib109] Yu et al. (2023) Yu, L., Chen, Q., Zhou, J., and He, L. Melo: Enhancing model editing with neuron-indexed dynamic lora, 2023. URL https://arxiv.org/abs/2312.11795.

[bib110] Yu et al. (2024) Yu, T., Xu, A., and Akkiraju, R. In defense of rag in the era of long-context language models, 2024. URL https://arxiv.org/abs/2409.01666.

[bib111] Yue et al. (2024) Yue, Y., Yuan, Z., Duanmu, H., Zhou, S., Wu, J., and Nie, L. Wkvquant: Quantizing weight and key/value cache for large language models gains more, 2024. URL https://arxiv.org/abs/2402.12065.

[bib112] Zheng et al. (2024) Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., et al. Sglang: Efficient execution of structured language model programs. arXiv preprint arXiv:2312.07104, 2024.

[bib113] Zhong et al. (2024) Zhong, W., Guo, L., Gao, Q., Ye, H., and Wang, Y. Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), Mar. 2024. doi: 10.1609/aaai.v38i17.29946. URL https://ojs.aaai.org/index.php/AAAI/article/view/29946.