: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation
Hanqi Jiang, Junhao Chen, Yi Pan, Ling Chen, Weihang You, Yifan Zhou, Ruidong Zhang, Lin Zhao, Yohannes Abate, Tianming Liu
Abstract
While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory. To bridge this gap, we introduce Synapse (Synergistic Associative Processing & Semantic Encoding), a unified memory architecture that transcends static vector similarity. Drawing from cognitive science, Synapse models memory as a dynamic graph where relevance emerges from spreading activation rather than pre-computed links. By integrating lateral inhibition and temporal decay, the system dynamically highlights relevant sub-graphs while filtering interference. We implement a Triple Hybrid Retrieval strategy that fuses geometric embeddings with activation-based graph traversal. Comprehensive evaluations on the LoCoMo benchmark show that Synapse significantly outperforms state-of-the-art methods in complex temporal and multi-hop reasoning tasks, offering a robust solution to the "Contextual Tunneling" problem. Our code and data will be made publicly available upon acceptance.
Synapse: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation
Hanqi Jiang 1 * , Junhao Chen 1 * , Yi Pan 1 , Ling Chen 2 , Weihang You 1 , Yifan Zhou 1 , Ruidong Zhang 1 , Lin Zhao 3 , Yohannes Abate 4 , Tianming Liu 1 †
1 School of Computing, University of Georgia, Athens
2 Department of Biosystems Engineering and Soil Science, University of Tennessee, Knoxville 3 Department of Biomedical Engineering, New Jersey Institute of Technology, Newark 4 Department of Physics and Astronomy, The University of Georgia, Athens
While Large Language Models (LLMs) excel at generalized reasoning, standard retrievalaugmented approaches fail to address the disconnected nature of long-term agentic memory. To bridge this gap, we introduce SYNAPSE ( Syn ergistic A ssociative P rocessing & S emantic E ncoding), a unified memory architecture that transcends static vector similarity. Drawing from cognitive science, SYNAPSE models memory as a dynamic graph where relevance emerges from spreading activation rather than pre-computed links. By integrating lateral inhibition and temporal decay, the system dynamically highlights relevant sub-graphs while filtering interference. We implement a Triple Hybrid Retrieval strategy that fuses geometric embeddings with activation-based graph traversal. Comprehensive evaluations on the LoCoMo benchmark show that SYNAPSE significantly outperforms state-of-the-art methods in complex temporal and multi-hop reasoning tasks, offering a robust solution to the "Contextual Tunneling" problem. Our code and data will be made publicly available upon acceptance.
Introduction
The evolution of Large Language Models (LLMs) from static responders to autonomous agents necessitates a fundamental rethinking of memory architecture (Park et al., 2023; Yao et al., 2023; Schick et al., 2023). While LLMs demonstrate remarkable reasoning within finite context windows, their agency is brittle without the ability to accumulate experiences and maintain narrative coherence over long horizons (Gutiérrez et al., 2024; Izacard et al., 2023). The predominant solution, RetrievalAugmented Generation (RAG) (Lewis et al., 2020), externalizes history into vector databases, retrieving information based on semantic similarity (Guu
- Equal contribution
† Corresponding author
et al., 2020; Asai et al., 2024). While effective for factual lookup (Borgeaud et al., 2022), standard RAG imposes a critical limitation on reasoning agents: it treats memory as a static library to be indexed, rather than a dynamic network to be reasoned over (Gutiérrez et al., 2024; Zhu et al., 2025).
We argue that existing systems suffer from Contextual Isolation , a failure mode stemming from the implicit Search Assumption : that the relevance of a past memory is strictly determined by its semantic proximity to the current query (Zhu et al., 2025; Edge et al., 2025; Sarthi et al., 2024). This assumption collapses in scenarios requiring causal or transitive reasoning. Consider a user asking, 'Why am I feeling anxious today?' . A vector-based system might retrieve recent mentions of 'anxiety,' but fail to surface a schedule conflict logged weeks prior. Although this conflict is the root cause, it shares no lexical or embedding overlap with the query. While hierarchical frameworks such as MemGPT (Packer et al., 2024) improve context management, they remain bound by querydriven retrieval, unable to autonomously surface structurally related yet semantically distinct information.
To bridge this gap, we draw inspiration from cognitive science theories of Spreading Activation (Collins and Loftus, 1975; Anderson, 1983), which posit that human memory retrieval is not a search process, but a propagation of energy. Accessing one concept naturally activates semantically, temporally, or causally linked concepts without explicit prompting.
We introduce SYNAPSE, a brain-inspired architecture that reimagines agentic memory. Unlike flat vector stores, SYNAPSE constructs a Unified Episodic-Semantic Graph, where raw interaction logs (episodic nodes) are synthesized into abstract concepts (semantic nodes). Retrieval in SYNAPSE is governed by activation dynamics: input signals inject energy into the graph, which propagates
through temporal and causal edges. This mechanism enables the system to prioritize memories that are structurally salient to the current context, such as the aforementioned schedule conflict, even when direct semantic similarity is absent. To ensure focus, we implement lateral inhibition, a biological mechanism that suppresses irrelevant distractors.
We evaluate SYNAPSE on the rigorous LoCoMo benchmark (Maharana et al., 2024), which involves long-horizon dialogues averaging 16K tokens. SYNAPSE establishes a new state-of-theart (SOTA), significantly outperforming traditional RAG and recent agentic memory systems. Notably, our activation-based approach improves accuracy on complex multi-hop reasoning tasks by up to 23% while reducing token consumption by 95% compared to full-context methods.
· Unified Episodic-Semantic Graph: We propose a dual-layer topology that synergizes granular interaction logs with synthesized abstract concepts, addressing the structural fragmentation inherent in flat vector stores. · Cognitive Dynamics with Uncertainty Gating: We introduce a retrieval mechanism governed by spreading activation and lateral inhibition to prioritize implicit relevance, coupled with a "feeling of knowing" protocol that robustly rejects hallucinations. · SOTAPerformance & Efficiency: SYNAPSE establishes a new state-of-the-art on the LoCoMo benchmark (+7.2 F1), improving multihop reasoning accuracy by 23% while reducing token consumption by 95% compared to full-context methods.
Related Work
Memory Allocation Capabilities
Systems such as MemGPT (Packer et al., 2024), MemoryOS (Li et al., 2025), and LangMem (LangChain Team, 2024) address context limitations by optimizing memory placement via policy-based controllers or hierarchical buffers (Lewis et al., 2020; Nafee et al., 2025; Guu et al., 2020). However, these approaches treat memory items as independent textual units, lacking the mechanisms to model causal or structural relationships during retrieval (Khandelwal et al., 2020). Consequently, they cannot recover linked memories absent surface-level similarity. In contrast, SYNAPSE shifts the focus from storage management to reasoning, where relevance propagates through a structured network rather than relying on independent item retrieval.
Graph-Based and Structured Memory
Recent works introduce structure into agentic memory via explicit linking. A-Mem (Xu et al., 2025) and AriGraph (Anokhin et al., 2025) utilize LLMs to maintain dynamic knowledge graphs, while HippoRAG (Gutiérrez et al., 2024) adapts Personal PageRank for retrieval. Crucially, methods like GraphRAG (Edge et al., 2025) optimize for global sense-making via community detection, summarizing entire datasets at high computational cost. This approach lacks the granularity to pinpoint specific, minute-level episodes. In contrast, SYNAPSE integrates cognitive dynamics (ACT-R) to strictly prioritize local relevance. By propagating activation along specific transitive paths (A → B → C) from query anchors, we recover precise context without traversing the global structure. This "biologically plausible" constraint-specifically the fan effect and inhibition-is not merely rhetorical but architectural: it enforces sparsity and competition, solving the "Hub Explosion" problem that plagues standard random-walk approaches in dense semantic graphs.
Semantic Similarity and Relational Retrieval
Standard retrieval methods like RAG and MemoryBank (Zhong et al., 2024) rely fundamentally on vector similarity (Karpukhin et al., 2020; Khattab and Zaharia, 2020), representing memories as isolated points in embedding space (Hu et al., 2025). Consequently, they struggle with queries requiring causal bridging between semantically dissimilar or distant events (Yang et al., 2018; Qi et al., 2019; Trivedi et al., 2022; Thorne et al., 2018). SYNAPSE overcomes this by encoding relationships as graph edges, enabling retrieval via relational paths (Sun et al., 2018).
Drawing from cognitive Spreading Activation theory (Collins and Loftus, 1975; Anderson, 1983) and ACT-R architectures (Anderson, 1983), we address the limitation of "seed dependence" in existing graph systems. While prior methods fail if the initial vector search misses the relevant subgraph (i.e., a "bad seed"), SYNAPSE uses spreading activation to dynamically recover from suboptimal
seeds, propagating energy to relevant contexts even under weak initial semantic overlap.
Methodology
Building on the cognitive foundations outlined above, we now present SYNAPSE, an agentic memory architecture that addresses Contextual Isolation through dynamic activation propagation. Our key insight is that relevance should emerge from distributed graph dynamics rather than being precomputed through static links or determined solely by vector similarity. The overall framework of our proposed method is detailed in Figure 1.
Unified Episodic-Semantic Graph
We formulate the agent's memory as a directed graph G = ( V , E ) . To capture both specific experiences and generalized knowledge, the vertex set V is partitioned into Episodic Nodes ( V E ) and Semantic Nodes ( V S ).
Node Construction. Each episodic node v e i ∈ V E encapsulates a distinct interaction turn, represented as a tuple ( c i , h i , τ i ) , where c i is the textual content, h i ∈ R d is the dense embedding produced by a sentence encoder ( all-MiniLM-L6-v2 ), and τ i is the timestamp. Semantic nodes v s j ∈ V S represent abstract concepts (e.g., entities, preferences) extracted by the LLM via prompted entity/concept extraction triggered every N = 5 turns. Duplicate detection uses embedding similarity with threshold τ dup = 0 . 92 . The complete graph construction algorithm is provided in Appendix A.1.
Topology. The edges E define the retrieval pathways: (i) Temporal Edges link sequential episodes ( v e t → v e t +1 ); (ii) Abstraction Edges bidirectionally connect episodes to relevant concepts within the same consolidation window ( N = 5 ). This temporal association allows bridging concepts (e.g., "Mark" ↔ "Ski Trip") via co-occurrence even without direct semantic similarity, enabling the "Bridge Node" effect (Figure 1); (iii) Association Edges model latent correlations between concepts.
Graph Maintenance and Scalability. To prevent quadratic graph growth ( O ( |V| 2 ) ) in longhorizon deployments, we enforce strict sparsity constraints: (1) Edge Pruning : Each node is limited to its TopK incoming edges (default K = 15 ); (2) Node Garbage Collection : Nodes with activation consistently below a dormancy threshold ϵ = 0 . 01 for W = 10 windows are archived to disk. This ensures the active graph remains compact ( |V| ≤ 10 , 000 ) while preserving retrieval speed.
Node Construction.
The evolution of Large Language Models (LLMs) from static responders to autonomous agents necessitates a fundamental rethinking of memory architecture (Park et al., 2023; Yao et al., 2023; Schick et al., 2023). While LLMs demonstrate remarkable reasoning within finite context windows, their agency is brittle without the ability to accumulate experiences and maintain narrative coherence over long horizons (Gutiérrez et al., 2024; Izacard et al., 2023). The predominant solution, RetrievalAugmented Generation (RAG) (Lewis et al., 2020), externalizes history into vector databases, retrieving information based on semantic similarity (Guu
- Equal contribution
† Corresponding author
et al., 2020; Asai et al., 2024). While effective for factual lookup (Borgeaud et al., 2022), standard RAG imposes a critical limitation on reasoning agents: it treats memory as a static library to be indexed, rather than a dynamic network to be reasoned over (Gutiérrez et al., 2024; Zhu et al., 2025).
We argue that existing systems suffer from Contextual Isolation , a failure mode stemming from the implicit Search Assumption : that the relevance of a past memory is strictly determined by its semantic proximity to the current query (Zhu et al., 2025; Edge et al., 2025; Sarthi et al., 2024). This assumption collapses in scenarios requiring causal or transitive reasoning. Consider a user asking, 'Why am I feeling anxious today?' . A vector-based system might retrieve recent mentions of 'anxiety,' but fail to surface a schedule conflict logged weeks prior. Although this conflict is the root cause, it shares no lexical or embedding overlap with the query. While hierarchical frameworks such as MemGPT (Packer et al., 2024) improve context management, they remain bound by querydriven retrieval, unable to autonomously surface structurally related yet semantically distinct information.
To bridge this gap, we draw inspiration from cognitive science theories of Spreading Activation (Collins and Loftus, 1975; Anderson, 1983), which posit that human memory retrieval is not a search process, but a propagation of energy. Accessing one concept naturally activates semantically, temporally, or causally linked concepts without explicit prompting.
We introduce SYNAPSE, a brain-inspired architecture that reimagines agentic memory. Unlike flat vector stores, SYNAPSE constructs a Unified Episodic-Semantic Graph, where raw interaction logs (episodic nodes) are synthesized into abstract concepts (semantic nodes). Retrieval in SYNAPSE is governed by activation dynamics: input signals inject energy into the graph, which propagates
through temporal and causal edges. This mechanism enables the system to prioritize memories that are structurally salient to the current context, such as the aforementioned schedule conflict, even when direct semantic similarity is absent. To ensure focus, we implement lateral inhibition, a biological mechanism that suppresses irrelevant distractors.
We evaluate SYNAPSE on the rigorous LoCoMo benchmark (Maharana et al., 2024), which involves long-horizon dialogues averaging 16K tokens. SYNAPSE establishes a new state-of-theart (SOTA), significantly outperforming traditional RAG and recent agentic memory systems. Notably, our activation-based approach improves accuracy on complex multi-hop reasoning tasks by up to 23% while reducing token consumption by 95% compared to full-context methods.
· Unified Episodic-Semantic Graph: We propose a dual-layer topology that synergizes granular interaction logs with synthesized abstract concepts, addressing the structural fragmentation inherent in flat vector stores. · Cognitive Dynamics with Uncertainty Gating: We introduce a retrieval mechanism governed by spreading activation and lateral inhibition to prioritize implicit relevance, coupled with a "feeling of knowing" protocol that robustly rejects hallucinations. · SOTAPerformance & Efficiency: SYNAPSE establishes a new state-of-the-art on the LoCoMo benchmark (+7.2 F1), improving multihop reasoning accuracy by 23% while reducing token consumption by 95% compared to full-context methods.
Topology.
Building on the cognitive foundations outlined above, we now present SYNAPSE, an agentic memory architecture that addresses Contextual Isolation through dynamic activation propagation. Our key insight is that relevance should emerge from distributed graph dynamics rather than being precomputed through static links or determined solely by vector similarity. The overall framework of our proposed method is detailed in Figure 1.
Graph Maintenance and Scalability.
Recent works introduce structure into agentic memory via explicit linking. A-Mem (Xu et al., 2025) and AriGraph (Anokhin et al., 2025) utilize LLMs to maintain dynamic knowledge graphs, while HippoRAG (Gutiérrez et al., 2024) adapts Personal PageRank for retrieval. Crucially, methods like GraphRAG (Edge et al., 2025) optimize for global sense-making via community detection, summarizing entire datasets at high computational cost. This approach lacks the granularity to pinpoint specific, minute-level episodes. In contrast, SYNAPSE integrates cognitive dynamics (ACT-R) to strictly prioritize local relevance. By propagating activation along specific transitive paths (A → B → C) from query anchors, we recover precise context without traversing the global structure. This "biologically plausible" constraint-specifically the fan effect and inhibition-is not merely rhetorical but architectural: it enforces sparsity and competition, solving the "Hub Explosion" problem that plagues standard random-walk approaches in dense semantic graphs.
Cognitive Dynamics: Spreading Activation
Inspired by human semantic memory models (Collins and Loftus, 1975), we implement a dynamic activation process to prioritize information.
Initialization. Given a query q , we identify a set of anchor nodes T via a dual-trigger mechanism: (1) Lexical Trigger : We use BM25 sparse retrieval to capture exact entity matches (e.g., proper nouns like "Kendall"), ensuring precision for named entities; (2) Semantic Trigger : We use dense retrieval ( all-MiniLM-L6-v2 ) to capture conceptual similarity (e.g., "Ski Trip"), maximizing recall for thematic queries. The union of Topk nodes from both streams forms the anchor set T . An initial activation vector a (0) is computed, where energy is injected only into anchors:
$$
$$
where sim ( · ) denotes cosine similarity and α is a scaling hyperparameter.
Propagation with Fan Effect. Following ACTR (Anderson, 1983), we incorporate the fan effect to model attention dilution. The raw activation potential u ( t +1) i is:
$$
$$
where S = 0 . 8 is the spreading factor, fan ( j ) = deg out ( j ) is the out-degree, and w ji represents edge weight: w ji = e -ρ | τ i -τ j | for temporal edges (with time decay ρ = 0 . 01 ) and w ji = sim ( h i , h j ) for semantic edges.
Lateral Inhibition. To model attentional selection, highly activated concepts inhibit competitors before firing. We apply inhibition to the potential u i :
$$
$$

Figure 1: Overview of the SYNAPSE architecture. (Left) Auser query regarding "that guy from the ski trip" activates the graph via Dual Triggers: Lexical matching targets explicit entities ("Kendall"), while Semantic embedding targets implicit concepts ("Ski Trip"). (Center) Spreading Activation dynamically propagates relevance through the Unified Episodic-Semantic Graph. Note how the bridge node "Mark" (purple) is activated despite not appearing in the query, connecting the disjoint concepts of "Ski Trip" and "Dating". (Right) The Triple Hybrid Scoring layer reranks candidates, successfully retrieving the ground truth ("broke up with Mark") while suppressing semantically similar but logically irrelevant distractors ("going skiing") via lateral inhibition.
where T M is the set of M highest-potential nodes (default M = 7 ) to enforce sparsity.
Sigmoid Activation. The inhibited potential is transformed into the final firing rate:
$$
$$
The cycle proceeds strictly as: Propagation (Eq. 2) → Lateral Inhibition (Eq. 3) → Non-linear Activation (Eq. 4). Stability is reached within T = 3 iterations.
Initialization.
While SYNAPSE creates a new Pareto frontier for agentic memory, several limitations warrant discussion, outlining clear directions for future research.
Algorithmic Trade-offs and Scope. First, the mechanisms that enable SYNAPSE to excel at complex reasoning introduce specific trade-offs. One notable limitation is the Cold Start problem: the efficacy of spreading activation relies on a sufficiently connected topology. In nascent conversations with sparse history, the computational overhead of graph maintenance provides diminishing returns compared to simple linear buffers.
Additionally, lateral inhibition can occasionally lead to Cognitive Tunneling, causing performance drops on simple queries where exhaustive retrieval is superior. Finally, our current evaluation is constrained to the text modality via the LoCoMo benchmark. Since embodied agents increasingly require processing visual and auditory cues, a key direction for future work is extending SYNAPSE to Multimodal Episodic Memory. By leveraging aligned embedding spaces, we aim to incorporate image and audio nodes into the unified graph, enabling structural reasoning across diverse modalities.
Dependency on Foundation Models. Our framework exhibits a dual dependency on LLM capabilities. On the upstream side, the topology of the Unified Graph is tightly coupled with the extraction quality of the underlying LLM. While GPT4o-mini demonstrates robust schema adherence, smaller local models may struggle with consistent entity extraction, potentially leading to error propagation. On the downstream side, we rely on LLMas-a-Judge for semantic evaluation. While we mitigate bias by separating the judge from the generator, model-based evaluation can still favor certain stylistic patterns. However, given the demonstrated failure of n -gram metrics (Table 11), we maintain this is a necessary trade-off for accurate assessment.
Privacy and Long-Term Safety. Persistent graph structures introduce distinct privacy risks compared to ephemeral context windows. Centralized storage of semantic profiles creates a vector for "Memory Poisoning," where erroneous facts or malicious injections could permanently corrupt the knowledge store. Moreover, the indefinite retention of user data raises compliance concerns. Future iterations will focus on Automated Graph Auditing to detect inconsistencies and User-Controlled Forgetting (Machine Unlearning) mechanisms to ensure privacy compliance and robust memory maintenance.
Propagation with Fan Effect.
Lateral Inhibition.
Sigmoid Activation.
While SYNAPSE creates a new Pareto frontier for agentic memory, several limitations warrant discussion, outlining clear directions for future research.
Algorithmic Trade-offs and Scope. First, the mechanisms that enable SYNAPSE to excel at complex reasoning introduce specific trade-offs. One notable limitation is the Cold Start problem: the efficacy of spreading activation relies on a sufficiently connected topology. In nascent conversations with sparse history, the computational overhead of graph maintenance provides diminishing returns compared to simple linear buffers.
Additionally, lateral inhibition can occasionally lead to Cognitive Tunneling, causing performance drops on simple queries where exhaustive retrieval is superior. Finally, our current evaluation is constrained to the text modality via the LoCoMo benchmark. Since embodied agents increasingly require processing visual and auditory cues, a key direction for future work is extending SYNAPSE to Multimodal Episodic Memory. By leveraging aligned embedding spaces, we aim to incorporate image and audio nodes into the unified graph, enabling structural reasoning across diverse modalities.
Dependency on Foundation Models. Our framework exhibits a dual dependency on LLM capabilities. On the upstream side, the topology of the Unified Graph is tightly coupled with the extraction quality of the underlying LLM. While GPT4o-mini demonstrates robust schema adherence, smaller local models may struggle with consistent entity extraction, potentially leading to error propagation. On the downstream side, we rely on LLMas-a-Judge for semantic evaluation. While we mitigate bias by separating the judge from the generator, model-based evaluation can still favor certain stylistic patterns. However, given the demonstrated failure of n -gram metrics (Table 11), we maintain this is a necessary trade-off for accurate assessment.
Privacy and Long-Term Safety. Persistent graph structures introduce distinct privacy risks compared to ephemeral context windows. Centralized storage of semantic profiles creates a vector for "Memory Poisoning," where erroneous facts or malicious injections could permanently corrupt the knowledge store. Moreover, the indefinite retention of user data raises compliance concerns. Future iterations will focus on Automated Graph Auditing to detect inconsistencies and User-Controlled Forgetting (Machine Unlearning) mechanisms to ensure privacy compliance and robust memory maintenance.
Triple-Signal Hybrid Retrieval
To maximize recall in open-domain QA tasks, we propose a hybrid scoring function that fuses semantic, contextual, and structural signals. The relevance score S ( v i ) is defined as:
$$
$$
The Topk nodes (default k = 30 ) are retrieved and re-ordered topologically. Factor scores are cached and updated only during consolidation ( N = 5 turns) to maintain query latency independent of history length T . Crucially, these components serve orthogonal roles: (1) PageRank acts as a Global Structural Prior , prioritizing universally important hubs (e.g., main characters) independent of the specific query; (2) Activation acts as a Local Contextual Signal , propagating query-specific relevance. Sensitivity analysis indicates robustness to λ 3 ∈ [0 . 1 , 0 . 3] , confirming PageRank's role as a stable prior. This decoupling ensures that novel but locally relevant details are not drowned out by global hubs.
Uncertainty-Aware Rejection
To robustly handle adversarial queries about nonexistent entities, SYNAPSE integrates a MetaCognitive Verification layer inspired by the "Feeling of Knowing" (FOK) in human memory monitoring. This mechanism operates via a dual-stage cognitive gating protocol:
Confidence-Based Gating We model retrieval confidence C ret as the activation energy of the top-ranked node. If C ret < τ gate (calibrated to τ gate = 0 . 12 ), the system activates a negative acknowledgement protocol, preemptively rejecting the query. This mirrors the brain's ability to rapidly inhibit response generation when memory traces are insufficient.
Explicit Verification Prompting For borderline cases effectively passing the gate, we employ a verification prompt that enforces a "strict evidence" constraint on the LLM: 'Is this EXPLICITLY mentioned? If not, output 'Not mentioned'.' This forces the generator to distinguish between parametric knowledge hallucination and grounded retrieval.
Confidence-Based Gating
Explicit Verification Prompting
We employ a structured extraction approach to synthesize semantic nodes from episodic context. The extraction prompt follows a schema-guided paradigm, as shown in Figure 3.
Experiments
Experimental Setup
Benchmark Dataset. We evaluate SYNAPSE on the LOCOMO benchmark (Maharana et al., 2024), a rigorous testbed for long-term conversational memory. Unlike standard datasets (e.g., MultiSession Chat) with short contexts ( ∼ 1K tokens), LoCoMo features extensive dialogues averaging 16K tokens across up to 35 sessions. We report the F1 Score and BLEU-1 Score across five cognitive categories: Single-Hop ( C 1 ), Temporal ( C 2 ), Open-Domain ( C 3 ), Multi-Hop ( C 4 ), and Adversarial ( C 5 ).
Baselines. To rigorously position SYNAPSE, we benchmark against ten state-of-the-art methods spanning four distinct memory paradigms: Systemlevel, Graph-based, Retrieval-based, and Agentic/Compression. We explicitly prioritized baselines designed for autonomous agentic memory -systems capable of stateful updates and continuous learning. We explicitly distinguish between static RAG (designed for fixed corpora) and agentic memory (designed for evolving interaction). While methods like HippoRAG (Gutiérrez et al., 2024) utilize similar graph propagation, they are optimized for static pre-indexed corpora and lack the incremental update ( O (1) write) and time-decay mechanisms required for continuous agentic dialogue. Thus, they are incompatible with the online read-write nature of the LoCoMo benchmark. Please refer to Appendix Table B and Table 5 for the complete taxonomy.
Implementation Details. For SYNAPSE, we utilize all-MiniLM-L6-v2 for embedding generation (dim=384). The Spreading Activation propagates for T = 3 steps with a retention parameter δ = 0 . 5 and temporal decay ρ = 0 . 01 . The hybrid retrieval weights are set to λ = { 0 . 5 , 0 . 3 , 0 . 2 } (Semantic, Activation, Structural). To ensure a fair "Unified Backbone" comparison, we re-ran all reproducible baselines (marked with † in Table 1) using
GPT-4o-mini with temperature 0 . 1 . For baselines with fixed proprietary backends, we report their default strong model performance. We provide a detailed discussion on the sensitivity of each hyperparameter and justify our selection choices in Appendix C.
Benchmark Dataset.
Baselines.
To comprehensively evaluate the effectiveness of SYNAPSE, we compare it against a diverse set of state-of-the-art long-term memory mechanisms. These baselines represent the current landscape of memory augmentation for LLMs. We classify these methods into four primary categories based on their underlying data structures and retrieval mechanisms, as detailed in Table 5.
Implementation Details.
Main Results
Table 1 details the comprehensive evaluation on the LoCoMo benchmark (GPT-4o-mini), reporting F1 and BLEU-1 scores across five distinct categories along with aggregate rankings.
Overall Performance. SYNAPSE establishes a new state-of-the-art with a weighted average F1 of 40.5 (calculated excluding the adversarial category for fair comparison). This performance represents a substantial margin of +7.2 points over A-Mem (33.3) and outperforms recent graph-based systems such as Zep (39.7) and AriGraph (33.7). Notably, SYNAPSE secures a perfect task ranking of 1.0, demonstrating consistent dominance across all evaluated metrics.
Category-wise Analysis. Our model shows significant advantages in tasks requiring dynamic context reasoning. In Temporal Reasoning , SYNAPSE attains an F1 score of 50.1 compared to 45.9 for AMem. This validates the efficacy of our time-aware activation decay, which correctly prioritizes recent information over semantically similar but obsolete memories. For Multi-Hop Reasoning , the spreading activation mechanism effectively propagates relevance across intermediate nodes, bridging disconnected facts that pure vector search fails to link (35.7 vs. 27.0 for A-Mem). Furthermore, regarding Adversarial Robustness , SYNAPSE achieves near-perfect rejection rates (96.6 F1), significantly exceeding strong baselines like LoCoMo (69.2). Unlike baseline methods that lack explicit rejection protocols and often hallucinate plausible answers, our lateral inhibition and confidence gating empower the model to strictly distinguish valid retrieval from non-existent information.
Adversarial Robustness and Fairness. On GPT4o-mini, SYNAPSE demonstrates exceptional stability against adversarial queries, attaining an Adversarial F1 of 96.6 via its uncertainty-aware rejection mechanism. Here, graph activation serves as an orthogonal confidence signal alongside semantic similarity. Unlike baselines that gate responses using brittle cosine-similarity heuristics-which
∗ To ensure fairness, we report the Performance as the Weighted F1 and BLEU-1 score averaged over the first four categories (excluding Adversarial). Task Rank denotes the mean rank. Statistical significance ( p < 0 . 05 ) is confirmed via paired t-test on instance-level scores ( N = 500 ). More details can be referred to Appendix A.3.
often fail to distinguish paraphrasing from hallucinations-our design effectively separates lowevidence cases from valid retrieval. To prevent score inflation, we calibrated τ gate on a held-out validation set, strictly bounding the false refusal rate below 2.5% on non-adversarial categories (See Appendix C.2 for detailed experiment). Crucially, our performance advantage is not driven solely by rejection: even with the gate disabled, SYNAPSE maintains an average F1 of 40.3 (See Table 3), strictly outperforming Zep (39.7) and AMem (33.3). Paired t-tests confirm that the improvement over Zep remains statistically significant ( p < 0 . 05 ) without gating. Furthermore, we report the weighted average excluding the adversarial category to ensure fair comparison; under this protocol, SYNAPSE retains its top rank with an average F1 of 40.5, validating that the structural retrieval mechanism contributes independently of the rejection module.
Beyond GPT-4o-mini, we evaluate SYNAPSE with multiple backbones and observe consistent trends; the full cross-backbone results and discussion are provided in Appendix F (Table 12).
Qualitative Comparison To further elucidate the mechanisms behind SYNAPSE's superior performance, we conduct a qualitative analysis of retrieval behaviors compared to the strongest baseline, A-Mem. Table 2 presents three representative failure modes of semantic-only retrieval and how SYNAPSE resolves them. In adversarial scenarios (row 1), A-Mem falls victim to Semantic Drift, retrieving hallucinations based on superficial keyword matches (e.g., retrieving 'Rex' for 'dog'). In contrast, SYNAPSE's meta-cognitive layer correctly identifies the adversarial intent and verifies the absence of the entity in the graph, preventing hallucination. For temporal queries (row 2), AMem exhibits Static Bias, favoring outdated but semantically high-scoring memories. SYNAPSE's spreading activation with temporal decay dynamically downweights obsolete information, ensuring the retrieval of current facts. Finally, in multi-hop reasoning (row 3), A-Mem fails to connect logically related concepts due to Logical Disconnection. SYNAPSE's graph traversal capabilities enable it to bridge these gaps, successfully inferring implicit connections through intermediate nodes. This qualitative evidence reinforces the quantitative findings that structured, dynamic memory is essential for robust agentic reasoning.
Overall Performance.
Category-wise Analysis.
Adversarial Robustness and Fairness.
Qualitative Comparison
Ablation Study
To understand the contribution of each component in SYNAPSE, we conduct systematic ablations on GPT-4o-mini by selectively disabling retrieval mechanisms. Results are shown in Table 3.
Micro-Dynamics Analysis. Table 3 reveals that SYNAPSE's performance relies on the synergistic interaction of specific cognitive mechanisms rather than a single component. Specifically, Lateral Inhibition acts as a critical pre-filter for the uncertainty gate. While removing the gate ( τ gate = 0 ) reduces Adversarial F1 to 67.2, further removing

Table 2: Qualitative Comparison of Retrieval Behaviors. SYNAPSE demonstrates superior handling of temporal updates, multi-hop reasoning chains, and adversarial inputs compared to the semantic-only A-Mem baseline.
Table 3: Mechanism Ablation Study. Impact of selectively disabling cognitive components on F1 scores (GPT-4o-mini). Removing specific dynamics causes targeted drops in corresponding task categories, validating our theoretical design.
inhibition ( β = 0 ) destabilizes the graph significantly. Without this winner-take-all competition, low-relevance "hallucination candidates" remain active enough to compete with valid nodes, degrading precision even on standard Single-Hop tasks. This confirms that inhibition is structurally necessary to separate signal from noise before the gating decision is even made.
Mechanism Specificity. Other dynamics target specific cognitive failures. The Fan Effect proves indispensable for associative reasoning; removing it causes a sharp decline in Open-Domain (25.9 → 16.8) and Multi-Hop scores. Without this attention dilution, "hub" nodes (common entities) accumulate excessive activation, flooding the graph with generic associations and drowning out specific signals. Similarly, Node Decay is the sole driver of timeline awareness. Setting δ = 0 destroys Temporal reasoning capabilities (50.1 → 14.2), as the model loses the ability to distinguish between current truths and obsolete facts based on activation energy.
Macro-Architecture Analysis. At the system level, the necessity of our hybrid design is evident. Removing the spreading activation layer ('(-) Activation Dynamics') regresses performance to that of a static graph (Avg 30.5), confirming that dynamics , not just topology, are essential for reasoning. Furthermore, relying on a geometric embedding space alone ('Vectors Only') yields the lowest performance (Avg 25.2), validating that unstructured
Table 4: Efficiency Profile. Comparison on GPT-4o-mini. Latency is measured on a single NVIDIA A100 GPU averaging over 100 queries; "Cost" reflects Total API Cost (Input + Output Tokens) at standard rates.
retrieval is insufficient for the long-horizon consistency required in agentic applications.
Micro-Dynamics Analysis.
Beyond accuracy, practical deployment requires efficient resource utilization. Table 4 compares token usage, latency, and API cost across methods.
Token Efficiency. SYNAPSE consumes only ∼ 814 tokens per query on average, representing a 95% reduction compared to full-context methods (LoCoMo: 16,910; MemGPT: 16,977). This efficiency stems from our selective activation mechanism, which retrieves only the most contextually relevant subgraph rather than injecting entire conversation histories.
Cost-Performance Trade-off. At $0.24 per 1,000 queries, SYNAPSE is 11 × cheaper than fullcontext approaches ($2.66-$2.67) while achieving nearly 2 × higher performance. In terms of Cost Efficiency ( F 1 / $ ), SYNAPSE achieves a score of 167.3, surpassing MemoryOS (126.8) and significantly outperforming LoCoMo (9.6) and MemGPT (10.5). While LangMem achieves comparable cost efficiency (150.7) due to minimal overhead, its absolute performance (34.3 F1) lags behind. Note that graph construction costs are amortized over the lifetime of the agent and are negligible per-query.
Latency Profile. With 1.9s average latency, SYNAPSE is 4 × faster than full-context methods (8.2-8.5s) and faster than ReadAgent (2.3s). We achieve a latency comparable to lightweight methods while delivering SOTA reasoning capabilities.
Mechanism Specificity.
Macro-Architecture Analysis.
Efficiency Analysis
Beyond accuracy, practical deployment requires efficient resource utilization. Table 4 compares token usage, latency, and API cost across methods.
Token Efficiency. SYNAPSE consumes only ∼ 814 tokens per query on average, representing a 95% reduction compared to full-context methods (LoCoMo: 16,910; MemGPT: 16,977). This efficiency stems from our selective activation mechanism, which retrieves only the most contextually relevant subgraph rather than injecting entire conversation histories.
Cost-Performance Trade-off. At $0.24 per 1,000 queries, SYNAPSE is 11 × cheaper than fullcontext approaches ($2.66-$2.67) while achieving nearly 2 × higher performance. In terms of Cost Efficiency ( F 1 / $ ), SYNAPSE achieves a score of 167.3, surpassing MemoryOS (126.8) and significantly outperforming LoCoMo (9.6) and MemGPT (10.5). While LangMem achieves comparable cost efficiency (150.7) due to minimal overhead, its absolute performance (34.3 F1) lags behind. Note that graph construction costs are amortized over the lifetime of the agent and are negligible per-query.
Latency Profile. With 1.9s average latency, SYNAPSE is 4 × faster than full-context methods (8.2-8.5s) and faster than ReadAgent (2.3s). We achieve a latency comparable to lightweight methods while delivering SOTA reasoning capabilities.
Token Efficiency.
Beyond accuracy, practical deployment requires efficient resource utilization. Table 4 compares token usage, latency, and API cost across methods.
Token Efficiency. SYNAPSE consumes only ∼ 814 tokens per query on average, representing a 95% reduction compared to full-context methods (LoCoMo: 16,910; MemGPT: 16,977). This efficiency stems from our selective activation mechanism, which retrieves only the most contextually relevant subgraph rather than injecting entire conversation histories.
Cost-Performance Trade-off. At $0.24 per 1,000 queries, SYNAPSE is 11 × cheaper than fullcontext approaches ($2.66-$2.67) while achieving nearly 2 × higher performance. In terms of Cost Efficiency ( F 1 / $ ), SYNAPSE achieves a score of 167.3, surpassing MemoryOS (126.8) and significantly outperforming LoCoMo (9.6) and MemGPT (10.5). While LangMem achieves comparable cost efficiency (150.7) due to minimal overhead, its absolute performance (34.3 F1) lags behind. Note that graph construction costs are amortized over the lifetime of the agent and are negligible per-query.
Latency Profile. With 1.9s average latency, SYNAPSE is 4 × faster than full-context methods (8.2-8.5s) and faster than ReadAgent (2.3s). We achieve a latency comparable to lightweight methods while delivering SOTA reasoning capabilities.
Cost-Performance Trade-off.
Latency Profile.
Sensitivity Analysis
Figure 2 examines the impact of the Topk retrieval parameter on overall performance. The relatively flat performance curve suggests that SYNAPSE is insensitive to precise k selection within the sufficient
![Figure 2: Sensitivity analysis of Topk retrieval on LoCoMo benchmark. Performance is robust across k ∈ [20 , 40] , with optimal stability around k = 30 . Star markers denote A-Mem baseline performance at their experiment settings.](2601.02744-figure_002.png)
Figure 2: Sensitivity analysis of Topk retrieval on LoCoMo benchmark. Performance is robust across k ∈ [20 , 40] , with optimal stability around k = 30 . Star markers denote A-Mem baseline performance at their experiment settings.
range. We sweep k ∈ [10 , 50] . Crucially, at a modest k = 30 , SYNAPSE significantly outperforms A-Mem while incurring lower retrieval costs, proving that structural precision is more efficient than simply increasing context volume; see Appendix C for further details about more hyperparameters.
Conclusion
We presented SYNAPSE, a cognitive architecture that resolves the Contextual Isolation of standard retrieval systems by emulating biological spreading activation. By modeling memory as a dynamic, associative graph, SYNAPSE effectively unifies disjointed facts and filters irrelevant noise, establishing a new Pareto frontier for efficient, longterm agentic memory. Our results demonstrate that neuro-symbolic mechanisms can successfully bridge the gap between static vector retrieval and adaptive, structured cognition, paving the way for more autonomous and resilient AI agents.
Limitations
While SYNAPSE creates a new Pareto frontier for agentic memory, several limitations warrant discussion, outlining clear directions for future research.
Algorithmic Trade-offs and Scope. First, the mechanisms that enable SYNAPSE to excel at complex reasoning introduce specific trade-offs. One notable limitation is the Cold Start problem: the efficacy of spreading activation relies on a sufficiently connected topology. In nascent conversations with sparse history, the computational overhead of graph maintenance provides diminishing returns compared to simple linear buffers.
Additionally, lateral inhibition can occasionally lead to Cognitive Tunneling, causing performance drops on simple queries where exhaustive retrieval is superior. Finally, our current evaluation is constrained to the text modality via the LoCoMo benchmark. Since embodied agents increasingly require processing visual and auditory cues, a key direction for future work is extending SYNAPSE to Multimodal Episodic Memory. By leveraging aligned embedding spaces, we aim to incorporate image and audio nodes into the unified graph, enabling structural reasoning across diverse modalities.
Dependency on Foundation Models. Our framework exhibits a dual dependency on LLM capabilities. On the upstream side, the topology of the Unified Graph is tightly coupled with the extraction quality of the underlying LLM. While GPT4o-mini demonstrates robust schema adherence, smaller local models may struggle with consistent entity extraction, potentially leading to error propagation. On the downstream side, we rely on LLMas-a-Judge for semantic evaluation. While we mitigate bias by separating the judge from the generator, model-based evaluation can still favor certain stylistic patterns. However, given the demonstrated failure of n -gram metrics (Table 11), we maintain this is a necessary trade-off for accurate assessment.
Privacy and Long-Term Safety. Persistent graph structures introduce distinct privacy risks compared to ephemeral context windows. Centralized storage of semantic profiles creates a vector for "Memory Poisoning," where erroneous facts or malicious injections could permanently corrupt the knowledge store. Moreover, the indefinite retention of user data raises compliance concerns. Future iterations will focus on Automated Graph Auditing to detect inconsistencies and User-Controlled Forgetting (Machine Unlearning) mechanisms to ensure privacy compliance and robust memory maintenance.
Algorithmic Trade-offs and Scope.
Dependency on Foundation Models.
Privacy and Long-Term Safety.
Ethical Considerations
Privacy and Data Retention. The core capability of SYNAPSE to accumulate long-term episodic memory inherently raises privacy concerns regarding the storage of sensitive user information. Unlike stateless LLMs that discard context after a session, our system persists interaction logs in a structured graph. While this persistence enables personalization, it necessitates strict data governance. In real-world deployments, the EpisodicSemantic Graph should be stored locally on the user's device or in encrypted enclaves to prevent unauthorized access. Furthermore, our architecture supports granular forgetting. The temporal decay mechanism ( δ ) and node pruning logic naturally mimic the 'right to be forgotten,' preventing the indefinite retention of obsolete or sensitive data.
Mitigation of False Memories. A critical ethical risk in memory-augmented agents is 'memory hallucination,' where an agent confidently recalls events that never occurred. This phenomenon can lead to harmful advice or misinformation. Our work explicitly addresses this issue through the Uncertainty-Aware Rejection module. By calibrating the gating threshold ( τ gate ) to prioritize precision over recall, as demonstrated in Section C.2, SYNAPSE is designed to fail safely. The system refuses to answer when evidence is insufficient rather than fabricating details. This design choice reflects a commitment to safety-critical reliability over conversational fluency.
Dataset and Compliance. Our experiments utilize the LoCoMo benchmark, which consists of synthesized and fictional long-horizon dialogues. No real-world user data or Personally Identifiable Information (PII) was processed, stored, or exposed during this research. Future deployments involving human subjects would require explicit consent protocols regarding memory persistence duration and scope.
Privacy and Data Retention.
Standard retrieval methods like RAG and MemoryBank (Zhong et al., 2024) rely fundamentally on vector similarity (Karpukhin et al., 2020; Khattab and Zaharia, 2020), representing memories as isolated points in embedding space (Hu et al., 2025). Consequently, they struggle with queries requiring causal bridging between semantically dissimilar or distant events (Yang et al., 2018; Qi et al., 2019; Trivedi et al., 2022; Thorne et al., 2018). SYNAPSE overcomes this by encoding relationships as graph edges, enabling retrieval via relational paths (Sun et al., 2018).
Drawing from cognitive Spreading Activation theory (Collins and Loftus, 1975; Anderson, 1983) and ACT-R architectures (Anderson, 1983), we address the limitation of "seed dependence" in existing graph systems. While prior methods fail if the initial vector search misses the relevant subgraph (i.e., a "bad seed"), SYNAPSE uses spreading activation to dynamically recover from suboptimal
seeds, propagating energy to relevant contexts even under weak initial semantic overlap.
Mitigation of False Memories.
We calibrate the uncertainty gating threshold τ gate on a held-out validation set (10% of samples) to strictly balance robustness against utility. Table 7 illustrates the sensitivity analysis.
Weobserve a clear "elbow" point at τ gate = 0 . 12 . Below this threshold, increasing the gate provides massive gains in Adversarial robustness (60.2 → 96.6) with negligible impact on valid queries. However, pushing beyond 0.12 yields diminishing returns: raising τ gate to 0.15 improves Adversarial F1 by only 0.6 points but nearly doubles the False Refusal Rate (FRR) from 2.1% to 4.2%. Notably, the ability to achieve near-perfect rejection at such a low threshold ( τ ≈ 0 . 12 ) indicates a strong Signalto-Noise Ratio in our graph. The lateral inhibition mechanism effectively suppresses irrelevant nodes close to zero, creating a clean margin between valid retrieval (high activation) and hallucination (low activation), minimizing the need for aggressive thresholding.
Dataset and Compliance.
Experiments
John R Anderson. 1983. A spreading activation theory of memory. Journal of verbal learning and verbal behavior , 22(3):261-295.
Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, and Evgeny Burnaev. 2025. Arigraph: Learning knowledge graph world models with episodic memory for llm agents. Preprint , arXiv:2407.04363.
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations .
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, and 9 others. 2022. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning , volume 162 of Proceedings of Machine Learning Research , pages 2206-2240. PMLR.
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025. Mem0: Building production-ready ai agents with scalable long-term memory. Preprint , arXiv:2504.19413.
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2025. From local to global: A graph rag approach to query-focused summarization. Preprint , arXiv:2404.16130.
Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. 2024. Hipporag: Neurobiologically inspired long-term memory for large language models. In Advances in Neural Information Processing Systems , volume 37, pages 59532-59569. Curran Associates, Inc.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning , volume 119 of Proceedings of Machine Learning Research , pages 3929-3938. PMLR.
Yuntong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, and Liang Zhao. 2025. Grag: Graph retrievalaugmented generation. Preprint , arXiv:2405.16506.
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane DwivediYu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research , 24(251):1-43.
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for opendomain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 6769-6781, Online. Association for Computational Linguistics.
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models. Preprint , arXiv:1911.00172.
Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '20, page 39-48, New York, NY, USA. Association for Computing Machinery.
LangChain Team. 2024. Langmem.
Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. 2024. A human-inspired reading agent with gist memory of very long contexts. In Forty-first International Conference on Machine Learning .
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledgeintensive nlp tasks. In Advances in Neural Information Processing Systems , volume 33, pages 94599474. Curran Associates, Inc.
Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, and 20 others. 2025. Memos: A memory os for ai system. Preprint , arXiv:2507.03724.
Mahmud Wasif Nafee, Maiqi Jiang, Haipeng Chen, and Yanfu Zhang. 2025. Dynamic retriever for incontext knowledge editing via policy optimization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages 16744-16757, Suzhou, China. Association for Computational Linguistics.
Implementation Details
Graph Construction Algorithm
We provide the complete algorithm for incremental graph construction in Algorithm 1. The graph is built online as the agent interacts with users. In practice, pairwise similarity checks (Line 23) are optimized using HNSW indexing to maintain O (log |V| ) scalable updates.
Semantic Extraction Prompt
We employ a structured extraction approach to synthesize semantic nodes from episodic context. The extraction prompt follows a schema-guided paradigm, as shown in Figure 3.
Evaluation Metric Calculation
To ensure a fair evaluation of overall performance, we calculate the Weighted F1 and BLEU-1 score across the four non-adversarial categories. This prevents the overall score from being skewed by
Statistical Analysis.
Table 8 reports the mean F1 scores and standard deviations across three independent runs. The low standard deviations ( ≤ 0 . 5 ) confirm that our method is stable and not dependent on favorable random initialization.
Baseline Methods
To comprehensively evaluate the effectiveness of SYNAPSE, we compare it against a diverse set of state-of-the-art long-term memory mechanisms. These baselines represent the current landscape of memory augmentation for LLMs. We classify these methods into four primary categories based on their underlying data structures and retrieval mechanisms, as detailed in Table 5.
Hyperparameter Sensitivity Analysis
We conduct a systematic sensitivity analysis to examine the robustness of SYNAPSE to hyperparameter choices (Table 6). All experiments are performed on the GPT-4o-mini backbone using the LoCoMo benchmark.
Key Findings
(1) Propagation depth T is the most sensitive parameter, with performance degrading significantly if the graph is traversed too shallowly or too deeply. (2) Node Decay rate δ directly impacts temporal reasoning; an optimal balance ( δ = 0 . 5 ) is needed to retain recent history without noise. (3) Inhibition TopM (Sparsity) shows a clear peak around M = 7 . Setting M too low (3) over-prunes context, while setting it too high (10) introduces irrelevant noise. (4) Spreading factor S = 0 . 8 achieves optimal diffusion, allowing relevance to flow to related concepts without saturating the graph.
Gating Calibration Analysis
We calibrate the uncertainty gating threshold τ gate on a held-out validation set (10% of samples) to strictly balance robustness against utility. Table 7 illustrates the sensitivity analysis.
Weobserve a clear "elbow" point at τ gate = 0 . 12 . Below this threshold, increasing the gate provides massive gains in Adversarial robustness (60.2 → 96.6) with negligible impact on valid queries. However, pushing beyond 0.12 yields diminishing returns: raising τ gate to 0.15 improves Adversarial F1 by only 0.6 points but nearly doubles the False Refusal Rate (FRR) from 2.1% to 4.2%. Notably, the ability to achieve near-perfect rejection at such a low threshold ( τ ≈ 0 . 12 ) indicates a strong Signalto-Noise Ratio in our graph. The lateral inhibition mechanism effectively suppresses irrelevant nodes close to zero, creating a clean margin between valid retrieval (high activation) and hallucination (low activation), minimizing the need for aggressive thresholding.
Additional Quantitative Results
Statistical Stability
Table 8 reports the mean F1 scores and standard deviations across three independent runs. The low standard deviations ( ≤ 0 . 5 ) confirm that our method is stable and not dependent on favorable random initialization.
Performance on Low Vector-Similarity Subsets
We evaluate models on subsets of the LoCoMo test set where the semantic similarity between the evidence and the question falls below specific thresholds (0.5 and 0.3).
As shown in Table 9, SYNAPSE exhibits strong robustness (drop < 8% ), whereas A-MEM suffers significant degradation (drop > 50% ). This validates that our graph spreading mechanism reduces reliance on purely surface-level vector similarity.
Semantic Evaluation via LLM-as-a-Judge
Table 10 presents the LLM-as-a-Judge evaluation results, offering a more nuanced perspective than rigid n -gram metrics. SYNAPSE achieves the highest semantic correctness across all categories (Overall 80.7), significantly outperforming strong baselines like ENGRAM (77.6) and MEMORYOS (67.7).
Structural Advantage in Reasoning. The performance gap is most pronounced in the MultiHop category, where SYNAPSE scores 84.2, establishing a clear margin over MemoryOS (63.7) and AriGraph (28.2). This validates our core hypothesis: while hierarchical or vector-based systems
Table 5: Taxonomy of baseline methods compared in our experiments. We categorize methods based on their core memory representation and retrieval mechanism.
struggle to retrieve disconnected evidence chains, SYNAPSE's spreading activation successfully propagates relevance across intermediate nodes, reconstructing the full reasoning path.
Temporal Consistency. In the Temporal category, SYNAPSE (72.1) and MemoryOS (72.7) are the only two methods surpassing the 70-point threshold. This parity is instructive: MemoryOS explicitly optimizes for memory updates (OS-like read/write), whereas SYNAPSE achieves this implicitly through temporal decay dynamics. The fact that our decay-based mechanism matches a dedicated memory-management system suggests that "forgetting" is as crucial as "remembering" for maintaining an accurate timeline.
Structural Advantage in Reasoning.
Temporal Consistency.
System Instruction: You are an expert knowledge engineer building a semantic graph from conversation history. Your goal is to consolidate episodic details into structured knowledge nodes.
Input Context: 5 recent conversation turns.
Reasoning (Chain of Thought): 1. Analyze: Identify new facts not present in previous context. 2. Classify: Categorize facts into Identity , Preference , Event , or Technical . 3. Extract: Form canonical node names (e.g., "likes camping" → "Camping Preference").
Qualitative Analysis
Metric Divergence
Table 11 provides a granular look at why standard metrics (F1/BLEU) systematically undervalue agentic memory systems. We identify three distinct phenomena where SYNAPSE demonstrates superior intelligence that is penalized by rigid string matching.
Dynamic Temporal Reasoning vs. Static Retrieval. In temporal queries, the ground truth is often a static string extracted from past context (e.g., "Since 2016"). However, SYNAPSE often performs arithmetic reasoning relative to the current timeframe (e.g., "Seven years", assuming the current year is 2023). As shown in Table 11 (row 11), this results in an F1 score of 0.0 despite the answer being factually perfect. This confirms that SYNAPSE is not merely retrieving text chunks but is understanding time as a dynamic variable.
Semantic Completeness vs. Brevity. For questions like "What motivated counseling?", the ground truth is often a concise extraction ("Her journey"). SYNAPSE, leveraging its connected graph, retrieves the broader context of her motivations ("Her own struggles and desire to help"). While this verbosity lowers overlap ratios (F1: 22.2), the LLM Judge correctly identifies it as a more complete and nuanced answer (Score: 100), demonstrating that our method preserves the richness of user history better than extractive baselines.
Inferential Paraphrasing. In Multi-Hop scenarios, SYNAPSE tends to answer with implications
Table 6: Hyperparameter sensitivity analysis on LoCoMo (GPT-4o-mini). Default values are marked with † .
Table 7: Impact of gating threshold τ gate on Adversarial F1 and False Refusal Rate (FRR) on non-adversarial queries. Our selected threshold of 0 . 12 creates a "safe operating window" with <2.5% false refusals.
rather than direct quotes. When asked if someone is an "ally," SYNAPSE synthesizes evidence of support ("Yes, Melanie supports and encourages...") rather than just outputting "Yes". This behavior mimics human memory-reconstructing the gist rather than rote memorization-which is essential for naturalistic interaction but challenging for lexical metrics.
Dynamic Temporal Reasoning vs. Static Retrieval.
Standard retrieval methods like RAG and MemoryBank (Zhong et al., 2024) rely fundamentally on vector similarity (Karpukhin et al., 2020; Khattab and Zaharia, 2020), representing memories as isolated points in embedding space (Hu et al., 2025). Consequently, they struggle with queries requiring causal bridging between semantically dissimilar or distant events (Yang et al., 2018; Qi et al., 2019; Trivedi et al., 2022; Thorne et al., 2018). SYNAPSE overcomes this by encoding relationships as graph edges, enabling retrieval via relational paths (Sun et al., 2018).
Drawing from cognitive Spreading Activation theory (Collins and Loftus, 1975; Anderson, 1983) and ACT-R architectures (Anderson, 1983), we address the limitation of "seed dependence" in existing graph systems. While prior methods fail if the initial vector search misses the relevant subgraph (i.e., a "bad seed"), SYNAPSE uses spreading activation to dynamically recover from suboptimal
seeds, propagating energy to relevant contexts even under weak initial semantic overlap.
Semantic Completeness vs. Brevity.
Inferential Paraphrasing.
Require: Conversation stream { ( u t , r t ) } T t =1 , consolidation interval N = 5 Ensure: Unified graph G = ( V , E ) 1: Initialize V E ←∅ , V S ←∅ , E ← ∅ 2: for each turn t do 3: c t ← concat ( u t , r t ) 4: h t ← Encoder ( c t ) ▷ all-MiniLM-L6-v2 5: v e t ← ( c t , h t , τ t ) ; V E ←V E ∪ { v e t } 6: if t > 1 then 7: E ← E ∪ { ( v e t -1 , v e t , w = 1 . 0 , TEMPORAL ) } 8: end if 9: if t mod N = 0 then ▷ Consolidation trigger 10: context ←{ v e t -N +1 , . . . , v e t } 11: items ← LLM_Extract ( context ) ▷ Entities & Concepts 12: for each item s ∈ items do 13: h s ← Encoder ( s ) 14: if ∃ v s j ∈ V S : sim ( h s , h j ) > 0 . 92 then 15: Update v s j embedding via EMA ▷ Deduplication 16: else 17: v s s ← ( s, h s ) ; V S ←V S ∪ { v s s } 18: end if 19: for each v e k ∈ context do 20: E ← E ∪ { ( v e k , v s s , w = 0 . 8 , ABSTRACTION ) } 21: end for 22: end for 23: for each pair ( v s i , v s j ) ∈ V S ×V S do 24: w ← sim ( h i , h j ) 25: if w > 0 . 92 and j ∈ Top15( N ( i )) then 26: E ← E ∪ { ( v s i , v s j , w, ASSOCIATION ) } 27: end if 28: end for 29: end if 30: end for 31: return G = ( V E ∪ V S , E )
Failure Analysis: Cognitive Tunneling
We analyze a representative failure case (Figure 4) where aggressive activation dynamics lead to the suppression of minor details.
Table 8: Statistical stability of SYNAPSE across 3 random seeds (GPT-4o-mini).
Table 9: LoCoMo QA results (F1, %) on low-similarity subsets. ↓ F1 denotes relative performance drop.
Extended Cross-Backbone Results
Table 12 presents the performance of SYNAPSE and baselines across different LLM backbones (GPT4o, Qwen-1.5b, Qwen-3b). We highlight two consistent observations.
Structured retrieval is more valuable for weaker backbones. On the resource-constrained Qwen3b, SYNAPSE achieves an Average F1 of 36.6, substantially outperforming MemoryOS (22.1) and A-Mem (16.2). This suggests that explicitly structured activation can partially compensate for the limited reasoning capacity of smaller models: rather than relying on the backbone to infer longrange dependencies from retrieved text alone, the retrieval stage itself exposes relationally relevant evidence through activation propagation.
Scaling to stronger backbones preserves the advantage, while exhaustive-context baselines remain strong in trivial lookup. On GPT-4o, SYNAPSE further improves to an Average F1 of 43.4, indicating that stronger backbones can better exploit the retrieved subgraph once the relevant evidence is surfaced. Meanwhile, LoCoMo retains an advantage in simple Single-Hop retrieval (61.6 vs. 46.5), which is expected because it operates on near-exhaustive context access. Importantly, SYNAPSE consistently dominates in complex reasoning categories (e.g., Multi-Hop and Temporal), supporting the claim that the core benefit stems from structured activation rather than brute-force
Table 10: LLM-as-a-Judge Semantic Scores (0-100). SYNAPSE dominates in complex reasoning tasks (Multi-Hop), validating the efficacy of graph-based activation.
Table 11: Expanded Analysis of Metric Divergence . Examples where SYNAPSE generates semantically accurate responses that are penalized by F1 scores due to synonymy, verbosity, or date formatting.
context injection.

Figure 4: Cognitive Tunneling: Lateral inhibition aggressively prunes low-degree details in the presence of highly activated hubs, leading to loss of "minor" facts.
Note: Main results for GPT-4o-mini are provided in Table 1. Values here differ due to different backbones. "All" rows in Table 8 denote the same validation set logic as Table 1.
| Category | Category | Category | Category | Category | Category | Category | Category | Category | Category | Average | Average | Average | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Multi-Hop | Multi-Hop | Temporal | Temporal | Open Domain | Open Domain | Single-Hop | Single-Hop | Adversarial | Adversarial | Performance ∗ | Performance ∗ | Task |
| F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | Rank | |
| MemoryBank † (Zhong et al., 2024) | 5.0 | 4.8 | 9.7 | 7.0 | 5.6 | 5.9 | 6.6 | 5.2 | 7.4 | 6.5 | 6.3 | 5.4 | 11.6 |
| ReadAgent † (Lee et al., 2024) | 9.2 | 6.5 | 12.6 | 8.9 | 5.3 | 5.1 | 9.7 | 7.7 | 9.8 | 9.0 | 9.8 | 7.1 | 11.0 |
| ENGRAM (Patel and Patel, 2025) | 18.3 | 13.2 | 21.9 | 14.7 | 8.6 | 5.5 | 23.1 | 13.7 | 33.5 | 19.4 | 19.3 | 13.1 | 9.2 |
| GraphRAG † (Edge et al., 2025) | 16.5 | 11.8 | 22.4 | 15.2 | 10.1 | 8.4 | 24.5 | 18.2 | 15.2 | 12.0 | 18.3 | 14.2 | 8.8 |
| MemGPT † (Packer et al., 2024) | 26.7 | 17.7 | 25.5 | 19.4 | 9.2 | 7.4 | 41.0 | 34.3 | 43.3 | 42.7 | 28.0 | 20.5 | 7.2 |
| LoCoMo † (Maharana et al., 2024) | 25.0 | 19.8 | 18.4 | 14.8 | 12.0 | 11.2 | 40.4 | 29.1 | 69.2 | 68.8 | 25.6 | 19.9 | 7.0 |
| LangMem (LangChain Team, 2024) | 34.5 | 23.7 | 30.8 | 25.8 | 24.3 | 19.2 | 40.9 | 33.6 | 47.6 | 46.3 | 34.3 | 25.7 | 5.0 |
| A-Mem † (Xu et al., 2025) | 27.0 | 20.1 | 45.9 | 36.7 | 12.1 | 12.0 | 44.7 | 37.1 | 50.0 | 49.5 | 33.3 | 26.2 | 4.8 |
| MemoryOS (Li et al., 2025) | 35.3 | 25.2 | 41.2 | 30.8 | 20.0 | 16.5 | 48.6 | 43.0 | - | - | 38.0 | 29.1 | - |
| AriGraph (Anokhin et al., 2025) | 28.5 | 21.0 | 43.2 | 33.5 | 14.5 | 13.0 | 45.1 | 38.0 | 48.5 | 47.0 | 33.7 | 26.2 | 4.6 |
| Zep (Rasmussen et al., 2025) | 35.5 | 25.8 | 48.5 | 40.2 | 23.1 | 18.0 | 48.0 | 41.5 | 65.4 | 64.0 | 39.7 | 31.2 | 2.6 |
| SYNAPSE (Ours) | 35.7 | 26.2 | 50.1 | 44.5 | 25.9 | 19.2 | 48.9 | 42.9 | 96.6 | 96.4 | 40.5 | 32.6 | 1.0 |
| System Com- ponent | A-Mem (Baseline) | Synapse (Ours) |
|---|---|---|
| Uncertainty- Aware Rejection (Confidence Gating) | Error: Semantic Drift Top-1 Retrieved: 'Melanie's kids love play- ing with their toy dinosaur, Rex.' ✗ False Association: Matches query 'dog' with semantic neighbor 'Rex', ignoring con- text. → Hallucination : 'She has a dog named Rex.' | Success: Confidence Gating Check: C ret < τ gate (0 . 12) Action: Trigger Negative Acknowledgement Protocol. ✓ Rejection: Low confidence preempts gen- eration. → Response : 'No record of such pet found.' |
| Spreading Activation (Dynamic Con- text) | Error: Temporal Obsolescence Top-1 Retrieved: [D4:3] 'Caroline moved from Sweden 4 years ago...' (Score: 0.92) ✗ Static Bias: High cosine similarity to query 'where living' dominates. → Output : 'She lives in Sweden.' | Success: Temporal Decay Action: S final = S sem + λ · S decay ( t ) Trace: D4:3 (Sweden) decay → 0.4. D1:1 (US) boost → 0.95. ✓ Reranking: Prioritizes current state over semantic overlap. → Output : 'Currently in the US.' |
| Knowledge Graph (Structure) | Error: Logical Disconnection Top-1 Retrieved: 'Caroline collects books.' (matches 'Dr. Seuss') ✗ Missing Link: Fails to bridge 'collects books' ↔ 'Dr. Seuss' without explicit over- lap. → Output : 'Uncertain/No info.' | Success: Multi-Hop Inference Action: G walk ( Caroline , Dr. Seuss ,k = 2) Path: Caroline collects - --- → Classic Books contains - ---→ Dr. Seuss ✓ Bridging: Uses graph structure to infer implicit connection. → Output : 'Yes, likely has them.' |
| Configuration | M-Hop | Temp. | Open | Single | Adv. | Avg. |
|---|---|---|---|---|---|---|
| SYNAPSE (Full) | 35.7 | 50.1 | 25.9 | 48.9 | 96.6 | 40.5 |
| Micro-Dynamics Ablation (Mechanism-Level) | Micro-Dynamics Ablation (Mechanism-Level) | Micro-Dynamics Ablation (Mechanism-Level) | Micro-Dynamics Ablation (Mechanism-Level) | Micro-Dynamics Ablation (Mechanism-Level) | Micro-Dynamics Ablation (Mechanism-Level) | Micro-Dynamics Ablation (Mechanism-Level) |
| (-) Uncertainty Gating ( τ gate = 0 ) | 35.6 | 50.0 | 25.4 | 48.8 | 67.2 | 40.3 |
| (-) Lateral Inhibition ( β = 0 ) | 35.1 | 49.8 | 22.4 | 49.1 | 71.5 | 39.4 |
| (-) Fan Effect (No Dilution) | 30.2 | 48.5 | 16.8 | 47.5 | 94.2 | 36.1 |
| (-) Node Decay ( δ = 0 ) | 34.8 | 14.2 | 24.5 | 48.2 | 95.8 | 30.7 |
| Macro-Architecture Ablation (System-Level) | Macro-Architecture Ablation (System-Level) | Macro-Architecture Ablation (System-Level) | Macro-Architecture Ablation (System-Level) | Macro-Architecture Ablation (System-Level) | Macro-Architecture Ablation (System-Level) | Macro-Architecture Ablation (System-Level) |
| (-) Activation Dynamics | 31.2 | 23.7 | 18.2 | 48.9 | 70.4 | 30.5 |
| (-) Graph Structure | 35.2 | 25.4 | 21.0 | 49.9 | 88.2 | 32.9 |
| Vectors Only (Baseline) | 27.5 | 14.7 | 12.5 | 46.0 | 69.2 | 25.2 |
| Method | Token Length | Latency | Cost/1k Queries | F1 (Excl. Adv.) * | Cost Eff. ( F 1 / $ ) |
|---|---|---|---|---|---|
| LoCoMo (Maharana et al., 2024) | ∼ 16,910 | 8.2s | $2.67 | 25.6 | 9.6 |
| MemGPT (Packer et al., 2024) | ∼ 16,977 | 8.5s | $2.67 | 28 | 10.5 |
| A-Mem (Xu et al., 2025) | ∼ 2,520 | 5.4s | $0.50 | 33.3 | 66.9 |
| MemoryOS (Li et al., 2025) | ∼ 1,198 | 1.5s | $0.30 | 38 | 126.8 |
| ReadAgent (Lee et al., 2024) | ∼ 643 | 2.3s | $0.22 | 9.8 | 45.3 |
| LangMem (LangChain Team, 2024) | ∼ 717 | 0.6s | $0.23 | 34.3 | 150.7 |
| MemoryBank (Zhong et al., 2024) | ∼ 432 | 1.2s | $0.18 | 6.3 | 34.1 |
| SYNAPSE (Ours) | ∼ 814 | 1.9s | $0.24 | 40.5 | 167.3 |
| Category Method | Key Mechanism | Reference |
|---|---|---|
| MemGPT MemoryOS Mem0 | with virtual context (Packer et al., 2024) optimizing read/write (Li et al., 2025) for personalization and (Chhikara et al., 2025) | System-level Hierarchical memory management paging (Main vs. External Context). OS-inspired memory hierarchy operations. Self-improving memory layer |
| AriGraph GraphRAG Zep SYNAPSE | Episodic and semantic memory organized as a dynamic graph structure. (Anokhin et al., 2025) Leverages community detection on knowledge graphs for global/local retrieval. (Edge et al., 2025) Knowledge graph-based memory designed for entity relationships. (Rasmussen et al., 2025) Hybrid spreading activation with dynamic structure (Ours). - | Graph-based |
| Retrieval MemoryBank ENGRAM LangMem | Retrieval-based memory incorporating the Ebbinghaus forgetting curve. Advanced latent memory clustering and retrieval mecha- nism. Memory injection via in-context learning or fine-tuning updates. | (Zhong et al., 2024) (Patel and Patel, 2025) (LangChain Team, 2024) |
| Agentic ReadAgent LoCoMo A-Mem | Agentic system that paginates long context and generates gist memories. Local Context Motion for compressing and selecting relevant blocks. Adaptive agentic memory system capable of self- updating summaries. | (Lee et al., 2024) (Maharana et al., 2024) (Xu et al., 2025) |
| Parameter | Value | M-Hop | Temp. | Avg |
|---|---|---|---|---|
| Spreading S | 0.6 | 32.8 | 47.5 | 38.3 |
| 0.8 † | 35.7 | 50.1 | 40.5 | |
| 1.0 | 33.5 | 48 | 38.8 | |
| Node Decay δ | 0.3 | 34.5 | 51.2 | 40 |
| Node Decay δ | 0.5 † | 35.7 | 50.1 | 40.5 |
| Node Decay δ | 0.7 | 33.8 | 46.5 | 38.1 |
| Steepness γ | 3.0 | 34.9 | 49.3 | 39.8 |
| Steepness γ | 5.0 † | 35.7 | 50.1 | 40.5 |
| Steepness γ | 7.0 | 35.1 | 49.7 | 40 |
| Threshold θ | 0.3 | 32.9 | 48.8 | 39 |
| Threshold θ | 0.5 † | 35.7 | 50.1 | 40.5 |
| Threshold θ | 0.7 | 34.1 | 49.2 | 39.5 |
| Inhibition β | 0.10 | 35.4 | 49.9 | 40.1 |
| Inhibition β | 0.15 † | 35.7 | 50.1 | 40.5 |
| Inhibition β | 0.20 | 35.2 | 49.6 | 39.9 |
| Propagation T | 2 | 31.5 | 46.8 | 37.7 |
| Propagation T | 3 † | 35.7 | 50.1 | 40.5 |
| Propagation T | 4 | 35.2 | 49.8 | 40.1 |
| Inhibition M | 3 | 33.5 | 48.9 | 39.2 |
| Inhibition M | 7 † | 35.7 | 50.1 | 40.5 |
| Inhibition M | 10 | 34.8 | 49.3 | 39.8 |
| τ gate | Adv. F1 | FRR (Non-Adv) | Verdict |
|---|---|---|---|
| 0 | 60.2 | 0.0% | Baseline |
| 0.05 | 94.2 | 0.8% | Conservative |
| 0.1 | 95.8 | 1.5% | Balanced |
| 0.12 | 96.6 | 2.1% | Selected |
| 0.15 | 97.2 | 4.2% | Aggressive |
| 0.2 | 98.1 | 8.5% | Unsafe |
| Category | F1 Score |
|---|---|
| Multi-Hop Temporal Open-Domain Single-Hop | 35.7 ± 0.1 50.1 ± 0.3 25.9 ± 0.2 48.9 ± 0.1 |
| Average | 40.5 ± 0.2 |
| Adversarial | 96.6 ± 0.1 |
| Model | Thres. | M-Hop | Temp. | Open | Single | Adv. | ↓ F1 |
|---|---|---|---|---|---|---|---|
| A-MEM | All | 32.9 | 39.4 | 17.1 | 48.4 | 36.4 | - |
| 0.5 | 20.2 | 19.3 | 11.5 | 28.8 | 19.4 | (43.1%) | |
| 0.3 | 14.6 | 16.3 | 9.5 | 19.7 | 16 | (56.3%) | |
| SYNAPSE | All | 39.3 | 55.5 | 29.5 | 46.5 | 97.8 | - |
| 0.5 | 42.8 | 49.4 | 22.2 | 44.8 | 95.3 | (5.3%) | |
| 0.3 | 42.3 | 47.5 | 21.5 | 43.8 | 93.7 | (7.4%) |
| Method | Single-Hop | Multi-Hop | Open Domain | Temporal | Overall |
|---|---|---|---|---|---|
| MemoryBank (Zhong et al., 2024) | 30.5 | 14.2 | 45.3 | 35.8 | 23.6 |
| ReadAgent (Lee et al., 2024) | 37.1 | 16.5 | 50.2 | 41.5 | 27.6 |
| LoCoMo (Maharana et al., 2024) | 38.5 | 17.8 | 53 | 48.2 | 30.1 |
| A-Mem (Xu et al., 2025) | 39.8 | 18.9 | 54.1 | 49.9 | 31.4 |
| Mem0 (Chhikara et al., 2025) | 67.1 | 51.2 | 75.7 | 58.1 | 57.1 |
| MemGPT (Packer et al., 2024) | 41.2 | 19.5 | 55.8 | 50.4 | 32.2 |
| AriGraph (Anokhin et al., 2025) | 45.5 | 28.2 | 60.1 | 51.5 | 38.2 |
| LangMem (LangChain Team, 2024) | 62.2 | 47.9 | 71.1 | 23.4 | 46.9 |
| Zep (Rasmussen et al., 2025) | 61.7 | 41.4 | 76.6 | 49.3 | 49 |
| MemoryOS (Li et al., 2025) | 78.3 | 63.7 | 54.6 | 72.7 | 67.7 |
| ENGRAM (Patel and Patel, 2025) | 79.9 | 79.8 | 72.9 | 70.8 | 77.6 |
| SYNAPSE | 81.5 | 84.2 | 76.8 | 72.1 | 80.7 |
| Category | Question | Ground Truth | SYNAPSE Output | F1 | Judge |
|---|---|---|---|---|---|
| What is Caroline's identity? | Transgender woman | Caroline is transgender. | 40 | 100 | |
| Who supports Caroline? | Her mentors, family | Her support system, those close to her | 16.7 | 90 | |
| Single-Hop | What motivated counseling? | Her journey and how it im- proved life | Her own struggles and desire to help | 22.2 | 100 |
| What was grandma's gift? | Necklace | A necklace symbolizing love | 33.3 | 100 | |
| Transition changes faced? | Changes to her body | Exploring her changing body | 50 | 100 | |
| Considered an ally? | Yes, she is supportive | Yes, Melanie supports and encour- ages... | 40 | 100 | |
| Multi-Hop | Likely enjoy Vivaldi? | Yes; it's classical | Yes, she enjoys classical music. | 33.3 | 100 |
| Likely have Dr. Seuss? | Yes, since she collects classics | Yes, likely for their creativity... | 15.4 | 100 | |
| Political leaning? | Liberal | Progressive or liberal. | 50 | 100 | |
| Realization after race? | Self-care is important | Importance of taking care of minds | 20 | 100 | |
| How long practicing art? | Since 2016 | Seven years (relative to 2023) | 0 | 100 | |
| Adoption meeting date? | Friday before 15 July | 14 July 2023 | 50 | 100 | |
| Temporal | When was the picnic? | Week before 6 July | 29 June 2023 | 25 | 100 |
| When was charity race? | Sunday before 25 May | 20 May 2023 | 50 | 100 | |
| Pottery class date? | 2 July 2023 | 02 July 2023 | 66.7 | 100 |
| Failure Mode: Cognitive Tunneling |
|---|
| Context: Episode E 15 (Low Degree) ...John put on his green jacket and left for the airport... |
| Retrieval Failure: Query "What color was John's jacket?" |
| Top-1: Airport Trip (Score 0.85) [Supressing E 15 ] - Hub Node Top-2: Taxi Ride (Score 0.72) Target: Green Jacket (Score 0.11 < τ ) - Pruned by Inhibition |
| Mechanism Diagnostics: High-degree "Airport" hub accumulates excessive activation ( S > 0 . 8 ), trigger- ing Lateral Inhibition ( β = 0 . 15 ) which suppresses the weakly connected "Jacket" detail. |
| Category | Category | Category | Category | Category | Category | Category | Category | Category | Category | Average | Average | Average | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | Method | Multi-Hop | Multi-Hop | Temporal | Temporal | Open Domain | Open Domain | Single-Hop | Single-Hop | Adversarial | Adversarial | Performance ∗ | Performance ∗ | Task |
| Model | Method | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | Rank |
| GPT-4o | LoCoMo (Maharana et al., 2024) | 28.0 | 18.5 | 9.1 | 5.8 | 16.5 | 14.8 | 61.6 | 54.2 | 52.6 | 51.1 | 29.5 | 22.2 | 2.8 |
| GPT-4o | ReadAgent (Lee et al., 2024) | 14.6 | 10.0 | 4.2 | 3.2 | 8.8 | 8.4 | 12.5 | 10.3 | 6.8 | 6.1 | 11.7 | 8.5 | 5.0 |
| GPT-4o | MemoryBank (Zhong et al., 2024) | 6.5 | 4.7 | 2.5 | 2.4 | 6.4 | 5.3 | 8.3 | 7.1 | 4.4 | 3.7 | 6.0 | 4.7 | 6.0 |
| GPT-4o | MemGPT (Packer et al., 2024) | 30.4 | 22.8 | 17.3 | 13.2 | 12.2 | 11.9 | 60.2 | 53.4 | 35.0 | 34.3 | 32.0 | 25.7 | 3.2 |
| GPT-4o | A-Mem (Xu et al., 2025) | 32.9 | 23.8 | 39.4 | 31.2 | 17.1 | 15.8 | 48.4 | 43.0 | 36.4 | 35.5 | 36.1 | 28.4 | 2.4 |
| GPT-4o | SYNAPSE (Ours) | 39.3 | 29.5 | 55.5 | 50.3 | 29.5 | 23.9 | 46.5 | 38.8 | 97.8 | 97.7 | 43.4 | 35.2 | 1.6 |
| Qwen-1.5b | LoCoMo (Maharana et al., 2024) | 9.1 | 6.6 | 4.3 | 4.0 | 9.9 | 8.5 | 11.2 | 8.7 | 40.4 | 40.2 | 8.5 | 6.6 | 4.0 |
| Qwen-1.5b | ReadAgent (Lee et al., 2024) | 6.6 | 4.9 | 2.6 | 2.5 | 5.3 | 12.2 | 10.1 | 7.5 | 5.4 | 27.3 | 6.3 | 5.3 | 5.8 |
| Qwen-1.5b | MemoryBank (Zhong et al., 2024) | 11.1 | 8.3 | 4.5 | 2.9 | 8.1 | 6.2 | 13.4 | 11.0 | 36.8 | 34.0 | 10.0 | 7.5 | 3.6 |
| Qwen-1.5b | MemGPT (Packer et al., 2024) | 10.4 | 7.6 | 4.2 | 3.9 | 13.4 | 11.6 | 9.6 | 7.3 | 31.5 | 28.9 | 9.1 | 7.0 | 4.6 |
| Qwen-1.5b | A-Mem (Xu et al., 2025) | 18.2 | 11.9 | 24.3 | 19.7 | 16.5 | 14.3 | 23.6 | 19.2 | 46.0 | 43.3 | 20.4 | 15.0 | 2.0 |
| Qwen-1.5b | SYNAPSE (Ours) | 38.1 | 24.6 | 35.5 | 28.6 | 18.1 | 11.7 | 35.8 | 26.6 | 98.1 | 60.1 | 35.9 | 25.0 | 1.0 |
| Qwen-3b | LoCoMo (Maharana et al., 2024) | 4.6 | 4.3 | 3.1 | 2.7 | 4.6 | 6.0 | 7.0 | 5.7 | 17.0 | 14.8 | 4.7 | 4.3 | 4.8 |
| Qwen-3b | ReadAgent (Lee et al., 2024) | 2.5 | 1.8 | 3.0 | 3.0 | 5.6 | 5.2 | 3.3 | 2.5 | 15.8 | 14.0 | 2.9 | 2.4 | 5.8 |
| Qwen-3b | MemoryBank (Zhong et al., 2024) | 3.6 | 3.4 | 1.7 | 2.0 | 6.6 | 6.6 | 4.1 | 3.3 | 13.1 | 10.3 | 3.5 | 3.3 | 6.0 |
| Qwen-3b | MemGPT (Packer et al., 2024) | 5.1 | 4.3 | 2.9 | 3.0 | 7.0 | 7.1 | 7.3 | 5.5 | 14.5 | 12.4 | 5.2 | 4.4 | 4.6 |
| Qwen-3b | A-Mem (Xu et al., 2025) | 12.6 | 9.0 | 27.6 | 25.1 | 7.1 | 7.3 | 17.2 | 13.1 | 27.9 | 25.2 | 16.2 | 13.0 | 2.6 |
| Qwen-3b | MemoryOS (Li et al., 2025) | 21.4 | 15.0 | 26.2 | 22.4 | 10.2 | 8.2 | 23.3 | 15.4 | - | - | 22.1 | 16.2 | - |
| Qwen-3b | SYNAPSE (Ours) | 38.8 | 25.1 | 36.2 | 29.6 | 14.7 | 11.5 | 37.8 | 26.1 | 98.9 | 60.5 | 36.6 | 25.4 | 1.0 |
Structured retrieval is more valuable for weaker backbones.
Scaling to stronger backbones preserves the advantage, while exhaustive-context baselines remain strong in trivial lookup.
While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory. To bridge this gap, we introduce Synapse (Synergistic Associative Processing & Semantic Encoding), a unified memory architecture that transcends static vector similarity. Drawing from cognitive science, Synapse models memory as a dynamic graph where relevance emerges from spreading activation rather than pre-computed links. By integrating lateral inhibition and temporal decay, the system dynamically highlights relevant sub-graphs while filtering interference. We implement a Triple Hybrid Retrieval strategy that fuses geometric embeddings with activation-based graph traversal. Comprehensive evaluations on the LoCoMo benchmark show that Synapse significantly outperforms state-of-the-art methods in complex temporal and multi-hop reasoning tasks, offering a robust solution to the "Contextual Tunneling" problem. Our code and data will be made publicly available upon acceptance.
Synapse: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation
Hanqi Jiang1††thanks: Equal contribution, Junhao Chen111footnotemark: 1, Yi Pan1, Ling Chen2, Weihang You1, Yifan Zhou1, Ruidong Zhang1, Lin Zhao3, Yohannes Abate4, Tianming Liu1††thanks: Corresponding author 1School of Computing, University of Georgia, Athens 2Department of Biosystems Engineering and Soil Science, University of Tennessee, Knoxville 3Department of Biomedical Engineering, New Jersey Institute of Technology, Newark 4Department of Physics and Astronomy, The University of Georgia, Athens
The evolution of Large Language Models (LLMs) from static responders to autonomous agents necessitates a fundamental rethinking of memory architecture Park et al. (2023); Yao et al. (2023); Schick et al. (2023). While LLMs demonstrate remarkable reasoning within finite context windows, their agency is brittle without the ability to accumulate experiences and maintain narrative coherence over long horizons Gutiérrez et al. (2024); Izacard et al. (2023). The predominant solution, Retrieval-Augmented Generation (RAG) Lewis et al. (2020), externalizes history into vector databases, retrieving information based on semantic similarity Guu et al. (2020); Asai et al. (2024). While effective for factual lookup Borgeaud et al. (2022), standard RAG imposes a critical limitation on reasoning agents: it treats memory as a static library to be indexed, rather than a dynamic network to be reasoned over Gutiérrez et al. (2024); Zhu et al. (2025).
We argue that existing systems suffer from Contextual Isolation, a failure mode stemming from the implicit Search Assumption: that the relevance of a past memory is strictly determined by its semantic proximity to the current query Zhu et al. (2025); Edge et al. (2025); Sarthi et al. (2024). This assumption collapses in scenarios requiring causal or transitive reasoning. Consider a user asking, “Why am I feeling anxious today?”. A vector-based system might retrieve recent mentions of “anxiety,” but fail to surface a schedule conflict logged weeks prior. Although this conflict is the root cause, it shares no lexical or embedding overlap with the query. While hierarchical frameworks such as MemGPT Packer et al. (2024) improve context management, they remain bound by query-driven retrieval, unable to autonomously surface structurally related yet semantically distinct information.
To bridge this gap, we draw inspiration from cognitive science theories of Spreading Activation Collins and Loftus (1975); Anderson (1983), which posit that human memory retrieval is not a search process, but a propagation of energy. Accessing one concept naturally activates semantically, temporally, or causally linked concepts without explicit prompting.
We introduce Synapse, a brain-inspired architecture that reimagines agentic memory. Unlike flat vector stores, Synapse constructs a Unified Episodic-Semantic Graph, where raw interaction logs (episodic nodes) are synthesized into abstract concepts (semantic nodes). Retrieval in Synapse is governed by activation dynamics: input signals inject energy into the graph, which propagates through temporal and causal edges. This mechanism enables the system to prioritize memories that are structurally salient to the current context, such as the aforementioned schedule conflict, even when direct semantic similarity is absent. To ensure focus, we implement lateral inhibition, a biological mechanism that suppresses irrelevant distractors.
We evaluate Synapse on the rigorous LoCoMo benchmark Maharana et al. (2024), which involves long-horizon dialogues averaging 16K tokens. Synapse establishes a new state-of-the-art (SOTA), significantly outperforming traditional RAG and recent agentic memory systems. Notably, our activation-based approach improves accuracy on complex multi-hop reasoning tasks by up to 23% while reducing token consumption by 95% compared to full-context methods.
In summary, our contributions are as follows:
Unified Episodic-Semantic Graph: We propose a dual-layer topology that synergizes granular interaction logs with synthesized abstract concepts, addressing the structural fragmentation inherent in flat vector stores.
Cognitive Dynamics with Uncertainty Gating: We introduce a retrieval mechanism governed by spreading activation and lateral inhibition to prioritize implicit relevance, coupled with a "feeling of knowing" protocol that robustly rejects hallucinations.
SOTA Performance & Efficiency: Synapse establishes a new state-of-the-art on the LoCoMo benchmark (+7.2 F1), improving multi-hop reasoning accuracy by 23% while reducing token consumption by 95% compared to full-context methods.
Systems such as MemGPT Packer et al. (2024), MemoryOS Li et al. (2025), and LangMem LangChain Team (2024) address context limitations by optimizing memory placement via policy-based controllers or hierarchical buffers Lewis et al. (2020); Nafee et al. (2025); Guu et al. (2020). However, these approaches treat memory items as independent textual units, lacking the mechanisms to model causal or structural relationships during retrieval Khandelwal et al. (2020). Consequently, they cannot recover linked memories absent surface-level similarity. In contrast, Synapse shifts the focus from storage management to reasoning, where relevance propagates through a structured network rather than relying on independent item retrieval.
Recent works introduce structure into agentic memory via explicit linking. A-Mem Xu et al. (2025) and AriGraph Anokhin et al. (2025) utilize LLMs to maintain dynamic knowledge graphs, while HippoRAG Gutiérrez et al. (2024) adapts Personal PageRank for retrieval. Crucially, methods like GraphRAG Edge et al. (2025) optimize for global sense-making via community detection, summarizing entire datasets at high computational cost. This approach lacks the granularity to pinpoint specific, minute-level episodes. In contrast, Synapse integrates cognitive dynamics (ACT-R) to strictly prioritize local relevance. By propagating activation along specific transitive paths (A→\toB→\toC) from query anchors, we recover precise context without traversing the global structure. This "biologically plausible" constraint—specifically the fan effect and inhibition—is not merely rhetorical but architectural: it enforces sparsity and competition, solving the "Hub Explosion" problem that plagues standard random-walk approaches in dense semantic graphs.
Standard retrieval methods like RAG and MemoryBank Zhong et al. (2024) rely fundamentally on vector similarity Karpukhin et al. (2020); Khattab and Zaharia (2020), representing memories as isolated points in embedding space Hu et al. (2025). Consequently, they struggle with queries requiring causal bridging between semantically dissimilar or distant events Yang et al. (2018); Qi et al. (2019); Trivedi et al. (2022); Thorne et al. (2018). Synapse overcomes this by encoding relationships as graph edges, enabling retrieval via relational paths Sun et al. (2018).
Drawing from cognitive Spreading Activation theory Collins and Loftus (1975); Anderson (1983) and ACT-R architectures Anderson (1983), we address the limitation of "seed dependence" in existing graph systems. While prior methods fail if the initial vector search misses the relevant subgraph (i.e., a "bad seed"), Synapse uses spreading activation to dynamically recover from suboptimal seeds, propagating energy to relevant contexts even under weak initial semantic overlap.
Building on the cognitive foundations outlined above, we now present Synapse, an agentic memory architecture that addresses Contextual Isolation through dynamic activation propagation. Our key insight is that relevance should emerge from distributed graph dynamics rather than being pre-computed through static links or determined solely by vector similarity. The overall framework of our proposed method is detailed in Figure 1.
We formulate the agent’s memory as a directed graph 𝒢=(𝒱,ℰ)\mathcal{G}=(\mathcal{V},\mathcal{E}). To capture both specific experiences and generalized knowledge, the vertex set 𝒱\mathcal{V} is partitioned into Episodic Nodes (𝒱E\mathcal{V}{E}) and Semantic Nodes (𝒱S\mathcal{V}{S}).
Each episodic node vie∈𝒱Ev_{i}^{e}\in\mathcal{V}{E} encapsulates a distinct interaction turn, represented as a tuple (ci,𝐡i,τi)(c{i},\mathbf{h}{i},\tau{i}), where cic_{i} is the textual content, 𝐡i∈ℝd\mathbf{h}{i}\in\mathbb{R}^{d} is the dense embedding produced by a sentence encoder (all-MiniLM-L6-v2), and τi\tau{i} is the timestamp. Semantic nodes vjs∈𝒱Sv_{j}^{s}\in\mathcal{V}{S} represent abstract concepts (e.g., entities, preferences) extracted by the LLM via prompted entity/concept extraction triggered every N=5N=5 turns. Duplicate detection uses embedding similarity with threshold τdup=0.92\tau{dup}=0.92. The complete graph construction algorithm is provided in Appendix A.1.
The edges ℰ\mathcal{E} define the retrieval pathways: (i) Temporal Edgeslink sequential episodes (vte→vt+1ev_{t}^{e}\rightarrow v_{t+1}^{e}); (ii) Abstraction Edgesbidirectionally connect episodes to relevant concepts within the same consolidation window (N=5N=5). This temporal association allows bridging concepts (e.g., "Mark" ↔\leftrightarrow "Ski Trip") via co-occurrence even without direct semantic similarity, enabling the "Bridge Node" effect (Figure 1); (iii) Association Edgesmodel latent correlations between concepts.
To prevent quadratic graph growth (O(|𝒱|2)O(|\mathcal{V}|^{2})) in long-horizon deployments, we enforce strict sparsity constraints: (1) Edge Pruning: Each node is limited to its Top-KK incoming edges (default K=15K=15); (2) Node Garbage Collection: Nodes with activation consistently below a dormancy threshold ϵ=0.01\epsilon=0.01 for W=10W=10 windows are archived to disk. This ensures the active graph remains compact (|𝒱|≤10,000|\mathcal{V}|\leq 10,000) while preserving retrieval speed.
Inspired by human semantic memory models (Collins and Loftus, 1975), we implement a dynamic activation process to prioritize information.
Given a query qq, we identify a set of anchor nodes 𝒯\mathcal{T} via a dual-trigger mechanism: (1) Lexical Trigger: We use BM25 sparse retrieval to capture exact entity matches (e.g., proper nouns like "Kendall"), ensuring precision for named entities; (2) Semantic Trigger: We use dense retrieval (all-MiniLM-L6-v2) to capture conceptual similarity (e.g., "Ski Trip"), maximizing recall for thematic queries. The union of Top-kk nodes from both streams forms the anchor set 𝒯\mathcal{T}. An initial activation vector 𝐚(0)\mathbf{a}^{(0)} is computed, where energy is injected only into anchors:
where sim(⋅)\text{sim}(\cdot) denotes cosine similarity and α\alpha is a scaling hyperparameter.
Following ACT-R (Anderson, 1983), we incorporate the fan effect to model attention dilution. The raw activation potential 𝐮i(t+1)\mathbf{u}_{i}^{(t+1)} is:
where S=0.8S=0.8 is the spreading factor, fan(j)=degout(j)\text{fan}(j)=\text{deg}{out}(j) is the out-degree, and wjiw{ji} represents edge weight: wji=e−ρ|τi−τj|w_{ji}=e^{-\rho|\tau_{i}-\tau_{j}|} for temporal edges (with time decay ρ=0.01\rho=0.01) and wji=sim(𝐡i,𝐡j)w_{ji}=\text{sim}(\mathbf{h}{i},\mathbf{h}{j}) for semantic edges.
To model attentional selection, highly activated concepts inhibit competitors before firing. We apply inhibition to the potential 𝐮i\mathbf{u}_{i}:
where 𝒯M\mathcal{T}_{M} is the set of MM highest-potential nodes (default M=7M=7) to enforce sparsity.
The inhibited potential is transformed into the final firing rate:
The cycle proceeds strictly as: Propagation (Eq. 2) →\rightarrow Lateral Inhibition (Eq. 3) →\rightarrow Non-linear Activation (Eq. 4). Stability is reached within T=3T=3 iterations.
To maximize recall in open-domain QA tasks, we propose a hybrid scoring function that fuses semantic, contextual, and structural signals. The relevance score 𝒮(vi)\mathcal{S}(v_{i}) is defined as:
The Top-kk nodes (default k=30k=30) are retrieved and re-ordered topologically. Factor scores are cached and updated only during consolidation (N=5N=5 turns) to maintain query latency independent of history length TT. Crucially, these components serve orthogonal roles: (1) PageRank acts as a Global Structural Prior, prioritizing universally important hubs (e.g., main characters) independent of the specific query; (2) Activation acts as a Local Contextual Signal, propagating query-specific relevance. Sensitivity analysis indicates robustness to λ3∈[0.1,0.3]\lambda_{3}\in[0.1,0.3], confirming PageRank’s role as a stable prior. This decoupling ensures that novel but locally relevant details are not drowned out by global hubs.
To robustly handle adversarial queries about non-existent entities, Synapse integrates a Meta-Cognitive Verification layer inspired by the "Feeling of Knowing" (FOK) in human memory monitoring. This mechanism operates via a dual-stage cognitive gating protocol:
We model retrieval confidence 𝒞ret\mathcal{C}{ret} as the activation energy of the top-ranked node. If 𝒞ret<τgate\mathcal{C}{ret}<\tau_{gate} (calibrated to τgate=0.12\tau_{gate}=0.12), the system activates a negative acknowledgement protocol, preemptively rejecting the query. This mirrors the brain’s ability to rapidly inhibit response generation when memory traces are insufficient.
For borderline cases effectively passing the gate, we employ a verification prompt that enforces a "strict evidence" constraint on the LLM: “Is this EXPLICITLY mentioned? If not, output ’Not mentioned’.” This forces the generator to distinguish between parametric knowledge hallucination and grounded retrieval.
We evaluate Synapse on the LoCoMo benchmark Maharana et al. (2024), a rigorous testbed for long-term conversational memory. Unlike standard datasets (e.g., Multi-Session Chat) with short contexts (∼\sim1K tokens), LoCoMo features extensive dialogues averaging 16K tokens across up to 35 sessions. We report the F1 Score and BLEU-1 Score across five cognitive categories: Single-Hop (C1C_{1}), Temporal (C2C_{2}), Open-Domain (C3C_{3}), Multi-Hop (C4C_{4}), and Adversarial (C5C_{5}).
To rigorously position Synapse, we benchmark against ten state-of-the-art methods spanning four distinct memory paradigms: System-level, Graph-based, Retrieval-based, and Agentic/Compression. We explicitly prioritized baselines designed for autonomous agentic memory—systems capable of stateful updates and continuous learning. We explicitly distinguish between static RAG (designed for fixed corpora) and agentic memory (designed for evolving interaction). While methods like HippoRAG Gutiérrez et al. (2024) utilize similar graph propagation, they are optimized for static pre-indexed corpora and lack the incremental update (O(1)O(1) write) and time-decay mechanisms required for continuous agentic dialogue. Thus, they are incompatible with the online read-write nature of the LoCoMo benchmark. Please refer to Appendix Table B and Table 5 for the complete taxonomy.
For Synapse, we utilize all-MiniLM-L6-v2 for embedding generation (dim=384). The Spreading Activation propagates for T=3T=3 steps with a retention parameter δ=0.5\delta=0.5 and temporal decay ρ=0.01\rho=0.01. The hybrid retrieval weights are set to λ={0.5,0.3,0.2}\lambda={0.5,0.3,0.2} (Semantic, Activation, Structural). To ensure a fair "Unified Backbone" comparison, we re-ran all reproducible baselines (marked with †\dagger in Table 1) using GPT-4o-mini with temperature 0.10.1. For baselines with fixed proprietary backends, we report their default strong model performance. We provide a detailed discussion on the sensitivity of each hyperparameter and justify our selection choices in Appendix C.
∗ To ensure fairness, we report the Performance as the Weighted F1 and BLEU-1 score averaged over the first four categories (excluding Adversarial). Task Rank denotes the mean rank. Statistical significance (p<0.05p<0.05) is confirmed via paired t-test on instance-level scores (N=500N=500). More details can be referred to Appendix A.3.
Table 1 details the comprehensive evaluation on the LoCoMo benchmark (GPT-4o-mini), reporting F1 and BLEU-1 scores across five distinct categories along with aggregate rankings.
Synapse establishes a new state-of-the-art with a weighted average F1 of 40.5 (calculated excluding the adversarial category for fair comparison). This performance represents a substantial margin of +7.2 points over A-Mem (33.3) and outperforms recent graph-based systems such as Zep (39.7) and AriGraph (33.7). Notably, Synapse secures a perfect task ranking of 1.0, demonstrating consistent dominance across all evaluated metrics.
Our model shows significant advantages in tasks requiring dynamic context reasoning. In Temporal Reasoning, Synapse attains an F1 score of 50.1 compared to 45.9 for A-Mem. This validates the efficacy of our time-aware activation decay, which correctly prioritizes recent information over semantically similar but obsolete memories. For Multi-Hop Reasoning, the spreading activation mechanism effectively propagates relevance across intermediate nodes, bridging disconnected facts that pure vector search fails to link (35.7 vs. 27.0 for A-Mem). Furthermore, regarding Adversarial Robustness, Synapse achieves near-perfect rejection rates (96.6 F1), significantly exceeding strong baselines like LoCoMo (69.2). Unlike baseline methods that lack explicit rejection protocols and often hallucinate plausible answers, our lateral inhibition and confidence gating empower the model to strictly distinguish valid retrieval from non-existent information.
On GPT-4o-mini, Synapse demonstrates exceptional stability against adversarial queries, attaining an Adversarial F1 of 96.6 via its uncertainty-aware rejection mechanism. Here, graph activation serves as an orthogonal confidence signal alongside semantic similarity. Unlike baselines that gate responses using brittle cosine-similarity heuristics—which often fail to distinguish paraphrasing from hallucinations—our design effectively separates low-evidence cases from valid retrieval. To prevent score inflation, we calibrated τgate\tau_{gate} on a held-out validation set, strictly bounding the false refusal rate below 2.5% on non-adversarial categories (See Appendix C.2 for detailed experiment). Crucially, our performance advantage is not driven solely by rejection: even with the gate disabled, Synapse maintains an average F1 of 40.3 (See Table 3), strictly outperforming Zep (39.7) and A-Mem (33.3). Paired t-tests confirm that the improvement over Zep remains statistically significant (p<0.05p<0.05) without gating. Furthermore, we report the weighted average excluding the adversarial category to ensure fair comparison; under this protocol, Synapse retains its top rank with an average F1 of 40.5, validating that the structural retrieval mechanism contributes independently of the rejection module.
Beyond GPT-4o-mini, we evaluate Synapse with multiple backbones and observe consistent trends; the full cross-backbone results and discussion are provided in Appendix F (Table 12).
To further elucidate the mechanisms behind Synapse’s superior performance, we conduct a qualitative analysis of retrieval behaviors compared to the strongest baseline, A-Mem. Table 2 presents three representative failure modes of semantic-only retrieval and how Synapse resolves them. In adversarial scenarios (row 1), A-Mem falls victim to Semantic Drift, retrieving hallucinations based on superficial keyword matches (e.g., retrieving “Rex” for “dog”). In contrast, Synapse’s meta-cognitive layer correctly identifies the adversarial intent and verifies the absence of the entity in the graph, preventing hallucination. For temporal queries (row 2), A-Mem exhibits Static Bias, favoring outdated but semantically high-scoring memories. Synapse’s spreading activation with temporal decay dynamically downweights obsolete information, ensuring the retrieval of current facts. Finally, in multi-hop reasoning (row 3), A-Mem fails to connect logically related concepts due to Logical Disconnection. Synapse’s graph traversal capabilities enable it to bridge these gaps, successfully inferring implicit connections through intermediate nodes. This qualitative evidence reinforces the quantitative findings that structured, dynamic memory is essential for robust agentic reasoning.
To understand the contribution of each component in Synapse, we conduct systematic ablations on GPT-4o-mini by selectively disabling retrieval mechanisms. Results are shown in Table 3.
Table 3 reveals that Synapse’s performance relies on the synergistic interaction of specific cognitive mechanisms rather than a single component. Specifically, Lateral Inhibition acts as a critical pre-filter for the uncertainty gate. While removing the gate (τgate=0\tau_{gate}=0) reduces Adversarial F1 to 67.2, further removing inhibition (β=0\beta=0) destabilizes the graph significantly. Without this winner-take-all competition, low-relevance "hallucination candidates" remain active enough to compete with valid nodes, degrading precision even on standard Single-Hop tasks. This confirms that inhibition is structurally necessary to separate signal from noise before the gating decision is even made.
Other dynamics target specific cognitive failures. The Fan Effect proves indispensable for associative reasoning; removing it causes a sharp decline in Open-Domain (25.9 →\rightarrow 16.8) and Multi-Hop scores. Without this attention dilution, "hub" nodes (common entities) accumulate excessive activation, flooding the graph with generic associations and drowning out specific signals. Similarly, Node Decay is the sole driver of timeline awareness. Setting δ=0\delta=0 destroys Temporal reasoning capabilities (50.1 →\rightarrow 14.2), as the model loses the ability to distinguish between current truths and obsolete facts based on activation energy.
At the system level, the necessity of our hybrid design is evident. Removing the spreading activation layer (“(-) Activation Dynamics”) regresses performance to that of a static graph (Avg 30.5), confirming that dynamics, not just topology, are essential for reasoning. Furthermore, relying on a geometric embedding space alone (“Vectors Only”) yields the lowest performance (Avg 25.2), validating that unstructured retrieval is insufficient for the long-horizon consistency required in agentic applications.
Beyond accuracy, practical deployment requires efficient resource utilization. Table 4 compares token usage, latency, and API cost across methods.
Synapse consumes only ∼\sim814 tokens per query on average, representing a 95% reduction compared to full-context methods (LoCoMo: 16,910; MemGPT: 16,977). This efficiency stems from our selective activation mechanism, which retrieves only the most contextually relevant subgraph rather than injecting entire conversation histories.
At $0.24 per 1,000 queries, Synapse is 11×\times cheaper than full-context approaches ($2.66–$2.67) while achieving nearly 2×\times higher performance. In terms of Cost Efficiency (F1/$F_{1}/\mathdollar), Synapse achieves a score of 167.3, surpassing MemoryOS (126.8) and significantly outperforming LoCoMo (9.6) and MemGPT (10.5). While LangMem achieves comparable cost efficiency (150.7) due to minimal overhead, its absolute performance (34.3 F1) lags behind. Note that graph construction costs are amortized over the lifetime of the agent and are negligible per-query.
With 1.9s average latency, Synapse is 4×\times faster than full-context methods (8.2–8.5s) and faster than ReadAgent (2.3s). We achieve a latency comparable to lightweight methods while delivering SOTA reasoning capabilities.
Figure 2 examines the impact of the Top-kk retrieval parameter on overall performance. The relatively flat performance curve suggests that Synapse is insensitive to precise kk selection within the sufficient range. We sweep k∈[10,50]k\in[10,50]. Crucially, at a modest k=30k=30, Synapse significantly outperforms A-Mem while incurring lower retrieval costs, proving that structural precision is more efficient than simply increasing context volume; see Appendix C for further details about more hyperparameters.
We presented Synapse, a cognitive architecture that resolves the Contextual Isolation of standard retrieval systems by emulating biological spreading activation. By modeling memory as a dynamic, associative graph, Synapse effectively unifies disjointed facts and filters irrelevant noise, establishing a new Pareto frontier for efficient, long-term agentic memory. Our results demonstrate that neuro-symbolic mechanisms can successfully bridge the gap between static vector retrieval and adaptive, structured cognition, paving the way for more autonomous and resilient AI agents.
While Synapse creates a new Pareto frontier for agentic memory, several limitations warrant discussion, outlining clear directions for future research.
First, the mechanisms that enable Synapse to excel at complex reasoning introduce specific trade-offs. One notable limitation is the Cold Start problem: the efficacy of spreading activation relies on a sufficiently connected topology. In nascent conversations with sparse history, the computational overhead of graph maintenance provides diminishing returns compared to simple linear buffers.
Additionally, lateral inhibition can occasionally lead to Cognitive Tunneling, causing performance drops on simple queries where exhaustive retrieval is superior. Finally, our current evaluation is constrained to the text modality via the LoCoMo benchmark. Since embodied agents increasingly require processing visual and auditory cues, a key direction for future work is extending Synapse to Multimodal Episodic Memory. By leveraging aligned embedding spaces, we aim to incorporate image and audio nodes into the unified graph, enabling structural reasoning across diverse modalities.
Our framework exhibits a dual dependency on LLM capabilities. On the upstream side, the topology of the Unified Graph is tightly coupled with the extraction quality of the underlying LLM. While GPT-4o-mini demonstrates robust schema adherence, smaller local models may struggle with consistent entity extraction, potentially leading to error propagation. On the downstream side, we rely on LLM-as-a-Judge for semantic evaluation. While we mitigate bias by separating the judge from the generator, model-based evaluation can still favor certain stylistic patterns. However, given the demonstrated failure of nn-gram metrics (Table 11), we maintain this is a necessary trade-off for accurate assessment.
Persistent graph structures introduce distinct privacy risks compared to ephemeral context windows. Centralized storage of semantic profiles creates a vector for "Memory Poisoning," where erroneous facts or malicious injections could permanently corrupt the knowledge store. Moreover, the indefinite retention of user data raises compliance concerns. Future iterations will focus on Automated Graph Auditing to detect inconsistencies and User-Controlled Forgetting (Machine Unlearning) mechanisms to ensure privacy compliance and robust memory maintenance.
The core capability of Synapse to accumulate long-term episodic memory inherently raises privacy concerns regarding the storage of sensitive user information. Unlike stateless LLMs that discard context after a session, our system persists interaction logs in a structured graph. While this persistence enables personalization, it necessitates strict data governance. In real-world deployments, the Episodic-Semantic Graph should be stored locally on the user’s device or in encrypted enclaves to prevent unauthorized access. Furthermore, our architecture supports granular forgetting. The temporal decay mechanism (δ\delta) and node pruning logic naturally mimic the “right to be forgotten,” preventing the indefinite retention of obsolete or sensitive data.
A critical ethical risk in memory-augmented agents is “memory hallucination,” where an agent confidently recalls events that never occurred. This phenomenon can lead to harmful advice or misinformation. Our work explicitly addresses this issue through the Uncertainty-Aware Rejection module. By calibrating the gating threshold (τgate\tau_{gate}) to prioritize precision over recall, as demonstrated in Section C.2, Synapse is designed to fail safely. The system refuses to answer when evidence is insufficient rather than fabricating details. This design choice reflects a commitment to safety-critical reliability over conversational fluency.
Our experiments utilize the LoCoMo benchmark, which consists of synthesized and fictional long-horizon dialogues. No real-world user data or Personally Identifiable Information (PII) was processed, stored, or exposed during this research. Future deployments involving human subjects would require explicit consent protocols regarding memory persistence duration and scope.
We provide the complete algorithm for incremental graph construction in Algorithm 1. The graph is built online as the agent interacts with users. In practice, pairwise similarity checks (Line 23) are optimized using HNSW indexing to maintain O(log|𝒱|)O(\log|\mathcal{V}|) scalable updates.
We employ a structured extraction approach to synthesize semantic nodes from episodic context. The extraction prompt follows a schema-guided paradigm, as shown in Figure 3.
To ensure a fair evaluation of overall performance, we calculate the Weighted F1 and BLEU-1 score across the four non-adversarial categories. This prevents the overall score from being skewed by categories with smaller sample sizes. The weighted average is computed as:
where SkS_{k} is the F1 (BLEU-1) score for category kk, and NkN_{k} is the number of instances. The specific instance counts for the LoCoMo benchmark are: Multi-Hop (N=841N=841), Single-Hop (N=282N=282), Temporal (N=321N=321), and Open-Domain (N=96N=96), resulting in a total of Ntotal=1540N_{total}=1540 valid evaluation samples.
We explicitly exclude the Adversarial category (C5C_{5}) from this weighted average. Since Synapse achieves near-perfect performance on adversarial rejection (96.6 F1) due to our dedicated gating mechanism, including it would disproportionately inflate our overall score compared to baselines that lack such modules. By omitting it, we ensure a fair comparison that highlights our model’s superior retrieval and reasoning capabilities across standard tasks with specific numbers, rather than masking gaps with rejection success.
Task Rank denotes the arithmetic mean rank of a method across all five evaluation categories, serving as a holistic metric for model versatility. To validate result reliability, we conduct a paired t-test on instance-level F1 scores comparing Synapse against the second-best performing baseline. Differences are considered statistically significant at p<0.05p<0.05. This verification is performed on a representative subset of N=500N=500 instances to confirm that improvements are robust against stochastic variance.
To comprehensively evaluate the effectiveness of Synapse, we compare it against a diverse set of state-of-the-art long-term memory mechanisms. These baselines represent the current landscape of memory augmentation for LLMs. We classify these methods into four primary categories based on their underlying data structures and retrieval mechanisms, as detailed in Table 5.
We conduct a systematic sensitivity analysis to examine the robustness of Synapse to hyperparameter choices (Table 6). All experiments are performed on the GPT-4o-mini backbone using the LoCoMo benchmark.
(1) Propagation depth TT is the most sensitive parameter, with performance degrading significantly if the graph is traversed too shallowly or too deeply. (2) Node Decay rate δ\delta directly impacts temporal reasoning; an optimal balance (δ=0.5\delta=0.5) is needed to retain recent history without noise. (3) Inhibition Top-MM (Sparsity) shows a clear peak around M=7M=7. Setting MM too low (3) over-prunes context, while setting it too high (10) introduces irrelevant noise. (4) Spreading factor S=0.8S=0.8 achieves optimal diffusion, allowing relevance to flow to related concepts without saturating the graph.
We calibrate the uncertainty gating threshold τgate\tau_{gate} on a held-out validation set (10% of samples) to strictly balance robustness against utility. Table 7 illustrates the sensitivity analysis.
We observe a clear "elbow" point at τgate=0.12\tau_{gate}=0.12. Below this threshold, increasing the gate provides massive gains in Adversarial robustness (60.2 →\rightarrow 96.6) with negligible impact on valid queries. However, pushing beyond 0.12 yields diminishing returns: raising τgate\tau_{gate} to 0.15 improves Adversarial F1 by only 0.6 points but nearly doubles the False Refusal Rate (FRR) from 2.1% to 4.2%. Notably, the ability to achieve near-perfect rejection at such a low threshold (τ≈0.12\tau\approx 0.12) indicates a strong Signal-to-Noise Ratio in our graph. The lateral inhibition mechanism effectively suppresses irrelevant nodes close to zero, creating a clean margin between valid retrieval (high activation) and hallucination (low activation), minimizing the need for aggressive thresholding.
Table 8 reports the mean F1 scores and standard deviations across three independent runs. The low standard deviations (≤0.5\leq 0.5) confirm that our method is stable and not dependent on favorable random initialization.
We evaluate models on subsets of the LoCoMo test set where the semantic similarity between the evidence and the question falls below specific thresholds (0.5 and 0.3).
As shown in Table 9, Synapse exhibits strong robustness (drop <8%<8%), whereas A-Mem suffers significant degradation (drop >50%>50%). This validates that our graph spreading mechanism reduces reliance on purely surface-level vector similarity.
Table 10 presents the LLM-as-a-Judge evaluation results, offering a more nuanced perspective than rigid nn-gram metrics. Synapse achieves the highest semantic correctness across all categories (Overall 80.7), significantly outperforming strong baselines like ENGRAM (77.6) and MemoryOS (67.7).
The performance gap is most pronounced in the Multi-Hop category, where Synapse scores 84.2, establishing a clear margin over MemoryOS (63.7) and AriGraph (28.2). This validates our core hypothesis: while hierarchical or vector-based systems struggle to retrieve disconnected evidence chains, Synapse’s spreading activation successfully propagates relevance across intermediate nodes, reconstructing the full reasoning path.
In the Temporal category, Synapse (72.1) and MemoryOS (72.7) are the only two methods surpassing the 70-point threshold. This parity is instructive: MemoryOS explicitly optimizes for memory updates (OS-like read/write), whereas Synapse achieves this implicitly through temporal decay dynamics. The fact that our decay-based mechanism matches a dedicated memory-management system suggests that "forgetting" is as crucial as "remembering" for maintaining an accurate timeline.
Table 11 provides a granular look at why standard metrics (F1/BLEU) systematically undervalue agentic memory systems. We identify three distinct phenomena where Synapse demonstrates superior intelligence that is penalized by rigid string matching.
In temporal queries, the ground truth is often a static string extracted from past context (e.g., "Since 2016"). However, Synapse often performs arithmetic reasoning relative to the current timeframe (e.g., "Seven years", assuming the current year is 2023). As shown in Table 11 (row 11), this results in an F1 score of 0.0 despite the answer being factually perfect. This confirms that Synapse is not merely retrieving text chunks but is understanding time as a dynamic variable.
For questions like "What motivated counseling?", the ground truth is often a concise extraction ("Her journey"). Synapse, leveraging its connected graph, retrieves the broader context of her motivations ("Her own struggles and desire to help"). While this verbosity lowers overlap ratios (F1: 22.2), the LLM Judge correctly identifies it as a more complete and nuanced answer (Score: 100), demonstrating that our method preserves the richness of user history better than extractive baselines.
In Multi-Hop scenarios, Synapse tends to answer with implications rather than direct quotes. When asked if someone is an "ally," Synapse synthesizes evidence of support ("Yes, Melanie supports and encourages…") rather than just outputting "Yes". This behavior mimics human memory—reconstructing the gist rather than rote memorization—which is essential for naturalistic interaction but challenging for lexical metrics.
We analyze a representative failure case (Figure 4) where aggressive activation dynamics lead to the suppression of minor details.
Table 12 presents the performance of Synapse and baselines across different LLM backbones (GPT-4o, Qwen-1.5b, Qwen-3b). We highlight two consistent observations.
Note: Main results for GPT-4o-mini are provided in Table 1. Values here differ due to different backbones. "All" rows in Table 8 denote the same validation set logic as Table 1. Category Average Model Method Multi-Hop Temporal Open Domain Single-Hop Adversarial Performance∗ Task F1 BLEU F1 BLEU F1 BLEU F1 BLEU F1 BLEU F1 BLEU Rank GPT-4o LoCoMo Maharana et al. (2024) 28.0 18.5 9.1 5.8 16.5 14.8 61.6 54.2 52.6 51.1 29.5 22.2 2.8 ReadAgent Lee et al. (2024) 14.6 10.0 4.2 3.2 8.8 8.4 12.5 10.3 6.8 6.1 11.7 8.5 5.0 MemoryBank Zhong et al. (2024) 6.5 4.7 2.5 2.4 6.4 5.3 8.3 7.1 4.4 3.7 6.0 4.7 6.0 MemGPT Packer et al. (2024) 30.4 22.8 17.3 13.2 12.2 11.9 60.2 53.4 35.0 34.3 32.0 25.7 3.2 A-Mem Xu et al. (2025) 32.9 23.8 39.4 31.2 17.1 15.8 48.4 43.0 36.4 35.5 36.1 28.4 2.4 Synapse (Ours) 39.3 29.5 55.5 50.3 29.5 23.9 46.5 38.8 97.8 97.7 43.4 35.2 1.6 Qwen-1.5b LoCoMo Maharana et al. (2024) 9.1 6.6 4.3 4.0 9.9 8.5 11.2 8.7 40.4 40.2 8.5 6.6 4.0 ReadAgent Lee et al. (2024) 6.6 4.9 2.6 2.5 5.3 12.2 10.1 7.5 5.4 27.3 6.3 5.3 5.8 MemoryBank Zhong et al. (2024) 11.1 8.3 4.5 2.9 8.1 6.2 13.4 11.0 36.8 34.0 10.0 7.5 3.6 MemGPT Packer et al. (2024) 10.4 7.6 4.2 3.9 13.4 11.6 9.6 7.3 31.5 28.9 9.1 7.0 4.6 A-Mem Xu et al. (2025) 18.2 11.9 24.3 19.7 16.5 14.3 23.6 19.2 46.0 43.3 20.4 15.0 2.0 Synapse (Ours) 38.1 24.6 35.5 28.6 18.1 11.7 35.8 26.6 98.1 60.1 35.9 25.0 1.0 Qwen-3b LoCoMo Maharana et al. (2024) 4.6 4.3 3.1 2.7 4.6 6.0 7.0 5.7 17.0 14.8 4.7 4.3 4.8 ReadAgent Lee et al. (2024) 2.5 1.8 3.0 3.0 5.6 5.2 3.3 2.5 15.8 14.0 2.9 2.4 5.8 MemoryBank Zhong et al. (2024) 3.6 3.4 1.7 2.0 6.6 6.6 4.1 3.3 13.1 10.3 3.5 3.3 6.0 MemGPT Packer et al. (2024) 5.1 4.3 2.9 3.0 7.0 7.1 7.3 5.5 14.5 12.4 5.2 4.4 4.6 A-Mem Xu et al. (2025) 12.6 9.0 27.6 25.1 7.1 7.3 17.2 13.1 27.9 25.2 16.2 13.0 2.6 MemoryOS Li et al. (2025) 21.4 15.0 26.2 22.4 10.2 8.2 23.3 15.4 – – 22.1 16.2 – Synapse (Ours) 38.8 25.1 36.2 29.6 14.7 11.5 37.8 26.1 98.9 60.5 36.6 25.4 1.0
On the resource-constrained Qwen-3b, Synapse achieves an Average F1 of 36.6, substantially outperforming MemoryOS (22.1) and A-Mem (16.2). This suggests that explicitly structured activation can partially compensate for the limited reasoning capacity of smaller models: rather than relying on the backbone to infer long-range dependencies from retrieved text alone, the retrieval stage itself exposes relationally relevant evidence through activation propagation.
On GPT-4o, Synapse further improves to an Average F1 of 43.4, indicating that stronger backbones can better exploit the retrieved subgraph once the relevant evidence is surfaced. Meanwhile, LoCoMo retains an advantage in simple Single-Hop retrieval (61.6 vs. 46.5), which is expected because it operates on near-exhaustive context access. Importantly, Synapse consistently dominates in complex reasoning categories (e.g., Multi-Hop and Temporal), supporting the claim that the core benefit stems from structured activation rather than brute-force context injection.
Table: S4.T1: Main results on the LoCoMo benchmark (GPT-4o-mini). Normalized results across all categories. Extended results for other backbones are provided in Appendix F.
| Category | Average | ||||||||||||
| Method | Multi-Hop | Temporal | Open Domain | Single-Hop | Adversarial | Performance∗ | Task | ||||||
| F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | Rank | |
| MemoryBank† Zhong et al. (2024) | 5.0 | 4.8 | 9.7 | 7.0 | 5.6 | 5.9 | 6.6 | 5.2 | 7.4 | 6.5 | 6.3 | 5.4 | 11.6 |
| ReadAgent† Lee et al. (2024) | 9.2 | 6.5 | 12.6 | 8.9 | 5.3 | 5.1 | 9.7 | 7.7 | 9.8 | 9.0 | 9.8 | 7.1 | 11.0 |
| ENGRAM Patel and Patel (2025) | 18.3 | 13.2 | 21.9 | 14.7 | 8.6 | 5.5 | 23.1 | 13.7 | 33.5 | 19.4 | 19.3 | 13.1 | 9.2 |
| GraphRAG† Edge et al. (2025) | 16.5 | 11.8 | 22.4 | 15.2 | 10.1 | 8.4 | 24.5 | 18.2 | 15.2 | 12.0 | 18.3 | 14.2 | 8.8 |
| MemGPT† Packer et al. (2024) | 26.7 | 17.7 | 25.5 | 19.4 | 9.2 | 7.4 | 41.0 | 34.3 | 43.3 | 42.7 | 28.0 | 20.5 | 7.2 |
| LoCoMo† Maharana et al. (2024) | 25.0 | 19.8 | 18.4 | 14.8 | 12.0 | 11.2 | 40.4 | 29.1 | 69.2 | 68.8 | 25.6 | 19.9 | 7.0 |
| LangMem LangChain Team (2024) | 34.5 | 23.7 | 30.8 | 25.8 | 24.3 | 19.2 | 40.9 | 33.6 | 47.6 | 46.3 | 34.3 | 25.7 | 5.0 |
| A-Mem† Xu et al. (2025) | 27.0 | 20.1 | 45.9 | 36.7 | 12.1 | 12.0 | 44.7 | 37.1 | 50.0 | 49.5 | 33.3 | 26.2 | 4.8 |
| MemoryOS Li et al. (2025) | 35.3 | 25.2 | 41.2 | 30.8 | 20.0 | 16.5 | 48.6 | 43.0 | – | – | 38.0 | 29.1 | – |
| AriGraph Anokhin et al. (2025) | 28.5 | 21.0 | 43.2 | 33.5 | 14.5 | 13.0 | 45.1 | 38.0 | 48.5 | 47.0 | 33.7 | 26.2 | 4.6 |
| Zep Rasmussen et al. (2025) | 35.5 | 25.8 | 48.5 | 40.2 | 23.1 | 18.0 | 48.0 | 41.5 | 65.4 | 64.0 | 39.7 | 31.2 | 2.6 |
| Synapse (Ours) | 35.7 | 26.2 | 50.1 | 44.5 | 25.9 | 19.2 | 48.9 | 42.9 | 96.6 | 96.4 | 40.5 | 32.6 | 1.0 |
Table: S4.T2: Qualitative Comparison of Retrieval Behaviors. Synapse demonstrates superior handling of temporal updates, multi-hop reasoning chains, and adversarial inputs compared to the semantic-only A-Mem baseline.
| System Component | A-Mem (Baseline) | Synapse (Ours) |
|---|---|---|
| Uncertainty-Aware Rejection (Confidence Gating) | Error: Semantic Drift Top-1 Retrieved: “Melanie’s kids love playing with their toy dinosaur, Rex.” ✗ False Association: Matches query ‘dog‘ with semantic neighbor ‘Rex‘, ignoring context. →\rightarrow Hallucination: “She has a dog named Rex.” | Success: Confidence Gating Check: 𝒞ret<τgate(0.12)\mathcal{C}{ret}<\tau{gate}(0.12) Action: Trigger Negative Acknowledgement Protocol. ✓ Rejection: Low confidence preempts generation. →\rightarrow Response: “No record of such pet found.” |
| Spreading Activation (Dynamic Context) | Error: Temporal Obsolescence Top-1 Retrieved: [D4:3] “Caroline moved from Sweden 4 years ago…” (Score: 0.92) ✗ Static Bias: High cosine similarity to query “where living” dominates. →\rightarrow Output: “She lives in Sweden.” | Success: Temporal Decay Action: Sfinal=Ssem+λ⋅Sdecay(t)S_{final}=S_{sem}+\lambda\cdot S_{decay}(t) Trace: D4:3 (Sweden) decay →\rightarrow 0.4. D1:1 (US) boost →\rightarrow 0.95. ✓ Reranking: Prioritizes current state over semantic overlap. →\rightarrow Output: “Currently in the US.” |
| Knowledge Graph (Structure) | Error: Logical Disconnection Top-1 Retrieved: “Caroline collects books.” (matches ‘Dr. Seuss‘) ✗ Missing Link: Fails to bridge ‘collects books‘ ↔\leftrightarrow ‘Dr. Seuss‘ without explicit overlap. →\rightarrow Output: “Uncertain/No info.” | Success: Multi-Hop Inference Action: 𝒢walk(Caroline,Dr. Seuss,k=2)\mathcal{G}_{walk}(\text{Caroline},\text{Dr. Seuss},k=2) Path: Caroline→collectsClassic Books→containsDr. Seuss\text{Caroline}\xrightarrow{\text{collects}}\text{Classic Books}\xrightarrow{\text{contains}}\text{Dr. Seuss} ✓ Bridging: Uses graph structure to infer implicit connection. →\rightarrow Output: “Yes, likely has them.” |
Table: S4.T3: Mechanism Ablation Study. Impact of selectively disabling cognitive components on F1 scores (GPT-4o-mini). Removing specific dynamics causes targeted drops in corresponding task categories, validating our theoretical design.
| Configuration | M-Hop | Temp. | Open | Single | Adv. | Avg. |
| Synapse (Full) | 35.7 | 50.1 | 25.9 | 48.9 | 96.6 | 40.5 |
| Micro-Dynamics Ablation (Mechanism-Level) | ||||||
| (-) Uncertainty Gating (τgate=0\tau_{gate}=0) | 35.6 | 50.0 | 25.4 | 48.8 | 67.2 | 40.3 |
| (-) Lateral Inhibition (β=0\beta=0) | 35.1 | 49.8 | 22.4 | 49.1 | 71.5 | 39.4 |
| (-) Fan Effect (No Dilution) | 30.2 | 48.5 | 16.8 | 47.5 | 94.2 | 36.1 |
| (-) Node Decay (δ=0\delta=0) | 34.8 | 14.2 | 24.5 | 48.2 | 95.8 | 30.7 |
| Macro-Architecture Ablation (System-Level) | ||||||
| (-) Activation Dynamics | 31.2 | 23.7 | 18.2 | 48.9 | 70.4 | 30.5 |
| (-) Graph Structure | 35.2 | 25.4 | 21.0 | 49.9 | 88.2 | 32.9 |
| Vectors Only (Baseline) | 27.5 | 14.7 | 12.5 | 46.0 | 69.2 | 25.2 |
Table: S4.T4: Efficiency Profile. Comparison on GPT-4o-mini. Latency is measured on a single NVIDIA A100 GPU averaging over 100 queries; "Cost" reflects Total API Cost (Input + Output Tokens) at standard rates.
| Method | Token Length | Latency | Cost/1k Queries | F1 (Excl. Adv.)* | Cost Eff. (F1/$F_{1}/\mathdollar) |
|---|---|---|---|---|---|
| LoCoMo Maharana et al. (2024) | ∼\sim16,910 | 8.2s | $2.67 | 25.6 | 9.6 |
| MemGPT Packer et al. (2024) | ∼\sim16,977 | 8.5s | $2.67 | 28.0 | 10.5 |
| A-Mem Xu et al. (2025) | ∼\sim2,520 | 5.4s | $0.50 | 33.3 | 66.9 |
| MemoryOS Li et al. (2025) | ∼\sim1,198 | 1.5s | $0.30 | 38.0 | 126.8 |
| ReadAgent Lee et al. (2024) | ∼\sim643 | 2.3s | $0.22 | 9.8 | 45.3 |
| LangMem LangChain Team (2024) | ∼\sim717 | 0.6s | $0.23 | 34.3 | 150.7 |
| MemoryBank Zhong et al. (2024) | ∼\sim432 | 1.2s | $0.18 | 6.3 | 34.1 |
| Synapse (Ours) | ∼\sim814 | 1.9s | $0.24 | 40.5 | 167.3 |
Table: A1.T5: Taxonomy of baseline methods compared in our experiments. We categorize methods based on their core memory representation and retrieval mechanism.
| Category | Method | Key Mechanism | Reference |
| System-level | MemGPT | Hierarchical memory management with virtual context paging (Main vs. External Context). | Packer et al. (2024) |
| MemoryOS | OS-inspired memory hierarchy optimizing read/write operations. | Li et al. (2025) | |
| Mem0 | Self-improving memory layer for personalization and continuity. | Chhikara et al. (2025) | |
| Graph-based | AriGraph | Episodic and semantic memory organized as a dynamic graph structure. | Anokhin et al. (2025) |
| GraphRAG | Leverages community detection on knowledge graphs for global/local retrieval. | Edge et al. (2025) | |
| Zep | Knowledge graph-based memory designed for entity relationships. | Rasmussen et al. (2025) | |
| Synapse | Hybrid spreading activation with dynamic structure (Ours). | – | |
| Retrieval | MemoryBank | Retrieval-based memory incorporating the Ebbinghaus forgetting curve. | Zhong et al. (2024) |
| ENGRAM | Advanced latent memory clustering and retrieval mechanism. | Patel and Patel (2025) | |
| LangMem | Memory injection via in-context learning or fine-tuning updates. | LangChain Team (2024) | |
| Agentic | ReadAgent | Agentic system that paginates long context and generates gist memories. | Lee et al. (2024) |
| LoCoMo | Local Context Motion for compressing and selecting relevant blocks. | Maharana et al. (2024) | |
| A-Mem | Adaptive agentic memory system capable of self-updating summaries. | Xu et al. (2025) |
Table: A3.T6: Hyperparameter sensitivity analysis on LoCoMo (GPT-4o-mini). Default values are marked with †\dagger.
| Parameter | Value | M-Hop | Temp. | Avg |
| Spreading SS | 0.6 | 32.8 | 47.5 | 38.3 |
| 0.8† | 35.7 | 50.1 | 40.5 | |
| 1.0 | 33.5 | 48.0 | 38.8 | |
| Node Decay δ\delta | 0.3 | 34.5 | 51.2 | 40.0 |
| 0.5† | 35.7 | 50.1 | 40.5 | |
| 0.7 | 33.8 | 46.5 | 38.1 | |
| Steepness γ\gamma | 3.0 | 34.9 | 49.3 | 39.8 |
| 5.0† | 35.7 | 50.1 | 40.5 | |
| 7.0 | 35.1 | 49.7 | 40.0 | |
| Threshold θ\theta | 0.3 | 32.9 | 48.8 | 39.0 |
| 0.5† | 35.7 | 50.1 | 40.5 | |
| 0.7 | 34.1 | 49.2 | 39.5 | |
| Inhibition β\beta | 0.10 | 35.4 | 49.9 | 40.1 |
| 0.15† | 35.7 | 50.1 | 40.5 | |
| 0.20 | 35.2 | 49.6 | 39.9 | |
| Propagation TT | 2 | 31.5 | 46.8 | 37.7 |
| 3† | 35.7 | 50.1 | 40.5 | |
| 4 | 35.2 | 49.8 | 40.1 | |
| Inhibition MM | 3 | 33.5 | 48.9 | 39.2 |
| 7† | 35.7 | 50.1 | 40.5 | |
| 10 | 34.8 | 49.3 | 39.8 |
Table: A3.T7: Impact of gating threshold τgate\tau_{gate} on Adversarial F1 and False Refusal Rate (FRR) on non-adversarial queries. Our selected threshold of 0.120.12 creates a "safe operating window" with <2.5% false refusals.
| τgate\tau_{gate} | Adv. F1 | FRR (Non-Adv) | Verdict |
|---|---|---|---|
| 0.00 | 60.2 | 0.0% | Baseline |
| 0.05 | 94.2 | 0.8% | Conservative |
| 0.10 | 95.8 | 1.5% | Balanced |
| 0.12 | 96.6 | 2.1% | Selected |
| 0.15 | 97.2 | 4.2% | Aggressive |
| 0.20 | 98.1 | 8.5% | Unsafe |
Table: A4.T8: Statistical stability of Synapse across 3 random seeds (GPT-4o-mini).
| Category | F1 Score |
|---|---|
| Multi-Hop | 35.7 ±\pm 0.1 |
| Temporal | 50.1 ±\pm 0.3 |
| Open-Domain | 25.9 ±\pm 0.2 |
| Single-Hop | 48.9 ±\pm 0.1 |
| Adversarial | 96.6 ±\pm 0.1 |
| Average | 40.5 ±\pm 0.2 |
Table: A4.T9: LoCoMo QA results (F1, %) on low-similarity subsets. ↓\downarrowF1 denotes relative performance drop.
| Model | Thres. | M-Hop | Temp. | Open | Single | Adv. | ↓\downarrowF1 |
|---|---|---|---|---|---|---|---|
| A-Mem | All | 32.9 | 39.4 | 17.1 | 48.4 | 36.4 | – |
| 0.5 | 20.2 | 19.3 | 11.5 | 28.8 | 19.4 | (43.1%) | |
| 0.3 | 14.6 | 16.3 | 9.5 | 19.7 | 16.0 | (56.3%) | |
| Synapse | All | 39.3 | 55.5 | 29.5 | 46.5 | 97.8 | – |
| 0.5 | 42.8 | 49.4 | 22.2 | 44.8 | 95.3 | (5.3%) | |
| 0.3 | 42.3 | 47.5 | 21.5 | 43.8 | 93.7 | (7.4%) |
Table: A4.T10: LLM-as-a-Judge Semantic Scores (0-100). Synapse dominates in complex reasoning tasks (Multi-Hop), validating the efficacy of graph-based activation.
| Method | Single-Hop | Multi-Hop | Open Domain | Temporal | Overall |
|---|---|---|---|---|---|
| MemoryBank Zhong et al. (2024) | 30.5 | 14.2 | 45.3 | 35.8 | 23.6 |
| ReadAgent Lee et al. (2024) | 37.1 | 16.5 | 50.2 | 41.5 | 27.6 |
| LoCoMo Maharana et al. (2024) | 38.5 | 17.8 | 53.0 | 48.2 | 30.1 |
| A-Mem Xu et al. (2025) | 39.8 | 18.9 | 54.1 | 49.9 | 31.4 |
| Mem0 Chhikara et al. (2025) | 67.1 | 51.2 | 75.7 | 58.1 | 57.1 |
| MemGPT Packer et al. (2024) | 41.2 | 19.5 | 55.8 | 50.4 | 32.2 |
| AriGraph Anokhin et al. (2025) | 45.5 | 28.2 | 60.1 | 51.5 | 38.2 |
| LangMem LangChain Team (2024) | 62.2 | 47.9 | 71.1 | 23.4 | 46.9 |
| Zep Rasmussen et al. (2025) | 61.7 | 41.4 | 76.6 | 49.3 | 49.0 |
| MemoryOS Li et al. (2025) | 78.3 | 63.7 | 54.6 | 72.7 | 67.7 |
| ENGRAM Patel and Patel (2025) | 79.9 | 79.8 | 72.9 | 70.8 | 77.6 |
| Synapse | 81.5 | 84.2 | 76.8 | 72.1 | 80.7 |
Table: A5.T11: Expanded Analysis of Metric Divergence. Examples where Synapse generates semantically accurate responses that are penalized by F1 scores due to synonymy, verbosity, or date formatting.
| Category | Question | Ground Truth | Synapse Output | F1 | Judge |
|---|---|---|---|---|---|
| Single-Hop | What is Caroline’s identity? | Transgender woman | Caroline is transgender. | 40.0 | 100 |
| Who supports Caroline? | Her mentors, family | Her support system, those close to her | 16.7 | 90 | |
| What motivated counseling? | Her journey and how it improved life | Her own struggles and desire to help | 22.2 | 100 | |
| What was grandma’s gift? | Necklace | A necklace symbolizing love | 33.3 | 100 | |
| Transition changes faced? | Changes to her body | Exploring her changing body | 50.0 | 100 | |
| Multi-Hop | Considered an ally? | Yes, she is supportive | Yes, Melanie supports and encourages… | 40.0 | 100 |
| Likely enjoy Vivaldi? | Yes; it’s classical | Yes, she enjoys classical music. | 33.3 | 100 | |
| Likely have Dr. Seuss? | Yes, since she collects classics | Yes, likely for their creativity… | 15.4 | 100 | |
| Political leaning? | Liberal | Progressive or liberal. | 50.0 | 100 | |
| Realization after race? | Self-care is important | Importance of taking care of minds | 20.0 | 100 | |
| Temporal | How long practicing art? | Since 2016 | Seven years (relative to 2023) | 0.0 | 100 |
| Adoption meeting date? | Friday before 15 July | 14 July 2023 | 50.0 | 100 | |
| When was the picnic? | Week before 6 July | 29 June 2023 | 25.0 | 100 | |
| When was charity race? | Sunday before 25 May | 20 May 2023 | 50.0 | 100 | |
| Pottery class date? | 2 July 2023 | 02 July 2023 | 66.7 | 100 |
Overview of the Synapse architecture. (Left) A user query regarding "that guy from the ski trip" activates the graph via Dual Triggers: Lexical matching targets explicit entities ("Kendall"), while Semantic embedding targets implicit concepts ("Ski Trip"). (Center) Spreading Activation dynamically propagates relevance through the Unified Episodic-Semantic Graph. Note how the bridge node "Mark" (purple) is activated despite not appearing in the query, connecting the disjoint concepts of "Ski Trip" and "Dating". (Right) The Triple Hybrid Scoring layer reranks candidates, successfully retrieving the ground truth ("broke up with Mark") while suppressing semantically similar but logically irrelevant distractors ("going skiing") via lateral inhibition.
Sensitivity analysis of Top-kk retrieval on LoCoMo benchmark. Performance is robust across k∈[20,40]k\in[20,40], with optimal stability around k=30k=30. Star markers denote A-Mem baseline performance at their experiment settings.
$$ \mathbf{a}_i^{(0)} = \begin{cases} \alpha \cdot \text{sim}(\mathbf{h}_i, \mathbf{h}_q) & \text{if } v_i \in \mathcal{T} \ 0 & \text{otherwise} \end{cases} $$
$$ \label{eq:propagation} \mathbf{u}i^{(t+1)} = (1 - \delta) \mathbf{a}i^{(t)} + \sum{j \in \mathcal{N}(i)} \frac{S \cdot w{ji} \cdot \mathbf{a}_j^{(t)}}{\text{fan}(j)} $$ \tag{eq:propagation}
$$ \label{eq:inhibition} \begin{split} \hat{\mathbf{u}}_i^{(t+1)} = \max\Big(0, \ & \mathbf{u}i^{(t+1)} \ & - \beta \sum{k \in \mathcal{T}_M} (\mathbf{u}_k^{(t+1)} - \mathbf{u}_i^{(t+1)}) \ & \cdot \mathbb{I}[\mathbf{u}_k^{(t+1)} > \mathbf{u}_i^{(t+1)}]\Big) \end{split} $$ \tag{eq:inhibition}
$$ \label{eq:sigmoid} \mathbf{a}_i^{(t+1)} = \sigma(\hat{\mathbf{u}}_i^{(t+1)}) = \frac{1}{1 + \exp(-\gamma(\hat{\mathbf{u}}_i^{(t+1)} - \theta))} $$ \tag{eq:sigmoid}
$$ \text{Weighted F1 (BLEU-1)} = \frac{\sum_{k \in \mathcal{C}} N_k \cdot S_k}{\sum_{k \in \mathcal{C}} N_k} $$
$$ \displaystyle\mathcal{S}(v_{i})= $$ \tag{S3.E5X}
$$ \label{eq:scoring} \begin{aligned} \mathcal{S}(v_i) = \ & \lambda_1 \cdot \text{sim}(\mathbf{h}_i, \mathbf{h}_q) \ + \ & \lambda_2 \cdot \mathbf{a}_i^{(T)} \ + \ & \lambda_3 \cdot \text{PageRank}(v_i) \end{aligned} $$ \tag{eq:scoring}
Algorithm: algorithm
[h]
\caption{Incremental Graph Construction}
\label{alg:graph_construction}
\small
\begin{algorithmic}[1]
\Require Conversation stream $\{(u_t, r_t)\}_{t=1}^T$, consolidation interval $N=5$
\Ensure Unified graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$
\State Initialize $\mathcal{V}_E \gets \emptyset$, $\mathcal{V}_S \gets \emptyset$, $\mathcal{E} \gets \emptyset$
\For{each turn $t$}
\State $c_t \gets \texttt{concat}(u_t, r_t)$
\State $\mathbf{h}_t \gets \texttt{Encoder}(c_t)$ \Comment{all-MiniLM-L6-v2}
\State $v_t^e \gets (c_t, \mathbf{h}_t, \tau_t)$; \quad $\mathcal{V}_E \gets \mathcal{V}_E \cup \{v_t^e\}$
\If{$t > 1$}
\State $\mathcal{E} \gets \mathcal{E} \cup \{(v_{t-1}^e, v_t^e, w=1.0, \textsc{Temporal})\}$
\EndIf
\If{$t \mod N = 0$} \Comment{Consolidation trigger}
\State $\texttt{context} \gets \{v_{t-N+1}^e, \ldots, v_t^e\}$
\State $\texttt{items} \gets \texttt{LLM\_Extract}(\texttt{context})$ \Comment{Entities \& Concepts}
\For{each item $s \in \texttt{items}$}
\State $\mathbf{h}_s \gets \texttt{Encoder}(s)$
\If{$\exists v_j^s \in \mathcal{V}_S: \text{sim}(\mathbf{h}_s, \mathbf{h}_j) > 0.92$}
\State Update $v_j^s$ embedding via EMA \Comment{Deduplication}
\Else
\State $v_s^s \gets (s, \mathbf{h}_s)$; $\mathcal{V}_S \gets \mathcal{V}_S \cup \{v_s^s\}$
\EndIf
\For{each $v_k^e \in \texttt{context}$}
\State $\mathcal{E} \gets \mathcal{E} \cup \{(v_k^e, v_s^s, w=0.8, \textsc{Abstraction})\}$
\EndFor
\EndFor
\For{each pair $(v_i^s, v_j^s) \in \mathcal{V}_S \times \mathcal{V}_S$}
\State $w \gets \text{sim}(\mathbf{h}_i, \mathbf{h}_j)$
\If{$w > 0.92$ \textbf{and} $j \in \text{Top-}15(\mathcal{N}(i))$}
\State $\mathcal{E} \gets \mathcal{E} \cup \{(v_i^s, v_j^s, w, \textsc{Association})\}$
\EndIf
\EndFor
\EndIf
\EndFor
\State \Return $\mathcal{G} = (\mathcal{V}_E \cup \mathcal{V}_S, \mathcal{E})$
\end{algorithmic}
| Category | Category | Category | Category | Category | Category | Category | Category | Category | Category | Average | Average | Average | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Multi-Hop | Multi-Hop | Temporal | Temporal | Open Domain | Open Domain | Single-Hop | Single-Hop | Adversarial | Adversarial | Performance ∗ | Performance ∗ | Task |
| F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | Rank | |
| MemoryBank † (Zhong et al., 2024) | 5.0 | 4.8 | 9.7 | 7.0 | 5.6 | 5.9 | 6.6 | 5.2 | 7.4 | 6.5 | 6.3 | 5.4 | 11.6 |
| ReadAgent † (Lee et al., 2024) | 9.2 | 6.5 | 12.6 | 8.9 | 5.3 | 5.1 | 9.7 | 7.7 | 9.8 | 9.0 | 9.8 | 7.1 | 11.0 |
| ENGRAM (Patel and Patel, 2025) | 18.3 | 13.2 | 21.9 | 14.7 | 8.6 | 5.5 | 23.1 | 13.7 | 33.5 | 19.4 | 19.3 | 13.1 | 9.2 |
| GraphRAG † (Edge et al., 2025) | 16.5 | 11.8 | 22.4 | 15.2 | 10.1 | 8.4 | 24.5 | 18.2 | 15.2 | 12.0 | 18.3 | 14.2 | 8.8 |
| MemGPT † (Packer et al., 2024) | 26.7 | 17.7 | 25.5 | 19.4 | 9.2 | 7.4 | 41.0 | 34.3 | 43.3 | 42.7 | 28.0 | 20.5 | 7.2 |
| LoCoMo † (Maharana et al., 2024) | 25.0 | 19.8 | 18.4 | 14.8 | 12.0 | 11.2 | 40.4 | 29.1 | 69.2 | 68.8 | 25.6 | 19.9 | 7.0 |
| LangMem (LangChain Team, 2024) | 34.5 | 23.7 | 30.8 | 25.8 | 24.3 | 19.2 | 40.9 | 33.6 | 47.6 | 46.3 | 34.3 | 25.7 | 5.0 |
| A-Mem † (Xu et al., 2025) | 27.0 | 20.1 | 45.9 | 36.7 | 12.1 | 12.0 | 44.7 | 37.1 | 50.0 | 49.5 | 33.3 | 26.2 | 4.8 |
| MemoryOS (Li et al., 2025) | 35.3 | 25.2 | 41.2 | 30.8 | 20.0 | 16.5 | 48.6 | 43.0 | - | - | 38.0 | 29.1 | - |
| AriGraph (Anokhin et al., 2025) | 28.5 | 21.0 | 43.2 | 33.5 | 14.5 | 13.0 | 45.1 | 38.0 | 48.5 | 47.0 | 33.7 | 26.2 | 4.6 |
| Zep (Rasmussen et al., 2025) | 35.5 | 25.8 | 48.5 | 40.2 | 23.1 | 18.0 | 48.0 | 41.5 | 65.4 | 64.0 | 39.7 | 31.2 | 2.6 |
| SYNAPSE (Ours) | 35.7 | 26.2 | 50.1 | 44.5 | 25.9 | 19.2 | 48.9 | 42.9 | 96.6 | 96.4 | 40.5 | 32.6 | 1.0 |
| System Com- ponent | A-Mem (Baseline) | Synapse (Ours) |
|---|---|---|
| Uncertainty- Aware Rejection (Confidence Gating) | Error: Semantic Drift Top-1 Retrieved: 'Melanie's kids love play- ing with their toy dinosaur, Rex.' ✗ False Association: Matches query 'dog' with semantic neighbor 'Rex', ignoring con- text. → Hallucination : 'She has a dog named Rex.' | Success: Confidence Gating Check: C ret < τ gate (0 . 12) Action: Trigger Negative Acknowledgement Protocol. ✓ Rejection: Low confidence preempts gen- eration. → Response : 'No record of such pet found.' |
| Spreading Activation (Dynamic Con- text) | Error: Temporal Obsolescence Top-1 Retrieved: [D4:3] 'Caroline moved from Sweden 4 years ago...' (Score: 0.92) ✗ Static Bias: High cosine similarity to query 'where living' dominates. → Output : 'She lives in Sweden.' | Success: Temporal Decay Action: S final = S sem + λ · S decay ( t ) Trace: D4:3 (Sweden) decay → 0.4. D1:1 (US) boost → 0.95. ✓ Reranking: Prioritizes current state over semantic overlap. → Output : 'Currently in the US.' |
| Knowledge Graph (Structure) | Error: Logical Disconnection Top-1 Retrieved: 'Caroline collects books.' (matches 'Dr. Seuss') ✗ Missing Link: Fails to bridge 'collects books' ↔ 'Dr. Seuss' without explicit over- lap. → Output : 'Uncertain/No info.' | Success: Multi-Hop Inference Action: G walk ( Caroline , Dr. Seuss ,k = 2) Path: Caroline collects - --- → Classic Books contains - ---→ Dr. Seuss ✓ Bridging: Uses graph structure to infer implicit connection. → Output : 'Yes, likely has them.' |
| Configuration | M-Hop | Temp. | Open | Single | Adv. | Avg. |
|---|---|---|---|---|---|---|
| SYNAPSE (Full) | 35.7 | 50.1 | 25.9 | 48.9 | 96.6 | 40.5 |
| Micro-Dynamics Ablation (Mechanism-Level) | Micro-Dynamics Ablation (Mechanism-Level) | Micro-Dynamics Ablation (Mechanism-Level) | Micro-Dynamics Ablation (Mechanism-Level) | Micro-Dynamics Ablation (Mechanism-Level) | Micro-Dynamics Ablation (Mechanism-Level) | Micro-Dynamics Ablation (Mechanism-Level) |
| (-) Uncertainty Gating ( τ gate = 0 ) | 35.6 | 50.0 | 25.4 | 48.8 | 67.2 | 40.3 |
| (-) Lateral Inhibition ( β = 0 ) | 35.1 | 49.8 | 22.4 | 49.1 | 71.5 | 39.4 |
| (-) Fan Effect (No Dilution) | 30.2 | 48.5 | 16.8 | 47.5 | 94.2 | 36.1 |
| (-) Node Decay ( δ = 0 ) | 34.8 | 14.2 | 24.5 | 48.2 | 95.8 | 30.7 |
| Macro-Architecture Ablation (System-Level) | Macro-Architecture Ablation (System-Level) | Macro-Architecture Ablation (System-Level) | Macro-Architecture Ablation (System-Level) | Macro-Architecture Ablation (System-Level) | Macro-Architecture Ablation (System-Level) | Macro-Architecture Ablation (System-Level) |
| (-) Activation Dynamics | 31.2 | 23.7 | 18.2 | 48.9 | 70.4 | 30.5 |
| (-) Graph Structure | 35.2 | 25.4 | 21.0 | 49.9 | 88.2 | 32.9 |
| Vectors Only (Baseline) | 27.5 | 14.7 | 12.5 | 46.0 | 69.2 | 25.2 |
| Method | Token Length | Latency | Cost/1k Queries | F1 (Excl. Adv.) * | Cost Eff. ( F 1 / $ ) |
|---|---|---|---|---|---|
| LoCoMo (Maharana et al., 2024) | ∼ 16,910 | 8.2s | $2.67 | 25.6 | 9.6 |
| MemGPT (Packer et al., 2024) | ∼ 16,977 | 8.5s | $2.67 | 28 | 10.5 |
| A-Mem (Xu et al., 2025) | ∼ 2,520 | 5.4s | $0.50 | 33.3 | 66.9 |
| MemoryOS (Li et al., 2025) | ∼ 1,198 | 1.5s | $0.30 | 38 | 126.8 |
| ReadAgent (Lee et al., 2024) | ∼ 643 | 2.3s | $0.22 | 9.8 | 45.3 |
| LangMem (LangChain Team, 2024) | ∼ 717 | 0.6s | $0.23 | 34.3 | 150.7 |
| MemoryBank (Zhong et al., 2024) | ∼ 432 | 1.2s | $0.18 | 6.3 | 34.1 |
| SYNAPSE (Ours) | ∼ 814 | 1.9s | $0.24 | 40.5 | 167.3 |
| Category Method | Key Mechanism | Reference |
|---|---|---|
| MemGPT MemoryOS Mem0 | with virtual context (Packer et al., 2024) optimizing read/write (Li et al., 2025) for personalization and (Chhikara et al., 2025) | System-level Hierarchical memory management paging (Main vs. External Context). OS-inspired memory hierarchy operations. Self-improving memory layer |
| AriGraph GraphRAG Zep SYNAPSE | Episodic and semantic memory organized as a dynamic graph structure. (Anokhin et al., 2025) Leverages community detection on knowledge graphs for global/local retrieval. (Edge et al., 2025) Knowledge graph-based memory designed for entity relationships. (Rasmussen et al., 2025) Hybrid spreading activation with dynamic structure (Ours). - | Graph-based |
| Retrieval MemoryBank ENGRAM LangMem | Retrieval-based memory incorporating the Ebbinghaus forgetting curve. Advanced latent memory clustering and retrieval mecha- nism. Memory injection via in-context learning or fine-tuning updates. | (Zhong et al., 2024) (Patel and Patel, 2025) (LangChain Team, 2024) |
| Agentic ReadAgent LoCoMo A-Mem | Agentic system that paginates long context and generates gist memories. Local Context Motion for compressing and selecting relevant blocks. Adaptive agentic memory system capable of self- updating summaries. | (Lee et al., 2024) (Maharana et al., 2024) (Xu et al., 2025) |
| Parameter | Value | M-Hop | Temp. | Avg |
|---|---|---|---|---|
| Spreading S | 0.6 | 32.8 | 47.5 | 38.3 |
| 0.8 † | 35.7 | 50.1 | 40.5 | |
| 1.0 | 33.5 | 48 | 38.8 | |
| Node Decay δ | 0.3 | 34.5 | 51.2 | 40 |
| Node Decay δ | 0.5 † | 35.7 | 50.1 | 40.5 |
| Node Decay δ | 0.7 | 33.8 | 46.5 | 38.1 |
| Steepness γ | 3.0 | 34.9 | 49.3 | 39.8 |
| Steepness γ | 5.0 † | 35.7 | 50.1 | 40.5 |
| Steepness γ | 7.0 | 35.1 | 49.7 | 40 |
| Threshold θ | 0.3 | 32.9 | 48.8 | 39 |
| Threshold θ | 0.5 † | 35.7 | 50.1 | 40.5 |
| Threshold θ | 0.7 | 34.1 | 49.2 | 39.5 |
| Inhibition β | 0.10 | 35.4 | 49.9 | 40.1 |
| Inhibition β | 0.15 † | 35.7 | 50.1 | 40.5 |
| Inhibition β | 0.20 | 35.2 | 49.6 | 39.9 |
| Propagation T | 2 | 31.5 | 46.8 | 37.7 |
| Propagation T | 3 † | 35.7 | 50.1 | 40.5 |
| Propagation T | 4 | 35.2 | 49.8 | 40.1 |
| Inhibition M | 3 | 33.5 | 48.9 | 39.2 |
| Inhibition M | 7 † | 35.7 | 50.1 | 40.5 |
| Inhibition M | 10 | 34.8 | 49.3 | 39.8 |
| τ gate | Adv. F1 | FRR (Non-Adv) | Verdict |
|---|---|---|---|
| 0 | 60.2 | 0.0% | Baseline |
| 0.05 | 94.2 | 0.8% | Conservative |
| 0.1 | 95.8 | 1.5% | Balanced |
| 0.12 | 96.6 | 2.1% | Selected |
| 0.15 | 97.2 | 4.2% | Aggressive |
| 0.2 | 98.1 | 8.5% | Unsafe |
| Category | F1 Score |
|---|---|
| Multi-Hop Temporal Open-Domain Single-Hop | 35.7 ± 0.1 50.1 ± 0.3 25.9 ± 0.2 48.9 ± 0.1 |
| Average | 40.5 ± 0.2 |
| Adversarial | 96.6 ± 0.1 |
| Model | Thres. | M-Hop | Temp. | Open | Single | Adv. | ↓ F1 |
|---|---|---|---|---|---|---|---|
| A-MEM | All | 32.9 | 39.4 | 17.1 | 48.4 | 36.4 | - |
| 0.5 | 20.2 | 19.3 | 11.5 | 28.8 | 19.4 | (43.1%) | |
| 0.3 | 14.6 | 16.3 | 9.5 | 19.7 | 16 | (56.3%) | |
| SYNAPSE | All | 39.3 | 55.5 | 29.5 | 46.5 | 97.8 | - |
| 0.5 | 42.8 | 49.4 | 22.2 | 44.8 | 95.3 | (5.3%) | |
| 0.3 | 42.3 | 47.5 | 21.5 | 43.8 | 93.7 | (7.4%) |
| Method | Single-Hop | Multi-Hop | Open Domain | Temporal | Overall |
|---|---|---|---|---|---|
| MemoryBank (Zhong et al., 2024) | 30.5 | 14.2 | 45.3 | 35.8 | 23.6 |
| ReadAgent (Lee et al., 2024) | 37.1 | 16.5 | 50.2 | 41.5 | 27.6 |
| LoCoMo (Maharana et al., 2024) | 38.5 | 17.8 | 53 | 48.2 | 30.1 |
| A-Mem (Xu et al., 2025) | 39.8 | 18.9 | 54.1 | 49.9 | 31.4 |
| Mem0 (Chhikara et al., 2025) | 67.1 | 51.2 | 75.7 | 58.1 | 57.1 |
| MemGPT (Packer et al., 2024) | 41.2 | 19.5 | 55.8 | 50.4 | 32.2 |
| AriGraph (Anokhin et al., 2025) | 45.5 | 28.2 | 60.1 | 51.5 | 38.2 |
| LangMem (LangChain Team, 2024) | 62.2 | 47.9 | 71.1 | 23.4 | 46.9 |
| Zep (Rasmussen et al., 2025) | 61.7 | 41.4 | 76.6 | 49.3 | 49 |
| MemoryOS (Li et al., 2025) | 78.3 | 63.7 | 54.6 | 72.7 | 67.7 |
| ENGRAM (Patel and Patel, 2025) | 79.9 | 79.8 | 72.9 | 70.8 | 77.6 |
| SYNAPSE | 81.5 | 84.2 | 76.8 | 72.1 | 80.7 |
| Category | Question | Ground Truth | SYNAPSE Output | F1 | Judge |
|---|---|---|---|---|---|
| What is Caroline's identity? | Transgender woman | Caroline is transgender. | 40 | 100 | |
| Who supports Caroline? | Her mentors, family | Her support system, those close to her | 16.7 | 90 | |
| Single-Hop | What motivated counseling? | Her journey and how it im- proved life | Her own struggles and desire to help | 22.2 | 100 |
| What was grandma's gift? | Necklace | A necklace symbolizing love | 33.3 | 100 | |
| Transition changes faced? | Changes to her body | Exploring her changing body | 50 | 100 | |
| Considered an ally? | Yes, she is supportive | Yes, Melanie supports and encour- ages... | 40 | 100 | |
| Multi-Hop | Likely enjoy Vivaldi? | Yes; it's classical | Yes, she enjoys classical music. | 33.3 | 100 |
| Likely have Dr. Seuss? | Yes, since she collects classics | Yes, likely for their creativity... | 15.4 | 100 | |
| Political leaning? | Liberal | Progressive or liberal. | 50 | 100 | |
| Realization after race? | Self-care is important | Importance of taking care of minds | 20 | 100 | |
| How long practicing art? | Since 2016 | Seven years (relative to 2023) | 0 | 100 | |
| Adoption meeting date? | Friday before 15 July | 14 July 2023 | 50 | 100 | |
| Temporal | When was the picnic? | Week before 6 July | 29 June 2023 | 25 | 100 |
| When was charity race? | Sunday before 25 May | 20 May 2023 | 50 | 100 | |
| Pottery class date? | 2 July 2023 | 02 July 2023 | 66.7 | 100 |
| Failure Mode: Cognitive Tunneling |
|---|
| Context: Episode E 15 (Low Degree) ...John put on his green jacket and left for the airport... |
| Retrieval Failure: Query "What color was John's jacket?" |
| Top-1: Airport Trip (Score 0.85) [Supressing E 15 ] - Hub Node Top-2: Taxi Ride (Score 0.72) Target: Green Jacket (Score 0.11 < τ ) - Pruned by Inhibition |
| Mechanism Diagnostics: High-degree "Airport" hub accumulates excessive activation ( S > 0 . 8 ), trigger- ing Lateral Inhibition ( β = 0 . 15 ) which suppresses the weakly connected "Jacket" detail. |
| Category | Category | Category | Category | Category | Category | Category | Category | Category | Category | Average | Average | Average | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | Method | Multi-Hop | Multi-Hop | Temporal | Temporal | Open Domain | Open Domain | Single-Hop | Single-Hop | Adversarial | Adversarial | Performance ∗ | Performance ∗ | Task |
| Model | Method | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | Rank |
| GPT-4o | LoCoMo (Maharana et al., 2024) | 28.0 | 18.5 | 9.1 | 5.8 | 16.5 | 14.8 | 61.6 | 54.2 | 52.6 | 51.1 | 29.5 | 22.2 | 2.8 |
| GPT-4o | ReadAgent (Lee et al., 2024) | 14.6 | 10.0 | 4.2 | 3.2 | 8.8 | 8.4 | 12.5 | 10.3 | 6.8 | 6.1 | 11.7 | 8.5 | 5.0 |
| GPT-4o | MemoryBank (Zhong et al., 2024) | 6.5 | 4.7 | 2.5 | 2.4 | 6.4 | 5.3 | 8.3 | 7.1 | 4.4 | 3.7 | 6.0 | 4.7 | 6.0 |
| GPT-4o | MemGPT (Packer et al., 2024) | 30.4 | 22.8 | 17.3 | 13.2 | 12.2 | 11.9 | 60.2 | 53.4 | 35.0 | 34.3 | 32.0 | 25.7 | 3.2 |
| GPT-4o | A-Mem (Xu et al., 2025) | 32.9 | 23.8 | 39.4 | 31.2 | 17.1 | 15.8 | 48.4 | 43.0 | 36.4 | 35.5 | 36.1 | 28.4 | 2.4 |
| GPT-4o | SYNAPSE (Ours) | 39.3 | 29.5 | 55.5 | 50.3 | 29.5 | 23.9 | 46.5 | 38.8 | 97.8 | 97.7 | 43.4 | 35.2 | 1.6 |
| Qwen-1.5b | LoCoMo (Maharana et al., 2024) | 9.1 | 6.6 | 4.3 | 4.0 | 9.9 | 8.5 | 11.2 | 8.7 | 40.4 | 40.2 | 8.5 | 6.6 | 4.0 |
| Qwen-1.5b | ReadAgent (Lee et al., 2024) | 6.6 | 4.9 | 2.6 | 2.5 | 5.3 | 12.2 | 10.1 | 7.5 | 5.4 | 27.3 | 6.3 | 5.3 | 5.8 |
| Qwen-1.5b | MemoryBank (Zhong et al., 2024) | 11.1 | 8.3 | 4.5 | 2.9 | 8.1 | 6.2 | 13.4 | 11.0 | 36.8 | 34.0 | 10.0 | 7.5 | 3.6 |
| Qwen-1.5b | MemGPT (Packer et al., 2024) | 10.4 | 7.6 | 4.2 | 3.9 | 13.4 | 11.6 | 9.6 | 7.3 | 31.5 | 28.9 | 9.1 | 7.0 | 4.6 |
| Qwen-1.5b | A-Mem (Xu et al., 2025) | 18.2 | 11.9 | 24.3 | 19.7 | 16.5 | 14.3 | 23.6 | 19.2 | 46.0 | 43.3 | 20.4 | 15.0 | 2.0 |
| Qwen-1.5b | SYNAPSE (Ours) | 38.1 | 24.6 | 35.5 | 28.6 | 18.1 | 11.7 | 35.8 | 26.6 | 98.1 | 60.1 | 35.9 | 25.0 | 1.0 |
| Qwen-3b | LoCoMo (Maharana et al., 2024) | 4.6 | 4.3 | 3.1 | 2.7 | 4.6 | 6.0 | 7.0 | 5.7 | 17.0 | 14.8 | 4.7 | 4.3 | 4.8 |
| Qwen-3b | ReadAgent (Lee et al., 2024) | 2.5 | 1.8 | 3.0 | 3.0 | 5.6 | 5.2 | 3.3 | 2.5 | 15.8 | 14.0 | 2.9 | 2.4 | 5.8 |
| Qwen-3b | MemoryBank (Zhong et al., 2024) | 3.6 | 3.4 | 1.7 | 2.0 | 6.6 | 6.6 | 4.1 | 3.3 | 13.1 | 10.3 | 3.5 | 3.3 | 6.0 |
| Qwen-3b | MemGPT (Packer et al., 2024) | 5.1 | 4.3 | 2.9 | 3.0 | 7.0 | 7.1 | 7.3 | 5.5 | 14.5 | 12.4 | 5.2 | 4.4 | 4.6 |
| Qwen-3b | A-Mem (Xu et al., 2025) | 12.6 | 9.0 | 27.6 | 25.1 | 7.1 | 7.3 | 17.2 | 13.1 | 27.9 | 25.2 | 16.2 | 13.0 | 2.6 |
| Qwen-3b | MemoryOS (Li et al., 2025) | 21.4 | 15.0 | 26.2 | 22.4 | 10.2 | 8.2 | 23.3 | 15.4 | - | - | 22.1 | 16.2 | - |
| Qwen-3b | SYNAPSE (Ours) | 38.8 | 25.1 | 36.2 | 29.6 | 14.7 | 11.5 | 37.8 | 26.1 | 98.9 | 60.5 | 36.6 | 25.4 | 1.0 |

References
[lewis2020rag] Lewis, Patrick, Perez, Ethan, Piktus, Aleksandra, Petroni, Fabio, Karpukhin, Vladimir, Goyal, Naman, K. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems.
[maharana2024locomo] Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang. (2024). Evaluating Very Long-Term Conversational Memory of LLM Agents.
[lee2024readagent] Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, Ian Fischer. (2024). A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts. Forty-first International Conference on Machine Learning.
[zhong2023memorybank] Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph E. Gonzalez. (2024). MemGPT: Towards LLMs as Operating Systems. Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.1609/aaai.v38i17.29946.
[xu2025amem] Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang. (2025). A-MEM: Agentic Memory for LLM Agents.
[chhikara2025mem0] Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, Deshraj Yadav. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.
[langmem2024] {LangChain Team. (2024). LangMem.
[ji2025memoryos] Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Huayi Lai, Hao Wu, Bo Tang, Zhengren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofen Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong. (2025). MemOS: A Memory OS for AI System.
[openai2023gpt4] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O'Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, Barret Zoph. (2024). GPT-4 Technical Report.
[anthropic2024claude] Anthropic. (2024). The Claude 3 Model Family: Opus, Sonnet, Haiku. Anthropic Technical Report.
[tulving1972episodic] Tulving, Endel, others. (1972). Episodic and semantic memory. Organization of memory.
[anderson1983spreading] Anderson, John R. (1983). A spreading activation theory of memory. Journal of verbal learning and verbal behavior.
[park2023generative] Park, Joon Sung, O'Brien, Joseph, Cai, Carrie Jun, Morris, Meredith Ringel, Liang, Percy, Bernstein, Michael S.. (2023). Generative Agents: Interactive Simulacra of Human Behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. doi:10.1145/3586183.3606763.
[collins1975spreading] Collins, Allan M, Loftus, Elizabeth F. (1975). A spreading-activation theory of semantic processing.. Psychological review.
[li2024personalllmagentsinsights] Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu. (2024). Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security.
[steenstra2024virtual] Steenstra, Ian, Nouraei, Farnaz, Arjmand, Mehdi, Bickmore, Timothy. (2024). Virtual Agents for Alcohol Use Counseling: Exploring LLM-Powered Motivational Interviewing. Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents. doi:10.1145/3652988.3673932.
[katsis2025mtragmultiturnconversationalbenchmark] Katsis, Yannis, Rosenthal, Sara, Fadnis, Kshitij, Gunasekara, Chulaka, Lee, Young-Suk, Popa, Lucian, Shah, Vraj, Zhu, Huaiyu, Contractor, Danish, Danilevsky, Marina. (2025). mtRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems. Transactions of the Association for Computational Linguistics. doi:10.1162/TACL.a.19.
[sharma2024retrievalaugmentedQA] Sanat Sharma, David Seunghyun Yoon, Franck Dernoncourt, Dewang Sultania, Karishma Bagga, Mengjiao Zhang, Trung Bui, Varun Kotte. (2024). Retrieval Augmented Generation for Domain-specific Question Answering.
[mialon2023augmented] Gr{'e. (2023). Augmented Language Models: a Survey. Transactions on Machine Learning Research.
[borgeaud2022retro] Borgeaud, Sebastian, Mensch, Arthur, Hoffmann, Jordan, Cai, Trevor, Rutherford, Eliza, Millican, Katie, Van Den Driessche, George Bm, Lespiau, Jean-Baptiste, Damoc, Bogdan, Clark, Aidan, De Las Casas, Diego, Guy, Aurelia, Menick, Jacob, Ring, Roman, Hennigan, Tom, Huang, Saffron, Maggiore, Loren, Jones, Chris, Cassirer, Albin, Brock, Andy, Paganini, Michela, Irving, Geoffrey, Vinyals, Oriol, Osindero, Simon, Simonyan, Karen, Rae, Jack, Elsen, Erich, Sifre, Laurent. (2022). Improving Language Models by Retrieving from Trillions of Tokens. Proceedings of the 39th International Conference on Machine Learning.
[codex] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba. (2021). Evaluating Large Language Models Trained on Code.
[yao2023react] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, Yuan Cao. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. The Eleventh International Conference on Learning Representations.
[schick2023toolformer] Schick, Timo, Dwivedi-Yu, Jane, Dessi, Roberto, Raileanu, Roberta, Lomeli, Maria, Hambro, Eric, Zettlemoyer, Luke, Cancedda, Nicola, Scialom, Thomas. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in Neural Information Processing Systems.
[jimenez2024hipporag] Guti'{e. (2024). HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. Advances in Neural Information Processing Systems. doi:10.52202/079017-1902.
[izacard2023atlas] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, Edouard Grave. (2023). Atlas: Few-shot Learning with Retrieval Augmented Language Models. Journal of Machine Learning Research.
[guu2020retrieval] Guu, Kelvin, Lee, Kenton, Tung, Zora, Pasupat, Panupong, Chang, Mingwei. (2020). Retrieval Augmented Language Model Pre-Training. Proceedings of the 37th International Conference on Machine Learning.
[asai2024self] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi. (2024). Self-{RAG. The Twelfth International Conference on Learning Representations.
[zhu-etal-2025-knowledge] Zhu, Xiangrong, Xie, Yuexiang, Liu, Yi, Li, Yaliang, Hu, Wei. (2025). Knowledge Graph-Guided Retrieval Augmented Generation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). doi:10.18653/v1/2025.naacl-long.449.
[edge2024local] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, Jonathan Larson. (2025). From Local to Global: A Graph RAG Approach to Query-Focused Summarization.
[sarthi2024raptor] Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, Christopher D Manning. (2024). {RAPTOR. The Twelfth International Conference on Learning Representations.
[nafee2025dynamic] Nafee, Mahmud Wasif, Jiang, Maiqi, Chen, Haipeng, Zhang, Yanfu. (2025). Dynamic Retriever for In-Context Knowledge Editing via Policy Optimization. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2025.emnlp-main.848.
[khandelwal2019generalization] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis. (2020). Generalization through Memorization: Nearest Neighbor Language Models.
[modarressi2024arigraph] Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, Evgeny Burnaev. (2025). AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents.
[zep2024graphiti] Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory.
[engram2025] Daivik Patel, Shrenik Patel. (2025). ENGRAM: Effective, Lightweight Memory Orchestration for Conversational Agents.
[karpukhin2020dense] Karpukhin, Vladimir, Oguz, Barlas, Min, Sewon, Lewis, Patrick, Wu, Ledell, Edunov, Sergey, Chen, Danqi, Yih, Wen-tau. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). doi:10.18653/v1/2020.emnlp-main.550.
[khattab2020colbert] Khattab, Omar, Zaharia, Matei. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. doi:10.1145/3397271.3401075.
[hu-etal-2025-grag] Yuntong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, Liang Zhao. (2025). GRAG: Graph Retrieval-Augmented Generation.
[yang2018hotpotqa] Yang, Zhilin, Qi, Peng, Zhang, Saizheng, Bengio, Yoshua, Cohen, William, Salakhutdinov, Ruslan, Manning, Christopher D.. (2018). {H. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/D18-1259.
[xiong2020answering] Qi, Peng, Lin, Xiaowen, Mehr, Leo, Wang, Zijian, Manning, Christopher D.. (2019). Answering Complex Open-domain Questions Through Iterative Query Generation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). doi:10.18653/v1/D19-1261.
[trivedi2022musique] Trivedi, Harsh, Balasubramanian, Niranjan, Khot, Tushar, Sabharwal, Ashish. (2022). MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics. doi:10.1162/tacl_a_00475.
[thorne-etal-2018-fever] Thorne, James, Vlachos, Andreas, Christodoulopoulos, Christos, Mittal, Arpit. (2018). {FEVER. Proceedings of the 2018 Conference of the North {A. doi:10.18653/v1/N18-1074.
[sun2018open] Sun, Haitian, Dhingra, Bhuwan, Zaheer, Manzil, Mazaitis, Kathryn, Salakhutdinov, Ruslan, Cohen, William. (2018). Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/D18-1455.
[bib1] John R Anderson. 1983. A spreading activation theory of memory. Journal of verbal learning and verbal behavior, 22(3):261–295.
[bib2] Anokhin et al. (2025) Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, and Evgeny Burnaev. 2025. Arigraph: Learning knowledge graph world models with episodic memory for llm agents. Preprint, arXiv:2407.04363.
[bib3] Asai et al. (2024) Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations.
[bib4] Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, and 9 others. 2022. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
[bib5] Chhikara et al. (2025) Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025. Mem0: Building production-ready ai agents with scalable long-term memory. Preprint, arXiv:2504.19413.
[bib6] Allan M Collins and Elizabeth F Loftus. 1975. A spreading-activation theory of semantic processing. Psychological review, 82(6):407.
[bib7] Edge et al. (2025) Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2025. From local to global: A graph rag approach to query-focused summarization. Preprint, arXiv:2404.16130.
[bib8] Gutiérrez et al. (2024) Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. 2024. Hipporag: Neurobiologically inspired long-term memory for large language models. In Advances in Neural Information Processing Systems, volume 37, pages 59532–59569. Curran Associates, Inc.
[bib9] Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
[bib10] Hu et al. (2025) Yuntong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, and Liang Zhao. 2025. Grag: Graph retrieval-augmented generation. Preprint, arXiv:2405.16506.
[bib11] Izacard et al. (2023) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43.
[bib12] Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
[bib13] Khandelwal et al. (2020) Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models. Preprint, arXiv:1911.00172.
[bib14] Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, page 39–48, New York, NY, USA. Association for Computing Machinery.
[bib15] LangChain Team. 2024. Langmem.
[bib16] Lee et al. (2024) Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. 2024. A human-inspired reading agent with gist memory of very long contexts. In Forty-first International Conference on Machine Learning.
[bib17] Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
[bib18] Li et al. (2025) Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, and 20 others. 2025. Memos: A memory os for ai system. Preprint, arXiv:2507.03724.
[bib19] Maharana et al. (2024) Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating very long-term conversational memory of llm agents. Preprint, arXiv:2402.17753.
[bib20] Nafee et al. (2025) Mahmud Wasif Nafee, Maiqi Jiang, Haipeng Chen, and Yanfu Zhang. 2025. Dynamic retriever for in-context knowledge editing via policy optimization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16744–16757, Suzhou, China. Association for Computational Linguistics.
[bib21] Packer et al. (2024) Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2024. Memgpt: Towards llms as operating systems. Preprint, arXiv:2310.08560.
[bib22] Park et al. (2023) Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY, USA. Association for Computing Machinery.
[bib23] Daivik Patel and Shrenik Patel. 2025. Engram: Effective, lightweight memory orchestration for conversational agents. Preprint, arXiv:2511.12960.
[bib24] Qi et al. (2019) Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and Christopher D. Manning. 2019. Answering complex open-domain questions through iterative query generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2590–2602, Hong Kong, China. Association for Computational Linguistics.
[bib25] Rasmussen et al. (2025) Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. 2025. Zep: A temporal knowledge graph architecture for agent memory. Preprint, arXiv:2501.13956.
[bib26] Sarthi et al. (2024) Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. 2024. RAPTOR: Recursive abstractive processing for tree-organized retrieval. In The Twelfth International Conference on Learning Representations.
[bib27] Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems, volume 36, pages 68539–68551. Curran Associates, Inc.
[bib28] Sun et al. (2018) Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William Cohen. 2018. Open domain question answering using early fusion of knowledge bases and text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4231–4242, Brussels, Belgium. Association for Computational Linguistics.
[bib29] Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
[bib30] Trivedi et al. (2022) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
[bib31] Xu et al. (2025) Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. 2025. A-mem: Agentic memory for llm agents. Preprint, arXiv:2502.12110.
[bib32] Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
[bib33] Yao et al. (2023) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
[bib34] Zhong et al. (2024) Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024. Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724–19731.
[bib35] Zhu et al. (2025) Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, and Wei Hu. 2025. Knowledge graph-guided retrieval augmented generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8912–8924, Albuquerque, New Mexico. Association for Computational Linguistics.