Skip to main content

Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory

Sizhe Yuen & Francisco Gomez Medina & Ting Su, The Alan Turing Institute, Yali Du, King's College London, Adam J. Sobey, University of Southampton, The Alan Turing Institute

Abstract

Multi-agent systems built on Large Language Models (LLMs) show exceptional promise for complex collaborative problem-solving, yet they face fundamental challenges stemming from context window limitations that impair memory consistency, role adherence, and procedural integrity. This paper introduces Intrinsic Memory Agents, a novel framework that addresses these limitations through agent-specific memories that evolve intrinsically with agent outputs. Specifically, our method maintains role-aligned memory that preserves specialized perspectives while focusing on task-relevant information. Our approach utilises a generic memory template applicable to new problems without the need to hand-craft specific memory prompts. We benchmark our approach on the PDDL, FEVER, and ALFWorld datasets, comparing its performance to existing state-of-the-art multi-agentic memory approaches and showing state-of-the-art or comparable performance across all three, with the highest consistency. An additional evaluation is performed on a complex data pipeline design task, and we demonstrate that our approach produces higher quality designs across 5 metrics: scalability, reliability, usability, cost-effectiveness, and documentation, plus additional qualitative evidence of the improvements. Our findings suggest that addressing memory limitations through intrinsic approaches can improve the capabilities of multi-agent LLM systems on structured planning tasks.

Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory

Sizhe Yuen & Francisco Gomez Medina & Ting Su The Alan Turing Institute

Adam J. Sobey University of Southampton, The Alan Turing Institute ajs502@soton.ac.uk, asobey@turing.ac.uk

Multi-agent systems built on Large Language Models (LLMs) show exceptional promise for complex collaborative problem-solving, yet they face fundamental challenges stemming from context window limitations that impair memory consistency, role adherence, and procedural integrity. This paper introduces Intrinsic Memory Agents, a novel framework that addresses these limitations through agent-specific memories that evolve intrinsically with agent outputs. Specifically, our method maintains role-aligned memory that preserves specialized perspectives while focusing on task-relevant information. Our approach utilises a generic memory template applicable to new problems without the need to hand-craft specific memory prompts. We benchmark our approach on the PDDL, FEVER, and ALFWorld datasets, comparing its performance to existing state-of-the-art multiagentic memory approaches and showing state-of-the-art or comparable performance across all three, with the highest consistency. An additional evaluation is performed on a complex data pipeline design task, and we demonstrate that our approach produces higher quality designs across 5 metrics: scalability, reliability, usability, cost-effectiveness, and documentation, plus additional qualitative evidence of the improvements. Our findings suggest that addressing memory limitations through intrinsic approaches can improve the capabilities of multi-agent LLM systems on structured planning tasks.

Introduction

Recent advances in large language models (LLMs) have enabled their application as autonomous or semi-autonomous agents capable of complex reasoning and decision-making (Huang et al., 2024). Multi-agent LLM systems, where multiple LLM instances interact to solve problems collaboratively, have shown particular promise for tasks requiring diverse expertise (Park et al., 2023; Qian et al., 2025). These systems leverage the complementary capabilities of specialized agents to address challenges that would be difficult for single-agent approaches to resolve effectively.

Despite their theoretical advantages, multi-agent LLM systems face several implementation challenges that limit their practical effectiveness, from coordination overhead, to the consistency in role adherence among the agents (Li et al., 2024c). Most critically, the fixed-size context windows of LLMs restrict their ability to maintain long-term conversational context, an issue that is exacerbated in multi-agent frameworks with multiple agents in a single conversation. This leads to issues such as perspective inconsistency, forgetting key requirements, and procedural drift. Current solutions such as Retrieval-Augmented Generation (RAG) (Lewis et al., 2020; Gao et al., 2024) and agentic memory approaches (Packer et al., 2024; Xu et al., 2025; Chhikara et al., 2025) are designed for single-agent and user interaction scenarios, which do not account for the volume of information growing with the number of agents.

Yali Du King's College London yali.du@kcl.ac.uk

To address these challenges, we introduce Intrinsic Memory Agents, a novel multi-agent architecture that uses agent-specific memories aligned with conversational objectives. Unlike previous approaches, our system updates memories that are specific to each agent, ensuring heterogeneity and memories that reflect both historical context and recent developments while preserving agentspecific perspectives. The intrinsic nature of memory updates, derived directly from agent outputs rather than external summarization, ensures unique memories that maintain consistency with agentspecific reasoning patterns and domain expertise. We evaluate our approach through benchmarking and through a specific data pipeline design case study to show its practical usage. The evaluation demonstrates that our Intrinsic Memory Agents approach yields significant improvements in conversational coherence, role consistency, and collaborative efficiency compared to conventional multi-agent implementations. These improvements translate to qualitative enhancements in solution quality without increasing the number of conversation turns, suggesting broad applicability across domains where multi-agent LLM systems are deployed.

The main contributions of our work are as follows:

· Intrinsic Memory Updates : Memory updates derived from agent outputs rather than external summarization. · Agent-Specific Memory : Independent memories maintained for each agent to preserve perspective autonomy.

Recent years have seen significant progress in the development of multi-agent systems powered by LLMs. These systems have been applied in various domains, such as software development, scientific experimentation, gaming, and social simulation (Li et al., 2024c). For example, in software development, multi-agent systems enable concurrent consideration of architectural design, security, user experience, and performance optimization (Hong et al., 2024). Hallucinations due to outdated knowledge or retrieval extraction issues remains a major challenge which limits the effectiveness of multi-agent systems Huang et al. (2025). The use of a shared knowledge base or memory storage is an important aspect to maintain up-to-date, coherent and correct information among agents.

Memory in Agent-based systems

In agent-based systems, memory is pivotal for maintaining context, learning from historical interactions, and making informed decisions. As Zhang et al. (2024) noted, memory supports tasks such as ensuring conversation consistency and effective role-playing for single-agent systems. In multiagent systems, memory facilitates coordination, communication, and collaborative problem-solving, as Guo et al. (2024) discussed.

Memory in LLMs can be categorized under short-term memory and long-term memory. Short-term memory is information that fits within the model's fixed context window. Commercial LLMs such as GPT-4o (OpenAI, 2024) and Claude (Anthropic, 2024) are able to process large contexts of over 100K tokens, with some models such as Gemini 2.5 Pro (Comanici et al., 2025) able to process over 1 million tokens in its context window. However, the hard limit of the context window size remains, and increasing the context length does not necessarily increase reasoning or learning capabilities of the LLM (Li et al., 2024b). This is because the long context can move the relevant information further away from each other in the context window.

Long-term memory is information that persists beyond the context window or single instance of an LLM. This information can be stored in external databases and retrieved using RAG techniques (Lewis et al., 2020; Gao et al., 2024). Long-term memory aims to alleviate the issue of short-term memory's limited capacity, but introduces other disadvantages such as retrieval noise, the complexity of building a retrieval system, latency, and storage costs (Asai et al., 2024; Yu et al., 2024).

The limitations of context length and existing memory mechanisms are particularly pronounced in multi-agent settings, where the volume of information exchanged grows with the number of agents involved Li et al. (2024a). As multi-agent conversations extend, the probability of critical information being at a long distance or even excluded from the accessible context increases dramatically. This information loss undermines the primary advantage of multi-agent systems: The integration

of diverse, specialized perspectives toward cohesive solutions He et al. (2025). This is exacerbated by current long-term memory approaches which provide a homogeneous memory for the agents, decreasing the benefits of having agents focused on a single part of the task. Our proposed approach therefore focuses on the heterogeneity of agents and their memories, ensuring that each agent maintains a memory that is uniquely relevant to their role.

Agentic Memory

Agentic memory offers a solution to long-term memory and limited contextual information by periodically condensing conversation history into concise summaries (Wang et al., 2025; Chen et al., 2024). These approaches generate sequential or hierarchical summaries that capture key decisions and insights from previous exchanges. Some agentic memory approaches combine with RAG approaches by storing the summarized contexts for retrieval later in the conversation (Xu et al., 2022), or by storing in- and out-of-context memory in a hierarchical system to dynamically adapt the current context (Packer et al., 2024; Xu et al., 2025). While agentic memory methods provide better contextual integration than pure retrieval approaches, they frequently lose critical details during the condensation process. Furthermore, the undirected and unstructured nature of general summarization often fails to preserve role-specific perspectives and specialized knowledge that are essential to effective multi-agent collaboration.

Our proposed Intrinsic Memory Agents similarly uses an agentic memory approach to summarize and store information. Unlike existing approaches, we introduce heterogeneous memory for each agent in the multi-agent system to maintain specialized roles in collaborative tasks, and apply a templated approach to each agent ensuring cohesive memory throughout. This addresses the limitations of existing memory mechanisms by ensuring that each agent maintains its own memory, reflecting both historical context and new information while maintaining heterogeneous agent-specific perspectives and expertise.

Intrinsic Memory Agents

The various agentic memory approaches are all designed in single-agent scenarios to remember crucial details when interacting with an end-user. Due to the multi-turn long conversations between agents, a direct implementation of single-agent agentic memory becomes complicated and resourceintensive, with each agent requiring retrieval systems and distinct contextual updates.

We propose Intrinsic Memory Agents, a framework for multi-agent LLM systems that maintains agent-specific memories aligned with conversational objectives. Figure 1 illustrates the architecture of our Intrinsic Memory Agents framework. In this approach a query is made by the user, the first agent makes a comment based on its role description, the conversation is updated, followed by a memory update for the agent that commented, there is a check for consensus and the cycle starts again. The context in this case is made up of both the agent's intrinsic memory and the conversation, meaning that as the conversation continues the agents increasingly diverge in their interpretation of that context.

Framework Definition

Let us define the multi-agent system A = { A 1 , A 2 , ..., A N } consisting of N agents. Each agent A n = { R n , M n , LLM n } is characterized by a role specification R n that defines the agent's expertise domain and objectives, a memory M n that evolves throughout the conversation, and an LLM instance LLM n , which may share parameters between agents.

The conversation consists of a sequence of turns T = t 1 , t 2 , ..., t M where each turn t m involves an agent selection function σ ( t m ) → A n that determines which agent speaks, an input context C n,m constructed for the selected agent, an output O n,m generated by the selected agent, and a memory update operation for the selected agent.

Critically, our framework separates the input context construction and memory update processes, allowing for agent-specific memory maintenance while preserving a shared conversation space.

Figure 1: Intrinsic Memory Agents Framework. For n agents and m conversation turns, each agent A n contains its own role description R n and language model L n . Its memory M n,m is updated based on the input context C n,m and output O n,m .

Figure 1: Intrinsic Memory Agents Framework. For n agents and m conversation turns, each agent A n contains its own role description R n and language model L n . Its memory M n,m is updated based on the input context C n,m and output O n,m .

Memory Update mechanism

For each agent in the system, we maintain a memory M n that evolves over time. Let M n,m represent the memory of agent n after m conversation turns. The memory update process works as follows:

Agent A n receives input context C n,m consisting of relevant conversation history H m and previous memory M n,m -1 ,

$$

$$

and agent A n generates output O n,m using the underlying LLM L n ,

$$

$$

Then with the generated output O n,m and the previous memory M n,m -1 , we update the slot content using a memory update function,

$$

$$

The memory update function f memory update is implemented as a prompted LLM operation. Specifically, for the previous memory M n,m -1 at turn m -1 and agent output O n,m at turn m , the update function constructs the prompt as shown in Figure 8. The LLM's response to this prompt becomes the updated memory M n,m . The context construction function f context presented in equation 1 determines what information is provided to an agent when generating a response. The algorithm takes the existing conversation history and agent memory, appending both to the context and using the remaining tokens to include the rest of the conversation history. The full algorithm pseudo-code is displayed in the Appendix A Algorithm 1.

This algorithm prioritizes:

  1. The initial task description to maintain objective alignment.
  2. The agent's structured memory to preserve role consistency.
  3. The most recent conversation turns to maintain immediate context.

By prioritizing memory inclusion over exhaustive conversation history, the algorithm ensures that agents maintain role consistency and task alignment even when conversation length exceeds context window limitations. We conduct an ablation study on the structure of the memory template in appendix B. The ablation study shows the use of a generic or dynamic LLM-generated template shows consistently better performance compared to a hand-crafted template, which is prone to sensitivity if a poorly created template is used.

Quantitative benchmarks

To evaluate our approach, we test our memory agents against the PDDL (Planning Domain Definition Language), FEVER (Fact Extraction and VERification) Thorne et al. (2018), and ALFWorld Shridhar et al. (2021) numeric benchmarks. PDDL involves structured planning tasks from AgentBoard (Ma et al., 2024), where the agents generate executable plans for abstract problem domains, evaluating their reasoning and coordination. FEVER is a dataset for evidence-based claim verification, requiring agents to retrieve and reason over textual evidence and assess a given factual claim. Finally, ALFWorld Shridhar et al. (2021) is a text-based interactive environment which simulates household tasks with natural language instructions and descriptions. It tests an agent's ability to navigate and execute complex sequential actions to complete tasks.

For numerical benchmarks, we follow the same experimental methodology as G-Memory (Zhang et al., 2025), another memory framework for multi-agent systems. We re-run the G-Memory framework 1 as we cannot directly compare to the published G-Memory results which were benchmarked with GPT-4o-mini as the base language model. We chose to use the G-Memory framework as a comparison as the framework implements a variety of existing memory architectures, allowing us to compare our Intrinsic Memory Agents with existing architectures and benchmarks. G-Memory uses Autogen for multi-agent simulation, matching our use of Autogen for our architecture. We chose to use the three benchmarks to cover a range of structured planning, comprehension, and reasoning tasks, all of which are aspects aligned with the data pipeline case study detailed in Section 5. We run Gemma3:12b for the numeric benchmarks using Ollama 2 with 5 independent runs, each with their own set seeds for reproducibility. We use a larger model for the numeric benchmarks as initial tests on the Llama3.1:3b model found poor results for every benchmark and memory framework. Our computational infrastructure utilizes a high performance computing cluster with A100 GPUs, running on GNU/Linux 4.18.0-553.el8 10.x86 64.

Benchmarking results

Figure 2: Intrinsic Memory performance across the three benchmarks, the blue bars are our Intrinsic Memory.

Figure 2: Intrinsic Memory performance across the three benchmarks, the blue bars are our Intrinsic Memory.

Figure 2 shows the average reward of each memory system benchmarked against ALFWorld, FEVER, and PDDL, the error bars showing the standard deviation across 5 independent runs. While our Intrinsic Memory mechanism doesn't obtain the highest rewards in each benchmark, there are results to indicate our approach has consistent strong performance compared to the other memory mechanisms.

In the ALFWorld benchmark, the Voyager and Generative approaches obtain the highest average reward, at 0.072 and 0.061 mean reward respectively. However, they also show the highest standard deviation among all memory mechanisms, at 0.035 and 0.031 respectively, indicating their high variability. In contrast, our method with both the generic and LLM-generated templates obtains the 3rd and 4th best performance, at 0.048 and 0.045 mean rewards, with much lower variance at 0.0083 and 0.0003 respectively. These results indicate our strong performance and consistency when solving the ALFWorld set of problems. Similarly for the FEVER dataset, all memory approaches obtain

1 https://github.com/bingreeky/GMemory

similar performance, with the best approach, MetaGPT, also showing the highest standard deviation compared to other approaches. Our Intrinsic Memory mechanism shows the lowest standard deviation on this dataset, with mean rewards ranked second for the generic template, and fourth for the LLM-generated template, showing more evidence of our consistency. Finally in the PDDL benchmark, both our Intrinsic Memory approaches outperform all other memory mechanisms, with not a significantly higher standard deviation than other approaches, at 0.260 and 0.254 mean rewards for the generic template and LLM-generated templates respectively. Table 4 in the appendix shows the mean rewards, standard deviation, and average token counts for each benchmark and memory mechanism in detail.

The PDDL dataset are structured planning tasks, which fits the intended use case of Intrinsic Memory for agent discussion, planning and design. As Intrinsic Memory assigns agent-specific memory, it can more clearly distinguish planning and actions to complete tasks. More tokens are used by Intrinsic Memory to generate structured templates per agent per round of discussion, and is a worthwhile trade-off in both reward score and token efficiency. In contrast, the FEVER dataset tasks are meant for fact extraction where reasoning plays a larger role than raw memory. We find that our Intrinsic Memory performs just as well as other memory methods, indicating memory methods in general are less applicable to the FEVER problems, and that our performance is in line with other memory mechanisms. Finally in ALFWorld, the two best performing memory mechanisms, Voyager and Generative, also have the highest standard deviation, showing a lack of consistency compared to the Intrinsic Memory approach, where the agent-specific memory helps maintain consistent performance compared to global or cross-trial memory implementations.

Data Pipeline Design Case Study

As a practical case study to evaluate our approach, we applied our memory agents to a collaborative data pipeline design, a complex task requiring multiple perspectives. We run 10 independent outputs with eight specialized agents:

  1. Evaluation Agent (EA) evaluates the output solutions.
  2. Knowledge Integration Agent (KIA) summarizes each discussion round (e.g. after every agent has contributed at least once).
  3. Data Engineer Agent (DEA) determines the data processing needs.
  4. Infrastructure Engineer (IA) designs the cloud infrastructure.
  5. Business Objective Engineer (BOA) checks against business requirements.
  6. Machine Learning Engineer (MLE) provides ML implementation.
  7. Conversation Delegation Agent (CDA) is responsible for facilitating the collaborative process.
  8. Documentation Joining Agent (DJE) is responsible for producing final output after consensus is reached among agents.

The agents are tasked with designing a cloud-based data pipeline architecture through a structured process involving proposals, discussions, and consensus formation. The full prompts and task descriptions can be found in Appendix D.

The output requirements include a concise summary, high-level plan, resource estimates, and a structured JSON specification.

System Configurations

We evaluated two system configurations: First, the Baseline System which consists of a standard multi-agent implementation without intrinsic memory. It uses standard prompt templates for each agent role, relying exclusively on conversation history for context. Second is our Intrinsic Memory System approach with agent-specific memories. It implements agent-specific memories and updates them intrinsically based on agent outputs, and constructs context using both conversation history and agent memories.

Both systems used identical agent roles and task specifications, with Llama-3.2-3b as the underlying LLM. Each agent role was initialized with the same role description and initial instructions across the two system configurations to ensure a fair comparison.

An agent selection function iterates through each worker agent and the conversation delegation agent (CDA), ensuring that all agents are represented in the discussion. Once all agents have accepted a proposed solution, marked through the 'ACCEPT' flag, the CDA emits a 'FINALIZE' flag, prompting the Documentation Engineer Agent to produce the final data pipeline output. The full algorithm for finalisation and ordering of agents is displayed in the Appendix A Algorithm 2.

Evaluation Metrics

To evaluate the quality of the data pipeline designs generated by our memory agents, and to compare to the pipeline designs generated from default Autogen, we use an LLM as a judge (Zheng et al., 2023) to score each pipeline design and provide a qualitative analysis to support these scores. We evaluated the multi-agent system performance under the following metrics:

· Scalability: ability for the data pipeline to handle increasing data volumes or user loads. · Reliability: ability for the data pipeline to handle failures and ensure data integrity. · Usability: is there enough detail in the data pipeline design for developers to implement the design? · Cost-effectiveness: balance between costs and benefits of each chosen component. · Documentation: how well-justified and documented is the choice of elements for the data pipeline?

The scalability and reliability metrics are chosen as core requirements in modern data pipelines. Scalability reflects the ability to grow the pipeline to handle larger volumes of data and users, while reliability ensures the pipeline is consistent and fault-tolerant, both of which are crucial if the pipeline were to be deployed. Usability and documentation metrics reflect the details and design decisions taken. A strong design is not useful if it does not contain enough detail or is too abstract to be practically implemented. Usability measures whether the output designs are detailed and clear enough for engineering teams to implement. Design decisions must be well-documented, with clear justifications and explanations for each component, which reveals the reasoning behind the agents' choices. Finally, the cost-effectiveness metric evaluates whether the pipeline design has considered and balanced the need for computation resources with the cost of those resources. Run-time metrics such as latency and throughput are not included in our evaluation metrics as we only present the design of the data pipelines to be evaluated, and do not implement the designs into code

Data pipeline design performance

The median and standard deviation of each quality metric is presented in Figure 3. The Intrinsic Memory system shows consistent improvement on all metrics compared to the baseline Autogen.

The Documentation quality focuses on the clarity and how well-justified the design choices are. While Intrinsic Memory helps to boost the Documentation score over the baseline, the score is still relatively low at a mean of 4.9. This suggests that retaining memory of the conversation alone does not guarantee good justification, and while some context and attributes of each component are remembered, the reasons for choosing the components are not. This could be a problem with the training corpus, and a requirement for better annotated training data. Similarly, the Usability score is low with means of 3.32 and 4.9 for the baseline and Intrinsic Memory, respectively.

The improved quality comes at a cost of additional tokens outlined in Table 1. Intrinsic Memory uses on average 32% more tokens than the baseline as it outputs are more descriptive on average, although the number of conversation turns is similar and not statistically significant. This indicates that the addition of a memory module costs additional token overhead to maintain, but does not increase the number of conversation turns between agents.

Figure 3: LLM-as-a-Judge metrics for the Data Pipeline design case study.

Figure 3: LLM-as-a-Judge metrics for the Data Pipeline design case study.

Table 1: Mean efficiency and LLM-as-a-Judge metrics after 10 independent runs, with p-values calculated using a Wilcoxon ranked sum test. Usability and number of conversation turns is highlighted in italics as the metrics that do not show statistical significance between the baseline and our Intrinsic Memory approach.

Qualitative analysis of data pipeline outputs

Figure 4 shows snippets for one component from the highest-scoring outputs for the intrinsic memory agent system and baseline Autogen system.

The Intrinsic Memory Agent system outperforms the baseline system across the five quality metrics. In terms of scalability, the Intrinsic Memory Agent system is capable of providing an overall assessment of scalability, specifically around varying data volumes, whereas the baseline system encapsulates that measure only in the form of 'maintenance difficulty' for each component of the pipeline. In terms of reliability, the Intrinsic Memory Agent provides considerations for each component, such as AWS Kinesis's secure streaming capabilities and considerations as well as the use of Docker containers within amazon SageMaker to improve stability and reproducibility of ML pipelines. The Intrinsic Memory Agent provides a more descriptive Usability output of the Intrinsic and a clearer pathway to implementation. In terms of cost, the Intrinsic Memory Agent makes specific calculations and observations for the cost-effectiveness and resource requirements, including reasoning behind each component choice, whereas the baseline system limits itself to overall evaluations of implementation and maintainability difficulties.

Finally, the Intrinsic Memory Agent ultimately provides justification and documents its recommendation under each component, including pros and cons for each component choice.

Overall, the Intrinsic Memory Agent provides a more descriptive answer and more value to engineers by specifying tools, configurations and trade-offs. For example, its Data Streaming design recommends Amazon Kinesis, whereas the baseline simply states 'Ingest data from various sources

"Component 1": "Data Ingestion (Amazon S3)" "AWSName": "AmazonS3", "Pros": ["Scalable", "durable", "secure storage for raw data"], "Cons": ["Additional cost for storing large amounts of data"], "Design": "Use S3 as a central repository for all data sources, with separate buckets for each source if needed.", "Details": "Implement S3 event notifications to trigger processing workflows upon new data arrival."

(a) Intrinsic Memory Agent system sample from highest-scoring output. This data pipeline received scores of Scalability: 7, Reliability: 5, Usability: 6, Cost-effectiveness: 6, Documentation: 7

"Component 1": "Name": "Data Ingestion", "Description": "Ingest data from various sources (camera, lidar, radar) at high speeds", "Implementation difficulties": 7, "Maintainability difficulties": 6

  • (b) Baseline Autogen system system sample from highest-scoring output. This data pipeline received scores of Scalability: 5, Reliability: 4, Usability: 3, Cost-effectiveness: 3, Documentation: 2

(camera, lidar, radar) at high speeds.' Similarly, the IMA cites the specific connections between components that must be implemented (for example, Amazon S3 =¿ Amazon EC2 through API) . Although some precise configuration settings remain unspecified, the baseline merely names each component without offering implementation details or alternatives.

The components specified by the Intrinsic Memory Agent are more relevant to the problem specification. The data pipeline design task explicitly specifies the input data contains lidar and radar data sources, in which SageMaker and Kinesis are particularly relevant, specifically used for lidar data processing. This contrasts the vague 'Lidar data processing' component of the baseline, which only contains a general description to process the data without providing any details.

Discussion and Limitations

Although the Intrinsic Memory Agent approach shows improved performance across the data pipeline generation task and the selected benchmarks, further validation is required across a broader set of complex tasks, potentially with varying number of agents, and models. Furthermore, our approach's performance and consistency comes at the cost of increased token usage, due to the additional update calls. Further work to reduce the number of update calls, updating only when necessary, will help to alleviate the additional usage.

The results demonstrate that a movement towards heterogeneity of agents leads to an improvement in performance of the multi-agent system, allowing agents to focus more specifically on an area of the design. This indicates that methods to provide additional heterogeneity, such as the ability to fine-tune agents towards their specialisation, might see additional performance gains, alongside the personalization of memories focused on individual experience.

Conclusion

This paper introduces Intrinsic Memory Agents, a novel multi-agent LLM framework that constructs agent-specific heterogeneous memories to enhance multi-agent collaboration in discussion and planning tasks. Evaluation PDDL dataset and on a practical data pipeline design problem demonstrates

our framework's improved performance on structured planning tasks, with a 15.5% increase over the next best memory architecture, at the cost of increased token usage. Further evaluation on the ALFWorld and FEVER datasets demonstrates our approach's consistent performance, ranking among the top memory mechanisms with the lowest standard deviation while the best memory mechanisms on these benchmarks show high standard deviation. Our strong performance using a generic memory template demonstrates the generalisablity of our approach to other problems, without the need to hand-craft high quality memory templates.

Results on the data pipeline case study further show the Intrinsic Memory Agents' enhanced ability to collaborate on complex tasks. The Intrinsic Memory Agent system outperforms the baseline system across all quality measures of scalability, reliability, usability, cost-effectiveness, and documentation, as well as an ability to more closely follow the task specification, providing more actionable recommendations by suggesting specific tools and frameworks, as well as trade-off details of each component in the pipeline.

Reproducibility Statement

Wehave taken care to ensure that our experiments and results are transparent and reproducible by detailing the models, computational setup, code, statistical tests, and prompts used in our experiments. The LLM model (Llama3.2:3b) is cited, named, and referenced in the main text. The computational infrastructure used, including the GPU model names and operating system are specified in section 4 of the main text. For code, the names and versions of relevant Python libraries are specified within the supplementary code files. A Wilcoxon rank-sum test is used to test statistical significance for the data pipeline case study. P-values and standard deviation measures are included in the performance analysis in section 5.3. 5 independent runs with different set seed are used for the numeric benchmarks, with the seeds specified within the supplementary code. All code for running the data pipeline case study and numeric benchmarks are included in the supplementary materials. Finally, the selected prompts of the multi-agent and Intrinsic memory architecture are shown in Appendix D. Further prompts for each agent can be found as part of the supplementary code, under the 'prompts' directory.

References

Algorithms

This section contains the context construction algorithm presented in Section 3 and the finalisation algorithm presented in 5.1.

Algorithm 1 The context construction algorithm, which takes the current conversation history, memory of the agent, and maximum number of tokens. It appends the most recent conversation turn and agent memory to the context first, before using the remainder of the tokens to append the rest of the conversation history, ensuring the memory and most recent output is always included.

def c o n s t r u c t c o n t e x t ( c o n v e r s a t i o n h i s t o r y , agent memory , max tokens ) : c o n t e x t = [ ] # I n c l u d e t h e i n i t i a l t a s k d e s c r i p t i o n c o n t e x t . append ( c o n v e r s a t i o n h i s t o r y [ 0 ] ) c o n t e x t . append ( agent memory ) # Add most r e c e n t c o n v e r s a t i o n t u r n s u n t i l c o n t e x t l i m i t i s reached r e m a i n i n g t o k e n s = max tokens -c o u n t t o k e n s ( c o n t e x t ) r e c e n t t u r n s = [ ] f o r t u r n i n reversed ( c o n v e r s a t i o n h i s t o r y [ 1 : ] ) : t u r n t o k e n s = c o u n t t o k e n s ( t u r n ) i f t u r n t o k e n s < = r e m a i n i n g t o k e n s : r e c e n t t u r n s . i n s e r t ( 0 , t u r n ) r e m a i n i n g t o k e n s -= t u r n t o k e n s e l s e : break c o n t e x t . extend ( r e c e n t t u r n s ) return c o n t e x t

Algorithm 2 The finalisation algorithm that specifies the order of agents speaking. It is a modified round-robin discussion between the agents: The discussion begins with each of the worker agents (BOA, DEA, MLA, IA) contributing to the conversation, with each worker's turn being followed by the conversation delegation agent (CDA). Once the workers have each had their turn, the knowledge integration agent and evaluation agent make their contributions, and the cycle begins again. The CDA is programmed to dedicate a certain number of turns to discussion, proposals, and consensus. The number of turns dedicated to each conversation stage is tracked, and once the consensus round is reached, each agent is asked to confirm if they agree with the proposed solution or not. If all agree on the proposed solution as being acceptable, the CDA will emit a 'FINALIZE' response, triggering the documentation joining agent (DJE) to compile the agreed response and format it according to the task requirements.

workers = [BOA, DEA, MLA, IA] global t u r n c o u n t e r t u r n c o u n t e r += 1 i f 'FINALIZATION' i n groupchat . messages [ -1][ ' content ' ] : return DJE i f l a s t s p e a k e r i s CDA: global worker c o u n t e r w = workers [ worker c o u n t e r %4] worker c o u n t e r += 1 print ( f ' w o r k e r c o u n t e r : { worker c o u n t e r } ' ) return w e l i f l a s t s p e a k e r i n workers : i f worker c o u n t e r %4 == 0: return KIA e l s e : return CDA e l i f l a s t s p e a k e r i s KIA: return ERA e l i f l a s t s p e a k e r i s ERA: return CDA

Ablation study

We conduct an ablation study to understand the sensitivity of the Intrinsic Memory Agents to the structure of the templates that the agents use to update their memory. We evaluate three approaches:

The approaches are evaluated on the PDDL, FEVER, and ALFWorld benchmark problems, tested across different base language models with multiple independent runs using set seeds for reproducibility and consistency.

Template approaches

Manual template

In the manual template approach, a strict structured template is used that agents must follow to update their internal memory. The manual templates are created to track the relevant information of the given task, with specific fields.

Generic template

For the generic approach, we use a universal template structure, giving only general fields to be filled in with the current task description and trajectory, allowing the model to use any form of memory representation, and not strictly requiring the same structured format to be used each time.

LLM-generated template

Finally, we leverage each agent's own capabilities to create a template during initialisation which it deems suitable for the task at hand. This approach removes the need to create manual templates for each type of task while allowing for memory templates specific to the task. The template generation prompt is as follows:

Ablation results

We perform the ablation study on two additional models: Gemma3-12b and Mistral-7b. Tables 2 and 3 display the mean and standard deviation of each template approach on the PDDL, FEVER, and ALFWorld benchmarks. We find that the approaches using an LLM-generated template or a generic template outperform the manual template approach. In general, the generic template performs best, but performance for both the LLM-generated approach and generic approach are similar using Gemma3. Token usage varies based on the underlying model. For example, Mistral uses more tokens on average for the generic approach whereas Gemma uses more tokens for the LLM-based approach. Note that the cost of the template generation in the LLM approach is a fixed one-time cost at the beginning of each problem, and therefore adds little overhead compared to the rest of the runtime.

You are a MEMORY UPDATER for a PDDL-style planning agent. Your job: -Maintain a compact JSON memory capturing stable, reusable information across tasks and domains. -Only store information that improves future planning: common strategies, mistakes, valid action patterns, state-transition insights. -Do not store long histories. Keep everything concise and deduplicated.

Inputs you receive each update call: -currentmemory: the previous memory as json (may be empty re-init). -latestturn: the agent's most recent Thought/Action/Observation. -currenttask: one of blockworld, barman, gripper, tyreworld. -goal: current goal description.

OUTPUT: -Return ONLY the updated memory as valid JSON following the template below. -No extra commentary.

--------------------------MEMORY TEMPLATE (ALWAYS FOLLOW)

"task summary": "brief description of PDDL planning setting",

"global strategies":

across domains"

],

[

"high-level

patterns":

["pickup

reusable planning

"blockworld":

X", "putdown

["free heuristics

"valid

X",

"stack

target

["wrong

["attempting action

format", block

before

without arm

grasp

base"], full"]

"fill-shot

"pour-shot-to-clean-shaker ..."], "good strategies": ["ensure hand availability before filling"], "invalid patterns": ["fill without holding glass"], "mistakes": ["grasp with occupied hand"]

, "gripper": "valid action patterns": ["move R1 R2", "pick O Room Gripper", "drop O Room Gripper"], "good strategies": ["carry multiple items before moving rooms"], "invalid patterns": ["drop object in wrong room"], "mistakes": ["pick while gripper full"]

, "tyreworld": "valid action patterns": ["open X", "fetch O C", "loosen N H", "jack-up H"], "good strategies": ["open boot early to access tools"], "invalid patterns": ["loosen nut without wrench"], "mistakes": ["inflate wheel without pump"] , "tasks":

[ "id": "identifier or hash of goal", "goal": "exact goal text", "status": "pending|solved", "helpful observations": ["short state insights from valid steps"], "invalid actions": ["summaries of failed attempts"], "progress notes": ["short planning insights for this task"] ]

  1. Parse current memory. -If empty or invalid, initialize using the template above.

  2. Update the domain-specific sections: -From latest turn, add new useful action patterns, invalid patterns, or mistakes. -Keep lists short, deduplicated, and generalisable.

  3. Update global strategies if the latest turn reveals a robust cross-domain heuristic.

  4. Update the relevant task entry: -If no entry exists for this goal, create one. -Add helpful observations if new actionable state insights appear. -Add invalid actions if latest turn shows an invalid move. -Add progress notes for general reasoning improvements. -If task finished, mark status = "solved". 5. Return ONLY the updated JSON memory, nothing else.

Figure 5: Manually generated prompt for the LLM agent in the PDDL task.

Use your latest response to populate and update the current memory with factual information to solve the task based on the task description. ## Task Description: { task description } ## Current Task Trajectory: { task trajectory } ## Current Memory: { current memory }

I have an AI agent that has to complete a task. The agent has a memory that is updated each time the LLM responds by comparing the latest response and the existing memory, and adding any new important information. The memory should be templated based on the nature of the task following a json-style format. The memory update is conducted as a prompted LLM call to update the memory. Provide the instructions to the agent for such an update operation, as well as the generic memory template for this particular task. Provide the full answer as a single prompt. Only include the most crucial details to the updating instructions to preserve token usage. Do not explain or describe the prompt, simply return the prompt and nothing more. This is the task description: { task description }

Table 2: Mistral:7b

Full benchmarking results

Table 4: Rewards and tokens for each memory framework on the three benchmark problem sets.

Full prompts and example outputs

You are maintaining the memory of an agent working as [ROLE] in a multi-agent conversation. Use the old memory and the newest output by the agent to populate and up- date the current memory json with factual information.

For context, old memory content: [MEMORY CONTENT]

Current content generated by the agent: [AGENT OUTPUT]

Update the memory content to incorporate new information while preserving key historical context. The updated content should be concise and focus on information relevant to both the old memory and the newly generated output.

Figure 8: Prompt of the memory update function, where ROLE is the agent's role specification R n ; MEMORY CONTENT is the current content M n,m -1 ; AGENT OUTPUT is the agent's output O n,m .

''' This discussion session is set up to discuss the best data pipeline for a real time data intensive machine learning training and inference self driving application. The goal is to discuss and find consensus on how to set up the data pipeline, including each component in the data pipeline.

You can assume that we have access to AWS.

Data Description: Real-time data of cars driving in street. There are 6 camera sources with data in .jpg format; 1 lidar source in .pcd.bin format; and 5 radar sources with data in .pcd format.

Discussion and Design:

-Emphasize comprehensive understanding of the data sources, processing requirements, and desired outcomes.

-Encourage each other to engage in an open discussion on potential technologies, components, and architectures that can handle the diverse data streams and real-time nature of the data.

-Keep the conversation on design and evaluating the pros and cons of different design choices, considering scalability, maintainability, and cost-effectiveness.

-The team should agrees on a final architectural design, justifying the choices made.

-The team should produce the required the document PIPELINE OVERVIEW.json.

Provide explaining

high-level why

it is

plan and

well-suited rationale

for the

the design,

use case.

-Estimate the cloud resources, implementation efforts, and associated costs, providing a rough breakdown and complexity rating.

-Generate a 'PIPELINE OVERVIEW.json' file, detailing the proposed complete architecture in JSON format with the following fields:

-\Platform: A cloud service provider's name if the cloud solution is the best, or \local server" if locally hosted servers are preferred.

\Component

1":

framework.

-\Component 2": The second component in the pipeline framework. Continue until all required components are listed.

-\Implementation difficulties": A rating from 1 to 10 (lowest to highest).

Remember, this

is project

execution.

a

collaborative

Refrain

assigning conversation

Emphasize focused

tasks on

not

with architectural

challenges.

well-thought-out deadlines.

'''

''' You are an expert in data pipeline design evaluation. Your task is to evaluate the following data pipeline design based on its description. For each of the specified metrics, assign a score from 1 to 10, where 1 is the lowest and 10 is the highest. Provide a brief justification for each score. Be critical and harsh if the design is poor and give it a low score. Base your evaluation solely on the provided description. Do not assume any additional information. Metrics: 1. Scalability: Ability to handle increasing data volumes or user loads. 2. Reliability: Ability to handle failures and ensure data integrity. 3. Usability: Enough detail for developers to implement the design. 4. Cost-effectiveness: Balance between costs and benefits. 5. Documentation: How well-justified is the choice of elements for the data pipeline Provide your evaluation in the following format in a json dict: [metric1]: score: [score], justification: [justification], [metric2]... '''

Figure 10: Prompt for the LLM-as-a-Judge to evaluate data pipeline designs.

"Platform": "AWS" "Component 1": "Amazon S3 (Data Storage)" "pros": "Scalable, durable, and secure storage for raw data." "cons": "Additional cost for storing large amounts of data." "design": "Use S3 as a central repository for all data sources, with separate buckets for each source if needed." "details": "Implement S3 event notifications to trigger data processing workflows upon new data arrival." "Component 2": "Amazon EC2 (GPU Instances) | Image & Point Cloud Processing" "pros": "Scalable, high-performance computing resources for complex image and point cloud processing." "cons": "Higher cost due to instance usage and GPU pricing." "design": "Use EC2 NVIDIA GPU instances for camera source data processing." "details": "Implement Docker-based containerization for reproducibility and efficient execution." "Component 3": "Amazon SageMaker | Model Training & Processing" "pros": "Fully managed, high-performance infrastructure for ML training and inference." "cons": "Higher cost for GPU-backed instances." "design": "Use SageMaker for lidar and radar data processing, including preprocessing and feature extraction." "details": "Use Docker containers to ensure stable, reproducible ML pipelines." "Component 4": "Amazon Kinesis | Real-Time Streaming" "pros": "Scalable, fast, and secure streaming and analytical capabilities." "cons": "Additional cost for high-throughput streaming." "design": "Use Kinesis for real-time ingestion, analytics, and event-driven triggers." "details": "Implement aggregation, filtering, and visualization workflows to support decisions." "Component 5": "Amazon Redshift | Batch Analytics" "pros": "High-performance analytics engine with strong scaling capabilities." "cons": "Additional cost for warehousing large datasets." "design": "Use Redshift for batch analytical workloads across multi-source data." "details": "Implement analytics queries and visualization pipelines for reporting and BI."

Figure 11: Example output from intrinsic memory agent system which scored the highest. The intrinsic memory agent system adheres to the initial prompt well, providing an answer in the correct format, as well as providing sufficient justification for the choice of components, as well as implementation details.

"Component 6": "Amazon IAM | Identity & Access Control" "pros": "Secure, scalable identity and permissions management." "cons": "Requires careful design to avoid security misconfiguration." "design": "Implement IAM roles and policies to manage access across all pipeline components." "details": "Use Cognito for user authentication and authorization where appropriate." "Component 7": "Amazon Cognito | Authentication" "pros": "Managed, secure identity federation and authentication." "cons": "Additional cost depending on usage tiers." "design": "Use Cognito for managing user identity, login, and token issuance." "details": "" "Connections 1": "Amazon S3 → Amazon EC2 (GPU instances) (API)" "Connections 2": "Amazon EC2 (GPU instances) → Amazon SageMaker (API)" "Connections 3": "Amazon SageMaker → Amazon Kinesis (API)" "Connections 4": "Amazon Kinesis → Amazon Redshift (API)" "Connections 5": "Amazon IAM → Amazon Cognito (API)" "Potential Challenge 1": "Scalability" "Potential Challenge 2": "Performance" "Potential Challenge 3": "Security" "Rationale": "Pipeline is designed to be scalable, maintainable, and secure using microservices-based architecture." "Complexity Rating": "7" "Estimated Cloud Resource 1": "S3: 10{20 TB of storage" "Estimated Cloud Resource 2": "EC2 (GPU instances): 4{8 GPU-enabled instances" "Estimated Cloud Resource 3": "SageMaker: 2{4 GPU-enabled instances" "Estimated Cloud Resource 4": "Kinesis: 2{4 streams" "Estimated Cloud Resource 5": "Redshift: 2{4 clusters" "Implementation Effort": "12{16 weeks" "Cost 1": "S3: $500{$2,000 per month" "Cost 2": "EC2 (GPU instances): $5,000{$20,000 per month" "Cost 3": "SageMaker: $2,000{$10,000 per month" "Cost 4": "Kinesis: $1,000{$5,000 per month" "Cost 5": "Redshift: $2,000{$10,000 per month"

"Platform": "AWS", "Component 1": "Name": "Data Ingestion", "Description": "Ingest data from various sources (camera, lidar, radar) at high speeds", "Implementation difficulties": 7, "Maintainability difficulties": 6 , "Component 2": "Name": "Image Processing", "Description": "Process camera data in real-time to extract relevant features", "Implementation difficulties": 8, "Maintainability difficulties": 7 , "Component 3": "Name": "Object Detection", "Description": "Detect objects of interest (e.g., pedestrians, cars, lanes) within image frames and track their movement over time", "Implementation difficulties": 9, "Maintainability difficulties": 8 , "Component 4": "Name": "Lidar Data Processing", "Description": "Process lidar data in real-time to estimate distances, velocities, and other relevant metrics", "Implementation difficulties": 8, "Maintainability difficulties": 7 , "Component 5": "Name": "Radar Data Processing", "Description": "Process radar data in real-time to estimate distances, velocities, and other relevant metrics", "Implementation difficulties": 8, "Maintainability difficulties": 7 , "Component 6": "Name": "Data Fusion", "Description": "Fuse the outputs from different sensors (camera, lidar, radar) to create a more accurate representation of the environment", "Implementation difficulties": 9, "Maintainability difficulties": 8 , "Component 7": "Name": "Model Training", "Description": "Train machine learning models on large datasets using AWS SageMaker's Training Grounds feature", "Implementation difficulties": 8, "Maintainability difficulties": 7 , "Component 8": "Name": "Inference", "Description": "Perform real-time inference on trained models, making predictions on new, unseen data", "Implementation difficulties": 9, "Maintainability difficulties": 8

Figure 13: Example output from baseline Autogen system which scored the highest.

MetricBaseline AutogenIntrinsic Memoryp-value
Tokens36077478300.0195
Conversation turns14.3160.2632
Scalability3.7570.0004
Reliability2.374.90.0003
Usability3.254.90.0093
Cost-effectiveness2.374.70.001
Documentation3.875.40.0077
BenchmarkMemoryRewardsRewardsAverage tokens
BenchmarkMemoryMeanStdAverage tokens
PDDLIntrinsic0.0636480.004113848,052 609,102
PDDLIntrinsic-LLM0.0697400.010224
PDDLIntrinsic-Generic0.0661980.008096613,631
FEVERIntrinsic0.1199740.0242281,395,395
FEVERIntrinsic-LLM0.3007460.051672949,377
FEVERIntrinsic-Generic0.3798530.0413651,079,041
ALFWorldIntrinsic0.0153900.0003171,789,989
ALFWorldIntrinsic-LLM0.0149440.000046991,884
ALFWorldIntrinsic-Generic0.0299260.0001161,027,602
BenchmarkMemoryRewardsRewardsAverage tokens
BenchmarkMemoryMeanStd
PDDLIntrinsic0.2531640.049818379,786 359,544
PDDLIntrinsic-LLM0.2539860.032846
PDDLIntrinsic-Generic0.2602550.022382350,541
FEVERIntrinsic0.6086030.006874178,456
FEVERIntrinsic-LLM0.6493910.00457686,849
FEVERIntrinsic-Generic0.6535000.00591088,274
ALFWorldIntrinsic0.0250650.000262942,916
ALFWorldIntrinsic-LLM0.0450020.000350883,913
ALFWorldIntrinsic-Generic0.0482960.008294879,134
BenchmarkMemoryRewardsRewardsAverage tokens
BenchmarkMemoryMeanStdAverage tokens
ChatDev0.0209040.006491174,451
Empty0.0204080.000000113,817
G-Memory0.0408510.019529104,492
Generative0.0606560.031094181,211
MemoryBank0.0316230.004932160,273
MetaGPT0.0307770.000547136,716
Voyager0.0721120.034554102,482
Intrinsic-Generic0.0482960.008294784,119
Intrinsic-LLM0.0450020.000350882,158
ChatDev0.6175000.01658380,656
Empty0.6466670.01570655,539
G-Memory0.6287920.027763267,823
Generative0.6512500.011087117,278
MemoryBank0.6328390.01077165,997
MetaGPT0.6670000.02885348,998
Voyager0.6430000.02588462,555
Intrinsic-Generic0.6535000.00591088,274
Intrinsic-LLM0.6493910.00457686,849
ChatDev0.2227460.02111470,765
Empty0.2243290.01976652,075
G-Memory0.1522220.039180162,712
Generative0.1649440.02197784,113
MemoryBank0.1580830.01369766,299
MetaGPT0.1971060.01854692,436
Voyager0.1910880.03156268,483
Intrinsic-Generic0.2602550.022382352,301
Intrinsic-LLM0.2539860.032846359,302

Multi-agent systems built on Large Language Models (LLMs) show exceptional promise for complex collaborative problem-solving, yet they face fundamental challenges stemming from context window limitations that impair memory consistency, role adherence, and procedural integrity. This paper introduces Intrinsic Memory Agents, a novel framework that addresses these limitations through structured agent-specific memories that evolve intrinsically with agent outputs. Specifically, our method maintains role-aligned memory templates that preserve specialized perspectives while focusing on task-relevant information. We benchmark our approach on the PDDL dataset, comparing its performance to existing state-of-the-art multi-agentic memory approaches and showing an improvement of 38.6% with the highest token efficiency. An additional evaluation is performed on a complex data pipeline design task, we demonstrate that our approach produces higher quality designs when comparing 5 metrics: scalability, reliability, usability, cost-effectiveness and documentation with additional qualitative evidence of the improvements. Our findings suggest that addressing memory limitations through structured, intrinsic approaches can improve the capabilities of multi-agent LLM systems on structured planning tasks.

Recent advances in large language models (LLMs) have enabled their application as autonomous or semi-autonomous agents capable of complex reasoning and decision-making (Huang et al. 2024). Multi-agent LLM systems, where multiple LLM instances interact to solve problems collaboratively, have shown particular promise for tasks requiring diverse expertise (Park et al. 2023; Qian et al. 2025). These systems leverage the complementary capabilities of specialized agents to address challenges that would be difficult for single-agent approaches to resolve effectively.

Despite their theoretical advantages, multi-agent LLM systems face several implementation challenges that limit their practical effectiveness, from coordination overhead, to the consistency in role adherence among the agents (Li et al. 2024b). Most critically, the fixed-size context windows of LLMs restrict their ability to maintain long-term conversational context, an issue that is exacerbated in multi-agent frameworks with multiple agents in a single conversation. This leads to issues such as perspective inconsistency, forgetting key requirements, and procedural drift. Current solutions such as Retrieval-Augmented Generation (RAG) (Lewis et al. 2020; Gao et al. 2024) and agentic memory approaches (Packer et al. 2024; Xu et al. 2025; Chhikara et al. 2025) are designed for single-agent and user interaction scenarios, which do not account for the volume of information growing with the number of agents.

To address these challenges, we introduce Intrinsic Memory Agents, a novel multi-agent architecture that uses structured, agent-specific memories aligned with conversational objectives. Unlike previous approaches, our system updates memories that are specific to each agent, ensuring heterogeneity and memories that reflect both historical context and recent developments while preserving agent-specific perspectives. The intrinsic nature of memory updates, derived directly from agent outputs rather than external summarization, ensures unique memories that maintain consistency with agent-specific reasoning patterns and domain expertise. We evaluate our approach through benchmarking and through a specific data pipeline design case study to show its practical usage. The evaluation demonstrates that our Intrinsic Memory Agents approach yields significant improvements in conversational coherence, role consistency, and collaborative efficiency compared to conventional multi-agent implementations. These improvements translate to qualitative enhancements in solution quality without increasing the number of conversation turns, suggesting broad applicability across domains where multi-agent LLM systems are deployed.

The main contributions of our work are as follows:

Structured Memory Templates: Predefined memory structures aligned with agent roles and conversational objectives.

Recent years have seen significant progress in the development of multi-agent systems powered by LLMs. These systems have been applied in various domains, such as software development, scientific experimentation, gaming, and social simulation (Li et al. 2024b). For example, in software development, multi-agent systems enable concurrent consideration of different aspects such as architectural design, security, user experience, and performance optimization (Hong et al. 2024). Hallucinations due to outdated knowledge or retrieval extraction issues remains a major challenge which limits the effectiveness of multi-agent systems. The use of a shared knowledge base or memory storage is an important aspect to maintain up-to-date, coherent and correct information among agents.

In agent-based systems, memory is pivotal for maintaining context, learning from historical interactions, and making informed decisions. As (Zhang et al. 2024) noted, memory supports tasks such as ensuring conversation consistency and effective role-playing for single-agent systems. In multi-agent systems, memory facilitates coordination, communication, and collaborative problem-solving, as Guo et al. (2024) discussed.

Memory in LLMs can be categorized under short-term memory and long-term memory. Short-term memory is information that fits within the model’s fixed context window. Commercial LLMs such as GPT-4o (OpenAI 2024) and Claude (Anthropic 2024) are able to process large contexts of over 100K tokens, with some models such as Gemini 2.5 Pro (Comanici et al. 2025) able to process over 1 million tokens in its context window. However, the hard limit of the context window size remains, and increasing the context length does not necessarily increase reasoning or learning capabilities of the LLM (Li et al. 2024a). This is because the long context can move the relevant information further away from each other in the context window.

Long-term memory is information that persists beyond the context window or single instance of an LLM. This information can be stored in external databases and retrieved using RAG techniques (Lewis et al. 2020; Gao et al. 2024). Long-term memory aims to alleviate the issue of short-term memory’s limited capacity, but introduces other disadvantages such as retrieval noise, the complexity of building a retrieval system, latency, and storage costs (Asai et al. 2024; Yu, Zhang, and Feng 2024).

The limitations of context length and existing memory mechanisms are particularly pronounced in multi-agent settings, where the volume of information exchanged grows approximately linearly with the number of agents involved. As multi-agent conversations extend, the probability of critical information being at a long distance or even excluded from the accessible context increases dramatically. This information loss undermines the primary advantage of multi-agent systems: the integration of diverse, specialized perspectives toward cohesive solutions. This is exacerbated by current long-term memory approaches which provide a homogeneous memory for the agents, decreasing the benefits of having agents focused on a single part of the task. Our proposed approach therefore focuses on the heterogeneity of agents and their memories, ensuring role consistency and coherence.

Agentic memory offers a solution to long-term memory and limited contextual information by periodically condensing conversation history into concise summaries (Wang et al. 2025; Chen et al. 2024). These approaches generate sequential or hierarchical summaries that capture key decisions and insights from previous exchanges. Some agentic memory approaches combine with RAG approaches by storing the summarized contexts for retrieval later in the conversation (Xu, Szlam, and Weston 2022), or by storing in- and out-of-context memory in a hierarchical system to dynamically adapt the current context (Packer et al. 2024; Xu et al. 2025). While agentic memory methods provide better contextual integration than pure retrieval approaches, they frequently lose critical details during the condensation process. Furthermore, the undirected and unstructured nature of general summarization often fails to preserve role-specific perspectives and specialized knowledge that are essential to effective multi-agent collaboration.

Our proposed Intrinsic Memory Agents similarly uses an agentic memory approach to summarize and store information. Unlike existing approaches, we introduce structured heterogeneous memory for each agent in the multi-agent system to maintain specialized roles in collaborative tasks, and apply a structured template to each agent ensuring a cohesive memory structure. This addresses the limitations of existing memory mechanisms by ensuring that each agent maintains its own memory, reflecting both historical context and new information while maintaining heterogeneous agent-specific perspectives and expertise.

The various agentic memory approaches are all designed in single-agent scenarios to remember crucial details when interacting with an end-user. Due to the multi-turn long conversations between agents, a direct implementation of single-agent agentic memory becomes complicated and resource-intensive, with each agent requiring retrieval systems and distinct contextual updates.

We propose Intrinsic Memory Agents, a framework for multi-agent LLM systems that maintains agent-specific structured memories aligned with conversational objectives. Figure 1 illustrates the architecture of our Intrinsic Memory Agents framework. In this approach a query is made by the user, the first agent makes a comment based on its role description, the conversation is updated, followed by a memory update for the agent that commented, there is a check for consensus and the cycle starts again. The context in this case is made up of both the agent’s intrinsic memory and the conversation, meaning that as the conversation continues the agents increasingly diverge in their interpretation of that context.

Let us define the multi-agent system 𝒜={A1,A2,…,AN}\mathcal{A}={A_{1},A_{2},...,A_{N}} consisting of NN agents. Each agent An={Rn,Mn,L​L​Mn}A_{n}={R_{n},M_{n},LLM_{n}} is characterized by a role specification RnR_{n} that defines the agent’s expertise domain and objectives, a structured memory MnM_{n} that evolves throughout the conversation, and an LLM instance L​L​MnLLM_{n}, which may share parameters between agents.

The conversation consists of a sequence of turns T=t1,t2,…,tMT={t_{1},t_{2},...,t_{M}} where each turn tmt_{m} involves an agent selection function σ​(tm)→An\sigma(t_{m})\rightarrow A_{n} that determines which agent speaks, an input context Cn,mC_{n,m} constructed for the selected agent, an output On,mO_{n,m} generated by the selected agent, and a memory update operation for the selected agent.

Critically, our framework separates the input context construction and memory update processes, allowing for agent-specific memory maintenance while preserving a shared conversation space.

For each agent AnA_{n}, we define a structured memory template M​TnMT_{n} that specifies the organization of agent-specific memories. The template consists of a set of memory slots M​Tn={S1,S2,…,SK}MT_{n}={S_{1},S_{2},...,S_{K}}. Each slot SkS_{k} is defined by a descriptive identifier (e.g., ”domain_expertise”, ”current_position”, ”proposed_solution”), formatted in JSON. The template can be nested so that each slot can have its own descriptive identifier with further detailed information. The structured nature of the memory templates ensures that updates remain focused on role-relevant information while maintaining consistency with the agent’s expertise domain.

For each agent in the system, we maintain a structured memory MnM_{n} that evolves over time. Let Mn,mM_{n,m} represent the memory of agent nn after mm conversation turns. The memory update process works as follows:

Agent AnA_{n} receives input context Cn,mC_{n,m} consisting of relevant conversation history HmH_{m} and previous memory Mn,m−1M_{n,m-1},

Then with the generated output On,mO_{n,m} and the previous memory Mn,m−1M_{n,m-1}, we update the slot content using a memory update function,

The memory update function fmemory_updatef_{\text{memory_update}} is implemented as a prompted LLM operation. Specifically, for the previous memory Mn,m−1M_{n,m-1} at turn m−1m-1 and agent output On,mO_{n,m} at turn mm, the update function constructs the prompt as shown in Figure 2.

The LLM’s response to this prompt becomes the updated memory Mn,mM_{n,m}.

The context construction function fcontextf_{\text{context}} presented in equation 1 determines what information is provided to an agent when generating a response. The algorithm for our implementation is displayed in algorithm 1.

Mn,0←M_{n,0}\leftarrow initialized;

⬇ 1def construct_context(conversation_history, agent_memory, max_tokens): 2context = [] 3 4# Include the initial task description 5context.append(conversation_history[0]) 6context.append(agent_memory) 7 8# Add most recent conversation turns until context limit is reached 9remaining_tokens = max_tokens - count_tokens(context) 10recent_turns = [] 11 12for turn in reversed(conversation_history[1:]): 13 turn_tokens = count_tokens(turn) 14 if turn_tokens <= remaining_tokens: 15 recent_turns.insert(0, turn) 16 remaining_tokens -= turn_tokens 17 else: 18 break 19 20context.extend(recent_turns) 21return context

This algorithm prioritizes:

The initial task description to maintain objective alignment.

The agent’s structured memory to preserve role consistency.

By prioritizing memory inclusion over exhaustive conversation history, the algorithm ensures that agents maintain role consistency and task alignment even when conversation length exceeds context window limitations.

To evaluate our approach, we test our memory agents against the PDDL (Planning Domain Definition Language) numeric benchmark. PDDL involves structured planning tasks from AgentBoard (Ma et al. 2024), where the agents generate executable plans for abstract problem domains, evaluating their reasoning and coordination.

For numerical benchmarks, we follow the same experimental methodology as G-Memory (Zhang et al. 2025), another memory framework for multi-agent systems. We re-run the G-Memory framework 111https://github.com/bingreeky/GMemory as we cannot directly compare to the published G-Memory results which were benchmarked with GPT-4o-mini as the base language model. We chose to use the G-Memory framework as a comparison as the framework implements a variety of existing memory architectures, allowing us to compare our Intrinsic Memory Agents with existing architectures and benchmarks. G-Memory uses Autogen for multi-agent simulation, matching our use of Autogen for our architecture. We chose to use the PDDL dataset as its structured planning task is the intended use case for the Intrinsic Memory Agents and aligns with the data pipeline case study detailed in Section 5. We run Llama3.1:8b for the numeric benchmarks using Ollama 222https://ollama.com/library with 1 repetition. We use a larger model for the numeric benchmarks as initial tests on the 3b model found poor results for every benchmark and memory framework. We use a single run with a set seed for reproducibility. Our computational infrastructure utilizes a high performance computing cluster with A100 GPUs, running on GNU/Linux 4.18.0-553.el8_10.x86_64.

Table 1 shows the rewards and tokens used for different memory architectures with a multi-agent system on the PDDL benchmark problems. We find substantially better average rewards for PDDL at 0.0833 compared to the next highest with MetaGPT at 0.0601. The improved rewards come at a cost of increased token usage, the highest among all memory architectures, requiring 12,558 more tokens than MetaGPT. Based on token efficiency, defined as average reward per token, Intrinsic Memory shows the best result at 5.933×10−75.933\times 10^{-7} with ChatDev coming second, which has a token efficiency of 5.466×10−75.466\times 10^{-7}.

The PDDL dataset are structured planning tasks, which fits the intended use case of Intrinsic Memory for agent discussion, planning and design. As Intrinsic Memory assigns agent-specific memory, it can more clearly distinguish planning and actions to complete tasks. More tokens are used by Intrinsic Memory to generate structured templates per agent per round of discussion, and is a worthwhile trade-off in both reward score and token efficiency. Notably, all memory architectures except ChatDev and Intrinsic Memory, the most token efficient methods, utilize cross-trial memory, in which memory is stored and carried across tasks, which significantly helps give few-shot examples when agents are stuck in a loop in the environment.

As a practical case study to evaluate our approach, we applied our memory agents to a collaborative data pipeline design, a complex task requiring multiple perspectives. We run 10 independent outputs with eight specialized agents:

Evaluation Agent (EA) evaluates the output solutions.

Knowledge Integration Agent (KIA) summarizes each discussion round (e.g. after every agent has contributed at least once).

Infrastructure Engineer (IA) designs the cloud infrastructure.

Business Objective Engineer (BOA) checks against business requirements.

Machine Learning Engineer (MLE) provides ML implementation.

Documentation Joining Agent (DJE) is responsible for producing final output after consensus is reached among agents.

The agents are tasked with designing a cloud-based data pipeline architecture through a structured process involving proposals, discussions, and consensus formation. The full prompts and task descriptions can be found in Appendix A.

The output requirements include a concise summary, high-level plan, resource estimates, and a structured JSON specification.

Baseline System which consists of a standard multi-agent implementation without structured memory.

Uses standard prompt templates for each agent role.

Relies exclusively on conversation history for context.

Limited to the most recent conversation turns due to context window constraints.

Implements role-specific memory templates.

Updates memories intrinsically based on agent outputs.

Both systems used identical agent roles and task specifications, with Llama-3.2-3b as the underlying LLM. Each agent role was initialized with the same role description and initial instructions across the two system configurations to ensure a fair comparison.

We implemented an agent selection function that iterates through each worker agent and the conversation delegation agent (CDA), ensuring that all unique perspectives are represented in the agent discussion. Once all agents have accepted a proposed solution, marked through the ”ACCEPT” flag, the CDA emits a ”FINALIZE” flag, prompting the Documentation Engineer Agent to produce the final data pipeline output in algorithm 2:

To evaluate the quality of the data pipeline designs generated by our memory agents, and to compare to the pipeline designs generated from default Autogen, we use an LLM as a judge (Zheng et al. 2023) to score each pipeline design and provide a qualitative analysis to support these scores. We evaluated the multi-agent system performance under the following metrics:

Scalability: ability for the data pipeline to handle increasing data volumes or user loads.

Usability: is there enough detail in the data pipeline design for developers to implement the design?

Cost-effectiveness: balance between costs and benefits of each chosen component.

Documentation: how well-justified and documented is the choice of elements for the data pipeline?

The scalability and reliability metrics are chosen as core requirements in modern data pipelines. Scalability reflects the ability to grow the pipeline to handle larger volumes of data and users, while reliability ensures the pipeline is consistent and fault-tolerant, both of which are crucial if the pipeline were to be deployed. Usability and documentation metrics reflect the details and design decisions taken. A strong design is not useful if it does not contain enough detail or is too abstract to be practically implemented. Usability measures whether the output designs are detailed and clear enough for engineering teams to implement. Design decisions must be well-documented, with clear justifications and explanations for each component, which reveals the reasoning behind the agents’ choices. Finally, the cost-effectiveness metric evaluates whether the pipeline design has considered and balanced the need for computation resources with the cost of those resources. Run-time metrics such as latency and throughput are not included in our evaluation metrics as we only present the design of the data pipelines to be evaluated, and do not implement the designs into code

The median and standard deviation of each quality metric is presented in Figure 3. The Intrinsic Memory system shows consistent improvement on all metrics compared to the baseline Autogen, with Usability as the only metric where there is not a statistically significant difference.

The Documentation quality focuses on the clarity and how well-justified the design choices are. While Intrinsic Memory helps to boost the Documentation score over the baseline, the score is still relatively low at a mean of 3.56. This suggests that retaining memory of the conversation alone does not guarantee good justification, and while some context and attributes of each component are remembered, the reasons for choosing the components are not. This could be a problem with the training corpus, and a requirement for better annotated training data. Similarly, the Usability score is low with means of 3 and 3.67 for the baseline and Intrinsic Memory , respectively.

The improved quality comes at a cost of additional tokens outlined in Table 2. Intrinsic Memory uses on average 32% more tokens than the baseline as it outputs are more descriptive on average, although the number of conversation turns is similar and not statistically significant. This indicates that the addition of a memory module costs additional token overhead to maintain, but does not increase the number of conversation turns between agents.

Figure 4 shows snippets for one component from the highest-scoring outputs for the intrinsic memory agent system and baseline Autogen system.

The Intrinsic Memory Agent system outperforms the baseline system across the five quality metrics. In terms of scalability, the Intrinsic Memory Agent system is capable of providing an overall assessment of scalability, specifically around varying data volumes, whereas the baseline system encapsulates that measure only in the form of ”maintenance difficulty” for each component of the pipeline. In terms of reliability, the Intrinsic Memory Agent provides considerations for AWS Kinesis’s fault tolerance, as well as the need for appropriate buffering and queuing mechanisms to handle varying data volumes. The Intrinsic Memory Agent provides a more descriptive Usability output of the Intrinsic and a clearer pathway to implementation. Neither system makes specific observations for the cost-effectiveness on an individual component basis, but the Intrinsic Memory Agent does provide an overall numerical evaluation of the pipeline’s cost-effectiveness. Finally, the Intrinsic Memory Agent ultimately provides justification and documents its recommendation under each component, including pros and cons for each component choice.

Overall, the Intrinsic Memory Agent provides a more descriptive answer and more value to engineers by specifying tools, configurations and trade-offs. For example, its Data Ingestion design recommends Apache Storm or AWS Lambda, whereas the baseline simply states “Ingest data from various sources (camera, lidar, radar) at high speeds.” Similarly, the IMA cites OpenCV and TensorFlow for image processing, PCL and Open3D for point-cloud handling, and MATLAB plus machine-learning libraries for radar signals. Although some precise configuration settings remain unspecified, the baseline merely names each component without offering implementation details or alternatives.

The components specified by the Intrinsic Memory Agent are more relevant to the problem specification. The data pipeline design task explicitly specifies the input data contains lidar and radar data sources, in which the PCL and Open3D are libraries specifically used for lidar data processing. This contrasts the vague “Lidar data processing” component of the baseline, which only contains a general description to process the data without providing any details.

Although the Intrinsic Memory Agent approach shows improved performance across the data pipeline generation task and the selected benchmarks, further validation is required across a broader set of complex tasks, potentially with varying number of agents, models, and memory templates. The structured memory templates are currently created manually, which does not easily transfer across tasks. An automated or generalized method to producing structured memory templates would improve the Intrinsic Memory Agent’s ability to adapt to new tasks.

The results demonstrate that a movement towards heterogeneity of agents leads to an improvement in performance of the multi-agent system, allowing agents to focus more specifically on an area of the design. This indicates that methods to provide additional heterogeneity, such as the ability to fine-tune agents towards their specialisation, might see additional performance gains, alongside the personalization of memories focused on individual experience.

This paper introduces Intrinsic Memory Agents, a novel multi-agent LLM framework that constructs agent-specific heterogeneous structured memories to enhance multi-agent collaboration in discussion and planning tasks. Evaluation on the PDDL dataset, and on a practical data pipeline design problem demonstrates our framework’s improved performance on structured planning tasks, with a 38% increase over the next best memory architecture, and maintaining the best token efficiency despite the increased token usage.

Results on the data pipeline case study further show the Intrinsic Memory Agents’ enhanced ability to collaborate on complex tasks. The Intrinsic Memory Agent system outperforms the baseline system across all quality measures of scalability, reliability, usability, cost-effectiveness, and documentation, as well as an ability to more closely follow the task specification, providing more actionable recommendations by suggesting specific tools and frameworks, as well as trade-off details of each component in the pipeline.

Table: S4.T1: Performance comparison between different memory architectures on PDDL. We record the average rewards per task and average number of tokens used of per task. The best results for each benchmark are highlighted in bold.

MemoryAverage rewardsAverage tokens
No Memory0.0231117,437
G-Memory0.0231118,316
Generative0.0530121,105
MemoryBank0.030580,599
MetaGPT0.0601127,860
Voyager0.0250133,268
ChatDev0.0583107,043
Intrinsic Memory0.0833140,418

Table: S5.T2: Mean efficiency and LLM-as-a-Judge metrics after 10 independent runs, with p-values calculated using a Wilcoxon ranked sum test. Usability and number of conversation turns is highlighted in italics as the metrics that do not show statistical significance between the baseline and our Intrinsic Memory approach.

MetricBaseline AutogenIntrinsic Memoryp-value
Tokens36077478300.0195
Conversation turns14.3160.2632
Scalability570.0041
Reliability3.564.890.005
Usability33.670.0948
Cost-effectiveness3.224.670.004
Documentation23.560.0017

Refer to caption Intrinsic Memory Agents Framework. For nn agents and mm conversation turns, each agent AnA_{n} contains each own role description RnR_{n} and language model LnL_{n}. Its memory Mn,mM_{n,m} is updated based on the input context Cn,mC_{n,m} and output On,mO_{n,m}.

Refer to caption LLM-as-a-Judge metrics for the Data Pipeline design case study.

$$ C_{n,m} = f_{\text{context}}(H_m, M_{n,m-1}); \label{equa-c_nm} $$ \tag{equa-c_nm}

$$ O_{n,m} = L_n(C_{n,m}). $$

Algorithm: algorithm
% \caption{Context construction algorithm}
% \label{Alg:Context_construction}
% $M_{n,0} \gets$ initialized;

% construct_context(conversation_history, agent_memory, max_tokens):
% context = []

% # Include the initial task description
% context.append(conversation_history[0])
% context.append(agent_memory)

% # Add most recent conversation turns until context limit is reached
% remaining_tokens = max_tokens - count_tokens(context)
% recent_turns = []

% for turn in reversed(conversation_history[1:]):
% turn_tokens = count_tokens(turn)
% if turn_tokens <= remaining_tokens:
% recent_turns.insert(0, turn)
% remaining_tokens -= turn_tokens
% else:
% break

% context.extend(recent_turns)
% return context
% \usebox{\mycode}
%
Algorithm: algorithm
% \caption{Finalisation algorithm}
% \label{Alg:Finalisation}
%
Algorithm: algorithm
\caption{The context construction algorithm, which takes the current conversation history,
memory of the agent, and maximum number of tokens. It appends the most recent conversation
turn and agent memory to the context first, before using the remainder of the tokens
to append the rest of the conversation history, ensuring the memory and most recent
output is always included.}
\label{Alg:Context_construction}
% $M_{n,0} \gets$ initialized;
\usebox{\mycode}
Algorithm: algorithm
\caption{The finalisation algorithm that specifies the order of agents speaking. It is a modified round-robin discussion between the agents: The discussion begins with each of the worker agents (BOA, DEA, MLA, IA) contributing to the conversation, with each worker's turn being followed by the conversation delegation agent (CDA). Once the workers have each had their turn, the knowledge integration agent and evaluation agent make their contributions, and the cycle begins again. The CDA is programmed to dedicate a certain number of turns to discussion, proposals, and consensus. The number of turns dedicated to each conversation stage is tracked, and once the consensus round is reached, each agent is asked to confirm if they agree with the proposed solution or not. If all agree on the proposed solution as being acceptable, the CDA will emit a "FINALIZE" response, triggering the documentation joining agent (DJE) to compile the agreed response and format it according to the task requirements.}
\label{Alg:Finalisation}
\usebox{\finalisationcode}
MetricBaseline AutogenIntrinsic Memoryp-value
Tokens36077478300.0195
Conversation turns14.3160.2632
Scalability3.7570.0004
Reliability2.374.90.0003
Usability3.254.90.0093
Cost-effectiveness2.374.70.001
Documentation3.875.40.0077
BenchmarkMemoryRewardsRewardsAverage tokens
BenchmarkMemoryMeanStdAverage tokens
PDDLIntrinsic0.0636480.004113848,052 609,102
PDDLIntrinsic-LLM0.0697400.010224
PDDLIntrinsic-Generic0.0661980.008096613,631
FEVERIntrinsic0.1199740.0242281,395,395
FEVERIntrinsic-LLM0.3007460.051672949,377
FEVERIntrinsic-Generic0.3798530.0413651,079,041
ALFWorldIntrinsic0.0153900.0003171,789,989
ALFWorldIntrinsic-LLM0.0149440.000046991,884
ALFWorldIntrinsic-Generic0.0299260.0001161,027,602
BenchmarkMemoryRewardsRewardsAverage tokens
BenchmarkMemoryMeanStd
PDDLIntrinsic0.2531640.049818379,786 359,544
PDDLIntrinsic-LLM0.2539860.032846
PDDLIntrinsic-Generic0.2602550.022382350,541
FEVERIntrinsic0.6086030.006874178,456
FEVERIntrinsic-LLM0.6493910.00457686,849
FEVERIntrinsic-Generic0.6535000.00591088,274
ALFWorldIntrinsic0.0250650.000262942,916
ALFWorldIntrinsic-LLM0.0450020.000350883,913
ALFWorldIntrinsic-Generic0.0482960.008294879,134
BenchmarkMemoryRewardsRewardsAverage tokens
BenchmarkMemoryMeanStdAverage tokens
ChatDev0.0209040.006491174,451
Empty0.0204080.000000113,817
G-Memory0.0408510.019529104,492
Generative0.0606560.031094181,211
MemoryBank0.0316230.004932160,273
MetaGPT0.0307770.000547136,716
Voyager0.0721120.034554102,482
Intrinsic-Generic0.0482960.008294784,119
Intrinsic-LLM0.0450020.000350882,158
ChatDev0.6175000.01658380,656
Empty0.6466670.01570655,539
G-Memory0.6287920.027763267,823
Generative0.6512500.011087117,278
MemoryBank0.6328390.01077165,997
MetaGPT0.6670000.02885348,998
Voyager0.6430000.02588462,555
Intrinsic-Generic0.6535000.00591088,274
Intrinsic-LLM0.6493910.00457686,849
ChatDev0.2227460.02111470,765
Empty0.2243290.01976652,075
G-Memory0.1522220.039180162,712
Generative0.1649440.02197784,113
MemoryBank0.1580830.01369766,299
MetaGPT0.1971060.01854692,436
Voyager0.1910880.03156268,483
Intrinsic-Generic0.2602550.022382352,301
Intrinsic-LLM0.2539860.032846359,302

References

[dettmers2023qlora] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer. (2023). QLoRA: Efficient Finetuning of Quantized LLMs.

[jiang2023mistral] Jiang, AQ, Sablayrolles, A, Mensch, A, Bamford, C, Chaplot, DS, de las Casas, D, Bressand, F, Lengyel, G, Lample, G, Saulnier, L, others. (2023). Mistral 7B (2023). arXiv preprint arXiv:2310.06825.

[huang2025-hallucination-survey] Huang, Lei, Yu, Weijiang, Ma, Weitao, Zhong, Weihong, Feng, Zhangyin, Wang, Haotian, Chen, Qianglong, Peng, Weihua, Feng, Xiaocheng, Qin, Bing, Liu, Ting. (2025). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst.. doi:10.1145/3703155.

[lewis2020retrieval] Lewis, Patrick, Perez, Ethan, Piktus, Aleksandra, Petroni, Fabio, Karpukhin, Vladimir, Goyal, Naman, K{. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems.

[karpukhin2020dense] Karpukhin, Vladimir, O{\u{g. (2020). Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.

[devlin2018bert] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[chen-etal-2017-reading] Chen, Danqi, Fisch, Adam, Weston, Jason, Bordes, Antoine. (2017). Reading {W. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/P17-1171.

[yang2018hotpotqa] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, Christopher D. Manning. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.

[joshi2017triviaqa] Joshi, Mandar, Choi, Eunsol, Weld, Daniel, Zettlemoyer, Luke. (2017). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[alberti2019naturalquestions] Chris Alberti, Kenton Lee, Michael Collins. (2019). A BERT Baseline for the Natural Questions.

[brill1992simple] Brill, Eric. (1992). A simple rule-based part of speech tagger. Proceedings of ANLP.

[fernandez2007application] Fern{'a. (2007). An application of recurrent neural networks to discriminative keyword spotting. ICANN.

[ciampaglia2015computational] Ciampaglia, Giovanni Luca, Shiralkar, Prashant, Rocha, Luis M, Bollen, Johan, Menczer, Filippo, Flammini, Alessandro. (2015). Computational fact checking from knowledge networks. PloS one.

[lewis2020bart] Lewis, Mike, Liu, Yinhan, Goyal, Naman, Ghazvininejad, Marjan, Mohamed, Abdelrahman, Levy, Omer, Stoyanov, Veselin, Zettlemoyer, Luke. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

[chalkidis2020legal] Chalkidis, Ilias, Fergadiotis, Manos, Malakasiotis, Prodromos, Aletras, Nikolaos, Androutsopoulos, Ion. (2020). LEGAL-BERT: The Muppets straight out of Law School. Findings of the Association for Computational Linguistics: EMNLP 2020.

[ye2024boosting] Ye, Linhao, Lei, Zhikai, Yin, Jianghao, Chen, Qin, Zhou, Jie, He, Liang. (2024). Boosting Conversational Question Answering with Fine-Grained Retrieval-Augmentation and Self-Check. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval.

[jeong2024adaptive] Jeong, Soyeong, Baek, Jinheon, Cho, Sukmin, Hwang, Sung Ju, Park, Jong C. (2024). Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers).

[asai2024selfrag] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi. (2024). Self-{RAG. The 12th International Conference on Learning Representations.

[lin2024radit] Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Richard James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih. (2024). {RA. The 12th International Conference on Learning Representations.

[shi2024replug] Shi, Weijia, Min, Sewon, Yasunaga, Michihiro, Seo, Minjoon, James, Richard, Lewis, Mike, Zettlemoyer, Luke, Yih, Wen-tau. (2024). REPLUG: Retrieval-Augmented Black-Box Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers).

[glass2022re2g] Glass, Michael, Rossiello, Gaetano, Chowdhury, Md Faisal Mahbub, Naik, Ankita, Cai, Pengshan, Gliozzo, Alfio. (2022). Re2G: Retrieve, Rerank, Generate. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

[tufano2024autodev] Michele Tufano, Anisha Agarwal, Jinu Jang, Roshanak Zilouchian Moghaddam, Neel Sundaresan. (2024). AutoDev: Automated AI-Driven Development.

[li2024autoflow] Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, Yongfeng Zhang. (2024). AutoFlow: Automated Workflow Generation for Large Language Model Agents.

[xue2024comfybench] Xiangyuan Xue, Zeyu Lu, Di Huang, Zidong Wang, Wanli Ouyang, Lei Bai. (2024). ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems.

[fan2024workflowllm] Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, Maosong Sun. (2024). WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models.

[wen2024llm-aws] Jinfeng Wen, Zhenpeng Chen, Federica Sarro, Zixi Zhu, Yi Liu, Haodi Ping, Shangguang Wang. (2024). LLM-Based Misconfiguration Detection for AWS Serverless Computing.

[zhou2024llm-data-management] Xuanhe Zhou, Xinyang Zhao, Guoliang Li. (2024). LLM-Enhanced Data Management.

[zhang2024chainbuddy] Jingyue Zhang, Ian Arawjo. (2024). ChainBuddy: An AI Agent System for Generating LLM Pipelines.

[mistral2023mistral7b] Mistral AI. (2023). Mistral-7B-Instruct-v0.2.

[touvron2023llama] Touvron, Hugo, Lavril, Louis, Izacard, Gautier, Martinet, Jean, Lachaux, Marie-Anne, Lacroix, Timothée, Rozière, Baptiste, Goyal, Naman, Hambro, Eric, Azhar, Faisal, Rodriguez, Aurelien, Joulin, Armand, Grave, Edouard, Lample, Guillaume. (2023). LLaMA 2: Open Foundation and Fine-Tuned Chat Models.

[google2024gemma] Google. (2024). Gemma: Lightweight, State-of-the-Art Open Models.

[qwen2024qwen] Alibaba Cloud. (2024). Qwen-1.8B-Chat.

[openai2023chatgpt] OpenAI. (2023). GPT-3.5 Turbo.

[wu2023autogen] Wu, Qingyun, Bansal, Gagan, Zhang, Jieyu, Wu, Yiran, Li, Beibin, Zhu, Erkang, Jiang, Li, Zhang, Xiaoyun, Zhang, Shaokun, Liu, Jiale, Awadallah, Ahmed Hassan, White, Ryen W, Burger, Doug, Wang, Chi. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.

[cobbe2021training] Cobbe, Karl, Kosaraju, Vineet, Bavarian, Mohammad, Chen, Mark, Jun, Heewoo, Kaiser, Łukasz, Miller, John, Plappert, Matthias, Tworek, Jerry, Hilton, Jacob, Nakano, Reiichiro, Hesse, Christopher, Schulman, John, Witteveen, Tim, Amodei, Dario. (2021). Training Verifiers to Solve Math Word Problems.

[amini2019mathqa] Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, Hannaneh Hajishirzi. (2019). MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms.

[mihaylov2018openbookqa] Mihaylov, Todor, Clark, Peter, Khot, Tushar, Sabharwal, Ashish. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.

[rajpurkar2018know] Rajpurkar, Pranav, Jia, Robin, Liang, Percy. (2018). Know What You Don't Know: Unanswerable Questions for SQuAD. arXiv preprint arXiv:1806.03822.

[hu2021lora] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen. (2021). LoRA: Low-Rank Adaptation of Large Language Models.

[sun2020testtimetraining] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, Moritz Hardt. (2020). Test-Time Training with Self-Supervision for Generalization under Distribution Shifts.

[shazeer2017mixtureofexperts] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.

[zhang2024surveymemoryllm] Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, Ji-Rong Wen. (2024). A Survey on the Memory Mechanism of Large Language Model based Agents.

[guo2024llm-multiagents-survey] Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang. (2024). Large Language Model based Multi-Agents: A Survey of Progress and Challenges.

[Li2024-multi-agent-survey] Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, Yi Yang. (2024). A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth. doi:10.1007/s44336-024-00009-2.

[khan2024-rag-infrastructure] Ayman Asad Khan, Md Toufique Hasan, Kai Kristian Kemell, Jussi Rasku, Pekka Abrahamsson. (2024). Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report.

[gao2024memorysharingllm] Hang Gao, Yongfeng Zhang. (2024). Memory Sharing for Large Language Model based Agents.

[cobbe2021gsm8k] Cobbe, Karl, Kosaraju, Vineet, Bavarian, Mohammad, Chen, Mark, Jun, Heewoo, Kaiser, Lukasz, Plappert, Matthias, Tworek, Jerry, Hilton, Jacob, Nakano, Reiichiro, Hesse, Christopher, Schulman, John. (2021). Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168.

[OpenBookQA2018] Todor Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. EMNLP.

[rajpurkar2018squadv2] Pranav Rajpurkar, Robin Jia, Percy Liang. (2018). Know What You Don't Know: Unanswerable Questions for SQuAD.

[park2023generativeagents] Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein. (2023). Generative Agents: Interactive Simulacra of Human Behavior.

[qian2025scalingmulti-agent] Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, Maosong Sun. (2025). Scaling Large Language Model-based Multi-Agent Collaboration.

[hong2024metagpt] Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, Jürgen Schmidhuber. (2024). MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework.

[su2025scientific-idea-multi-agent] Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, Nanqing Dong. (2025). Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System.

[gao2024rag-survey] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey.

[chen2023agentverse] Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, Jie Zhou. (2023). AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors.

[huang2024-planning-agents] Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, Enhong Chen. (2024). Understanding the planning of LLM agents: A survey.

[edge2024-graphrag] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, Jonathan Larson. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization.

[yu2024-autorag] Tian Yu, Shaolei Zhang, Yang Feng. (2024). Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models.

[openai2024-gpt4ocard] OpenAI. (2024). GPT-4o System Card.

[anthropic-claude35] {Anthropic. (2024). Claude 3.5 Sonnet.

[comanici2025-gemini25] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian, Xiaofan Zhang, Raluca Ada Popa, Kedar Dhamdhere, Blaž Bratanič, Kyuyeun Kim, Terry Koo, Ferran Alet, Yi-ting Chen, Arsha Nagrani, Hannah Muckenhirn, Zhiyuan Zhang, Corbin Quick, Filip Pavetić, Duc Dung Nguyen, Joao Carreira, Michael Elabd, Haroon Qureshi, Fabian Mentzer, Yao-Yuan Yang, Danielle Eisenbud, Anmol Gulati, Ellie Talius, Eric Ni, Sahra Ghalebikesabi, Edouard Yvinec, Alaa Saade, Thatcher Ulrich, Lorenzo Blanco, Dan A. Calian, Muhuan Huang, Aäron van den Oord, Naman Goyal, Terry Chen, Praynaa Rawlani, Christian Schallhart, Swachhand Lokhande, Xianghong Luo, Jyn Shan, Ceslee Montgomery, Victoria Krakovna, Federico Piccinini, Omer Barak, Jingyu Cui, Yiling Jia, Mikhail Dektiarev, Alexey Kolganov, Shiyu Huang. (2025). Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.

[li2024-longcontextllmsstruggle] Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen. (2024). Long-context LLMs Struggle with Long In-context Learning.

[wang2025-recursivelysummarizing] Qingyue Wang, Yanhe Fu, Yanan Cao, Shuai Wang, Zhiliang Tian, Liang Ding. (2025). Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models.

[xu-2022-beyond-goldfish-memory] Xu, Jing, Szlam, Arthur, Weston, Jason. (2022). Beyond Goldfish Memory: Long-Term Open-Domain Conversation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2022.acl-long.356.

[wang2025-mirix] Yu Wang, Xi Chen. (2025). MIRIX: Multi-Agent Memory System for LLM-Based Agents.

[xu2025-a-mem] Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, Yongfeng Zhang. (2025). A-MEM: Agentic Memory for LLM Agents.

[mem0] Chhikara, Prateek, Khant, Dev, Aryan, Saket, Singh, Taranjeet, Yadav, Deshraj. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv preprint arXiv:2504.19413.

[thorne-etal-2018-fever] Thorne, James, Vlachos, Andreas, Christodoulopoulos, Christos, Mittal, Arpit. (2018). {FEVER. Proceedings of the 2018 Conference of the North {A. doi:10.18653/v1/N18-1074.

[ma2024-agentboard] Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, Junxian He. (2024). AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents.

[ALFWorld20] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre C^ot'e, Yonatan Bisk, Adam Trischler, Matthew Hausknecht. (2021). {ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. Proceedings of the International Conference on Learning Representations (ICLR).

[zhang2025-gmemory] Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan. (2025). G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems.

[zheng2023-llmjudge] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

[rasmussen2025-zep] Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory.

[vaswani2017-attention] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. (2017). Attention Is All You Need.

[packer2024-memgpt] Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph E. Gonzalez. (2024). MemGPT: Towards LLMs as Operating Systems.

[chen2024-compressivememory] Nuo Chen, Hongguang Li, Juhua Huang, Baoyuan Wang, Jia Li. (2024). Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations.

[Khan:2024] Ayman Asad Khan, Md Toufique Hasan, Kai Kristian Kemell, Jussi Rasku, Pekka Abrahamsson. (2024). Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report.

[bingreeky_gmemory_2025] {bingreeky. (2025). {GMemory.

[jain2024livecodebench] Jain, Naman, Han, King, Gu, Alex, Li, Wen-Ding, Yan, Fanjia, Zhang, Tianjun, Wang, Sida, Solar-Lezama, Armando, Sen, Koushik, Stoica, Ion. (2024). LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv preprint arXiv:2403.07974.

[li2024agentsneed] Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, Deheng Ye. (2024). More Agents Is All You Need.

[he2025llmbasedmultiagentsystemssoftware] Junda He, Christoph Treude, David Lo. (2025). LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead.

[bib1] Anthropic. 2024. Claude 3.5 Sonnet. Technical announcement. Available via Claude.ai and API.

[bib2] Asai et al. (2024) Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; and Hajishirzi, H. 2024. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In The 12th International Conference on Learning Representations.

[bib3] Chen et al. (2024) Chen, N.; Li, H.; Huang, J.; Wang, B.; and Li, J. 2024. Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations. arXiv:2402.11975.

[bib4] Chhikara et al. (2025) Chhikara, P.; Khant, D.; Aryan, S.; Singh, T.; and Yadav, D. 2025. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv preprint arXiv:2504.19413.

[bib5] Comanici et al. (2025) Comanici, G.; Bieber, E.; Schaekermann, M.; Pasupat, I.; Sachdeva, N.; Dhillon, I.; Blistein, M.; Ram, O.; Zhang, D.; Rosen, E.; Marris, L.; Petulla, S.; Gaffney, C.; Aharoni, A.; Lintz, N.; Pais, T. C.; Jacobsson, H.; Szpektor, I.; Jiang, N.-J.; Haridasan, K.; Omran, A.; Saunshi, N.; Bahri, D.; Mishra, G.; Chu, E.; Boyd, T.; Hekman, B.; Parisi, A.; Zhang, C.; Kawintiranon, K.; Bedrax-Weiss, T.; Wang, O.; Xu, Y.; Purkiss, O.; Mendlovic, U.; Deutel, I.; Nguyen, N.; Langley, A.; Korn, F.; Rossazza, L.; Ramé, A.; Waghmare, S.; Miller, H.; Keshava, V.; Jian, Y.; Zhang, X.; Popa, R. A.; Dhamdhere, K.; Bratanič, B.; Kim, K.; Koo, T.; Alet, F.; ting Chen, Y.; Nagrani, A.; Muckenhirn, H.; Zhang, Z.; Quick, C.; Pavetić, F.; Nguyen, D. D.; Carreira, J.; Elabd, M.; Qureshi, H.; Mentzer, F.; Yang, Y.-Y.; Eisenbud, D.; Gulati, A.; Talius, E.; Ni, E.; Ghalebikesabi, S.; Yvinec, E.; Saade, A.; Ulrich, T.; Blanco, L.; Calian, D. A.; Huang, M.; van den Oord, A.; Goyal, N.; Chen, T.; Rawlani, P.; Schallhart, C.; Lokhande, S.; Luo, X.; Shan, J.; Montgomery, C.; Krakovna, V.; Piccinini, F.; Barak, O.; Cui, J.; Jia, Y.; Dektiarev, M.; Kolganov, A.; and Huang, S. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261.

[bib6] Gao et al. (2024) Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; and Wang, H. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997.

[bib7] Guo et al. (2024) Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N. V.; Wiest, O.; and Zhang, X. 2024. Large Language Model based Multi-Agents: A Survey of Progress and Challenges. arXiv:2402.01680.

[bib8] Hong et al. (2024) Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S. K. S.; Lin, Z.; Zhou, L.; Ran, C.; Xiao, L.; Wu, C.; and Schmidhuber, J. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv:2308.00352.

[bib9] Huang et al. (2024) Huang, X.; Liu, W.; Chen, X.; Wang, X.; Wang, H.; Lian, D.; Wang, Y.; Tang, R.; and Chen, E. 2024. Understanding the planning of LLM agents: A survey. arXiv:2402.02716.

[bib10] Lewis et al. (2020) Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33: 9459–9474.

[bib11] Li et al. (2024a) Li, T.; Zhang, G.; Do, Q. D.; Yue, X.; and Chen, W. 2024a. Long-context LLMs Struggle with Long In-context Learning. arXiv:2404.02060.

[bib12] Li et al. (2024b) Li, X.; Wang, S.; Zeng, S.; Wu, Y.; and Yang, Y. 2024b. A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth, 1(1): 9.

[bib13] Ma et al. (2024) Ma, C.; Zhang, J.; Zhu, Z.; Yang, C.; Yang, Y.; Jin, Y.; Lan, Z.; Kong, L.; and He, J. 2024. AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents. arXiv:2401.13178.

[bib14] OpenAI. 2024. GPT-4o System Card. arXiv:2410.21276.

[bib15] Packer et al. (2024) Packer, C.; Wooders, S.; Lin, K.; Fang, V.; Patil, S. G.; Stoica, I.; and Gonzalez, J. E. 2024. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560.

[bib16] Park et al. (2023) Park, J. S.; O’Brien, J. C.; Cai, C. J.; Morris, M. R.; Liang, P.; and Bernstein, M. S. 2023. Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442.

[bib17] Qian et al. (2025) Qian, C.; Xie, Z.; Wang, Y.; Liu, W.; Zhu, K.; Xia, H.; Dang, Y.; Du, Z.; Chen, W.; Yang, C.; Liu, Z.; and Sun, M. 2025. Scaling Large Language Model-based Multi-Agent Collaboration. arXiv:2406.07155.

[bib18] Wang et al. (2025) Wang, Q.; Fu, Y.; Cao, Y.; Wang, S.; Tian, Z.; and Ding, L. 2025. Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models. arXiv:2308.15022.

[bib19] Xu, J.; Szlam, A.; and Weston, J. 2022. Beyond Goldfish Memory: Long-Term Open-Domain Conversation. In Muresan, S.; Nakov, P.; and Villavicencio, A., eds., Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5180–5197. Dublin, Ireland: Association for Computational Linguistics.

[bib20] Xu et al. (2025) Xu, W.; Mei, K.; Gao, H.; Tan, J.; Liang, Z.; and Zhang, Y. 2025. A-MEM: Agentic Memory for LLM Agents. arXiv:2502.12110.

[bib21] Yu, T.; Zhang, S.; and Feng, Y. 2024. Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models. arXiv:2411.19443.

[bib22] Zhang et al. (2025) Zhang, G.; Fu, M.; Wan, G.; Yu, M.; Wang, K.; and Yan, S. 2025. G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems. arXiv:2506.07398.

[bib23] Zhang et al. (2024) Zhang, Z.; Bo, X.; Ma, C.; Li, R.; Chen, X.; Dai, Q.; Zhu, J.; Dong, Z.; and Wen, J.-R. 2024. A Survey on the Memory Mechanism of Large Language Model based Agents. arXiv:2404.13501.

[bib24] Zheng et al. (2023) Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E. P.; Zhang, H.; Gonzalez, J. E.; and Stoica, I. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.