MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, Shuicheng Yan

MemEvolve: A Meta-Evolving Memory Framework

OPPO AI Agent Team, LV-NUS lab

Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve , a meta-evolutionary framework that jointly evolves agents' experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future selfevolving systems, we introduce EvolveLab , a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space ( encode , store , retrieve , manage ), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to 17 . 06% ; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.

/calendar Date:

December 23, 2025

/github Code:

https://github.com/bingreeky/MemEvolve

inl

171

18.39

9.0

58.0 ; | AAG

68.0 68.0 ; | oe 69.09 im 69.09

68.48

67.27

69.0

a 6hCUrC

69.41

‘66.

».6/

BA0

‘660.06

wo rer

nea Nc et pe rer WN

yoo

Any

NS gre

ot sow OO

we 0"

een

oN?!

sate

‘Gere" po NOS

ane

S10

exe

yale, Nn cre?

Figure 1 The comparison between MemEvolve and several popular self-evolving agent memory systems across benchmarks. The underlying framework is Flash-Searcher (Qin et al., 2025)+GPT-5-Mini.

Figure 2 The paradigm of agent self-evolution admits a natural analogy to human learning. At one extreme, a mediocre learner fails to benefit from experience (agents without memory). More capable skillful learners can extract reusable skills from past experience, albeit through a fixed and pre-defined abstraction scheme. In contrast, an adaptive learner simultaneously accumulates experience and dynamically adjusts the strategy by which experience is consolidated and utilized. This final regime precisely characterizes the objective of MemEvolve .

Introduction

Language agents and agent systems, empowered by increasingly capable foundation models (Team et al., 2025a,b) and sophisticated scaffolding (Wang et al., 2024a; LangChain, 2023), have advanced rapidly, demonstrating unprecedented performance across complex tasks such as deep research (Chen et al., 2025), scientific discovery (Bai et al., 2025; Wei et al., 2025b), and industrial report generation (Zhang et al., 2025g). A key driving force behind this success is the agent memory system (Zhang et al., 2024b; Hu et al., 2025c), which persistently captures interactions between the agent and environment, distilling them into diverse forms of knowledge and skills, and thereby enabling large language model (LLM)-based agents to evolve continuously in task solving and world exploration (Wu et al., 2025c).

Naturally, the choice of memory paradigm plays a decisive role in shaping an agent's capacity for on-the-fly selfevolution. Initial designs centered on raw trajectory storage and few-shot prompting (Zhong et al., 2024; Wen et al., 2024), which were later superseded by more abstracted textual artifacts such as tips, shortcuts, and reasoning templates (Ouyang et al., 2025; Zhang et al., 2025b; Ye et al., 2025; Tang et al., 2025). Recent advances have also explored structured tool interfaces ( e.g. , APIs (Zheng et al., 2025), MCPs (Qiu et al., 2025b,a; Zhang et al., 2025h)) and code-level repositories (Zhang et al., 2025e; Wang et al., 2025a) as memory carriers. Amid this growing diversity, an inquisitive practitioner might ask: What kind of memory architecture most effectively drives agent self-improving?

We posit that no universally optimal memory architecture exists. For instance, a memory system that distill reusable APIs from past trajectories may excel in tasks such as web browsing, yet offer limited utility for mathematical and scientific reasoning. Conversely, memories predicated on self-critique, while powerful in reasoning-intensive domains (Cai et al., 2025), show diminished efficacy in coding and tool-use scenarios, as empirically discussed in (Zhang et al., 2025d). We contend that these trade-offs arise from the static nature of current memory systems. Researchers typically design a fixed memory pipeline ( i.e. , memory ingestion/abstraction/retrieval (Zhang et al., 2025i)) and embed it within an agent, assuming it will sustain long-term evolution through mere exposure to new experiences. Yet this overlooks a crucial reality: distinct tasks are coupled with distinct memory affordances. A memory system that cannot adapt itself to the task at hand is fundamentally misaligned with the very premise of open-ended agent evolution.

To elucidate this dilemma, consider the analogy of human learning. Both high- and low-performing students inevitably make mistakes, yet their distinction lies in the meta-cognitive strategies they employ to learn from these errors. An underperforming student might resort to rote memorization, superficially recording an error without genuine comprehension (Zhong et al., 2024; Orhan, 2023). In contrast, a more skillful student engages in higher-order learning: they not only record errors but also distill transferable insights through reflection (Shinn et al., 2023; Zhao et al., 2024) or derive reusable schemas (Zheng et al., 2025; Qiu et al., 2025b)). Current memory systems effectively model a skillful

learner. Herein lies the critical gap: the most effective human learners are not merely skillful, but adaptive . They dynamically alter their learning strategies based on the subjects, for instance, prioritizing memorization for literary analysis while abstracting solution templates for mathematics. It is precisely this transition, from a skillful to an adaptive learner (as shown in Figure 2), that we argue agent memory systems must undergo. To put it more formally:

How can a memory system not only facilitate the agent system's evolution but also meta-evolve its own architecture to achieve superior task-domain performance gains while preserving generalizability?

To address the challenge, we introduce MemEvolve , a framework that facilitates the dual evolution of an agent's experience and its memory architecture. Conceptually, MemEvolve operates as a bilevel optimization process: the inner loop performs a first-order evolution , where the agent, guided by a fixed memory system, adapts to a continuous stream of new tasks by populating its experience base. The outer loop drives a second-order evolution , meta-learning a more effective memory architecture to accelerate future learning. This allows the agent not only to evolve, but to evolve more efficiently and intelligently over time.

However, the vast and heterogeneous design space of memory systems ( e.g. , knowledge graphs, skill libraries, vector databases) presents a significant challenge to controllable optimization. To render this optimization tractable, we introduce a modular design, decomposing any memory architecture into four key components: ♣ Encode (perceiving and formatting experiences), ♦ Store (committing information), ♥ Retrieve (context-aware recall), and ♠ Manage (consolidation and forgetting). MemEvolve evolves the programmatic implementations of these modules in a modeldriven fashion, using feedback from the agent's performance in the inner loop. This process establishes a virtuous cycle: an improved memory architecture from the outer loop enhances the agent's learning efficiency. In turn, a more capable agent generates higher-quality trajectories, providing a more precise fitness signal for the outer loop to drive the next round of architectural evolution.

To ground our framework within the diverse landscape of existing self-improving agent memories, we systematically re-implement twelve representative architectures in a unified modular design space, including ExpeL (Zhao et al., 2024), Agent Workflow Memory (Wang et al., 2024b), and Dynamic Cheatsheet (Suzgun et al., 2025). The resulting framework, denoted as EvolveLab , serves both as an empirical foundation for MemEvolve 's evolutionary process and as a standardized codebase to facilitate future research on self-evolving agents. Our contributions are as follows:

❶ Unified Codebase: We introduce EvolveLab , a modular design space for self-improving agent memory systems encompassing four key components ( encoding , storage , retrieval , and management ), providing unified implementations and benchmark support for a wide range of prevailing agent memory systems.
❷ Meta-Evolution Framework: We propose MemEvolve , a meta-evolutionary framework that jointly evolves both agents' experiential knowledge and their underlying memory architecture, in which agent systems not only accumulate experience but also progressively refine their mechanism for learning from it.
❸ Experimental Evaluation: Extensive experiments on four challenging agentic benchmarks demonstrate that MemEvolve delivers (I) substantial performance gains , improving frameworks such as SmolAgent and FlashSearcher by up to 17 . 06% ; and (II) cross-domain, cross-framework and cross-LLM generalization , where memory systems evolved on TaskCraft yield 2 . 0 -9 . 09% gains with unseen benchmarks and backbone models.

LLM Agent Systems. The past two years have witnessed rapid advances in LLM-based agent systems across multiple dimensions (Tran et al., 2025; Fang et al., 2025a). In terms of system complexity , development has progressed from early single-agent setups with manually defined workflows and limited tool configurations (Wu et al., 2023; Significant-Gravitas, 2023) to sophisticated multi-agent architectures featuring diverse MCP integrations and automated orchestration (Zhang et al., 2024a, 2025a; Wang et al., 2025b; Zhang et al., 2025c). From the perspective of task domains , capabilities have expanded from relatively constrained areas such as coding and mathematical reasoning (Hong et al., 2024; Yin et al., 2023) to more challenging domains, including deep research and scientific discovery (Du et al., 2025; Ghareeb et al., 2025). Today, numerous open-source multi-agent systems demonstrate competitive performance on demanding benchmarks such as GAIA (Mialon et al., 2023), HLE (Phan et al., 2025), BrowseComp (Wei et al., 2025a), and xBench (Chen et al., 2025), including CAMEL's OWL (Hu et al., 2025a), Tencent's CK-Pro (Fang et al., 2025c), Skywork's AgentOrchestra (Zhang et al., 2025f), and ByteDance's AIME (Shi et al., 2025b), among others.

Agent Memory Architectures. Agent memory systems can be broadly divided by objective into personalized memory and self-improving memory (Zhang et al., 2024b; Hu et al., 2025c). The former enables agent chatbots to dynamically capture user-specific information and preferences, while the latter focuses on distilling knowledge and skills from continual interactions with the environment to enhance performance, a focus adopted in this work. Self-improving memories are primarily differentiated by their storage modality. Early systems stored raw agent trajectories as few-shot examples (Wang et al., 2023; Zhong et al., 2024; Packer et al., 2023); subsequent designs abstracted these experiences into higher-level lessons, insights (Yang et al., 2025; Sun and Zeng, 2025; Wu et al., 2025b), procedural tips (Wang et al., 2025c; Zheng et al., 2025; Fang et al., 2025b), and more recently, reusable tools and structured repositories (Zhao et al., 2025; Qiu et al., 2025a,b; Zhang et al., 2025e). Despite their differences in representation, there approaches share the same ambition, i.e. , to enable agents to learn, adapt, and improve in a human-esque manner.

EvolveLab: A Unified Codebase for Self-Evolving Memory

In this section, we first formalize the LLM-based agentic system and its associated memory architecture, then present the modular design space of EvolveLab , which comprehensively captures the characteristics of existing self-evolving agent memories, and finally introduce the unified codebase EvolveLab .

Preliminary

We formalize an LLM-based agentic system as M = ⟨I , S , A , Ψ , Ω ⟩ , where I indexes the { 1 , /uni22EF , N } agents, S denotes the shared state space, A = ⋃ i ∈I A i represents the joint action space, and Ψ ( s t + 1 ∣ s t , a t , µ ( t )) describes the environment dynamics with µ ( t ) ∈ I indicating the active agent at time step t . The system leverages a memory module Ω , which maintains a continuously evolving memory state M t . At each step, the active agent observes the current state s t , considers a task-specific query Q , and interacts with Ω to retrieve contextually relevant memory c t , conditioned on its interaction history H t . The agent µ t 's policy π µ t then delivers an action:

Following task execution, a trajectory τ = ( s 0 , a 0 , . . . , s T ) is recorded, with an overall performance evaluated via a terminal reward R ( τ ) . The memory system assimilates new experience units ϵ , which can vary in granularity (from individual state-action transitions to aggregated segments or complete trajectories), and updates the memory state as

where Ω abstracts the memory's mechanisms for integrating and organizing new experiences or knowledge.

Modular Design Space of Memory Systems

The heterogeneous and rapidly evolving landscape of self-improving agent memories presents challenges for systematic analysis and controlled experimentation. To address this, we propose a modular design space that decomposes any memory system Ω into four functionally distinct yet interdependent components: Ω = (E , U , R , G) , representing encode , store , retrieve , and manage operations, respectively.

· Encode ( E ) : Transforms raw experiences, such as trajectory segments τ t = ( s t , a t , s t + 1 ) , tool outputs, or selfcritiques, into structured representations e t = E( ϵ t ) . Encoding may be as simple as compressing raw traces (Zheng et al., 2023) or as sophisticated as extracting generalizable lessons (Zheng et al., 2025). · Store ( U ) : Integrates encoded experiences into the persistent memory M t , yielding M t + 1 = U( M t , e t ) . Storage can be vector databases (Zhao et al., 2024), knowledge graphs (Zhang et al., 2025b; Rasmussen et al., 2025), or others. · Retrieve ( R ) : Provides task-relevant memory content, formalized as c t = R( M t , s t , Q) , which informs the agent's policy decision a t . Retrieved content may include reusable tools (Zhang et al., 2025f), planning experience (Tang et al., 2025), or distilled procedural knowledge (Wu et al., 2025b; Yang et al., 2025; Fang et al., 2025b). · Manage( G ) : Performs offline and asynchronous operations such as consolidation, abstraction, or selective forgetting to maintain long-term memory quality and efficiency, denoted as M ′ t = G( M t ) .

This modular abstraction allows us to represent each memory system as a specific combination of programmatic implementations for (E , U , R , G) , forming a 'genotype' that facilitates the meta-evolutionary process of MemEvolve .

Table 1 A taxonomy of self-improving agent memory systems implemented in EvolveLab . In the 'Mul.' column, ♂ indicates support for single-agent settings, while ♂♂ denotes compatibility with multi-agent systems. 'Gran.' specifies the granularity at which memory is provided ( step-wise vs. trajectory-wise ), and 'Online' indicates whether memory is updated on-the-fly ( /link ) or maintained as an offline experience repository ( /unlink ).

EvolveLab Codebase

Based on the above design space, we introduce EvolveLab , a unified and extensible codebase designed for the systematic implementation and evaluation of self-evolving memories, serving as a standardized resource for the community.

Implementation. The cornerstone of EvolveLab is its modular and hierarchical design. Every memory architecture re-implemented in our codebase (see Table 1) inherits from a singular abstract base class, BaseMemoryProvider, which enforces the unified four-component interface: ♣ Encode, ♦ Store, ♥ Retrieve, and ♠ Manage. This ensures that diverse memory mechanisms can be managed, modified, and evolved under a consistent programmatic structure. More details on the implementations can be found at Section A.

Evaluation. Beyond unified implementation, EvolveLab provides a standardized testbed for rigorously assessing memory architectures across diverse agentic tasks. The framework offers out-of-the-box support for multiple challenging benchmarks, including GAIA (Mialon et al., 2023), xBench (Chen et al., 2025), and DeepResearchBench (Du et al., 2025). EvolveLab accommodates two evaluation paradigms: an ■ online mode, where the experiential memory base is updated on-the-fly as the agent system processes a continuous stream of tasks, and an ■ offline mode, where the memory system first accumulates experience from a static set of trajectories before being assessed on separate, unseen tasks. To ensure robust and versatile assessment, we support multiple evaluation protocols, including exact string matching and flexible LLM-as-a-Judge.

MemEvolve: A Meta-Evolving Memory Framework

Dual-Evolution Process

Traditional self-improving memory systems operate under a fixed memory architecture , where the memory interface Ω is predefined and remains static. Within this architecture, the agent iteratively populates and updates its memory state M t through interaction with the environment and task experiences. For a trajectory τ induced by a query Q , the memory evolution follows

where E(⋅) denotes an experience extraction operator that maps a trajectory to a set of experience units, and ϵ τ is an element sampled from this set. While this enables the accumulation of knowledge, it fundamentally precludes architectural adaptation, as the memory interface Ω itself remains immutable.

To transcend this limitation, we propose a dual-evolution process that jointly evolves (i) the agent's memory base and (ii) the underlying memory architectures (as illustrated in Figure 3). Instead of a single static Ω , we maintain, at each evolutionary iteration k , a finite set of candidate memory systems { Ω ( k ) j } j ∈J ( k ) , where each Ω ( k ) j is instantiated as a concrete realization of the four-component memory interface Ω ( k ) j ≜ (E ( k ) j , U ( k ) j , R ( k ) j , G ( k ) j ) . The initial iteration start from a singleton set ∣J ( 0 ) ∣ = 1 , corresponding to a hand-designed baseline memory, while later iterations admit

Figure 3 The overview of our proposed MemEvolve .

multiple competing candidates. Given a batch of trajectories T ( k ) j independently generated by executing the agent with memory system Ω ( k ) j , the dual-evolution process consists of two nested loops:

· Inner Loop (Experience Evolution). For each candidate memory system Ω ( k ) j , the associated memory state M ( k ) t,j , initialized as an empty memory at the beginning of iteration k , is updated along trajectories τ ∈ T ( k ) j via

Executing the agent with Ω ( k ) j over T ( k ) j yields, for each trajectory τ , a feedback vector f j ( τ ) ∈ R d , where d = 3 corresponds to the number of evaluation metrics ( i.e. , task success, token consumption, and latency). An aggregation operator S summarizes the inner-loop outcomes for each candidate as

· Outer Loop (Architectural Evolution). The set of memory architectures is then updated based on the collection of summaries { F ( k ) j } j ∈J ( k ) . A meta-evolution operator F selects high-performing candidates and proposes new variants, producing the next iteration's candidate set:

Specifically, F ranks candidates according to F ( k ) j , retains the topK memory systems, and generates new architectures by modifying or recombining all four components (E , U , R , G) of the selected candidates, where K denotes a fixed survivor budget. We detail the implementation of F(⋅) in Section 4.2.

Unified view. At a higher level, each iteration k alternates between (i) evolving the memory experience base from an empty initialization under a fixed set of architectures, and (ii) evolving the memory architectures themselves based on the induced performance:

By iterating this dual-evolution process, the agent does not merely accumulate experience within a fixed memory system; instead, both the memory base and the governing memory architectures co-evolve, yielding increasingly adaptive and resource-aware memory-driven behavior over time.

Diagnose-and-Design Evolution

We now detail the meta-evolution operator F , which governs the architectural update in each evolutionary iteration. Conceptually, F decomposes into two coordinated components: (i) architectural selection, which identifies a subset of high-performing memory systems to serve as evolutionary parents, and (ii) diagnose-and-design evolution, which generates new memory architectures from each selected parent through a structured diagnosis procedure followed by a constrained redesign within the modular memory design space.

Architectural Selection. Given the candidate set { Ω ( k ) j } j ∈J ( k ) and their corresponding summaries { F ( k ) j } , we define each summary vector as

where higher values are preferred in all dimensions. Candidates are first ranked by non-dominated sorting over F ( k ) j , yielding a Pareto rank ρ ( k ) j . Within the same Pareto rank, candidates are further ordered by the primary performance metric Perf ( k ) j . The topK candidates are selected as the parent set:

This selection step ensures that architectural evolution is guided by systems that exhibit favorable trade-offs between task effectiveness and resource efficiency, while prioritizing task performance among Pareto-equivalent candidates.

Diagnose-and-Design Evolution. For each parent architecture Ω ( k ) p ∈ P ( k ) , F generates a set of S descendants { Ω ( k + 1 ) p,s } S s = 1 through a two-phase process:

· Diagnosis. Each parent architecture is examined using trajectory-level evidence from its own execution batch T ( k ) p . For each trajectory, the agent provides outcome statistics (e.g., success indicators, token costs) together with a structured description of the associated task query. A replay interface grants access to the corresponding trajectories τ ∈ T ( k ) p , enabling targeted inspection of memory behavior, including retrieval failures, ineffective abstractions, or storage inefficiencies. The diagnosis phase thus produces a structured defect profile D( Ω ( k ) p ) , characterizing architectural bottlenecks across the four memory components (E ( k ) p , U ( k ) p , R ( k ) p , G ( k ) p ) . · Design. Conditioned on the defect profile D( Ω ( k ) p ) , a redesigned architecture is constructed by modifying only the permissible implementation sites within the modular interface, thereby ensuring compatibility and isolating architectural changes to the designated design space. The design step produces S variants by instantiating distinct but valid configurations of the four components:

These variants differ in encoding strategies, storage rules, retrieval constraints, or management policies, yet all conform to the unified memory-system interface and remain executable by the agent.

Resulting update. Aggregating all descendants across parents yields the next set of candidate architectures:

This diagnose-and-design evolution operationalizes F for producing increasingly adaptive memory systems, ensuring that architectural updates are both empirically grounded and structurally constrained within the unified design space.

Experiments

Experiment Setup

Benchmarks. We evaluate the proposed framework across four challenging agentic benchmarks, including GAIA (Mialon et al., 2023), WebWalkerQA (Wu et al., 2025a), xBench-DeepSearch (xBench-DS) (Chen et al., 2025), as well as TaskCraft (Shi et al., 2025a). Further statistics and details are provided in Section B.1.

Table 2 Performance of various agent frameworks on the WebWalerQA, xBench-Ds, TaskCraft, and GAIA benchmarks.

Method Configurations. We run the dual-evolution process for K max = 3 iterations. In the outer loop, the survivor budget is set as K = 1 ; at each iteration, only the top-ranked architecture is retained and expanded to S = 3 descendants. In the inner loop, each candidate architecture Ω ( k ) j is evaluated on a batch T ( k ) j of 60 task trajectories, consisting of 40 newly sampled tasks and 20 tasks reused from the previous iteration to stabilize inter-iteration comparison.

Agent Framework. We integrate MemEvolve into two representative agentic frameworks: SmolAgent (Roucher et al., 2025), a lightweight two-agent architecture, and Flash-Searcher (Qin et al., 2025), a high-performance single-agent deep research system. To assess the generalization and plug-and-play capability of MemEvolve , we further evaluate it on two held-out multi-agent systems: Tencent's Cognitive Kernel-Pro (CK-Pro) (Fang et al., 2025c), a three-agent framework comprising main/file/web agents; and OWL (Hu et al., 2025b), a hierarchical system including planner, coordinator, web, document, and coding agents. This diversity in architecture and system complexity enables a comprehensive examination of the adaptability of MemEvolve across heterogeneous agentic scaffolds.

Model Configurations. We instantiate MemEvolve using GPT-5-mini (OpenAI, 2025) as the LLM backbone for the underlying agentic frameworks, and for supporting the meta-evolution operator F(⋅) . To further evaluate the crossLLM generalization capability of MemEvolve , we additionally consider alternative backbones, including DeepSeek V3.2 (DeepSeek-AI et al., 2025), and Kimi K2 (Team et al., 2025a). For clarity, we explicitly report the specific LLM backbone used by each agentic framework in the following experiments.

Table 3 Performance, cost, delay, and steps across datasets under different memory settings for Flash-Searcher. Here, cost denotes the average API cost incurred per task query, delay measures the average execution latency (seconds) per task, and #steps reports the number of agent interaction steps required to complete each task.

Main Results

We report the pass@1-3 performance of MemEvolve integrated with SmolAgent and Flash-Searcher in Table 2, together with its generalization results when paired with unseen LLMs (Kimi K2, DeepSeek V3.2). Notably, on the relatively simple TaskCraft benchmark, we evolve two distinct memory systems using MemEvolve+ and MemEvolve+ , respectively. These evolved memory systems are then fixed and evaluated on WebWalkerQA and xBench-DS, i.e. , without conducting dataset-specific meta-evolution.

Memory System Matters For Agent Systems. As shown in Table 2, equipping agentic systems with effective memory architectures is critical to performance. On

Figure 4 The cross-framework generalization analysis. We transfer the memory system evolved on TaskCraft+ to and . Red percentages denote the relative score gains of each framework after integrating MemEvolve over its memory-free counterpart.

xBench, +GPT-5-Mini achieves an initial pass@1 of 51% ; after integrating MemEvolve , pass@1 increases by 6% , while pass@3 goes up to 68 . 0% . Similarly, +GPT-5-Mini improves from 69% to 74% on xBench when augmented with MemEvolve . These results clearly demonstrate the substantial impact of a well-designed memory system on agent performance. At the same time, memory is not a panacea and remains bounded by the capabilities of the underlying agentic framework. On GAIA, MemEvolve + attains a pass@3 of 72 . 12% , comparable to AgentKB, while avoiding the construction of large and costly offline knowledge bases. In contrast, the gains with MemEvolve + are even more pronounced, achieving a pass@3 of 80 . 61% , surpassing several strong multi-agent systems such as OWL-Workforce and CK-Pro under the same metric.

MemEvolve Exhibits Cross-Task, Cross-Model, and Cross-Framework Generalization. Recall that the memory systems used on WebWalkerQA and xBench are directly inherited from those evolved on TaskCraft, without any task-specific meta-evolution. Nevertheless, these transferred memories yield consistent gains on more challenging benchmarks (WebWalkerQA+ : 58 . 82 → 61 . 18% ; xBench+ : 69 . 0 → 74 . 0% ), indicating that MemEvolve captures task-agnostic principles of memory design rather than overfitting to individual datasets. MemEvolve also demonstrates strong cross-LLM generalization. Although meta-evolution is conducted using GPT-5-Mini, memory systems evolved on TaskCraft+ transfer effectively to Kimi K2 and DeepSeek V3.2 without manual adaptation. Notably, Kimi K2+ improves by 17 . 06% on WebWalkerQA and 10 . 0% on TaskCraft. Finally, MemEvolve exhibits compelling crossframework generalization. As shown in Figure 4, directly transferring the memory system evolved on TaskCraft+ to heterogeneous agentic frameworks, including and , consistently improves performance despite substantial architectural differences. These results demonstrate that MemEvolve learns framework-agnostic memory abstractions that are readily pluggable across diverse agentic systems.

Figure 5 Evolution of cumulative accuracy across question indices. Cumulative accuracy at index i is defined as the average accuracy over the first i questions. The curves exhibit larger fluctuations at early indices due to limited sample size, and gradually stabilize as more questions are accumulated.

Figure 6 Illustration of the progressive evolution from the fixed AgentKB architecture to increasingly agentic and efficient memory architectures. Each stage reflects structural and functional modifications in memory encoding, storing, retrieval, and maintenance, culminating in high-performing systems such as Riva and Cerebra.

Self-Evolving Memory Comparison

We further compare the memory systems automatically evolved by MemEvolve against prevailing human-designed self-improving memory systems. In Table 3, we integrate seven representative self-improving memory systems implemented in EvolveLab with Flash-Searcher, and comprehensively report performance, per-task cost/execution latency/execution steps. Results for MemEvolve are obtained using the system evolved on TaskCraft+ +GPT-5-Mini.

Existing Memory Systems Fail to Deliver Consistent Gains. Despite faithful re-implementations aligned with the original designs, many existing memory systems do not yield stable improvements. For example, DILU improves performance on xBench and WebWalkerQA, yet degrades GAIA by 2 . 42% . Dynamic Cheatsheet achieves a 1 . 76% gain on WebWalkerQA via skill condensation, but performs poorly on GAIA and xBench. More extreme cases are also observed: ExpeL underperforms on all three benchmarks. Upon closer inspection, this is unsurprising, as ExpeL was originally designed for relatively simple embodied or QA settings ( e.g. , ALFWorld, HotpotQA), and its prompts and mechanisms are ill-suited for long-horizon, long-context deep research. These results underscore the necessity of task-aware memory design.

Figure 7 Illustration of how evolved memories are instantiated during real-world tasks from GAIA and xBench. The memory system adaptively provides stage-specific guidance, ranging from high-level planning and task decomposition to fine-grained tool-use suggestions and salient context recall, thereby steering the agent toward efficient and successful task completion.

MemEvolveDelivers Robust and Consistent Improvements. In contrast to prior approaches, MemEvolve yields stable and robust performance gains. Although the underlying memory system is evolved on TaskCraft, it consistently achieves improvements of 3 . 54% ∼ 5 . 0% across all three evaluated benchmarks. Importantly, these gains are not achieved by substantially increasing the per-task cost. As shown in Table 3, MemEvolve maintains API costs comparable to the No-Memory baseline across all benchmarks ( e.g. , GAIA: $0.085 vs. $0.086; xBench: $0.136 vs. $0.141), while its execution delay remains on a similar scale to other self-improving baselines ( e.g. , GAIA: 693 . 33 s vs. 584 . 88 s for AWM and 559 . 81 s for Cheatsheet; xBench: 773 . 06 s vs. 761 . 33 s for AWM and 818 . 07 s for Cheatsheet). Figure 5 further illustrates the cumulative success rate of different self-evolving memory systems as task execution progresses. Although performance exhibits higher variance in the early stages due to limited sample size, MemEvolve gradually stabilizes and converges to a consistently superior performance regime. This indicates that MemEvolve discovers principled and effective memory designs rather than relying on brittle, task-specific heuristics.

At first glance, such generalization may appear to conflict with our original motivation that memory systems cannot generalize across all domains and therefore require task-specific evolution. We argue this is not the case. Memory systems evolved on TaskCraft are unlikely to transfer effectively to fundamentally different task families ( e.g. , embodied action), where environments, action space and tool sets differ substantially. Nevertheless, MemEvolve enables the discovery of broadly applicable memory architectures within a shared task regime, while retaining the capacity for further task-specific adaptation when required.

Meta-Evolving Dynamics

Having established the substantial performance gains delivered by MemEvolve , we further examine how metaevolution is executed in practice and which components are modified or enhanced during the evolutionary process. As illustrated in Figure 6, MemEvolve starts from the predefined structure of AgentKB and iteratively evolves toward increasingly efficient memory architectures. Figures 9 and 10 highlights two high-performing memory systems discovered along this trajectory, denoted as Riva and Cerebra. Figure 8 presents a system evolved from the simplest few-shot example memory baseline, referred to as Lightweight.

Agents Spontaneously Evolve Efficient Memory Architectures. As illustrated in Figure 6, the initial AgentKB memory system adopts a frozen design for both encoding and storage, lacking the capability to assimilate new experiences.

Starting from this baseline, MemEvolve explores a spectrum of evolutionary directions. Some candidates are relatively aggressive (e.g., Ω ( 1 ) 1 , an Adaptive Decision System that decomposes a single agent trajectory into nine skill granularities), while others are more conservative (e.g., Ω ( 1 ) 3 , an Meta Memory System that stores trajectories at four levels and introduces an LLM-based meta-guardrail during retrieval to filter irrelevant information). The latter emerges as the winner in the first evolutionary round. The defining characteristic of this stage is agentic : both memory encoding and decoding increasingly rely on agent-driven decisions rather than predefined pipelines. The third evolution round introduces two further advances. Evolving from Ω ( 2 ) 3 Riva to Ω ( 3 ) 1 Cerebra, the memory system learns to distill not only textual insights but also reusable tools from past experience, while incorporating periodic maintenance of the memory database. Together, these enhancements provide faster evolutionary momentum for underlying agentic frameworks.

Evolved Memory Systems Are Effective in Practice. We further present concrete memory examples produced by the Lightweight system during real executions, as shown in Figure 7. The results illustrate that Lightweight delivers memory content at varying levels of granularity, adaptively tailored to different task stages. During early planning, the memory provides high-level guidance, such as task decomposition strategies. As execution proceeds, it offers more fine-grained recommendations for tool-use, along with a form of working memory that highlights salient information from previous turns. Notably, Lightweight also exhibits predictive behavior by anticipating that target information may appear within image content on online travel websites, successfully guiding the agent to locate the evidence on trip.com. Together, these examples demonstrate the practical effectiveness of memory systems evolved by MemEvolve .

Conclusion

This work provides a unified implementation and design space for the rapidly growing field of self-evolving agent memory, together with a standardized codebase, termed EvolveLab , upon which we further build MemEvolve , a meta-evolutionary memory framework. Departing from the conventional paradigm of manually crafting a single self-improving memory architecture and expecting it to generalize across all domains, MemEvolve instead embraces adaptive, architecture-level evolution driven by empirical interaction feedback. Extensive experiments across diverse agentic benchmarks and backbone models demonstrate the effectiveness, robustness, and generalization of this approach. Moreover, analysis of the automatically evolved memory systems reveals several instructive design principles, including increased agentic involvement, hierarchical organization, and multi-level abstraction. We hope that MemEvolve serves as a step toward more automated, principled, and meta-evolutionary pathways for building continually improving agentic intelligence.

Contributions

· Guibin Zhang · Haotian Ren

Contributions

• Chong Zhan • Zhenhong Zhou • Junhao Wang • He Zhu

If you have any questions regarding the code, paper details, or other aspects of this work, you are very welcome to contact the authors at guibinz@outlook.com or via raising a Github issue.

Corresponding Authors

• Wangchunshu Zhou · Shuicheng Yan

Experiments

X., Miao, Y., Pan, S., Peng, Y., Qin, R., Qu, B., Shang, Z., Shi, L., Shi, S., Song, F., Su, J., Su, Z., Sun, X., Sung, F., Tang, H., Tao, J., Teng, Q., Wang, C., Wang, D., Wang, F., Wang, H., Wang, J., Wang, J., Wang, J., Wang, S., Wang, S., Wang, Y., Wang, Y., Wang, Y., Wang, Y., Wang, Y., Wang, Z., Wang, Z., Wang, Z., Wei, C., Wei, Q., Wu, W., Wu, X., Wu, Y., Xiao, C., Xie, X., Xiong, W., Xu, B., Xu, J., Xu, J., Xu, L. H., Xu, L., Xu, S., Xu, W., Xu, X., Xu, Y., Xu, Z., Yan, J., Yan, Y., Yang, X., Yang, Y., Yang, Z., Yang, Z., Yang, Z., Yao, H., Yao, X., Ye, W., Ye, Z., Yin, B., Yu, L., Yuan, E., Yuan, H., Yuan, M., Zhan, H., Zhang, D., Zhang, H., Zhang, W., Zhang, X., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Z., Zhao, H., Zhao, Y., Zheng, H., Zheng, S., Zhou, J., Zhou, X., Zhou, Z., Zhu, Z., Zhuang, W., and Zu, X. (2025a). Kimi k2: Open agentic intelligence.

Appendix

EvolveLab Implementation

EvolveLab is designed as a modular and extensible codebase to support the systematic study of self-evolving agent memory systems. It provides a unified interface that abstracts the complexities of diverse memory architectures, enabling standardized implementation, evaluation, and meta-evolution.

Unified Interface and Abstract Base Class

The cornerstone of EvolveLab is the BaseMemoryProvider abstract base class (ABC), which defines the fundamental protocol for all memory systems. As shown in the code snippet below, the interface enforces two primary operations that map to the modular design space ( Encode, Store, Retrieve, Manage ):

While take_in_memory primarily integrates the Encode and Store stages, the Manage functionality that is responsible for offline consolidation or selective forgetting is typically implemented as auxiliary methods within the provider classes or invoked during specific lifecycle events.

1 class BaseMemoryProvider(ABC): 2 """Abstract base class for memory providers""" 3 4 def init(self, memory_type: MemoryType , config: Optional[dict] = None): 5 self.memory_type = memory_type 6 self.config = config or {} 7 8 @abstractmethod 9 def provide_memory(self, request: MemoryRequest) -> MemoryResponse: 10 """ 11 Retrieve relevant memories based on query , context and status 12 Args: 13 request: MemoryRequest containing query , context , status and optional params 14 Returns: 15 MemoryResponse containing relevant memories 16 """ 17 pass 18 19 @abstractmethod 20 def take_in_memory(self, trajectory_data: TrajectoryData) -> tuple[bool, str]: 21 """ 22 Store/ingest new memory from trajectory data 23 Args: 24 trajectory_data: TrajectoryData containing query , trajectory and metadata 25 Returns: 26 tuple[bool, str]: (Success status of memory ingestion , Description of absorbed memory) 27 """ 28 pass 29 30 @abstractmethod 31 def initialize(self) -> bool: 32 """ 33 Initialize the memory provider (load existing data , setup indices , etc.) 34 Returns: 35 bool: Success status of initialization 36 """ 37 pass

38 39 def get_memory_type(self) -> MemoryType: 40 """Get the type of this memory provider""" 41 return self.memory_type 42 43 def get_config(self) -> dict: 44 """Get the configuration of this memory provider""" 45 return self.config.copy()

Listing 1 The Abstract Base Class of Memory Providers

Standardized Data Carriers

To ensure seamless interoperability across heterogeneous memory designs and agent frameworks, EvolveLab utilizes standardized memory data carriers. These structures act as the "universal language" of the framework:

Implementation Examples: ExpeL and SkillWeaver

The versatility of the EvolveLab interface is demonstrated by our implementation of twelve distinct memory systems. Two representative examples are:

Experiment Details

Dataset Details

The four datasets used in this study are described and summarized as follows:

Memory System Demonstration

To provide a concrete and intuitive understanding of the memory architectures evolved by MemEvolve , we visualize three representative systems discovered along different evolutionary trajectories, as shown in Figures 8 to 10. These examples highlight how MemEvolve progressively transforms simple, static memory mechanisms into more expressive and adaptive architectures by modifying memory encoding, retrieval, and management strategies. Together, they illustrate the diversity of memory designs that can emerge under the same meta-evolutionary framework.

Figure 8 Illustration of the Lightweight memory system evolved by MemEvolve . The evolutionary starting point is a minimal few-shot trajectory memory, similar to MemoryBank, where each completed trajectory is stored verbatim. For a new task, the agent retrieves the topk most similar trajectories via vector similarity and directly conditions on them. MemEvolve progressively refines this baseline into a more structured and stage-aware memory system.

Figure 9 Illustration of the Riva memory system evolved by MemEvolve . Its evolutionary initialization follows an AgentKB-style architecture, but without inheriting the large and costly offline knowledge base. Through meta-evolution, Riva develops more agent-centric encoding and retrieval strategies while remaining lightweight and fully online.

Figure 10 Illustration of the Cerebra memory system evolved by MemEvolve . Starting from the same AgentKB-style initialization (without the offline knowledge base), Cerebra further evolves to distill both reusable tools and abstract knowledge from experience, and incorporates working memory maintenance mechanisms to support long-horizon agent evolution.

Method	Date	Mul.	Gran.	Online	Encode	Store	Retrieve	Manage
I. Voyager II. ExpeL III. Generative IV. DILU V.AWM VI. Mobile-E	2023.5	♂	traj.	/link	Traj. & Tips	Vector DB Vector DB Vector DB Vector DB Vector DB Vector DB JSON Tool Library Graph Hybrid DB JSON JSON	Semantic Search	N/A
	2023.8	♂	traj.	/link	Traj. & Insights		Contrastive Comparison	N/A
	2023.1	♂♂	traj.	/link	Traj. & Insights		Semantic Search	N/A
	2024.2	♂	traj.	/link	Traj.		Semantic Search	N/A
	2024.9	♂	traj.	/link/unlink	Workflows		Semantic Search	N/A
	2025.1	♂	step	/unlink	Tips & Shortcuts		Semantic Search	N/A
VII. Cheatsheet	2025.4	♂	traj.	/link	Tips & Shortcuts		Semantic Search	N/A
VIII. SkillWeaver	2025.4	♂	traj.	/unlink	APIs		Function Matching	Skill Pruning
IX. G-Memory	2025.6	♂♂	traj.	/link	Tips & Workflow		Graph/Semantic Search	Episodic Consolidation
X. Agent-KB	2025.7	♂♂	step	/unlink	Tips & Workflow		Hybrid Search	Deduplication
XI. Memp	2025.8	♂	step.	/link	Tips & Workflow		Semantic Search	Failure-driven Adjustment
XII. EvolveR	2025.1	♂	step.	/link	Tips & Workflow		Contrastive Comparison	Update & Pruning

Framework	Model Family	WebWalker QA	xBench -DS	Task Craft	GAIA	GAIA	GAIA	GAIA
					Avg.	Level 1	Level 2	Level 3
Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks
Langfun	Claude 3.7 etc.	-	-	-	71.52	83.02	68.60	57.69
TraseAgent	Claude etc.	-	-	-	70.30	83.02	69.77	46.15
OpenAI Deep Research	o1, o3 etc.	-	-	-	67.36	74.29	69.06	47.60
h2oGPTe	Claude-3.5	-	-	-	63.64	67.92	67.44	42.31
Desearch	GPT-4o	-	-	-	56.97	71.70	58.14	23.08
Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks
OWL Workforce (pass@3)	GPT-4o+o3-mini	57.64	55.0	58.33	60.61	81.14	58.14	26.92
OWL RP (pass@3)	GPT-4o+o3-mini	-	-	-	58.18	81.14	54.65	23.08
TapeAgents	Claude 3.7 etc.	-	-	-	55.76	71.70	53.49	30.77
AutoAgent	Claude 3.5 etc.	-	-	-	55.15	71.70	53.40	26.92
Smolagents	GPT-4.1	-	-	-	55.15	67.92	53.49	34.62
Smolagents	GPT-5-mini	58.82	51.0	64.00	55.75	69.81	54.65	30.77
Magnetic-1	OpenAI o1 etc.	-	-	-	46.06	56.60	46.51	23.08
Cognitive Kernel-Pro (pass@1)	Claude-3.7 etc.	60.64	56.0	66.00	60.00	79.25	56.98	30.77
Cognitive Kernel-Pro (pass@3)	Claude-3.7 etc.	-	-	-	75.15	84.91	73.26	61.54
OAgents	Claude-3.7 etc.	58.23	47.0	-	66.67	77.36	66.28	46.15
JoyAgents	Claude-4, o4-mini	-	-	-	75.2	86.8	77.9	42.3
Agent KB (pass@1)	GPT-4.1	60.59	48.0	61.67	61.21	79.25	58.14	34.62
Agent KB (pass@2)	GPT-4.1	68.82	58.0	72.67	67.27	83.02	67.44	34.62
Agent KB (pass@3)	GPT-4.1	73.53	68.0	75.33	73.94	84.91	73.26	53.85
Flash-Searcher (pass@1)	GPT-5-mini	71.18	69.0	69.67	69.09	79.25	69.77	46.15
Flash-Searcher (pass@1)	Kimi K2	52.35	66.0	58.00	52.12	58.49	52.33	34.62
Flash-Searcher (pass@1)	DeepSeek V3.2	69.41	68.0	69.33	60.61	79.25	53.49	46.15
MemEvolve + (pass@1)	GPT-5-mini	61.18	57.0	67.67	64.24	83.02	58.14	46.15
MemEvolve + (pass@2)	GPT-5-mini	67.06	63.0	75.00	67.88	84.91	63.95	46.15
MemEvolve + (pass@3)	GPT-5-mini	71.18	68.0	77.00	72.12	88.68	68.60	50.00
MemEvolve + (pass@1)	GPT-5-mini	74.71	74.0	72.00	73.33	83.02	73.26	53.85
MemEvolve + (pass@2)	GPT-5-mini	79.41	77.0	75.00	77.58	92.45	74.42	57.69
MemEvolve + (pass@3)	GPT-5-mini	81.18	78.0	79.33	80.61	94.34	79.07	57.69
MemEvolve + (pass@1)	Kimi K2	69.41	68.0	68.00	61.21	67.92	63.95	38.46
						83.02
MemEvolve + (pass@1)	DeepSeek V3.2	72.35	70.0	72.67	67.88		63.95	50.00

Memory Setting					xBench	xBench	xBench	xBench	WebWalkerQA	WebWalkerQA	WebWalkerQA	WebWalkerQA
Memory Setting	Perf.	Cost	Delay	#Steps	Perf.	Cost	Delay	#Steps	Perf.	Cost	Delay	#Steps
No-Memory	69.09	0.086	505.46	10.44	69.00	0.141	523.05	14.69	71.18	0.048	251.57	6.91
Generative	66.67	0.061	436.26	8.87	70.00	0.131	818.37	13.45	72.35	0.045	268.56	6.64
Voyager	69.70	0.060	499.89	9.25	68.00	0.117	553.46	12.71	73.53	0.049	333.69	6.99
DILU	66.67	0.059	444.62	8.91	69.00	0.134	500.72	13.83	72.94	0.046	272.16	6.96
ExpeL	66.06	0.059	500.11	8.68	64.00	0.123	710.32	13.05	69.41	0.076	385.28	10.96
AWM	67.27	0.062	584.88	10.23	71.00	0.138	761.33	14.12	72.35	0.068	397.20	11.40
Mobile-E	69.09	0.065	321.80	9.35	68.00	0.120	537.18	13.16	71.76	0.059	296.01	6.52
Cheatsheet	68.48	0.069	559.81	9.72	65.00	0.174	818.07	15.99	72.94	0.057	367.13	7.59
MemEvolve	73.33	0.085	693.33	10.14	74.00	0.136	773.06	14.20	74.71	0.040	332.49	6.64

$$ a_t = \pi_{\mu(t)}(s_t, \mathcal{H}_t, \mathcal{Q}, c_t),;c_t \sim \Omega(M_t, s_t, \mathcal{H}_t, \mathcal{Q}). $$

$$ M_{t+1} = \Omega(M_t, \epsilon), $$

$$ M_{t+1,j}^{(k)} = \Omega_j^{(k)}\big(M_{t,j}^{(k)}, \epsilon_{\tau}\big), \quad \epsilon_{\tau} \in \mathcal{E}_j^{(k)}(\tau). $$

$$ \big{\Omega_{j'}^{(k+1)}\big}{j' \in \mathcal{J}^{(k+1)}} = \mathcal{F}!\left( {\Omega_j^{(k)}}{j \in \mathcal{J}^{(k)}},; {\mathbf{F}j^{(k)}}{j \in \mathcal{J}^{(k)}} \right). $$

$$ \big({\varnothing}{j \in \mathcal{J}^{(k)}},, {\Omega_j^{(k)}}{j \in \mathcal{J}^{(k)}}\big) ;\xrightarrow{\text{inner}}; \big({M_{t+1,j}^{(k)}}{j \in \mathcal{J}^{(k)}},, {\Omega_j^{(k)}}{j \in \mathcal{J}^{(k)}}\big) ;\xrightarrow{\text{outer}}; \big({M_{t+1,j}^{(k)}}{j \in \mathcal{J}^{(k)}},, {\Omega{j'}^{(k+1)}}_{j' \in \mathcal{J}^{(k+1)}}\big). $$

$$ \mathbf{F}_j^{(k)} \triangleq \big( \text{Perf}_j^{(k)},; -\text{Cost}_j^{(k)},; -\text{Delay}_j^{(k)} \big), $$

$$ \mathcal{P}^{(k)} = \operatorname*{Top\text{-}K}_{j \in \mathcal{J}^{(k)}} \Big(\rho_j^{(k)},; \text{Perf}_j^{(k)}\Big). $$

Method	Date	Mul.	Gran.	Online	Encode	Store	Retrieve	Manage
I. Voyager II. ExpeL III. Generative IV. DILU V.AWM VI. Mobile-E	2023.5	♂	traj.	/link	Traj. & Tips	Vector DB Vector DB Vector DB Vector DB Vector DB Vector DB JSON Tool Library Graph Hybrid DB JSON JSON	Semantic Search	N/A
	2023.8	♂	traj.	/link	Traj. & Insights		Contrastive Comparison	N/A
	2023.1	♂♂	traj.	/link	Traj. & Insights		Semantic Search	N/A
	2024.2	♂	traj.	/link	Traj.		Semantic Search	N/A
	2024.9	♂	traj.	/link/unlink	Workflows		Semantic Search	N/A
	2025.1	♂	step	/unlink	Tips & Shortcuts		Semantic Search	N/A
VII. Cheatsheet	2025.4	♂	traj.	/link	Tips & Shortcuts		Semantic Search	N/A
VIII. SkillWeaver	2025.4	♂	traj.	/unlink	APIs		Function Matching	Skill Pruning
IX. G-Memory	2025.6	♂♂	traj.	/link	Tips & Workflow		Graph/Semantic Search	Episodic Consolidation
X. Agent-KB	2025.7	♂♂	step	/unlink	Tips & Workflow		Hybrid Search	Deduplication
XI. Memp	2025.8	♂	step.	/link	Tips & Workflow		Semantic Search	Failure-driven Adjustment
XII. EvolveR	2025.1	♂	step.	/link	Tips & Workflow		Contrastive Comparison	Update & Pruning

Framework	Model Family	WebWalker QA	xBench -DS	Task Craft	GAIA	GAIA	GAIA	GAIA
					Avg.	Level 1	Level 2	Level 3
Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks	Closed-source Agent Frameworks
Langfun	Claude 3.7 etc.	-	-	-	71.52	83.02	68.60	57.69
TraseAgent	Claude etc.	-	-	-	70.30	83.02	69.77	46.15
OpenAI Deep Research	o1, o3 etc.	-	-	-	67.36	74.29	69.06	47.60
h2oGPTe	Claude-3.5	-	-	-	63.64	67.92	67.44	42.31
Desearch	GPT-4o	-	-	-	56.97	71.70	58.14	23.08
Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks	Open-Source Agent Frameworks
OWL Workforce (pass@3)	GPT-4o+o3-mini	57.64	55.0	58.33	60.61	81.14	58.14	26.92
OWL RP (pass@3)	GPT-4o+o3-mini	-	-	-	58.18	81.14	54.65	23.08
TapeAgents	Claude 3.7 etc.	-	-	-	55.76	71.70	53.49	30.77
AutoAgent	Claude 3.5 etc.	-	-	-	55.15	71.70	53.40	26.92
Smolagents	GPT-4.1	-	-	-	55.15	67.92	53.49	34.62
Smolagents	GPT-5-mini	58.82	51.0	64.00	55.75	69.81	54.65	30.77
Magnetic-1	OpenAI o1 etc.	-	-	-	46.06	56.60	46.51	23.08
Cognitive Kernel-Pro (pass@1)	Claude-3.7 etc.	60.64	56.0	66.00	60.00	79.25	56.98	30.77
Cognitive Kernel-Pro (pass@3)	Claude-3.7 etc.	-	-	-	75.15	84.91	73.26	61.54
OAgents	Claude-3.7 etc.	58.23	47.0	-	66.67	77.36	66.28	46.15
JoyAgents	Claude-4, o4-mini	-	-	-	75.2	86.8	77.9	42.3
Agent KB (pass@1)	GPT-4.1	60.59	48.0	61.67	61.21	79.25	58.14	34.62
Agent KB (pass@2)	GPT-4.1	68.82	58.0	72.67	67.27	83.02	67.44	34.62
Agent KB (pass@3)	GPT-4.1	73.53	68.0	75.33	73.94	84.91	73.26	53.85
Flash-Searcher (pass@1)	GPT-5-mini	71.18	69.0	69.67	69.09	79.25	69.77	46.15
Flash-Searcher (pass@1)	Kimi K2	52.35	66.0	58.00	52.12	58.49	52.33	34.62
Flash-Searcher (pass@1)	DeepSeek V3.2	69.41	68.0	69.33	60.61	79.25	53.49	46.15
MemEvolve + (pass@1)	GPT-5-mini	61.18	57.0	67.67	64.24	83.02	58.14	46.15
MemEvolve + (pass@2)	GPT-5-mini	67.06	63.0	75.00	67.88	84.91	63.95	46.15
MemEvolve + (pass@3)	GPT-5-mini	71.18	68.0	77.00	72.12	88.68	68.60	50.00
MemEvolve + (pass@1)	GPT-5-mini	74.71	74.0	72.00	73.33	83.02	73.26	53.85
MemEvolve + (pass@2)	GPT-5-mini	79.41	77.0	75.00	77.58	92.45	74.42	57.69
MemEvolve + (pass@3)	GPT-5-mini	81.18	78.0	79.33	80.61	94.34	79.07	57.69
MemEvolve + (pass@1)	Kimi K2	69.41	68.0	68.00	61.21	67.92	63.95	38.46
						83.02
MemEvolve + (pass@1)	DeepSeek V3.2	72.35	70.0	72.67	67.88		63.95	50.00

Memory Setting					xBench	xBench	xBench	xBench	WebWalkerQA	WebWalkerQA	WebWalkerQA	WebWalkerQA
Memory Setting	Perf.	Cost	Delay	#Steps	Perf.	Cost	Delay	#Steps	Perf.	Cost	Delay	#Steps
No-Memory	69.09	0.086	505.46	10.44	69.00	0.141	523.05	14.69	71.18	0.048	251.57	6.91
Generative	66.67	0.061	436.26	8.87	70.00	0.131	818.37	13.45	72.35	0.045	268.56	6.64
Voyager	69.70	0.060	499.89	9.25	68.00	0.117	553.46	12.71	73.53	0.049	333.69	6.99
DILU	66.67	0.059	444.62	8.91	69.00	0.134	500.72	13.83	72.94	0.046	272.16	6.96
ExpeL	66.06	0.059	500.11	8.68	64.00	0.123	710.32	13.05	69.41	0.076	385.28	10.96
AWM	67.27	0.062	584.88	10.23	71.00	0.138	761.33	14.12	72.35	0.068	397.20	11.40
Mobile-E	69.09	0.065	321.80	9.35	68.00	0.120	537.18	13.16	71.76	0.059	296.01	6.52
Cheatsheet	68.48	0.069	559.81	9.72	65.00	0.174	818.07	15.99	72.94	0.057	367.13	7.59
MemEvolve	73.33	0.085	693.33	10.14	74.00	0.136	773.06	14.20	74.71	0.040	332.49	6.64

References

[qiu2025alita] Qiu, Jiahao, Qi, Xuan, Zhang, Tongcheng, Juan, Xinzhe, Guo, Jiacheng, Lu, Yifu, Wang, Yimin, Yao, Zixin, Ren, Qihan, Jiang, Xun, others. (2025). Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution. arXiv preprint arXiv:2505.20286.

[shi2025taskcraft] Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Yang, Jian Yang, Ge Zhang, Jiaheng Liu, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou. (2025). TaskCraft: Automated Generation of Agentic Tasks.

[DeepResearchAgent] Wentao Zhang, Ce Cui, Yang Liu, Bo An. (2025). DeepResearchAgent: A Hierarchical Multi-Agent Framework for General-purpose Task Solving..

[chen2023agentverse] Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, Zhiyuan Liu, Maosong Sun, Jie Zhou. (2023). AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors in Agents.

[liu2023dynamic] Liu, Zijun, Zhang, Yanzhe, Li, Peng, Liu, Yang, Yang, Diyi. (2023). Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. arXiv preprint arXiv:2310.02170.

[zhou2023recurrentgpt] Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, Mrinmaya Sachan. (2023). RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text.

[guo2024embodied] Xudong Guo, Kaixuan Huang, Jiale Liu, Wenhui Fan, Natalia V{'e. (2024). Embodied {LLM. Language Gamification - NeurIPS 2024 Workshop.

[smolagents] Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, Erik Kaunismäki. (2025). smolagents: a smol library to build great agentic systems..

[friel2024ragbench] Friel, Robert, Belyi, Masha, Sanyal, Atindriyo. (2025). RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. doi:10.48550/arXiv.2407.11005.

[liu2024RAISE] Na Liu, Liangyu Chen, Xiaoyu Tian, Wei Zou, Kaijiang Chen, Ming Cui. (2024). From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models.

[shinn2023reflexionlanguageagentsverbal] Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning.

[zhou2024languageagenttreesearch] Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, Yu-Xiong Wang. (2024). Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models.

[jin2019pubmedqa] Jin, Qiao, Dhingra, Bhuwan, Liu, Zhengping, Cohen, William W, Lu, Xinghua. (2019). Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146.

[lang2023langchain] LangChain. (2023). Langchain: Build context-aware reasoning applications.

[autogpt2023autogpt] Significant-Gravitas. (2023). AutoGPT.

[li2023camel] Li, Guohao, Hammoud, Hasan, Itani, Hani, Khizbullin, Dmitrii, Ghanem, Bernard. (2023). Camel: Communicative agents for. Advances in Neural Information Processing Systems.

[hong2024metagpt] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, J{. (2024). Meta{GPT. The Twelfth International Conference on Learning Representations.

[xie2023openagents] Xie, Tianbao, Zhou, Fan, Cheng, Zhoujun, Shi, Peng, Weng, Luoxuan, Liu, Yitao, Hua, Toh Jing, Zhao, Junning, Liu, Qian, Liu, Che, others. (2023). Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634.

[wu2023autogen] Wu, Qingyun, Bansal, Gagan, Zhang, Jieyu, Wu, Yiran, Li, Beibin, Zhu, Erkang, Jiang, Li, Zhang, Xiaoyun, Zhang, Shaokun, Liu, Jiale, others. (2023). Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155.

[wang2022self] Wang, Xuezhi, Wei, Jason, Schuurmans, Dale, Le, Quoc, Chi, Ed, Narang, Sharan, Chowdhery, Aakanksha, Zhou, Denny. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.

[wang2024planning] Wang, Evan, Cassano, Federico, Wu, Catherine, Bai, Yunfeng, Song, Will, Nath, Vaskar, Han, Ziwen, Hendryx, Sean, Yue, Summer, Zhang, Hugh. (2024). Planning in natural language improves llm search for code generation. arXiv preprint arXiv:2409.03733.

[levine2024cell2sentence] Levine, Joshua Z, Madan, Arjun, Wagner, Aaron, Ray, Anirban, Collado-Torres, Leonardo, Stephens, Matthew, Greenleaf, William J. (2024). Cell2Sentence: Translating single-cell trajectories into biological narratives with transformers. Cell systems.

[liu2023evaluating] Liu, Tianyu, Li, Kexing, Wang, Yuge, Li, Hongyu, Zhao, Hongyu. (2023). Evaluating the utilities of foundation models in single-cell data analysis. bioRxiv.

[qiu2022mapping] Qiu, Jiezhong, Zitnik, Marinka. (2022). Mapping the genotype-phenotype landscape with deep learning and perturbation response models. Nature machine intelligence.

[li2024systematic] Li, Lanxiang, You, Yue, Liao, Wenyu, Fan, Xueying, Lu, Shihong, Cao, Ye, Li, Bo, Ren, Wenle, Fu, Yunlin, Kong, Jiaming, others. (2024). A Systematic Comparison of Single-Cell Perturbation Response Prediction Models. bioRxiv.

[jiang2024scpram] Jiang, Yifan, Li, Yuxuan, Wang, Zichen, Liu, Xiaohui. (2024). scPRAM: Single-cell perturbation response prediction via adversarial matching. Bioinformatics.

[dong2023causal] Dong, Xiaohan, Li, Yifan, Bai, Xiaohan, Liu, Xiaohui. (2023). Causal optimal transport for single-cell perturbation response prediction. arXiv preprint arXiv:2307.09543.

[bunne2023learning] Bunne, Jan, Theis, Fabian J, Wolny, Malte. (2023). Learning optimal transport for single-cell perturbation response prediction. arXiv preprint arXiv:2302.04243.

[skinnider2021cell] Skinnider, Matthew, Climente-Gonzalez, Hector, Pratapa, Surya, Schoech, Patrick J, Kharchenko, Peter V. (2021). Cell-type-specific perturbation response prediction using single-cell rna-seq. Genome biology.

[zhou2024agents2] Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, Yuchen Eleanor Jiang. (2024). Symbolic Learning Enables Self-Evolving Agents.

[zhou2023agents] Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu Chen, Wentao Zhang, Xiangru Tang, Ningyu Zhang, Huajun Chen, Peng Cui, Mrinmaya Sachan. (2023). Agents: An Open-source Framework for Autonomous Language Agents.

[dixit2016perturb] Dixit, Rishi, Pollen, Alex A, Grove, John C, Perelman, Ariel, Reshef, Yakir A, Yost, Kevin, Rothenberg, Eli V, McDevitt, Casey A, Elowitz, Michael B. (2016). Perturb-Seq: Dissecting molecular circuits with single-cell RNA-seq of pooled genetic screens. Cell.

[kamimoto2023dissecting] Kamimoto, Kenta, Koyanagi, Yu, Sato, Takayuki, Imoto, Seiya. (2023). Dissecting single-cell perturbation responses with graph neural networks. Bioinformatics.

[bai2024attentionpert] Bai, Xiaohan, Wang, Yuxuan, Li, Yifan, Zhang, Bowen, Liu, Xiaohui. (2024). AttentionPert: Predicting cellular perturbation responses using attention-based deep learning. Briefings in bioinformatics.

[roohani2024predicting] Roohani, Yusuf, Lee, Andrew, Huang, Qian, Vora, Jian, Steinhart, Zachary, Huang, Kexin, Marson, Alexander, Liang, Percy, Leskovec, Jure. (2024). Predicting single-cell perturbation responses via graph neural networks with application to covid-19. arXiv preprint arXiv:2405.17631.

[lotfollahi2023predicting] Lotfollahi, Mohammad, Klimovskaia Susmelj, Anna, De Donno, Carlo, Hetzel, Leon, Ji, Yuge, Ibarra, Ignacio L, Srivatsan, Sanjay R, Naghipourfar, Mohsen, Daza, Riza M, Martin, Beth, others. (2023). Predicting cellular responses to complex perturbations in high-throughput screens. Molecular systems biology.

[hetzel2022predicting] Hetzel, Julia, Kestler, Hans A, Binder, Benjamin. (2022). Predicting single-cell perturbation responses using deep learning with application to covid-19. Briefings in bioinformatics.

[lotfollahi2019scgen] Lotfollahi, Mohammad, Velten, Jan, Marioni, John C. (2019). scGen: single-cell gene expression prediction through deep learning. Genome biology.

[hao2024large] Hao, Jie, Zhou, Xing, Zhang, Yuchen, Liu, Xiaohui. (2024). Large-scale pre-training for single-cell rna-seq analysis. Nature methods.

[wenteler2024perteval] Wenteler, Aaron, Occhetta, Martina, Branson, Nikhil, Huebner, Magdalena, Curean, Victor, Dee, William, Connell, William, Hawkins-Hooker, Alex, Chung, Pui, Ektefaie, Yasha, others. (2024). Perteval-scfm: Benchmarking single-cell foundation models for perturbation effect prediction. bioRxiv.

[li2024benchmarking] Li, Chen, Gao, Haoxiang, She, Yuli, Bian, Haiyang, Chen, Qing, Liu, Kai, Wei, Lei, Zhang, Xuegong. (2024). Benchmarking AI Models for In Silico Gene Perturbation of Cells. bioRxiv.

[bock2022high] Bock, Christoph, Datlinger, Paul, Chardon, Florence, Coelho, Matthew A, Dong, Matthew B, Lawson, Keith A, Lu, Tian, Maroc, Laetitia, Norman, Thomas M, Song, Bicna, others. (2022). High-content CRISPR screening. Nature Reviews Methods Primers.

[szalata2024benchmark] Sza{\l. (2024). A benchmark for prediction of transcriptomic responses to chemical perturbations across cell types. Advances in Neural Information Processing Systems.

[openproblems] Daniel Burkhardt, Andrew Benz, Richard Lieberman, Scott Gigante, Ashley Chow, Ryan Holbrook, Robrecht Cannoodt, Malte Luecken. (2023). Open Problems – Single-Cell Perturbations.

[edwards2022translation] Edwards, Carl, Lai, Tuan, Ros, Kevin, Honke, Garrett, Ji, Heng. (2022). Translation between molecules and natural language. arXiv preprint arXiv:2204.11817.

[peidli2024scperturb] Peidli, Stefan, Green, Tessa D, Shen, Ciyue, Gross, Torsten, Min, Joseph, Garda, Samuele, Yuan, Bo, Schumacher, Linus J, Taylor-King, Jake P, Marks, Debora S, others. (2024). {scPerturb. Nature Methods.

[bendidi2024benchmarking] Bendidi, Ihab, Whitfield, Shawn, Kenyon-Dean, Kian, Yedder, Hanene Ben, Mesbahi, Yassir El, Noutahi, Emmanuel, Denton, Alisandra K.. (2024). Benchmarking Transcriptomics Foundation Models for Perturbation Analysis: one PCA still rules them all. doi:10.48550/arXiv.2410.13956.

[lu2024ai] Lu, Chris, Lu, Cong, Lange, Robert Tjarko, Foerster, Jakob, Clune, Jeff, Ha, David. (2024). The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. doi:10.48550/arXiv.2408.06292.

[swanson2024virtual] Swanson, Kyle, Wu, Wesley, Bulaong, Nash L., Pak, John E., Zou, James. (2024). The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation. doi:10.1101/2024.11.11.623004.

[cui2024scgpt] Cui, Yuhan, Li, Yifan, Bai, Xiaohan, Liu, Xiaohui. (2024). scGPT: Large language model for single-cell omics data analysis. arXiv preprint arXiv:2401.03456.

[theodoris2023transfer] Theodoris, Panagiotis, Li, Yifan, Bai, Xiaohan, Liu, Xiaohui. (2023). Transfer learning for predicting drug responses across cell lines using single-cell rna-seq. Briefings in bioinformatics.

[lewis2020retrieval] Lewis, Patrick, Perez, Ethan, Piktus, Aleksandra, Petroni, Fabio, Karpukhin, Vladimir, others. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv preprint arXiv:2005.11280.

[reimers2019sbert] Reimers, Nils, Gurevych, Iryna. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084 [cs].

[zheng2023judging] Zheng, Lianmin, Chiang, Wei-Lin, Sheng, Ying, Zhuang, Siyuan, Wu, Zhanghao, others. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. doi:10.48550/arXiv.2306.05685.

[openhands] Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, Graham Neubig. (2024). {OpenHands: An Open Platform for AI Software Developers as Generalist Agents.

[guo2024large] Guo, Taicheng, Chen, Xiuying, Wang, Yaqi, Chang, Ruidi, Pei, Shichao, Chawla, Nitesh V., Wiest, Olaf, Zhang, Xiangliang. (2024). Large Language Model based Multi-Agents: A Survey of Progress and Challenges. doi:10.48550/arXiv.2402.01680.

[majumder2024discoverybench] Majumder, Bodhisattwa Prasad, Surana, Harshit, Agarwal, Dhruv, Mishra, Bhavana Dalvi, Meena, Abhijeetsingh, Prakhar, Aryan, Khot, Tirth, Sabharwal, Ashish, Clark, Peter. (2024). DiscoveryBench: Towards Data-Driven Discovery with Large Language Models. arXiv preprint arXiv:2407.01725.

[chen2024scienceagentbench] Chen, Ziru, Chen, Shijie, Ning, Yuting, Zhang, Qianheng, Wang, Boshi, Yu, Botao, Li, Yifei, Liao, Zeyi, Wei, Chen, Lu, Zitong, others. (2024). ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery. arXiv preprint arXiv:2410.05080.

[tian2024scicode] Tian, Minyang, Gao, Luyu, Zhang, Shizhuo Dylan, Chen, Xinan, Fan, Cunwei, Guo, Xuefei, Haas, Roland, Ji, Pan, Krongchon, Kittithat, Li, Yao, others. (2024). SciCode: A Research Coding Benchmark Curated by Scientists. arXiv preprint arXiv:2407.13168.

[gu2024blade] Gu, Ken, Shang, Ruoxi, Jiang, Ruien, Kuang, Keying, Lin, Richard-John, Lyu, Donghe, Mao, Yue, Pan, Youran, Wu, Teng, Yu, Jiaqian, others. (2024). BLADE: Benchmarking Language Model Agents for Data-Driven Science. arXiv preprint arXiv:2408.09667.

[ghafarollahi2024b] Ghafarollahi, Alireza, Buehler, Markus J. (2024). ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning. arXiv preprint arXiv:2402.04268.

[roohani2024biodiscoveryagent] Roohani, Yusuf, Lee, Andrew, Huang, Qian, Vora, Jian, Steinhart, Zachary, Huang, Kexin, Marson, Alexander, Liang, Percy, Leskovec, Jure. (2024). BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments. arXiv preprint arXiv:2405.17631.

[liu2024tais] Liu, Haoyang, Li, Yijiang, Jian, Jinglin, Cheng, Yuxuan, Lu, Jianrong, Guo, Shuyi, Zhu, Jinglei, Zhang, Mianchen, Zhang, Miantong, Wang, Haohan. (2024). Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data. arXiv preprint arXiv:2402.12391.

[jin2024agentmd] Jin, Qiao, Wang, Zhizheng, Yang, Yifan, Zhu, Qingqing, Wright, Donald, Huang, Thomas, Wilbur, W John, He, Zhe, Taylor, Andrew, Chen, Qingyu, others. (2024). AgentMD: Empowering Language Agents for Risk Prediction with Large-Scale Clinical Tool Learning. arXiv preprint arXiv:2402.13225.

[chen2023chemistx] Chen, Kexin, Li, Junyou, Wang, Kunyi, Du, Yuyang, Yu, Jiahui, Lu, Jiamin, Li, Lanqing, Qiu, Jiezhong, Pan, Jianzhang, Heng, Pheng Ann, others. (2023). Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis. arXiv preprint arXiv:2311.10776.

[bran2024chemcrow] Bran, Andres M, Cox, Sam, Schilter, Oliver, Baldassari, Carlo, White, Andrew D, Schwaller, Philippe. (2024). Augmenting large language models with chemistry tools. Nature Machine Intelligence.

[kang2023chatmof] Kang, Yeonghun, Kim, Jihan. (2023). ChatMOF: An Autonomous AI System for Predicting and Generating Metal-Organic Frameworks. arXiv preprint arXiv:2308.01423.

[ghafarollahi2024a] Ghafarollahi, Alireza, Buehler, Markus J. (2024). AtomAgents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence. arXiv preprint arXiv:2407.10022.

[sun2024mephisto] Sun, Zechang, Ting, Yuan-Sen, Liang, Yaobo, Duan, Nan, Huang, Song, Cai, Zheng. (2024). Interpreting Multi-band Galaxy Observations with Large Language Model-Based Agents. arXiv preprint arXiv:2409.14807.

[baek2024researchagent] Baek, Jinheon, Jauhar, Sujay Kumar, Cucerzan, Silviu, Hwang, Sung Ju. (2024). ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models. arXiv preprint arXiv:2404.07738.

[jin2024b] Jin, Yiqiao, Zhao, Qinlin, Wang, Yiyang, Chen, Hao, Zhu, Kaijie, Xiao, Yijia, Wang, Jindong. (2024). AgentReview: Exploring Peer Review Dynamics with LLM Agents. arXiv preprint arXiv:2406.12708.

[li2024b] Li, Ruochen, Patel, Teerth, Wang, Qingyun, Du, Xinya. (2024). MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents. arXiv preprint arXiv:2408.14033.

[futurehouse2024] . Future House Platform: AI Agents for Scientific Research. (2024).

[wei2022chain] Wei, Jason, Wang, Xuezhi, Schuurmans, Dale, Bosma, Maarten, Xia, Fei, Chi, Ed, Le, Quoc V, Zhou, Denny, others. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems.

[yao2023react] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, Yuan Cao. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. The Eleventh International Conference on Learning Representations.

[fourney2024magentic] Fourney, Adam, Bansal, Gagan, Mozannar, Hussein, Tan, Cheng, Salinas, Eduardo, Niedtner, Friederike, Proebsting, Grace, Bassman, Griffin, Gerrits, Jack, Alber, Jacob, others. (2024). Magentic-one: A generalist multi-agent system for solving complex tasks. arXiv preprint arXiv:2411.04468.

[wu2024copilot] Wu, Zhiyong, Han, Chengcheng, Ding, Zichen, Weng, Zhenmin, Liu, Zhoumianze, Yao, Shunyu, Yu, Tao, Kong, Lingpeng. (2024). Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456.

[chen2024tree] Chen, Ziru, White, Michael, Mooney, Raymond, Payani, Ali, Su, Yu, Sun, Huan. (2024). When is tree search useful for llm planning? it depends on the discriminator. arXiv preprint arXiv:2402.10890.

[koh2024tree] Koh, Jing Yu, McAleer, Stephen, Fried, Daniel, Salakhutdinov, Ruslan. (2024). Tree search for language model agents. arXiv preprint arXiv:2407.01476.

[song2024trial] Song, Yifan, Yin, Da, Yue, Xiang, Huang, Jie, Li, Sujian, Lin, Bill Yuchen. (2024). Trial and error: Exploration-based trajectory optimization for llm agents. arXiv preprint arXiv:2403.02502.

[pan2024autonomous] Pan, Jiayi, Zhang, Yichi, Tomlin, Nicholas, Zhou, Yifei, Levine, Sergey, Suhr, Alane. (2024). Autonomous evaluation and refinement of digital agents. arXiv preprint arXiv:2404.06474.

[paul2023refiner] Paul, Debjit, Ismayilzada, Mete, Peyrard, Maxime, Borges, Beatriz, Bosselut, Antoine, West, Robert, Faltings, Boi. (2023). Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904.

[fu2025agentrefine] Fu, Dayuan, He, Keqing, Wang, Yejie, Hong, Wentao, Gongque, Zhuoma, Zeng, Weihao, Wang, Wei, Wang, Jingang, Cai, Xunliang, Xu, Weiran. (2025). AgentRefine: Enhancing Agent Generalization through Refinement Tuning. arXiv preprint arXiv:2501.01702.

[zhang2024towards] Zhang, Yang, Yang, Shixin, Bai, Chenjia, Wu, Fei, Li, Xiu, Wang, Zhen, Li, Xuelong. (2024). Towards efficient llm grounding for embodied multi-agent collaboration. arXiv preprint arXiv:2405.14314.

[qin2024tool] Qin, Yujia, Hu, Shengding, Lin, Yankai, Chen, Weize, Ding, Ning, Cui, Ganqu, Zeng, Zheni, Zhou, Xuanhe, Huang, Yufei, Xiao, Chaojun, others. (2024). Tool learning with foundation models. ACM Computing Surveys.

[wang2024executable] Wang, Xingyao, Chen, Yangyi, Yuan, Lifan, Zhang, Yizhe, Li, Yunzhu, Peng, Hao, Ji, Heng. (2024). Executable code actions elicit better llm agents. Forty-first International Conference on Machine Learning.

[song2023llm] Song, Chan Hee, Wu, Jiaman, Washington, Clayton, Sadler, Brian M, Chao, Wei-Lun, Su, Yu. (2023). Llm-planner: Few-shot grounded planning for embodied agents with large language models. Proceedings of the IEEE/CVF international conference on computer vision.

[yang2023how2comm] Yang, Dingkang, Yang, Kun, Wang, Yuzheng, Liu, Jing, Xu, Zhi, Yin, Rongbin, Zhai, Peng, Zhang, Lihua. (2023). How2comm: Communication-efficient and collaboration-pragmatic multi-agent perception. Advances in Neural Information Processing Systems.

[yang2023what2comm] Yang, Kun, Yang, Dingkang, Zhang, Jingyu, Wang, Hanqi, Sun, Peng, Song, Liang. (2023). What2comm: Towards communication-efficient collaborative perception via feature decoupling. Proceedings of the 31st ACM international conference on multimedia.

[tordesillas2021mader] Tordesillas, Jesus, How, Jonathan P. (2021). MADER: Trajectory planner in multiagent and dynamic environments. IEEE Transactions on Robotics.

[trase2024trase] Trase. (2024). Meet trase systems..

[tang2025autoagent] Tang, Jiabin, Fan, Tianyu, Huang, Chao. (2025). AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents. arXiv e-prints.

[wei2025browsecomp] Wei, Jason, Sun, Zhiqing, Papay, Spencer, McKinney, Scott, Han, Jeffrey, Fulford, Isa, Chung, Hyung Won, Passos, Alex Tachard, Fedus, William, Glaese, Amelia. (2025). Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516.

[mialon2023gaia] Mialon, Gr{'e. (2023). Gaia: a benchmark for general ai assistants. The Twelfth International Conference on Learning Representations.

[hu2024automated] Hu, Shengran, Lu, Cong, Clune, Jeff. (2024). Automated design of agentic systems. arXiv preprint arXiv:2408.08435.

[schick2023toolformer] Schick, Timo, Dwivedi-Yu, Jane, Dess{`\i. (2023). Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems.

[li2025search] Li, Xiaoxi, Dong, Guanting, Jin, Jiajie, Zhang, Yuyao, Zhou, Yujia, Zhu, Yutao, Zhang, Peitian, Dou, Zhicheng. (2025). Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366.

[li2025webthinker] Li, Xiaoxi, Jin, Jiajie, Dong, Guanting, Qian, Hongjin, Zhu, Yutao, Wu, Yongkang, Wen, Ji-Rong, Dou, Zhicheng. (2025). Webthinker: Empowering large reasoning models with deep research capability. arXiv preprint arXiv:2504.21776.

[aworld2025] Agent Team at Ant Group. (2025). AWorld: A Unified Agent Playground for Computer and Phone Use Tasks.

[h2oGPTe2024h2oGPTe] H2O.ai. (2024). Autonomous agentic ai: execute multi-step workflows autonomously.

[opendeepresearch] LangChain. (2024). Open Deep Research.

[bahdanau2024tapeagentsholisticframeworkagent] Dzmitry Bahdanau, Nicolas Gontier, Gabriel Huang, Ehsan Kamalloo, Rafael Pardinas, Alex Piché, Torsten Scholak, Oleh Shliazhko, Jordan Prince Tremblay, Karam Ghanem, Soham Parikh, Mitul Tiwari, Quaizar Vohra. (2024). TapeAgents: a Holistic Framework for Agent Development and Optimization.

[owl2025] Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Ping Luo, Guohao Li. (2025). OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation.

[Peng_Langfun_2023] Peng, Daiyi. (2023). {Langfun.

[cosight] ZTE AICloud. (2025). {Co-Sight.

[desearch] Desearch AI. (2024). Desearch.

[deepresearch] OpenAI. (2024). deepresearch.

[xu2025mem] Xu, Wujiang, Liang, Zujie, Mei, Kai, Gao, Hang, Tan, Juntao, Zhang, Yongfeng. (2025). A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110.

[zhang2024survey-memory] Zhang, Zeyu, Bo, Xiaohe, Ma, Chen, Li, Rui, Chen, Xu, Dai, Quanyu, Zhu, Jieming, Dong, Zhenhua, Wen, Ji-Rong. (2024). A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501.

[openaiGPT5System] OpenAI. (2025). {G.

[Aho:72] Alfred V. Aho, Jeffrey D. Ullman. (1972). The Theory of Parsing, Translation and Compiling.

[zhou2024symbolic] Zhou, Wangchunshu, Ou, Yixin, Ding, Shengwei, Li, Long, Wu, Jialong, Wang, Tiannan, Chen, Jiamin, Wang, Shuai, Xu, Xiaohua, Zhang, Ningyu, others. (2024). Symbolic learning enables self-evolving agents. arXiv preprint arXiv:2406.18532.

[Hugging_Face_2024] Cui, Ganqu, Yuan, Lifan, Ding, Ning, Yao, Guanming, He, Bingxiang, Zhu, Wei, Ni, Yuan, Xie, Guotong, Xie, Ruobing, Lin, Yankai, others. (2023). Ultrafeedback: Boosting language models with scaled ai feedback. arXiv preprint arXiv:2310.01377.

[frick2025p2l] Frick, Evan, Chen, Connor, Tennyson, Joseph, Li, Tianle, Chiang, Wei-Lin, Angelopoulos, Anastasios N, Stoica, Ion. (2025). Prompt-to-leaderboard. arXiv preprint arXiv:2502.14855.

[liu2023human+] Liu, Jiawei, Xia, Chunqiu Steven, Wang, Yuyao, Zhang, Lingming. (2023). Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems.

[chen2024optima] Chen, Weize, Yuan, Jiarui, Qian, Chen, Yang, Cheng, Liu, Zhiyuan, Sun, Maosong. (2024). Optima: Optimizing effectiveness and efficiency for llm-based multi-agent system. arXiv preprint arXiv:2410.08115.

[liang2024self] Liang, Xuechen, Tao, Meiling, Xia, Yinghui, Shi, Tianyu, Wang, Jun, Yang, JingSong. (2024). Self-evolving Agents with reflective and memory-augmented abilities. arXiv preprint arXiv:2409.00872.

[wang2024moa] Wang, Junlin, Wang, Jue, Athiwaratkun, Ben, Zhang, Ce, Zou, James. (2024). Mixture-of-agents enhances large language model capabilities. arXiv preprint arXiv:2406.04692.

[chen2024routerdc] Chen, Shuhao, Jiang, Weisen, Lin, Baijiong, Kwok, James T, Zhang, Yu. (2024). RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models. arXiv preprint arXiv:2409.19886.

[han2024wildguard] Han, Seungju, Rao, Kavel, Ettinger, Allyson, Jiang, Liwei, Lin, Bill Yuchen, Lambert, Nathan, Choi, Yejin, Dziri, Nouha. (2024). Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495.

[APA:83] {American Psychological Association. (1983). Publications Manual.

[Chandra:81] Ashok K. Chandra, Dexter C. Kozen, Larry J. Stockmeyer. (1981). Alternation. Journal of the Association for Computing Machinery. doi:10.1145/322234.322243.

[andrew2007scalable] Andrew, Galen, Gao, Jianfeng. (2007). Scalable training of {L1. Proceedings of the 24th International Conference on Machine Learning.

[Gusfield:97] Dan Gusfield. (1997). Algorithms on Strings, Trees and Sequences.

[rasooli-tetrault-2015] Mohammad Sadegh Rasooli, Joel R. Tetreault. (2015). Yara Parser: {A. Computing Research Repository.

[Ando2005] Ando, Rie Kubota, Zhang, Tong. (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. Journal of Machine Learning Research.

[AMSTrans = "American Mathematical Society Translations" } @String{AMSTrans = "Amer. Math. Soc. Transl." } @String{BullAMS = "Bulletin of the American Mathematical Society" } @String{BullAMS = "Bull. Amer. Math. Soc." } @String{ProcAMS = "Proceedings of the American Mathematical Society" } @String{ProcAMS = "Proc. Amer. Math. Soc." } @String{TransAMS = "Transactions of the American Mathematical Society" } @String{TransAMS = "Trans. Amer. Math. Soc." }

%ACM @String{CACM = "Communications of the {ACM}" } @String{CACM = "Commun. {ACM}" } @String{CompServ = "Comput. Surveys" } @String{JACM = "J. ACM" } @String{ACMMathSoft = "{ACM} Transactions on Mathematical Software" } @String{ACMMathSoft = "{ACM} Trans. Math. Software" } @String{SIGNUM = "{ACM} {SIGNUM} Newsletter" } @String{SIGNUM = "{ACM} {SIGNUM} Newslett." }

@String{AmerSocio = "American Journal of Sociology" } @String{AmerStatAssoc = "Journal of the American Statistical Association" } @String{AmerStatAssoc = "J. Amer. Statist. Assoc." } @String{ApplMathComp = "Applied Mathematics and Computation" } @String{ApplMathComp = "Appl. Math. Comput." } @String{AmerMathMonthly = "American Mathematical Monthly" } @String{AmerMathMonthly = "Amer. Math. Monthly" } @String{BIT = "{BIT}" } @String{BritStatPsych = "British Journal of Mathematical and Statistical Psychology" } @String{BritStatPsych = "Brit. J. Math. Statist. Psych." } @String{CanMathBull = "Canadian Mathematical Bulletin" } @String{CanMathBull = "Canad. Math. Bull." } @String{CompApplMath = "Journal of Computational and Applied Mathematics" } @String{CompApplMath = "J. Comput. Appl. Math." } @String{CompPhys = "Journal of Computational Physics" } @String{CompPhys = "J. Comput. Phys." } @String{CompStruct = "Computers and Structures" } @String{CompStruct = "Comput. & Structures" } @String{CompJour = "The Computer Journal" } @String{CompJour = "Comput. J." } @String{CompSysSci = "Journal of Computer and System Sciences" } @String{CompSysSci = "J. Comput. System Sci." } @String{Computing = "Computing" } @String{ContempMath = "Contemporary Mathematics" } @String{ContempMath = "Contemp. Math." } @String{Crelle = "Crelle's Journal" } @String{GiornaleMath = "Giornale di Mathematiche" } @String{GiornaleMath = "Giorn. Mat." } % didn't find in AMS MR.] Bengio, Yoshua, LeCun, Yann. (2007). Scaling Learning Algorithms Towards {AI. Large Scale Kernel Machines.

[li2024gslb] Li, Zhixun, Sun, Xin, Luo, Yifan, Zhu, Yanqiao, Chen, Dingshuo, Luo, Yingtao, Zhou, Xiangxin, Liu, Qiang, Wu, Shu, Wang, Liang, others. (2024). GSLB: the graph structure learning benchmark. Advances in Neural Information Processing Systems.

[Hinton06] Hinton, Geoffrey E., Osindero, Simon, Teh, Yee Whye. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation.

[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, Bengio, Yoshua. (2016). Deep learning.

[Tang:12KDDCross] Jie Tang, Sen Wu, Jimeng Sun, Hang Su. (2012). Cross-domain Collaboration Recommendation. KDD'2012.

[sankar2020dysat] Sankar, Aravind, Wu, Yanhong, Gou, Liang, Zhang, Wei, Yang, Hao. (2020). Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. Proceedings of the 13th International Conference on Web Search and Data Mining.

[wu2022handling] Wu, Qitian, Zhang, Hengrui, Yan, Junchi, Wipf, David. (2022). Handling Distribution Shifts on Graphs: An Invariance Perspective. International Conference on Learning Representations.

[mikolov2013efficient] Tom{'{a. (2013). Efficient Estimation of Word Representations in Vector Space. 1st International Conference on Learning Representations.

[wu2022discovering] Yingxin Wu, Xiang Wang, An Zhang, Xiangnan He, Tat{-. (2022). Discovering Invariant Rationales for Graph Neural Networks. The Tenth International Conference on Learning Representations.

[zhu2021shift] Zhu, Qi, Ponomareva, Natalia, Han, Jiawei, Perozzi, Bryan. (2021). Shift-robust gnns: Overcoming the limitations of localized graph training data. Advances in Neural Information Processing Systems.

[gagnon2022woods] Gagnon-Audet, Jean-Christophe, Ahuja, Kartik, Darvishi-Bayazi, Mohammad-Javad, Dumas, Guillaume, Rish, Irina. (2022). WOODS: Benchmarks for Out-of-Distribution Generalization in Time Series Tasks. arXiv preprint arXiv:2203.09978.

[du2021adarnn] Du, Yuntao, Wang, Jindong, Feng, Wenjie, Pan, Sinno, Qin, Tao, Xu, Renjun, Wang, Chongjun. (2021). Adarnn: Adaptive learning and forecasting of time series. Proceedings of the 30th ACM International Conference on Information & Knowledge Management.

[kim2021reversible] Kim, Taesung, Kim, Jinhee, Tae, Yunwon, Park, Cheonbok, Choi, Jang-Ho, Choo, Jaegul. (2021). Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift. International Conference on Learning Representations.

[venkateswaran2021environment] Venkateswaran, Praveen, Muthusamy, Vinod, Isahagian, Vatche, Venkatasubramanian, Nalini. (2021). Environment agnostic invariant risk minimization for classification of sequential datasets. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining.

[driess2023palm-e] Driess, Danny, Xia, Fei, Sajjadi, Mehdi SM, Lynch, Corey, Chowdhery, Aakanksha, Wahid, Ayzaan, Tompson, Jonathan, Vuong, Quan, Yu, Tianhe, Huang, Wenlong, others. (2023). Palm-e: An embodied multimodal language model.

[huang2024understanding] Huang, Xu, Liu, Weiwen, Chen, Xiaolong, Wang, Xingmei, Wang, Hao, Lian, Defu, Wang, Yasheng, Tang, Ruiming, Chen, Enhong. (2024). Understanding the planning of LLM agents: A survey. arXiv preprint arXiv:2402.02716.

[putta2024agentq] Putta, Pranav, Mills, Edmund, Garg, Naman, Motwani, Sumeet, Finn, Chelsea, Garg, Divyansh, Rafailov, Rafael. (2024). Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199.

[masterman2024landscape] Masterman, Tula, Besen, Sandi, Sawtell, Mason, Chao, Alex. (2024). The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey. arXiv preprint arXiv:2404.11584.

[lu2021diversify] Lu, Wang, Wang, Jindong, Chen, Yiqiang, Sun, Xinwei. (2021). DIVERSIFY to Generalize: Learning Generalized Representations for Time Series Classification. arXiv preprint.

[zhu2024knowagent] Zhu, Yuqi, Qiao, Shuofei, Ou, Yixin, Deng, Shumin, Lyu, Shiwei, Shen, Yue, Liang, Lei, Gu, Jinjie, Chen, Huajun, Zhang, Ningyu. (2024). Knowagent: Knowledge-augmented planning for llm-based agents. arXiv preprint arXiv:2403.03101.

[skarding2021foundations] Skarding, Joakim, Gabrys, Bogdan, Musial, Katarzyna. (2021). Foundations and Modeling of Dynamic Networks Using Dynamic Graph Neural Networks: A Survey. IEEE Access.

[zhu2022learnable] Zhu, Yuecai, Lyu, Fuyuan, Hu, Chengming, Chen, Xi, Liu, Xue. (2022). Learnable Encoder-Decoder Architecture for Dynamic Graph: A Survey. arXiv preprint arXiv:2203.10480.

[zhang2024cut] Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, Tianlong Chen. (2024). Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems. arXiv preprint arXiv:2410.02506.

[wang2021inductive] Yanbang Wang, Yen{-. (2021). Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks. 9th International Conference on Learning Representations.

[cong2021dynamic] Cong, Weilin, Wu, Yanhong, Tian, Yuandong, Gu, Mengting, Xia, Yinglong, Mahdavi, Mehrdad, Chen, Chun-cheng Jason. (2021). Dynamic Graph Representation Learning via Graph Transformer Networks. arXiv preprint arXiv:2111.10447.

[hong2024data-interpreter] Hong, Sirui, Lin, Yizhang, Liu, Bang, Liu, Bangbang, Wu, Binhao, Zhang, Ceyao, Wei, Chenxing, Li, Danyang, Chen, Jiaqi, Zhang, Jiayi, others. (2024). Data interpreter: An llm agent for data science. arXiv preprint arXiv:2402.18679.

[yang2021discrete] Yang, Menglin, Zhou, Min, Kalander, Marcus, Huang, Zengfeng, King, Irwin. (2021). Discrete-time Temporal Network Embedding via Implicit Hierarchical Learning in Hyperbolic Space. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining.

[sun2021hyperbolic] Sun, Li, Zhang, Zhongbao, Zhang, Jiawei, Wang, Feiyang, Peng, Hao, Su, Sen, Yu, Philip S. (2021). Hyperbolic variational graph neural network for modeling dynamic graphs. Proceedings of the AAAI Conference on Artificial Intelligence.

[xu2020inductive] Da Xu, Chuanwei Ruan, Evren K{. (2020). Inductive representation learning on temporal graphs. 8th International Conference on Learning Representations.

[wang2021tcl] Wang, Lu, Chang, Xiaofu, Li, Shuang, Chu, Yunfei, Li, Hui, Zhang, Wei, He, Xiaofeng, Song, Le, Zhou, Jingren, Yang, Hongxia. (2021). Tcl: Transformer-based dynamic graph modelling via contrastive learning. arXiv preprint arXiv:2105.07944.

[rossi2020temporal] Rossi, Emanuele, Chamberlain, Ben, Frasca, Fabrizio, Eynard, Davide, Monti, Federico, Bronstein, Michael. (2020). Temporal graph networks for deep learning on dynamic graphs. arXiv preprint arXiv:2006.10637.

[hajiramezanali2019variational] Hajiramezanali, Ehsan, Hasanzadeh, Arman, Narayanan, Krishna, Duffield, Nick, Zhou, Mingyuan, Qian, Xiaoning. (2019). Variational graph recurrent neural networks. Advances in neural information processing systems.

[pareja2020evolvegcn] Pareja, Aldo, Domeniconi, Giacomo, Chen, Jie, Ma, Tengfei, Suzumura, Toyotaro, Kanezashi, Hiroki, Kaler, Tim, Schardl, Tao, Leiserson, Charles. (2020). Evolvegcn: Evolving graph convolutional networks for dynamic graphs. Proceedings of the AAAI Conference on Artificial Intelligence.

[seo2018structured] Seo, Youngjoo, Defferrard, Micha{. (2018). Structured sequence modeling with graph convolutional recurrent networks. International Conference on Neural Information Processing.

[kipf2016variational] Kipf, Thomas N, Welling, Max. (2016). Variational graph auto-encoders. arXiv preprint arXiv:1611.07308.

[arjovsky2019invariant] Arjovsky, Martin, Bottou, L{'e. (2019). Invariant risk minimization. arXiv preprint.

[sagawa2019distributionally] Sagawa, Shiori, Koh, Pang Wei, Hashimoto, Tatsunori B, Liang, Percy. Distributionally Robust Neural Networks. International Conference on Learning Representations.

[krueger2021out] Krueger, David, Caballero, Ethan, Jacobsen, Joern-Henrik, Zhang, Amy, Binas, Jonathan, Zhang, Dinghuai, Le Priol, Remi, Courville, Aaron. (2021). Out-of-distribution generalization via risk extrapolation (rex). International Conference on Machine Learning.

[cadene2019rubi] Cadene, Remi, Dancette, Corentin, Cord, Matthieu, Parikh, Devi, others. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems.

[aggarwal2014evolutionary] Aggarwal, Charu, Subbian, Karthik. (2014). Evolutionary network analysis: A survey. ACM Computing Surveys (CSUR).

[qiu2020temporal] Qiu, Zhenyu, Hu, Wenbin, Wu, Jia, Liu, Weiwei, Du, Bo, Jia, Xiaohua. (2020). Temporal network embedding with high-order nonlinear information. Proceedings of the AAAI Conference on Artificial Intelligence.

[huang2020motif] Huang, Hong, Fang, Zixuan, Wang, Xiao, Miao, Youshan, Jin, Hai. (2020). Motif-Preserving Temporal Network Embedding.. IJCAI.

[zhou2018dynamic] Zhou, Lekui, Yang, Yang, Ren, Xiang, Wu, Fei, Zhuang, Yueting. (2018). Dynamic network embedding by modeling triadic closure process. Proceedings of the AAAI conference on artificial intelligence.

[trivedi2019dyrep] Trivedi, Rakshit, Farajtabar, Mehrdad, Biswal, Prasenjeet, Zha, Hongyuan. (2019). Dyrep: Learning representations over dynamic graphs. International conference on learning representations.

[ding2021closer] Ding, Mucong, Kong, Kezhi, Chen, Jiuhai, Kirchenbauer, John, Goldblum, Micah, Wipf, David, Huang, Furong, Goldstein, Tom. (2021). A Closer Look at Distribution Shifts and Out-of-Distribution Generalization on Graphs.

[kovanen2011temporal] Kovanen, Lauri, Karsai, M{'a. (2011). Temporal motifs in time-dependent networks. Journal of Statistical Mechanics: Theory and Experiment.

[benson2016higher] Benson, Austin R, Gleich, David F, Leskovec, Jure. (2016). Higher-order organization of complex networks. Science.

[paranjape2017motifs] Paranjape, Ashwin, Benson, Austin R, Leskovec, Jure. (2017). Motifs in temporal networks. Proceedings of the tenth ACM international conference on web search and data mining.

[zitnik2019evolution] Zitnik, Marinka, Sosi{\v{c. (2019). Evolution of resilience in protein interactomes across the tree of life. Proceedings of the National Academy of Sciences.

[coleman1994foundations] Coleman, James S. (1994). Foundations of social theory.

[huang2015triadic] Huang, Hong, Tang, Jie, Liu, Lu, Luo, JarDer, Fu, Xiaoming. (2015). Triadic closure pattern analysis and prediction in social networks. IEEE Transactions on Knowledge and Data Engineering.

[kovanen2013temporal] Kovanen, Lauri, Kaski, Kimmo, Kert{'e. (2013). Temporal motifs reveal homophily, gender-specific patterns, and group talk in call sequences. Proceedings of the National Academy of Sciences.

[glymour2016causal] Glymour, Madelyn, Pearl, Judea, Jewell, Nicholas P. (2016). Causal inference in statistics: A primer.

[pearl2000models] Pearl, Judea, others. (2000). Models, reasoning and inference. Cambridge, UK: CambridgeUniversityPress.

[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, {\L. (2017). Attention is all you need. Advances in neural information processing systems.

[tian2006characterization] Tian, Jin, Kang, Changsung, Pearl, Judea. (2006). A characterization of interventional distributions in semi-Markovian causal models.

[brown1992survivorship] Brown, Stephen J, Goetzmann, William, Ibbotson, Roger G, Ross, Stephen A. (1992). Survivorship bias in performance studies. The Review of Financial Studies.

[berk1983introduction] Berk, Richard A. (1983). An introduction to sample selection bias in sociological data. American sociological review.

[simmel1950sociology] Simmel, Georg. (1950). The sociology of georg simmel.

[shen2021towards] Shen, Zheyan, Liu, Jiashuo, He, Yue, Zhang, Xingxuan, Xu, Renzhe, Yu, Han, Cui, Peng. (2021). Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624.

[nascimento2021dynamic] Nascimento, Diego C, Pimentel, Bruno A, Souza, Renata MCR, Costa, Lilia, Gon{\c{c. (2021). Dynamic graph in a symbolic data framework: An account of the causal relation using COVID-19 reports and some reflections on the financial world. Chaos, Solitons & Fractals.

[zhang2021dyngraphtrans] Zhang, Shilei, Suzumura, Toyotaro, Zhang, Li. (2021). DynGraphTrans: Dynamic Graph Embedding via Modified Universal Transformer Networks for Financial Transaction Data. 2021 IEEE International Conference on Smart Data Services (SMDS).

[berger2006framework] Berger-Wolf, Tanya Y, Saia, Jared. (2006). A framework for analysis of dynamic social networks. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.

[greene2010tracking] Greene, Derek, Doyle, Donal, Cunningham, Padraig. (2010). Tracking the evolution of communities in dynamic social networks. 2010 international conference on advances in social networks analysis and mining.

[peng2021dynamic] Peng, Hao, Du, Bowen, Liu, Mingsheng, Liu, Mingzhe, Ji, Shumei, Wang, Senzhang, Zhang, Xu, He, Lifang. (2021). Dynamic graph convolutional network for long-term traffic flow prediction with reinforcement learning. Information Sciences.

[peng2020spatial] Peng, Hao, Wang, Hongfei, Du, Bowen, Bhuiyan, Md Zakirul Alam, Ma, Hongyuan, Liu, Jianwei, Wang, Lihong, Yang, Zeyu, Du, Linfeng, Wang, Senzhang, others. (2020). Spatial temporal incidence dynamic graph neural networks for traffic flow forecasting. Information Sciences.

[wang2022causal] Wang, Wenjie, Lin, Xinyu, Feng, Fuli, He, Xiangnan, Lin, Min, Chua, Tat-Seng. (2022). Causal Representation Learning for Out-of-Distribution Recommendation. Proceedings of the ACM Web Conference 2022.

[jin2021community] Jin, Tian, Wu, Qiong, Ou, Xuan, Yu, Jianjun. (2021). Community detection and co-author recommendation in co-author networks. International Journal of Machine Learning and Cybernetics.

[ahuja2020empirical] Kartik Ahuja, Jun Wang, Amit Dhurandhar, Karthikeyan Shanmugam, Kush R. Varshney. (2021). Empirical or Invariant Risk Minimization? {A. 9th International Conference on Learning Representations.

[huang2020graph] Huang, Kexin, Zitnik, Marinka. (2020). Graph meta learning via local subgraphs. Advances in Neural Information Processing Systems.

[li2022out] Li, Haoyang, Wang, Xin, Zhang, Ziwei, Zhu, Wenwu. (2022). Out-Of-Distribution Generalization on Graphs: A Survey. arXiv preprint.

[kingma2014adam] Diederik P. Kingma, Jimmy Ba. (2015). Adam: {A. 3rd International Conference on Learning Representations.

[paszke2019pytorch] Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer, Adam, Bradbury, James, Chanan, Gregory, Killeen, Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca, others. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems.

[ba2016layer] Ba, Jimmy Lei, Kiros, Jamie Ryan, Hinton, Geoffrey E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.

[Fey/Lenssen/2019] Fey, Matthias, Lenssen, Jan E.. (2019). Fast Graph Representation Learning with {PyTorch Geometric. ICLR Workshop on Representation Learning on Graphs and Manifolds.

[chang2020invariant] Chang, Shiyu, Zhang, Yang, Yu, Mo, Jaakkola, Tommi. (2020). Invariant rationalization. International Conference on Machine Learning.

[ahuja2020invariant] Ahuja, Kartik, Shanmugam, Karthikeyan, Varshney, Kush, Dhurandhar, Amit. (2020). Invariant risk minimization games. International Conference on Machine Learning.

[rosenfeld2020risks] Elan Rosenfeld, Pradeep Kumar Ravikumar, Andrej Risteski. (2021). The Risks of Invariant Risk Minimization. 9th International Conference on Learning Representations.

[mitrovic2020representation] Jovana Mitrovic, Brian McWilliams, Jacob C. Walker, Lars Holger Buesing, Charles Blundell. (2021). Representation Learning via Invariant Causal Mechanisms. 9th International Conference on Learning Representations.

[barrat2004architecture] Barrat, Alain, Barthelemy, Marc, Pastor-Satorras, Romualdo, Vespignani, Alessandro. (2004). The architecture of complex weighted networks. Proceedings of the national academy of sciences.

[Cho2014LearningPR] Kyunghyun Cho, Bart van Merrienboer, Çaglar G{. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP.

[hochreiter1997long] Hochreiter, Sepp, Schmidhuber, J{. (1997). Long short-term memory. Neural computation.

[qin2022graph] Qin, Yijian, Wang, Xin, Zhang, Ziwei, Xie, Pengtao, Zhu, Wenwu. (2022). Graph Neural Architecture Search Under Distribution Shifts. International Conference on Machine Learning.

[li2022ood] Li, Haoyang, Wang, Xin, Zhang, Ziwei, Zhu, Wenwu. (2022). Ood-gnn: Out-of-distribution generalized graph neural network. IEEE Transactions on Knowledge and Data Engineering.

[zhang2022learning] Zeyang Zhang, Ziwei Zhang, Xin Wang, Wenwu Zhu. (2022). Learning to Solve Travelling Salesman Problem with Hardness-Adaptive Curriculum. Thirty-Sixth {AAAI.

[bengio2013representation] Bengio, Yoshua, Courville, Aaron, Vincent, Pascal. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence.

[hsieh2018learning] Hsieh, Jun-Ting, Liu, Bingbin, Huang, De-An, Fei-Fei, Li F, Niebles, Juan Carlos. (2018). Learning to decompose and disentangle representations for video prediction. Advances in neural information processing systems.

[ma2018disentangled] Ma, Liqian, Sun, Qianru, Georgoulis, Stamatios, Van Gool, Luc, Schiele, Bernt, Fritz, Mario. (2018). Disentangled person image generation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[ma2019disentangled] Ma, Jianxin, Cui, Peng, Kuang, Kun, Wang, Xin, Zhu, Wenwu. (2019). Disentangled graph convolutional networks. International conference on machine learning.

[yang2020factorizable] Yang, Yiding, Feng, Zunlei, Song, Mingli, Wang, Xinchao. (2020). Factorizable graph convolutional networks. Advances in Neural Information Processing Systems.

[wang2022disentangled] Wang, Xin, Chen, Hong, Zhou, Yuwei, Ma, Jianxin, Zhu, Wenwu. (2022). Disentangled Representation Learning for Recommendation. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[li2021disentangled] Li, Haoyang, Wang, Xin, Zhang, Ziwei, Yuan, Zehuan, Li, Hang, Zhu, Wenwu. (2021). Disentangled contrastive learning on graphs. Advances in Neural Information Processing Systems.

[li2022disentangled] Li, Haoyang, Zhang, Ziwei, Wang, Xin, Zhu, Wenwu. (2022). Disentangled Graph Contrastive Learning With Independence Promotion. IEEE Transactions on Knowledge and Data Engineering.

[chen2021curriculum] Chen, Hong, Chen, Yudong, Wang, Xin, Xie, Ruobing, Wang, Rui, Xia, Feng, Zhu, Wenwu. (2021). Curriculum Disentangled Recommendation with Noisy Multi-feedback. Advances in Neural Information Processing Systems.

[wang2021multimodal] Wang, Xin, Chen, Hong, Zhu, Wenwu. (2021). Multimodal disentangled representation for recommendation. 2021 IEEE International Conference on Multimedia and Expo (ICME).

[ma2020disentangled] Ma, Jianxin, Zhou, Chang, Yang, Hongxia, Cui, Peng, Wang, Xin, Zhu, Wenwu. (2020). Disentangled self-supervision in sequential recommenders. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

[ma2019learning] Ma, Jianxin, Zhou, Chang, Cui, Peng, Yang, Hongxia, Zhu, Wenwu. (2019). Learning disentangled representations for recommendation. Advances in neural information processing systems.

[wang2020disenhan] Wang, Yifan, Tang, Suyao, Lei, Yuntong, Song, Weiping, Wang, Sheng, Zhang, Ming. (2020). Disenhan: Disentangled heterogeneous graph attention network for recommendation. Proceedings of the 29th ACM International Conference on Information & Knowledge Management.

[chen2016infogan] Chen, Xi, Duan, Yan, Houthooft, Rein, Schulman, John, Sutskever, Ilya, Abbeel, Pieter. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems.

[denton2017unsupervised] Denton, Emily L, others. (2017). Unsupervised learning of disentangled representations from video. Advances in neural information processing systems.

[tran2017disentangled] Tran, Luan, Yin, Xi, Liu, Xiaoming. (2017). Disentangled representation learning gan for pose-invariant face recognition. Proceedings of the IEEE conference on computer vision and pattern recognition.

[liu2020independence] Liu, Yanbei, Wang, Xiao, Wu, Shu, Xiao, Zhitao. (2020). Independence promoted graph disentangled networks. Proceedings of the AAAI Conference on Artificial Intelligence.

[zhang2021disentangled] Zhang, Wenbin, Zhang, Liming, Pfoser, Dieter, Zhao, Liang. (2021). Disentangled dynamic graph deep generation. Proceedings of the 2021 SIAM International Conference on Data Mining (SDM).

[du2022disentangled] Yuanqi Du, Xiaojie Guo, Hengning Cao, Yanfang Ye, Liang Zhao. (2022). Disentangled Spatiotemporal Graph Generative Models. Thirty-Sixth {AAAI.

[chen2022invariance] Chen, Yongqiang, Zhang, Yonggang, Yang, Han, Ma, Kaili, Xie, Binghui, Liu, Tongliang, Han, Bo, Cheng, James. (2022). Invariance Principle Meets Out-of-Distribution Generalization on Graphs. arXiv preprint.

[fan2021generalizing] Fan, Shaohua, Wang, Xiao, Shi, Chuan, Cui, Peng, Wang, Bai. (2021). Generalizing Graph Neural Networks on Out-Of-Distribution Graphs. arXiv preprint arXiv:2111.10657.

[chang2020continuous] Chang, Xiaofu, Liu, Xuqin, Wen, Jianfeng, Li, Shuang, Fang, Yanming, Song, Le, Qi, Yuan. (2020). Continuous-time dynamic graph learning via neural interaction processes. Proceedings of the 29th ACM International Conference on Information & Knowledge Management.

[huang2021coupled] Huang, Zijie, Sun, Yizhou, Wang, Wei. (2021). Coupled Graph ODE for Learning Interacting System Dynamics.. KDD.

[li2021intention] Li, Haoyang, Wang, Xin, Zhang, Ziwei, Ma, Jianxin, Cui, Peng, Zhu, Wenwu. (2021). Intention-aware sequential recommendation with structured intent transition. IEEE Transactions on Knowledge and Data Engineering.

[cai2021structural] Cai, Lei, Chen, Zhengzhang, Luo, Chen, Gui, Jiaping, Ni, Jingchao, Li, Ding, Chen, Haifeng. (2021). Structural temporal graph neural networks for anomaly detection in dynamic graphs. Proceedings of the 30th ACM international conference on Information & Knowledge Management.

[deng2020dynamic] Deng, Songgaojun, Rangwala, Huzefa, Ning, Yue. (2020). Dynamic knowledge graph based multi-event forecasting. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

[yao2021interpretable] Yao, Yuhang, Joe-Wong, Carlee. (2021). Interpretable clustering on dynamic graphs with recurrent graph neural networks. Proceedings of the AAAI Conference on Artificial Intelligence.

[you2019hierarchical] You, Jiaxuan, Wang, Yichen, Pal, Aditya, Eksombatchai, Pong, Rosenburg, Chuck, Leskovec, Jure. (2019). Hierarchical temporal convolutional networks for dynamic recommender systems. The world wide web conference.

[wang2021tedic] Wang, Yanbang, Li, Pan, Bai, Chongyang, Leskovec, Jure. (2021). TEDIC: Neural modeling of behavioral patterns in dynamic social interaction networks. Proceedings of the Web Conference 2021.

[wu2020temp] Jiapeng Wu, Meng Cao, Jackie Chi Kit Cheung, William L. Hamilton. (2020). TeMP: Temporal Message Passing for Temporal Knowledge Graph Completion. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, {EMNLP.

[li2019fates] Li, Haoyang, Cui, Peng, Zang, Chengxi, Zhang, Tianyang, Zhu, Wenwu, Lin, Yishi. (2019). Fates of Microscopic Social Ecosystems: Keep Alive or Dead?. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

[yao2022wildtime] Huaxiu Yao, Caroline Choi, Yoonho Lee, Pang Wei Koh, Chelsea Finn. (2022). Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time. Proceedings of the Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.

[yao2022improving] Yao, Huaxiu, Wang, Yu, Li, Sai, Zhang, Linjun, Liang, Weixin, Zou, James, Finn, Chelsea. (2022). Improving Out-of-Distribution Robustness via Selective Augmentation. Proceeding of the Thirty-ninth International Conference on Machine Learning.

[li2022gil] Li, Haoyang, Zhang, Ziwei, Wang, Xin, Zhu, Wenwu. (2022). Learning Invariant Graph Representations for Out-of-Distribution Generalization. Thirty-Sixth Conference on Neural Information Processing Systems.

[zhang2022dynamic] Zhang, Zeyang, Wang, Xin, Zhang, Ziwei, Li, Haoyang, Qin, Zhou, Zhu, Wenwu. (2022). Dynamic graph neural networks under spatio-temporal distribution shift. Advances in Neural Information Processing Systems.

[hu2020open] Hu, Weihua, Fey, Matthias, Zitnik, Marinka, Dong, Yuxiao, Ren, Hongyu, Liu, Bowen, Catasta, Michele, Leskovec, Jure. (2020). Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems.

[Tang:08KDD] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, Zhong Su. (2008). ArnetMiner: Extraction and Mining of Academic Social Networks. KDD'08.

[sinha2015overview] Sinha, Arnab, Shen, Zhihong, Song, Yang, Ma, Hao, Eide, Darrin, Hsu, Bo-june Paul, Wang, Kuansan. (2015). An overview of microsoft academic service (mas) and applications. Proceedings of the 24th international conference on world wide web.

[wang2020microsoft] Wang, Kuansan, Shen, Zhihong, Huang, Chiyuan, Wu, Chieh-Han, Dong, Yuxiao, Kanakia, Anshul. (2020). Microsoft academic graph: When experts are not enough. Quantitative Science Studies.

[mikolov2013distributed] Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, Dean, Jeff. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems.

[kipf2016semi] Thomas N. Kipf, Max Welling. (2017). Semi-Supervised Classification with Graph Convolutional Networks. 5th International Conference on Learning Representations.

[velivckovicgraph] Veli{\v{c. Graph Attention Networks. International Conference on Learning Representations.

[9847099] Bevilacqua, Beatrice, Zhou, Yangze, Ribeiro, Bruno. (2021). Size-invariant graph representations for graph classification extrapolations. IEEE Transactions on Pattern Analysis and Machine Intelligence. doi:10.1109/TPAMI.2021.3086060.

[han2022g] Han, Xiaotian, Jiang, Zhimeng, Liu, Ninghao, Hu, Xia. (2022). G-mixup: Graph data augmentation for graph classification. International Conference on Machine Learning.

[9780235] Wang, Shihao, Yu, Zhiding, Jiang, Xiaohui, Lan, Shiyi, Shi, Min, Chang, Nadine, Kautz, Jan, Li, Ying, Alvarez, Jose M. (2024). Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. arXiv preprint arXiv:2405.01533. doi:10.1109/TPAMI.2021.3122444.

[zhong2024memorybank] Zhong, Wanjun, Guo, Lianghong, Gao, Qiqi, Ye, He, Wang, Yanlin. (2024). Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence.

[packer2023memgpt] Packer, Charles, Fang, Vivian, Patil, Shishir_G, Lin, Kevin, Wooders, Sarah, Gonzalez, Joseph_E. (2023). MemGPT: Towards LLMs as Operating Systems..

[hu2023chatdb] Hu, Chenxu, Fu, Jie, Du, Chenzhuang, Luo, Simian, Zhao, Junbo, Zhao, Hang. (2023). Chatdb: Augmenting llms with databases as their symbolic memory. arXiv preprint arXiv:2306.03901.

[lu2023memochat] Lu, Junru, An, Siyu, Lin, Mingbao, Pergola, Gabriele, He, Yulan, Yin, Di, Sun, Xing, Wu, Yunsheng. (2023). Memochat: Tuning llms to use memos for consistent long-range open-domain conversation. arXiv preprint arXiv:2308.08239.

[li2023metaagents] Li, Yuan, Zhang, Yixuan, Sun, Lichao. (2023). Metaagents: Simulating interactions of human behaviors for llm-based task-oriented coordination via collaborative generative agents. arXiv preprint arXiv:2310.06500.

[gao2023s3] Gao, Chen, Lan, Xiaochong, Lu, Zhihong, Mao, Jinzhu, Piao, Jinghua, Wang, Huandong, Jin, Depeng, Li, Yong. (2023). S3: Social-network simulation system with large language model-empowered agents. arXiv preprint arXiv:2307.14984.

[wang2023recmind] Wang, Yancheng, Jiang, Ziyan, Chen, Zheng, Yang, Fan, Zhou, Yingxue, Cho, Eunah, Fan, Xing, Huang, Xiaojiang, Lu, Yanbin, Yang, Yingzhen. (2023). Recmind: Large language model powered agent for recommendation. arXiv preprint arXiv:2308.14296.

[zhao2024expel] Zhao, Andrew, Huang, Daniel, Xu, Quentin, Lin, Matthieu, Liu, Yong-Jin, Huang, Gao. (2024). Expel: Llm agents are experiential learners. Proceedings of the AAAI Conference on Artificial Intelligence.

[modarressi2024memllm] Modarressi, Ali, K{. (2024). Memllm: Finetuning llms to use an explicit read-write memory. arXiv preprint arXiv:2404.11672.

[chen2024driving] Chen, Long, Sinavski, Oleg, H{. (2024). Driving with llms: Fusing object-level vector modality for explainable autonomous driving. 2024 IEEE International Conference on Robotics and Automation (ICRA).

[sun2024optimizing] Sun, Yuan, Salami Pargoo, Navid, Jin, Peter, Ortiz, Jorge. (2024). Optimizing autonomous driving for safety: A human-centric approach with llm-enhanced rlhf. Companion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing.

[yang2024embodied] Yang, Yijun, Zhou, Tianyi, Li, Kanxue, Tao, Dapeng, Li, Lusong, Shen, Li, He, Xiaodong, Jiang, Jing, Shi, Yuhui. (2024). Embodied multi-modal agent trained by an llm from a parallel textworld. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[li2024embodied] Li, Manling, Zhao, Shiyu, Wang, Qineng, Wang, Kangrui, Zhou, Yu, Srivastava, Sanjana, Gokmen, Cem, Lee, Tony, Li, Erran Li, Zhang, Ruohan, others. (2024). Embodied agent interface: Benchmarking llms for embodied decision making. Advances in Neural Information Processing Systems.

[zheng2023steve] Zheng, Sipeng, Liu, Jiazheng, Feng, Yicheng, Lu, Zongqing. (2023). Steve-eye: Equipping llm-based embodied agents with visual perception in open worlds. arXiv preprint arXiv:2310.13255.

[wei2024editable] Wei, Yuxi, Wang, Zi, Lu, Yifan, Xu, Chenxin, Liu, Changxing, Zhao, Hao, Chen, Siheng, Wang, Yanfeng. (2024). Editable scene simulation for autonomous driving via collaborative llm-agents. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[liu2021heterogeneous] Liu, Jiashuo, Hu, Zheyuan, Cui, Peng, Li, Bo, Shen, Zheyan. (2021). Heterogeneous risk minimization. International Conference on Machine Learning.

[yue2025masrouter] Yue, Yanwei, Zhang, Guibin, Liu, Boyang, Wan, Guancheng, Wang, Kun, Cheng, Dawei, Qi, Yiyan. (2025). Masrouter: Learning to route llms for multi-agent systems. arXiv preprint arXiv:2502.11133.

[wang2024battleagentbench] Wang, Wei, Zhang, Dan, Feng, Tao, Wang, Boyan, Tang, Jie. (2024). Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems. arXiv preprint arXiv:2408.15971.

[10.1145/3604427] Li, Haoyang, Zhang, Ziwei, Wang, Xin, Zhu, Wenwu. (2023). Invariant Node Representation Learning under Distribution Shifts with Multiple Latent Environments. ACM Transactions on Information Systems (TOIS). doi:10.1145/3604427.

[cai2023user] Cai, Desheng, Qian, Shengsheng, Fang, Quan, Hu, Jun, Xu, Changsheng. (2023). User cold-start recommendation via inductive heterogeneous graph neural network. ACM Transactions on Information Systems (TOIS).

[chen2020neural] Chen, Xu, Xiong, Kun, Zhang, Yongfeng, Xia, Long, Yin, Dawei, Huang, Jimmy Xiangji. (2020). Neural feature-aware recommendation with signed hypergraph convolutional network. ACM Transactions on Information Systems (TOIS).

[huang2023position] Huang, Liwei, Ma, Yutao, Liu, Yanbo, Danny Du, Bohong, Wang, Shuliang, Li, Deyi. (2023). Position-enhanced and time-aware graph convolutional network for sequential recommendations. ACM Transactions on Information Systems (TOIS).

[ma2023kr] Ma, Ting, Huang, Longtao, Lu, Qianqian, Hu, Songlin. (2023). Kr-gcn: Knowledge-aware reasoning with graph convolution network for explainable recommendation. ACM Transactions on Information Systems (TOIS).

[yang2021hgat] Yang, Tianchi, Hu, Linmei, Shi, Chuan, Ji, Houye, Li, Xiaoli, Nie, Liqiang. (2021). HGAT: Heterogeneous graph attention networks for semi-supervised short text classification. ACM Transactions on Information Systems (TOIS).

[zhang2022efraudcom] Zhang, Ge, Li, Zhao, Huang, Jiaming, Wu, Jia, Zhou, Chuan, Yang, Jian, Gao, Jianliang. (2022). efraudcom: An e-commerce fraud detection system via competitive graph neural networks. ACM Transactions on Information Systems (TOIS).

[zitnik2018modeling] Zitnik, Marinka, Agrawal, Monica, Leskovec, Jure. (2018). Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics.

[li2022graph] Li, Michelle M, Huang, Kexin, Zitnik, Marinka. (2022). Graph representation learning in biomedicine and healthcare. Nature Biomedical Engineering.

[li2023preference] Li, Yakun, Hou, Lei, Li, Juanzi. (2023). Preference-aware Graph Attention Networks for Cross-Domain Recommendations with Collaborative Knowledge Graph. ACM Transactions on Information Systems (TOIS).

[xie2021graph] Xie, Qianqian, Zhu, Yutao, Huang, Jimin, Du, Pan, Nie, Jian-Yun. (2021). Graph neural collaborative topic model for citation recommendation. ACM Transactions on Information Systems (TOIS).

[wang2021combining] Wang, Hongwei, Leskovec, Jure. (2021). Combining graph convolutional neural networks and label propagation. ACM Transactions on Information Systems (TOIS).

[bi2023predicting] Bi, Wendong, Xu, Bingbing, Sun, Xiaoqian, Xu, Li, Shen, Huawei, Cheng, Xueqi. (2023). Predicting the silent majority on graphs: Knowledge transferable graph neural network. Proceedings of the ACM Web Conference 2023.

[wu2020dynamic] Wu, Junshuang, Zhang, Richong, Mao, Yongyi, Guo, Hongyu, Soflaei, Masoumeh, Huai, Jinpeng. (2020). Dynamic graph convolutional networks for entity linking. Proceedings of The ACM Web Conference 2020.

[taheri2019learning] Taheri, Aynaz, Gimpel, Kevin, Berger-Wolf, Tanya. (2019). Learning to represent the evolution of dynamic graphs with recurrent models. Proceedings of the ACM Web Conference 2019.

[bai2023hgwavenet] Bai, Qijie, Nie, Changli, Zhang, Haiwei, Zhao, Dongming, Yuan, Xiaojie. (2023). HGWaveNet: A Hyperbolic Graph Neural Network for Temporal Link Prediction. Proceedings of the ACM Web Conference 2023.

[liu2022confidence] Liu, Hongrui, Hu, Binbin, Wang, Xiao, Shi, Chuan, Zhang, Zhiqiang, Zhou, Jun. (2022). Confidence may cheat: Self-training on graph neural networks under distribution shift. Proceedings of the ACM Web Conference 2022.

[tang2023dynamic] Tang, Haoran, Wu, Shiqing, Xu, Guandong, Li, Qing. (2023). Dynamic Graph Evolution Learning for Recommendation. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval.

[fu2021sdg] Fu, Dongqi, He, Jingrui. (2021). Sdg: A simplified and dynamic graph neural network. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.

[wang2022disenctr] Wang, Yifan, Qin, Yifang, Sun, Fang, Zhang, Bo, Hou, Xuyang, Hu, Ke, Cheng, Jia, Lei, Jun, Zhang, Ming. (2022). DisenCTR: Dynamic graph-based disentangled representation for click-through rate prediction. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.

[zhao2023time] Zhao, Ziwei, Zhu, Xi, Xu, Tong, Lizhiyu, Aakas, Yu, Yu, Li, Xueying, Yin, Zikai, Chen, Enhong. (2023). Time-interval Aware Share Recommendation via Bi-directional Continuous Time Dynamic Graphs. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval.

[yang2023generic] Yang, Zhengyi, He, Xiangnan, Zhang, Jizhi, Wu, Jiancan, Xin, Xin, Chen, Jiawei, Wang, Xiang. (2023). A Generic Learning Framework for Sequential Recommendation with Distribution Shifts. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval.

[gao2023alleviating] Gao, Yuan, Wang, Xiang, He, Xiangnan, Liu, Zhenguang, Feng, Huamin, Zhang, Yongdong. (2023). Alleviating structural distribution shift in graph anomaly detection. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining.

[liu2023good] Liu, Yixin, Ding, Kaize, Liu, Huan, Pan, Shirui. (2023). Good-d: On unsupervised graph out-of-distribution detection. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining.

[yang2023interpretable] Yang, Qiang, Ma, Changsheng, Zhang, Qiannan, Gao, Xin, Zhang, Chuxu, Zhang, Xiangliang. (2023). Interpretable Research Interest Shift Detection with Temporal Heterogeneous Graphs. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining.

[wang2023tutorial] Wang, Jindong, Li, Haoliang, Pan, Sinno, Xie, Xing. (2023). A Tutorial on Domain Generalization. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining.

[chen2022learning] Chen, Cen, Ye, Tiandi, Wang, Li, Gao, Ming. (2022). Learning to generalize in heterogeneous federated networks. Proceedings of the 31st ACM International Conference on Information & Knowledge Management.

[wang2022adagcl] Wang, Yili, Zhou, Kaixiong, Miao, Rui, Liu, Ninghao, Wang, Xin. (2022). AdaGCL: Adaptive Subgraph Contrastive Learning to Generalize Large-scale Graph Training. Proceedings of the 31st ACM International Conference on Information & Knowledge Management.

[wang2022imbalanced] Wang, Yu, Zhao, Yuying, Shah, Neil, Derr, Tyler. (2022). Imbalanced graph classification via graph-of-graph neural networks. Proceedings of the 31st ACM International Conference on Information & Knowledge Management.

[wang2019heterogeneous] Wang, Xiao, Ji, Houye, Shi, Chuan, Wang, Bai, Ye, Yanfang, Cui, Peng, Yu, Philip S. (2019). Heterogeneous graph attention network. The world wide web conference.

[wang2017community] Wang, Xiao, Cui, Peng, Wang, Jing, Pei, Jian, Zhu, Wenwu, Yang, Shiqiang. (2017). Community preserving network embedding. Proceedings of the AAAI conference on artificial intelligence.

[zhou2020graph] Zhou, Jie, Cui, Ganqu, Hu, Shengding, Zhang, Zhengyan, Yang, Cheng, Liu, Zhiyuan, Wang, Lifeng, Li, Changcheng, Sun, Maosong. (2020). Graph neural networks: A review of methods and applications. AI open.

[wu2020comprehensive] Wu, Zonghan, Pan, Shirui, Chen, Fengwen, Long, Guodong, Zhang, Chengqi, Philip, S Yu. (2020). A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems.

[xu2018powerful] Xu, Keyulu, Hu, Weihua, Leskovec, Jure, Jegelka, Stefanie. (2018). How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826.

[zhang2023graph] Zhang, Ziwei, Li, Haoyang, Zhang, Zeyang, Qin, Yijian, Wang, Xin, Zhu, Wenwu. (2023). Graph meets llms: Towards large graph models. NeurIPS 2023 Workshop: New Frontiers in Graph Learning.

[Bengio+chapter2007] Bengio, Yoshua, LeCun, Yann. (2007). Scaling Learning Algorithms Towards {AI. Large Scale Kernel Machines.

[debate2-thu] Liang, Tian, He, Zhiwei, Jiao, Wenxiang, Wang, Xing, Wang, Yan, Wang, Rui, Yang, Yujiu, Tu, Zhaopeng, Shi, Shuming. (2023). Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate.

[PHPrompting] Zheng, Chuanyang, Liu, Zhengying, Xie, Enze, Li, Zhenguo, Li, Yu. (2023). Progressive-Hint Prompting Improves Reasoning in Large Language Models.

[blender] Jiang, Dongfu, Ren, Xiang, Lin, Bill Yuchen. (2023). {LLM. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[bondy1976graph] Bondy, John Adrian, Murty, Uppaluri Siva Ramachandra, others. (1976). Graph theory with applications.

[wang2023selfconsistency] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. The Eleventh International Conference on Learning Representations.

[spielman2008graph] Spielman, Daniel A, Srivastava, Nikhil. (2008). Graph sparsification by effective resistances. Proceedings of the fortieth annual ACM symposium on Theory of computing.

[chen2021unified] Chen, Tianlong, Sui, Yongduo, Chen, Xuxi, Zhang, Aston, Wang, Zhangyang. (2021). A unified lottery ticket hypothesis for graph neural networks. International conference on machine learning.

[zhang2024graph] Zhang, Guibin, Wang, Kun, Huang, Wei, Yue, Yanwei, Wang, Yang, Zimmermann, Roger, Zhou, Aojun, Cheng, Dawei, Zeng, Jin, Liang, Yuxuan. (2024). Graph lottery ticket automated. The Twelfth International Conference on Learning Representations.

[achille2018critical] Achille, Alessandro, Rovere, Matteo, Soatto, Stefano. (2018). Critical learning periods in deep networks. International Conference on Learning Representations.

[entezari2020all] Entezari, Negin, Al-Sayouri, Saba A, Darvishzadeh, Amirali, Papalexakis, Evangelos E. (2020). All you need is low (rank) defending against adversarial attacks on graphs. Proceedings of the 13th international conference on web search and data mining.

[ennadir2024simple] Ennadir, Sofiane, Abbahaddou, Yassine, Lutzeyer, Johannes F, Vazirgiannis, Michalis, Bostr{. (2024). A Simple and Yet Fairly Effective Defense for Graph Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence.

[alchihabi2023efficient] Alchihabi, Abdullah, En, Qing, Guo, Yuhong. (2023). Efficient Low-Rank GNN Defense Against Structural Attacks. 2023 IEEE International Conference on Knowledge Graph (ICKG).

[you2019drawing] You, Haoran, Li, Chaojian, Xu, Pengfei, Fu, Yonggan, Wang, Yue, Chen, Xiaohan, Baraniuk, Richard G, Wang, Zhangyang, Lin, Yingyan. (2019). Drawing early-bird tickets: Towards more efficient training of deep networks. arXiv preprint arXiv:1909.11957.

[zhang2024two] Zhang, Guibin, Yue, Yanwei, Wang, Kun, Fang, Junfeng, Sui, Yongduo, Wang, Kai, Liang, Yuxuan, Cheng, Dawei, Pan, Shirui, Chen, Tianlong. (2024). Two heads are better than one: Boosting graph sparse training via semantic and topological awareness. arXiv preprint arXiv:2402.01242.

[williams1992simple] Williams, Ronald J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning.

[li2023api] Li, Minghao, Zhao, Yingxiu, Yu, Bowen, Song, Feifan, Li, Hangyu, Yu, Haiyang, Li, Zhoujun, Huang, Fei, Li, Yongbin. (2023). Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.

[wang2023brave] Wang, Kun, Liang, Yuxuan, Li, Xinglin, Li, Guohao, Ghanem, Bernard, Zimmermann, Roger, Yi, Huahui, Zhang, Yudong, Wang, Yang, others. (2023). Brave the wind and the waves: Discovering robust and generalizable graph lottery tickets. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[chen2023demystifying] Chen, Yuhan, Ye, Haojie, Vedula, Sanketh, Bronstein, Alex, Dreslinski, Ronald, Mudge, Trevor, Talati, Nishil. (2023). Demystifying graph sparsification algorithms in graph properties preservation. Proceedings of the VLDB Endowment.

[augmented-lm-survey] {Mialon. {Augmented Language Models: a Survey. arXiv e-prints.

[audio-gpt] Huang, Rongjie, Li, Mingze, Yang, Dongchao, Shi, Jiatong, Chang, Xuankai, Ye, Zhenhui, Wu, Yuning, Hong, Zhiqing, Huang, Jiawei, Liu, Jinglin, Ren, Yi, Zhao, Zhou, Watanabe, Shinji. (2023). AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.

[visual-gpt] Chen, Jun, Guo, Han, Yi, Kai, Li, Boyang, Elhoseiny, Mohamed. (2021). VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning.

[neural-sequence] {Dabagia. {Computation with Sequences in the Brain. arXiv e-prints.

[khattab2023dspy] Khattab, Omar, Singhvi, Arnav, Maheshwari, Paridhi, Zhang, Zhiyuan, Santhanam, Keshav, Vardhamanan, Sri, Haq, Saiful, Sharma, Ashutosh, Joshi, Thomas T, Moazam, Hanna, others. (2023). Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714.

[RUC2024agent-memory] Zhang, Zeyu, Bo, Xiaohe, Ma, Chen, Li, Rui, Chen, Xu, Dai, Quanyu, Zhu, Jieming, Dong, Zhenhua, Wen, Ji-Rong. (2024). A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501.

[yang2023investlm] Yang, Yi, Tang, Yixuan, Tam, Kar Yan. (2023). Investlm: A large language model for investment using financial domain instruction tuning. arXiv preprint arXiv:2309.13064.

[li2024survey-mas] Li, Xinyi, Wang, Sai, Zeng, Siqi, Wu, Yu, Yang, Yi. (2024). A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth.

[zhu2023ghost] Zhu, Xizhou, Chen, Yuntao, Tian, Hao, Tao, Chenxin, Su, Weijie, Yang, Chenyu, Huang, Gao, Li, Bin, Lu, Lewei, Wang, Xiaogang, others. (2023). Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144.

[zheng2023synapse] Zheng, Longtao, Wang, Rundong, Wang, Xinrun, An, Bo. (2023). Synapse: Trajectory-as-exemplar prompting with memory for computer control. arXiv preprint arXiv:2306.07863.

[zelikman2023self] Zelikman, Eric, Lorch, Eliana, Mackey, Lester, Kalai, Adam Tauman. (2023). Self-taught optimizer (stop): Recursively self-improving code generation. arXiv preprint arXiv:2310.02304.

[zhang2025maas] Zhang, Guibin, Niu, Luyang, Fang, Junfeng, Wang, Kun, Bai, Lei, Wang, Xiang. (2025). Multi-agent Architecture Search via Agentic Supernet. arXiv preprint arXiv:2502.04180.

[yuan2024evoagent] Yuan, Siyu, Song, Kaitao, Chen, Jiangjie, Tan, Xu, Li, Dongsheng, Yang, Deqing. (2024). EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms. arXiv preprint arXiv:2406.14228.

[hu2024adas] Hu, Shengran, Lu, Cong, Clune, Jeff. (2024). Automated design of agentic systems. arXiv preprint arXiv:2408.08435.

[shang2024agentsquare] Shang, Yu, Li, Yu, Zhao, Keyu, Ma, Likai, Liu, Jiahe, Xu, Fengli, Li, Yong. (2024). AgentSquare: Automatic LLM Agent Search in Modular Design Space. arXiv preprint arXiv:2410.06153.

[liu2024evaluating] Liu, Jiawei, Xie, Songrun, Wang, Junhao, Wei, Yuxiang, Ding, Yifeng, Zhang, Lingming. (2024). Evaluating Language Models for Efficient Code Generation. arXiv preprint arXiv:2408.06450.

[ling2017program] Ling, Wang, Yogatama, Dani, Dyer, Chris, Blunsom, Phil. (2017). Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.

[software-dev] Qian, Chen, Cong, Xin, Yang, Cheng, Chen, Weize, Su, Yusheng, Xu, Juyuan, Liu, Zhiyuan, Sun, Maosong. (2023). Communicative Agents for Software Development.

[debate3-multi-models] Xiong, Kai, Ding, Xiao, Cao, Yixin, Liu, Ting, Qin, Bing. (2023). Examining the Inter-Consistency of Large Language Models: An In-depth Analysis via Debate.

[chen2024compundLLM] Chen, Lingjiao, Davis, Jared Quincy, Hanin, Boris, Bailis, Peter, Stoica, Ion, Zaharia, Matei, Zou, James. (2024). Are more llm calls all you need? towards scaling laws of compound inference systems. arXiv preprint arXiv:2403.02419.

[piatti2024cooperate] Piatti, Giorgio, Jin, Zhijing, Kleiman-Weiner, Max, Sch{. (2024). Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents. arXiv preprint arXiv:2404.16698.

[zhang2006introduction] Zhang, Ping, Chartrand, Gary. (2006). Introduction to graph theory. Tata McGraw-Hill.

[ma2023laser] Ma, Kaixin, Zhang, Hongming, Wang, Hongwei, Pan, Xiaoman, Yu, Wenhao, Yu, Dong. (2023). Laser: Llm agent with state-space exploration for web navigation. arXiv preprint arXiv:2309.08172.

[bouzenia2024repairagent] Bouzenia, Islem, Devanbu, Premkumar, Pradel, Michael. (2024). Repairagent: An autonomous, llm-based agent for program repair. arXiv preprint arXiv:2403.17134.

[thorne2018fever] Thorne, James, Vlachos, Andreas, Christodoulopoulos, Christos, Mittal, Arpit. (2018). FEVER: a large-scale dataset for fact extraction and VERification. arXiv preprint arXiv:1803.05355.

[yang2018hotpotqa] Yang, Zhilin, Qi, Peng, Zhang, Saizheng, Bengio, Yoshua, Cohen, William W, Salakhutdinov, Ruslan, Manning, Christopher D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.

[zhou2025reso] Zhou, Heng, Geng, Hejia, Xue, Xiangyuan, Yin, Zhenfei, Bai, Lei. (2025). ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks. arXiv preprint arXiv:2503.02390.

[zhao2025sirius] Zhao, Wanjia, Yuksekgonul, Mert, Wu, Shirley, Zou, James. (2025). SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning. arXiv preprint arXiv:2502.04780.

[ishibashi2024selforganize-mother] Ishibashi, Yoichi, Nishimura, Yoshimasa. (2024). Self-organized agents: A llm multi-agent framework toward ultra large-scale code generation and optimization. arXiv preprint arXiv:2404.02183.

[BOOK_1991organizational_memory] Walsh, James P, Ungson, Gerardo Rivera. (1991). Organizational memory. Academy of management review.

[mmlu] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt. (2021). Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR).

[nie2025weak4strong] Nie, Fan, Feng, Lan, Ye, Haotian, Liang, Weixin, Lu, Pan, Yao, Huaxiu, Alahi, Alexandre, Zou, James. (2025). Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors. arXiv preprint arXiv:2504.04785.

[lu2024morphagent] Lu, Siyuan, Shao, Jiaqi, Luo, Bing, Lin, Tao. (2024). MorphAgent: Empowering Agents through Self-Evolving Profiles and Decentralized Collaboration. arXiv preprint arXiv:2410.15048.

[wang2025agentdropout] Wang, Zhexuan, Wang, Yutong, Liu, Xuebo, Ding, Liang, Zhang, Miao, Liu, Jie, Zhang, Min. (2025). AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration. arXiv preprint arXiv:2503.18891.

[chen2023evoprompting] Chen, Angelica, Dohan, David, So, David. (2023). Evoprompting: Language models for code-level neural architecture search. Advances in neural information processing systems.

[sachdev2024evolutionary] Sachdev, Rithik, Wang, Zhong-Qiu, Yang, Chao-Han Huck. (2024). Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction. arXiv preprint arXiv:2407.16370.

[ethayarajh2024kto] Ethayarajh, Kawin, Xu, Winnie, Muennighoff, Niklas, Jurafsky, Dan, Kiela, Douwe. (2024). Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.

[hendrycks2021ethics] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt. (2021). Aligning AI With Shared Human Values. Proceedings of the International Conference on Learning Representations (ICLR).

[subramaniam2025multiagent-ft] Subramaniam, Vighnesh, Du, Yilun, Tenenbaum, Joshua B, Torralba, Antonio, Li, Shuang, Mordatch, Igor. (2025). Multiagent finetuning: Self improvement with diverse reasoning chains. arXiv preprint arXiv:2501.05707.

[park2025maporl] Park, Chanwoo, Han, Seungju, Guo, Xingzhi, Ozdaglar, Asuman, Zhang, Kaiqing, Kim, Joo-Kyung. (2025). Maporl: Multi-agent post-co-training for collaborative large language models with reinforcement learning. arXiv preprint arXiv:2502.18439.

[hendrycksmath2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt. (2021). Measuring Mathematical Problem Solving With the MATH Dataset. NeurIPS.

[generative-agents-simulacra] Park, Joon Sung, O'Brien, Joseph C., Cai, Carrie J., Ringel Morris, Meredith, Liang, Percy, Bernstein, Michael S.. (2023). Generative Agents: Interactive Simulacra of Human Behavior.

[zhang2024classroom] Zhang, Zheyuan, Zhang-Li, Daniel, Yu, Jifan, Gong, Linlu, Zhou, Jinchang, Liu, Zhiyuan, Hou, Lei, Li, Juanzi. (2024). Simulating classroom education with llm-empowered agents. arXiv preprint arXiv:2406.19226.

[zhao2023competeai] Zhao, Qinlin, Wang, Jindong, Zhang, Yixuan, Jin, Yiqiao, Zhu, Kaijie, Chen, Hao, Xie, Xing. (2023). Competeai: Understanding the competition behaviors in large language model-based agents. arXiv preprint arXiv:2310.17512.

[multi-persona] Wang, Zhenhailong, Mao, Shaoguang, Wu, Wenshan, Ge, Tao, Wei, Furu, Ji, Heng. (2023). Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration.

[bargaining-feedback] Fu, Yao, Peng, Hao, Khot, Tushar, Lapata, Mirella. (2023). Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback.

[sot] Ning, Xuefei, Lin, Zinan, Zhou, Zixuan, Yang, Huazhong, Wang, Yu. (2023). Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding.

[hua2023war] Hua, Wenyue, Fan, Lizhou, Li, Lingyao, Mei, Kai, Ji, Jianchao, Ge, Yingqiang, Hemphill, Libby, Zhang, Yongfeng. (2023). War and peace (waragent): Large language model-based multi-agent simulation of world wars. arXiv preprint arXiv:2311.17227.

[chen2023gamegpt] Chen, Dake, Wang, Hanbin, Huo, Yunhao, Li, Yuzhao, Zhang, Haoyang. (2023). Gamegpt: Multi-agent collaborative framework for game development. arXiv preprint arXiv:2310.08067.

[cohen2023lm] Cohen, Roi, Hamri, May, Geva, Mor, Globerson, Amir. (2023). Lm vs lm: Detecting factual errors via cross examination. arXiv preprint arXiv:2305.13281.

[self-correction] Pan, Liangming, Saxon, Michael, Xu, Wenda, Nathani, Deepak, Wang, Xinyi, Wang, William Yang. (2023). Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies.

[qian2024scaling] Qian, Chen, Xie, Zihao, Wang, Yifei, Liu, Wei, Dang, Yufan, Du, Zhuoyun, Chen, Weize, Yang, Cheng, Liu, Zhiyuan, Sun, Maosong. (2024). Scaling Large-Language-Model-based Multi-Agent Collaboration. arXiv preprint arXiv:2406.07155.

[zhuge2024gptswarm] Zhuge, Mingchen, Wang, Wenyi, Kirsch, Louis, Faccio, Francesco, Khizbullin, Dmitrii, Schmidhuber, J{. (2024). GPTSwarm: Language Agents as Optimizable Graphs. Forty-first International Conference on Machine Learning.

[openai2023gpt4] OpenAI. (2023). GPT-4 Technical Report.

[coalitional-game] Saad, Walid, Han, Zhu, Debbah, Merouane, Hjorungnes, Are, Basar, Tamer. (2009). Coalitional game theory for communication networks. IEEE Signal Processing Magazine.

[held-yang-2023-shapley] Held, William, Yang, Diyi. (2023). Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.

[imp-score] Chen, Mark, Tworek, Jerry, Jun, Heewoo, Yuan, Qiming, Ponde de Oliveira Pinto, Henrique, Kaplan, Jared, Edwards, Harri, Burda, Yuri, Joseph, Nicholas, Brockman, Greg, Ray, Alex, Puri, Raul, Krueger, Gretchen, Petrov, Michael, Khlaaf, Heidy, Sastry, Girish, Mishkin, Pamela, Chan, Brooke, Gray, Scott, Ryder, Nick, Pavlov, Mikhail, Power, Alethea, Kaiser, Lukasz, Bavarian, Mohammad, Winter, Clemens, Tillet, Philippe, Petroski Such, Felipe, Cummings, Dave, Plappert, Matthias, Chantzis, Fotios, Barnes, Elizabeth, Herbert-Voss, Ariel, Hebgen Guss, William, Nichol, Alex, Paino, Alex, Tezak, Nikolas, Tang, Jie, Babuschkin, Igor, Balaji, Suchir, Jain, Shantanu, Saunders, William, Hesse, Christopher, Carr, Andrew N., Leike, Jan, Achiam, Josh, Misra, Vedant, Morikawa, Evan, Radford, Alec, Knight, Matthew, Brundage, Miles, Murati, Mira, Mayer, Katie, Welinder, Peter, McGrew, Bob, Amodei, Dario, McCandlish, Sam, Sutskever, Ilya, Zaremba, Wojciech. (2021). Evaluating Large Language Models Trained on Code. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[minecraft-agent] {Zhu. {Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory. arXiv e-prints.

[meta-gpt] Hong, Sirui, Zheng, Xiawu, Chen, Jonathan, Cheng, Yuheng, Wang, Jinlin, Zhang, Ceyao, Wang, Zili, Yau, Steven Ka Shing, Lin, Zijuan, Zhou, Liyang, Ran, Chenyu, Xiao, Lingfeng, Wu, Chenglin. (2023). MetaGPT: Meta Programming for Multi-Agent Collaborative Framework.

[self-debug] Olausson, Theo X., Priya Inala, Jeevana, Wang, Chenglong, Gao, Jianfeng, Solar-Lezama, Armando. (2023). Demystifying GPT Self-Repair for Code Generation.

[zhang2024g-designer] Zhang, Guibin, Yue, Yanwei, Sun, Xiangguo, Wan, Guancheng, Yu, Miao, Fang, Junfeng, Wang, Kun, Chen, Tianlong, Cheng, Dawei. (2024). G-designer: Architecting multi-agent communication topologies via graph neural networks. arXiv preprint arXiv:2410.11782.

[chen2023codet] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, Weizhu Chen. (2023). CodeT: Code Generation with Generated Tests. The Eleventh International Conference on Learning Representations.

[self-evaluation-decode] {Xie. {Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding. arXiv e-prints.

[llm-overconfident] {Xiong. {Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. arXiv e-prints.

[llm-as-a-judge] Zheng, Lianmin, Chiang, Wei-Lin, Sheng, Ying, Zhuang, Siyuan, Wu, Zhanghao, Zhuang, Yonghao, Lin, Zi, Li, Zhuohan, Li, Dacheng, Xing, Eric. P, Zhang, Hao, Gonzalez, Joseph E., Stoica, Ion. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.

[sliding-window] Qin, Zhen, Jagerman, Rolf, Hui, Kai, Zhuang, Honglei, Wu, Junru, Shen, Jiaming, Liu, Tianqi, Liu, Jialu, Metzler, Donald, Wang, Xuanhui, Bendersky, Michael. (2023). Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting.

[SHAP] Lundberg, Scott M., Lee, Su-In. (2017). A Unified Approach to Interpreting Model Predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems.

[shapley-origin] Stan Lipovetsky, Michael Conklin. {Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry.

[got] Besta, Maciej, Blach, Nils, Kubicek, Ales, Gerstenberger, Robert, Gianinazzi, Lukas, Gajda, Joanna, Lehmann, Tomasz, Podstawski, Michal, Niewiadomski, Hubert, Nyczyk, Piotr, Hoefler, Torsten. (2023). Graph of Thoughts: Solving Elaborate Problems with Large Language Models.

[agent-bench] Liu, Xiao, Yu, Hao, Zhang, Hanchen, Xu, Yifan, Lei, Xuanyu, Lai, Hanyu, Gu, Yu, Ding, Hangliang, Men, Kaiwen, Yang, Kejuan, Zhang, Shudan, Deng, Xiang, Zeng, Aohan, Du, Zhengxiao, Zhang, Chenhui, Shen, Sheng, Zhang, Tianjun, Su, Yu, Sun, Huan, Huang, Minlie, Dong, Yuxiao, Tang, Jie. (2023). AgentBench: Evaluating LLMs as Agents.

[cot] Wei, Jason, Wang, Xuezhi, Schuurmans, Dale, Bosma, Maarten, Ichter, Brian, Xia, Fei, Chi, Ed, Le, Quoc, Zhou, Denny. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.

[kim2014convolutional] Kim, Yoon. (2014). Convolutional neural networks for sentence classification. arXiv:1408.5882.

[nair2010rectified] Nair, Vinod, Hinton, Geoffrey E. (2010). Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th international conference on machine learning (ICML-10).

[hendrycks2016gaussian] Hendrycks, Dan, Gimpel, Kevin. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.

[listmle] Xia, Fen, Liu, Tie-Yan, Wang, Jue, Zhang, Wensheng, Li, Hang. (2008). Listwise Approach to Learning to Rank: Theory and Algorithm. Proceedings of the 25th International Conference on Machine Learning.

[autogpt] Toran Bruce Richards, et al.. (2023). Auto-GPT: An Autonomous GPT-4 Experiment. GitHub repository.

[autogen] Wu, Qingyun, Bansal, Gagan, Zhang, Jieyu, Wu, Yiran, Zhang, Shaokun, Zhu, Erkang, Li, Beibin, Jiang, Li, Zhang, Xiaoyun, Wang, Chi. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. arXiv:2308.08155.

[voyager] {Wang. {Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv e-prints.

[liu2024apigen] Liu, Zuxin, Hoang, Thai, Zhang, Jianguo, Zhu, Ming, Lan, Tian, Tan, Juntao, Yao, Weiran, Liu, Zhiwei, Feng, Yihao, RN, Rithesh, others. (2024). Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets. Advances in Neural Information Processing Systems.

[jin2023surrealdriver] Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, Jiangtao Gong. (2023). SurrealDriver: Designing Generative Driver Agent Simulation Framework in Urban Contexts based on Large Language Model.

[human-team-optimization] Lykourentzou, Ioanna, Vinella, Federica Lucia, Ahmed, Faez, Papastathis, Costas, Papangelis, Konstantinos, Khan, Vassilis-Javed, Masthoff, Judith. (2022). Self-Organization in Online Collaborative Work Settings. Collective Intelligence.

[babyagi] Yohei Nakajima. (2023). BabyAGI. GitHub repository.

[agentgpt] Reworkd. (2023). AgentGPT. GitHub repository.

[liu2024toolace] Liu, Weiwen, Huang, Xu, Zeng, Xingshan, Hao, Xinlong, Yu, Shuai, Li, Dexun, Wang, Shuai, Gan, Weinan, Liu, Zhengying, Yu, Yuanqing, others. (2024). Toolace: Winning the points of llm function calling. arXiv preprint arXiv:2409.00920.

[li2023reflection] Li, Ming, Chen, Lichang, Chen, Jiuhai, He, Shwai, Huang, Heng, Gu, Jiuxiang, Zhou, Tianyi. (2023). Reflection-tuning: Data recycling improves llm instruction-tuning. arXiv preprint arXiv:2310.11716.

[gulcehre2023reinforced] Gulcehre, Caglar, Paine, Tom Le, Srinivasan, Srivatsan, Konyushkova, Ksenia, Weerts, Lotte, Sharma, Abhishek, Siddhant, Aditya, Ahern, Alex, Wang, Miaosen, Gu, Chenjie, others. (2023). Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998.

[asai2023self] Asai, Akari, Wu, Zeqiu, Wang, Yizhong, Sil, Avirup, Hajishirzi, Hannaneh. (2023). Self-rag: Learning to retrieve, generate, and critique through self-reflection. The Twelfth International Conference on Learning Representations.

[xu2024sayself] Xu, Tianyang, Wu, Shujin, Diao, Shizhe, Liu, Xiaoze, Wang, Xingyao, Chen, Yangyi, Gao, Jing. (2024). Sayself: Teaching llms to express confidence with self-reflective rationales. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

[chen2024agent-flan] Chen, Zehui, Liu, Kuikun, Wang, Qiuchen, Zhang, Wenwei, Liu, Jiangning, Lin, Dahua, Chen, Kai, Zhao, Feng. (2024). Agent-flan: Designing data and methods of effective agent tuning for large language models. arXiv preprint arXiv:2403.12881.

[qin2025ui-tars] Qin, Yujia, Ye, Yining, Fang, Junjie, Wang, Haoming, Liang, Shihao, Tian, Shizuo, Zhang, Junda, Li, Jiahao, Li, Yunxin, Huang, Shijue, others. (2025). UI-TARS: Pioneering Automated GUI Interaction with Native Agents. arXiv preprint arXiv:2501.12326.

[rafailov2023dpo] Rafailov, Rafael, Sharma, Archit, Mitchell, Eric, Manning, Christopher D, Ermon, Stefano, Finn, Chelsea. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems.

[motwani2024malt] Motwani, Sumeet Ramesh, Smith, Chandler, Das, Rocktim Jyoti, Rafailov, Rafael, Laptev, Ivan, Torr, Philip HS, Pizzati, Fabio, Clark, Ronald, de Witt, Christian Schroeder. (2024). Malt: Improving reasoning with multi-agent llm training. arXiv preprint arXiv:2412.01928.

[liao2025marft] Liao, Junwei, Wen, Muning, Wang, Jun, Zhang, Weinan. (2025). MARFT: Multi-Agent Reinforcement Fine-Tuning. arXiv preprint arXiv:2504.16129.

[zhao2023slic] Zhao, Yao, Joshi, Rishabh, Liu, Tianqi, Khalman, Misha, Saleh, Mohammad, Liu, Peter J. (2023). Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.

[liu2023rso] Liu, Tianqi, Zhao, Yao, Joshi, Rishabh, Khalman, Misha, Saleh, Mohammad, Liu, Peter J, Liu, Jialu. (2023). Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657.

[xu2024cpo] Xu, Haoran, Sharaf, Amr, Chen, Yunmo, Tan, Weiting, Shen, Lingfeng, Van Durme, Benjamin, Murray, Kenton, Kim, Young Jin. (2024). Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417.

[schulman2017ppo] Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, Klimov, Oleg. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

[chen2025agent-atlas] Chen, Zhixun, Li, Ming, Huang, Yuxuan, Du, Yali, Fang, Meng, Zhou, Tianyi. (2025). Atlas: Agent tuning via learning critical steps. arXiv preprint arXiv:2503.02197.

[hu2024agentgen] Hu, Mengkang, Zhao, Pu, Xu, Can, Sun, Qingfeng, Lou, Jianguang, Lin, Qingwei, Luo, Ping, Rajmohan, Saravan. (2024). Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation. arXiv preprint arXiv:2408.00764.

[zhang2023astools] Zhang, Jintian, Xu, Xin, Deng, Shumin. (2023). Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124.

[amini2024variational-bon] Amini, Afra, Vieira, Tim, Ash, Elliott, Cotterell, Ryan. (2024). Variational best-of-n alignment. arXiv preprint arXiv:2407.06057.

[kapoor2024ai-agent-matter] Kapoor, Sayash, Stroebl, Benedikt, Siegel, Zachary S, Nadgir, Nitya, Narayanan, Arvind. (2024). Ai agents that matter. arXiv preprint arXiv:2407.01502.

[cemri2025mas-fail] Cemri, Mert, Pan, Melissa Z, Yang, Shuyi, Agrawal, Lakshya A, Chopra, Bhavya, Tiwari, Rishabh, Keutzer, Kurt, Parameswaran, Aditya, Klein, Dan, Ramchandran, Kannan, others. (2025). Why Do Multi-Agent LLM Systems Fail?. arXiv preprint arXiv:2503.13657.

[pesce2023learning] Pesce, Emanuele, Montana, Giovanni. (2023). Learning multi-agent coordination through connectivity-driven communication. Machine Learning.

[liu2022temporal] Liu, Yuntao, Dou, Yong, Li, Yuan, Xu, Xinhai, Liu, Donghong. (2022). Temporal dynamic weighted graph convolution for multi-agent reinforcement learning. Proceedings of the Annual Meeting of the Cognitive Science Society.

[hu2024magraph] Hu, Shengchao, Shen, Li, Zhang, Ya, Tao, Dacheng. (2024). Learning multi-agent communication from graph modeling perspective. arXiv preprint arXiv:2405.08550.

[bolaa] Liu, Zhiwei, Yao, Weiran, Zhang, Jianguo, Xue, Le, Heinecke, Shelby, Murthy, Rithesh, Feng, Yihao, Chen, Zeyuan, Niebles, Juan Carlos, Arpit, Devansh, Xu, Ran, Mui, Phil, Wang, Huan, Xiong, Caiming, Savarese, Silvio. (2023). BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents.

[cumulative-cr] Zhang, Yifan, Yang, Jingqin, Yuan, Yang, Chi-Chih Yao, Andrew. (2023). Cumulative Reasoning with Large Language Models.

[tptu] Ruan, Jingqing, Chen, Yihong, Zhang, Bin, Xu, Zhiwei, Bao, Tianpeng, Du, Guoqing, Shi, Shiwei, Mao, Hangyu, Zeng, Xingyu, Zhao, Rui. (2023). TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents.

[lu2023chameleon] Lu, Pan, Peng, Baolin, Cheng, Hao, Galley, Michel, Chang, Kai-Wei, Wu, Ying Nian, Zhu, Song-Chun, Gao, Jianfeng. (2023). Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. arXiv preprint arXiv:2304.09842.

[deep-network] Sordoni, Alessandro, Yuan, Xingdi, Côté, Marc-Alexandre, Pereira, Matheus, Trischler, Adam, Xiao, Ziang, Hosseini, Arian, Niedtner, Friederike, Le Roux, Nicolas. (2023). Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference.

[jain2024livecodebench] Jain, Naman, Han, King, Gu, Alex, Li, Wen-Ding, Yan, Fanjia, Zhang, Tianjun, Wang, Sida, Solar-Lezama, Armando, Sen, Koushik, Stoica, Ion. (2024). Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974.

[hurst2024gpt[4o] Hurst, Aaron, Lerer, Adam, Goucher, Adam P, Perelman, Adam, Ramesh, Aditya, Clark, Aidan, Ostrow, AJ, Welihinda, Akila, Hayes, Alan, Radford, Alec, others. (2024). Gpt-4o system card. arXiv preprint arXiv:2410.21276.

[hu2022lora] Hu, Edward J, Shen, Yelong, Wallis, Phillip, Allen-Zhu, Zeyuan, Li, Yuanzhi, Wang, Shean, Wang, Lu, Chen, Weizhu, others. (2022). Lora: Low-rank adaptation of large language models.. ICLR.

[guo2025deepseek-r1] Guo, Daya, Yang, Dejian, Zhang, Haowei, Song, Junxiao, Zhang, Ruoyu, Xu, Runxin, Zhu, Qihao, Ma, Shirong, Wang, Peiyi, Bi, Xiao, others. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.

[yang2024qwen2.5] Yang, An, Yang, Baosong, Zhang, Beichen, Hui, Binyuan, Zheng, Bo, Yu, Bowen, Li, Chengyuan, Liu, Dayiheng, Huang, Fei, Wei, Haoran, others. (2024). Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115.

[rasal2024llm] Rasal, Sumedh. (2024). Llm harmony: Multi-agent communication for problem solving. arXiv preprint arXiv:2401.01312.

[chu2025sft-memorize] Chu, Tianzhe, Zhai, Yuexiang, Yang, Jihan, Tong, Shengbang, Xie, Saining, Schuurmans, Dale, Le, Quoc V, Levine, Sergey, Ma, Yi. (2025). Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161.

[madaan2023selfrefine] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, Peter Clark. (2023). Self-Refine: Iterative Refinement with Self-Feedback.

[aggarwal2023adaptive-consistency] Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam. (2023). Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning with LLMs.

[tot] Yao, Shunyu, Yu, Dian, Zhao, Jeffrey, Shafran, Izhak, Griffiths, Thomas L., Cao, Yuan, Narasimhan, Karthik. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601.

[embodied] {Zhang. {Building Cooperative Embodied Agents Modularly with Large Language Models. arXiv e-prints.

[chateval] {Chan. {ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. arXiv e-prints.

[llm-dba] {Zhou. {LLM As DBA. arXiv e-prints.

[gpt-iot] {Nascimento. {GPT-in-the-Loop: Adaptive Decision-Making for Multiagent Systems. arXiv e-prints.

[ma2024agentboard] Ma, Chang, Zhang, Junlei, Zhu, Zhihao, Yang, Cheng, Yang, Yujiu, Jin, Yaohui, Lan, Zhenzhong, Kong, Lingpeng, He, Junxian. (2024). Agentboard: An analytical evaluation board of multi-turn llm agents. arXiv preprint arXiv:2401.13178.

[post-2018-sacrebleu] Post, Matt. (2018). A Call for Clarity in Reporting {BLEU. Proceedings of the Third Conference on Machine Translation: Research Papers.

[wang2022scienceworld] Wang, Ruoyao, Jansen, Peter, C{^o. (2022). Scienceworld: Is your agent smarter than a 5th grader?. arXiv preprint arXiv:2203.07540.

[Ren2020CodeBLEUAM] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, M. Zhou, Ambrosio Blanco, Shuai Ma. (2020). CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. ArXiv.

[shridhar2020alfworld] Shridhar, Mohit, Yuan, Xingdi, C{^o. (2020). Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.

[trueskill] Herbrich, Ralf, Minka, Tom, Graepel, Thore. (2006). TrueSkill\texttrademark : A Bayesian Skill Rating System. Advances in Neural Information Processing Systems.

[self-collab-codegen] {Dong. {Self-collaboration Code Generation via ChatGPT. arXiv e-prints.

[human-collaboration] Arthur C. Graesser, Stephen M. Fiore, Samuel Greiff, Jessica Andrews-Todd, Peter W. Foltz, Friedrich W. Hesse. (2018). Advancing the Science of Collaborative Problem Solving. Psychological Science in the Public Interest.

[multi-agent] Hao, Rui, Hu, Linmei, Qi, Weijian, Wu, Qingliu, Zhang, Yirui, Nie, Liqiang. (2023). ChatLLM Network: More brains, More intelligence. IEEE Access.

[human-team-building] Zhang, Jintian, Xu, Xin, Deng, Shumin. (2023). Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124.

[arXiv2023_Survey-LLM] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian{-. (2023). A Survey of Large Language Models. arXiv preprint. doi:10.48550/arXiv.2303.18223.

[arXiv2023_Survey-MLLM] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen. (2023). A Survey on Multimodal Large Language Models. arXiv preprint. doi:10.48550/arXiv.2306.13549.

[arXiv2023_Survey-LLM-KGCR] Yuqi Zhu, Xiaohan Wang, Jing Chen, Shuofei Qiao, Yixin Ou, Yunzhi Yao, Shumin Deng, Huajun Chen, Ningyu Zhang. (2023). LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities. CoRR. doi:10.48550/arXiv.2305.13168.

[arXiv2024_Survey-LLM-Psychology-Applications] Luoma Ke, Song Tong, Peng Chen, Kaiping Peng. (2024). Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review. CoRR.

[FCS2024_Survey-Agent] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji{-. (2024). A Survey on Large Language Model based Autonomous Agents. Front. Comput. Sci..

[arXiv2023_Survey-Agent_2] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huan, Tao Gui. (2023). The Rise and Potential of Large Language Model Based Agents: {A. arxiv preprint.

[arXiv2023_Survey-Agent_3] Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, Yong Li. (2023). Large Language Models Empowered Agent-based Modeling and Simulation: A Survey and Perspectives. CoRR.

[arXiv2024_Survey-Agent_4] Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, Xiuqiang He. (2024). Exploring Large Language Model based Intelligent Agents: Definitions, Methods, and Prospects. CoRR.

[arXiv2024_Survey-Agents-CompExp] Qun Ma, Xiao Xue, Deyu Zhou, Xiangning Yu, Donghua Liu, Xuwen Zhang, Zihan Zhao, Yifan Shen, Peilin Ji, Juanjuan Li, Gang Wang, Wanpeng Ma. (2024). Computational Experiments Meet Large Language Model Based Agents: A Survey and Perspective. CoRR.

[arXiv2024_Survey-MultiAgent] Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang. (2024). Large Language Model based Multi-Agents: A Survey of Progress and Challenges. CoRR.

[arXiv2024_Survey-MultiAgent_2] Pouya Pezeshkpour, Eser Kandogan, Nikita Bhutani, Sajjadur Rahman, Tom Mitchell, Estevam Hruschka. (2024). Reasoning Capacity in Multi-Agent Systems: Limitations, Challenges and Human-Centered Solutions. CoRR.

[arXiv2024_Survey-MultiAgent-System] Hung Du, Srikanth Thudumu, Rajesh Vasa, Kon Mouzakis. (2024). A Survey on Context-Aware Multi-Agent Systems: Techniques, Challenges and Future Directions. CoRR.

[arXiv2024_Survey-MultiAgent-System_2] Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, Zhaozhuo Xu, Chaoyang He. (2024). LLM Multi-Agent Systems: Challenges and Open Problems. CoRR.

[J2024_Survey-AI-SocialScience] Ruoxi Xu, Yingfei Sun, Mengjie Ren, Shiguang Guo, Ruotong Pan, Hongyu Lin, Le Sun, Xianpei Han. (2024). AI for social science and social science of AI: A Survey. Information Processing & Management. doi:https://doi.org/10.1016/j.ipm.2024.103665.

[arXiv2023_Survey-MultiAgentCooperation] Yali Du, Joel Z. Leibo, Usman Islam, Richard Willis, Peter Sunehag. (2023). A Review of Cooperation in Multi-agent Learning. CoRR. doi:10.48550/ARXIV.2312.05162.

[arXiv2024_Survey-AgentAI_MMInteraction] Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, Katsushi Ikeuchi, Hoi Vo, Li Fei-Fei, Jianfeng Gao. (2024). Agent AI: Surveying the Horizons of Multimodal Interaction. CoRR.

[arXiv2024_Survey-CooperativeAgent-RL] Jiechuan Jiang, Kefan Su, Zongqing Lu. (2024). Fully Decentralized Cooperative Multi-Agent Reinforcement Learning: A Survey. CoRR.

[ICLR2024_Sycophancy-LLM] Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield{-. (2024). Towards Understanding Sycophancy in Language Models. {ICLR.

[J2008_Survey-MultiAgent-Reinforce] Lucian Busoniu, Robert Babuska, Bart De Schutter. (2008). A Comprehensive Survey of Multiagent Reinforcement Learning. {IEEE. doi:10.1109/TSMCC.2007.913919.

[arXiv2023_Survey-MultiAgent-Reinforce] Dom Huh, Prasant Mohapatra. (2023). Multi-agent Reinforcement Learning: A Comprehensive Survey. arxiv preprint.

[arXiv2023_Survey-Hallucination_LFM] Vipula Rawte, Amit P. Sheth, Amitava Das. (2023). A Survey of Hallucination in Large Foundation Models. CoRR. doi:10.48550/arXiv.2309.05922.

[J2023_Survey-Hallucination_NLG] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, Pascale Fung. (2023). Survey of Hallucination in Natural Language Generation. {ACM. doi:10.1145/3571730.

[arXiv2023_Survey-Hallucination_LLM] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi. (2023). Siren's Song in the {AI. CoRR. doi:10.48550/ARXIV.2309.01219.

[arXiv2024_Survey-Hallucination] Junliang Luo, Tianyu Li, Di Wu, Michael Jenkin, Steve Liu, Gregory Dudek. (2024). Hallucination Detection and Hallucination Mitigation: An Investigation. CoRR.

[arXiv2024_Analysis-Hallucination] Ziwei Xu, Sanjay Jain, Mohan Kankanhalli. (2024). Hallucination is Inevitable: An Innate Limitation of Large Language Models. CoRR.

[1995_MultiAgent-System] Gerhard Wei{\ss. (1995). Adaptation and Learning in Multi-Agent Systems: Some Remarks and a Bibliography. Adaption and Learning in Multi-Agent Systems. doi:10.1007/3-540-60923-7_16.

[J2000_MultiAgent-System] Peter Stone, Manuela M. Veloso. (2000). Multiagent Systems: {A. Auton. Robots. doi:10.1023/A:1008942012299.

[Book2006_MultiAgent-System] Jos'{e. (2006). Fundamentals of Multiagent Systems: Using NetLogo Models.

[Book2009_MultiAgent-System] Michael J. Wooldridge. (2009). An Introduction to MultiAgent Systems, Second Edition.

[arXiv2024_Formal-LLM] Zelong Li, Wenyue Hua, Hao Wang, He Zhu, Yongfeng Zhang. (2024). Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents. CoRR.

[arXiv2023_Agents] Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu Chen, Wentao Zhang, Ningyu Zhang, Huajun Chen, Peng Cui, Mrinmaya Sachan. (2023). Agents: An Open-source Framework for Autonomous Language Agents. CoRR. doi:10.48550/arXiv.2309.07870.

[arXiv2023_OpenAgents] Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, Tao Yu. (2023). OpenAgents: An Open Platform for Language Agents in the Wild. CoRR. doi:10.48550/ARXIV.2310.10634.

[arXiv2023_AutoAgents] Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, B{. (2023). AutoAgents: {A. CoRR. doi:10.48550/ARXIV.2309.17288.

[arXiv2023_CGMI] Jinxin Shi, Jiabao Zhao, Yilei Wang, Xingjiao Wu, Jiawen Li, Liang He. (2023). {CGMI:. CoRR. doi:10.48550/ARXIV.2308.12503.

[arXiv2023_Believability-Agents] Yang Xiao, Yi Cheng, Jinlan Fu, Jiashuo Wang, Wenjie Li, Pengfei Liu. (2023). How Far Are We from Believable AI Agents? A Framework for Evaluating the Believability of Human Behavior Simulation. CoRR.

[arXiv2023_MAgIC] Lin Xu, Zhiyuan Hu, Daquan Zhou, Hongyu Ren, Zhen Dong, Kurt Keutzer, See{-. (2023). MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration. CoRR. doi:10.48550/ARXIV.2311.08562.

[arXiv2024_AgentBoard] Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, Junxian He. (2024). AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents. CoRR.

[arXiv2024_Evaluate-Agents] Lize Alberts, Geoff Keeling, Amanda McCroskery. (2024). What makes for a 'good' social actor? Using respect as a lens to evaluate interactions with language agents. CoRR.

[arXiv2023_InterAct] Po{-. (2023). InterAct: Exploring the Potentials of ChatGPT as a Cooperative Agent. CoRR. doi:10.48550/ARXIV.2308.01552.

[arXiv2023_Multi-Agent-Collaboration_Intelligent] Yashar Talebirad, Amirhossein Nadiri. (2023). Multi-Agent Collaboration: Harnessing the Power of Intelligent {LLM. CoRR. doi:10.48550/ARXIV.2306.03314.

[UIST2023_Agent-Simulate-Interaction] Joon Sung Park, Joseph C. O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein. (2023). Generative Agents: Interactive Simulacra of Human Behavior. {UIST.

[AAAI2024_CooperativeAgents_ProAgent] Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song{-. (2024). ProAgent: Building Proactive Cooperative Agents with Large Language Models. {AAAI. doi:10.1609/AAAI.V38I16.29710.

[sun2023corex] Sun, Qiushi, Yin, Zhangyue, Li, Xiang, Wu, Zhiyong, Qiu, Xipeng, Kong, Lingpeng. (2023). Corex: Pushing the boundaries of complex reasoning through multi-model collaboration. arXiv preprint arXiv:2310.00280.

[chen2024comm] Chen, Pei, Han, Boran, Zhang, Shuai. (2024). CoMM: Collaborative multi-agent, multi-reasoning-path prompting for complex problem solving. arXiv preprint arXiv:2404.17729.

[ICLR2024_MultiAgent_AgentVerse] Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi{-. (2024). AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors in Agents. {ICLR.

[arXiv2023_Dynamic-LLM-Agent] Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, Diyi Yang. (2023). Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization. CoRR.

[EMNLP2023-Demo_CollaborativeLLMs] Kai Lv, Shuo Zhang, Tianle Gu, Shuhao Xing, Jiawei Hong, Keyu Chen, Xiaoran Liu, Yuqing Yang, Honglin Guo, Tengxiao Liu, Yu Sun, Qipeng Guo, Hang Yan, Xipeng Qiu. (2023). CoLLiE: Collaborative Training of Large Language Models in an Efficient Way. {EMNLP.

[J2024_MechAgents-MultiAgentCollaborations] Bo Ni, Markus J. Buehler. (2024). MechAgents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge. Extreme Mechanics Letters. doi:https://doi.org/10.1016/j.eml.2024.102131.

[arXiv2023_MultiAgent-Coordination-Eval] Saaket Agashe, Yue Fan, Xin Eric Wang. (2023). Evaluating Multi-Agent Coordination Abilities in Large Language Models. CoRR. doi:10.48550/ARXIV.2310.03903.

[ICLR2024_Agent-Interactive-Eval] Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis{-. (2024). {SOTOPIA:. {ICLR.

[AAMAS2024_AgentInteraction-Quantifying] Yuxin Chen, Chen Tang, Ran Tian, Chenran Li, Jinning Li, Masayoshi Tomizuka, Wei Zhan. (2024). Quantifying Agent Interaction in Multi-agent Reinforcement Learning for Cost-efficient Generalization. {AAMAS. doi:10.5555/3635637.3663107.

[arXiv2023_LLM-Deliberation] Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Sch{. (2023). LLM-Deliberation: Evaluating LLMs with Interactive Multi-Agent Negotiation Games. CoRR. doi:10.48550/ARXIV.2309.17234.

[arXiv2023_MultiAgent-Cooperation] Rafael Pina, Varuna De Silva, Corentin Artaud. (2023). Discovering Causality for Efficient Cooperation in Multi-Agent Environments. CoRR. doi:10.48550/ARXIV.2306.11846.

[arXiv2023_MultiAgent-Algorithms] Lin Yang, Xuchuang Wang, Mohammad Hajiesmaili, Lijun Zhang, John C. S. Lui, Don Towsley. (2023). Cooperative Multi-agent Bandits: Distributed Algorithms with Optimal Individual Regret and Constant Communication Costs. CoRR. doi:10.48550/ARXIV.2308.04314.

[Book1971_Rhetoric] Perelman, Chaim. (1971). The new rhetoric.

[Book2005_Society-Dissent] Cass R Sunstein. (2005). Why societies need dissent.

[J2009_DecisionMaking] Leila Amgoud, Henri Prade. (2009). Using arguments for making and explaining decisions. Artif. Intell.. doi:10.1016/j.artint.2008.11.006.

[arXiv2023_MultiAgent-Debate] Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. CoRR.

[yan2024depending] Yan, Yikuan, Zhang, Yaolun, Huang, Keman. (2024). Depending on yourself when you should: Mentoring LLM with RL agents to become the master in cybersecurity games. arXiv preprint arXiv:2403.17674.

[holt2024l2mac] Holt, Samuel, Luyten, Max Ruiz, van der Schaar, Mihaela. (2024). L2MAC: Large Language Model Automatic Computer for Extensive Code Generation. The Twelfth International Conference on Learning Representations.

[zhou2023large] Zhou, Zihao, Hu, Bin, Zhao, Chenyang, Zhang, Pu, Liu, Bin. (2023). Large language model as a policy teacher for training reinforcement learning agents. arXiv preprint arXiv:2311.13373.

[arXiv2023_MultiAgent-Debate_2] Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming Shi. (2023). Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. CoRR.

[ICLR2024_Multiagent-Debate-Embeddings] Chau Pham, Boyi Liu, Yingxiang Yang, Zhengyu Chen, Tianyi Liu, Jianbo Yuan, Bryan A. Plummer, Zhaoran Wang, Hongxia Yang. (2024). Let Models Speak Ciphers: Multiagent Debate through Embeddings. {ICLR.

[ICLR2024_Multiagent-Debate-Eval] Chi{-. (2024). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. {ICLR.

[J1985_Reflection] R. J. Bogumil. (1985). The reflective practitioner: How professionals think in action. Proc. {IEEE. doi:10.1109/PROC.1985.13210.

[Book2010_Reflection] Bolton, Gillie. (2010). Reflective practice: Writing and professional development.

[zelikman2023parsel] Zelikman, Eric, Huang, Qian, Poesia, Gabriel, Goodman, Noah, Haber, Nick. (2023). Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions. Advances in Neural Information Processing Systems.

[huang2024anpl] Huang, Di, Nan, Ziyuan, Hu, Xing, Jin, Pengwei, Peng, Shaohui, Wen, Yuanbo, Zhang, Rui, Du, Zidong, Guo, Qi, Pu, Yewen, others. (2024). ANPL: towards natural programming with interactive decomposition. Advances in Neural Information Processing Systems.

[reflexion] Noah Shinn, Beck Labash, Ashwin Gopinath. (2023). Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint. doi:10.48550/arXiv.2303.11366.

[NeurIPS2023_Self-Refine] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, Peter Clark. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS.

[J2003_Reflection-ContinuingEducation] Mezirow, Jack. (2003). How critical reflection triggers transformative learning. Adult and Continuing Education: Teaching, learning and research.

[arXiv2024_Self-Contrast] Wenqi Zhang, Yongliang Shen, Linjuan Wu, Qiuying Peng, Jun Wang, Yueting Zhuang, Weiming Lu. (2024). Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives. CoRR.

[arXiv2023_InteractiveNLP] Zekun Wang, Ge Zhang, Kexin Yang, Ning Shi, Wangchunshu Zhou, Shaochun Hao, Guangzheng Xiong, Yizhi Li, Mong Yuan Sim, Xiuying Chen, Qingqing Zhu, Zhenzhu Yang, Adam Nik, Qi Liu, Chenghua Lin, Shi Wang, Ruibo Liu, Wenhu Chen, Ke Xu, Dayiheng Liu, Yike Guo, Jie Fu. (2023). Interactive Natural Language Processing. CoRR. doi:10.48550/arXiv.2305.13246.

[J2023_InteractiveNLP] Guanhua Zhang, Matteo Bortoletto, Zhiming Hu, Lei Shi, Mihai B{^{a. (2023). Exploring Natural Language Processing Methods for Interactive Behaviour Modelling. {INTERACT. doi:10.1007/978-3-031-42286-7_1.

[NMI2023_Human-Like-AI] Edgar A. Du{'{e. (2023). A social path to human-like artificial intelligence. Nat. Mac. Intell.. doi:10.1038/S42256-023-00754-X.

[ICLR2024_LLM-Simulate-Society] Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M. Dai, Diyi Yang, Soroush Vosoughi. (2024). Training Socially Aligned Language Models in Simulated Human Society. {ICLR.

[arXiv2023_LLMAgents-Simulate-Society_S3] Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, Yong Li. (2023). S({. CoRR. doi:10.48550/ARXIV.2307.14984.

[UIST2022_SocialSimulacra] Joon Sung Park, Lindsay Popowski, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein. (2022). Social Simulacra: Creating Populated Prototypes for Social Computing Systems. {UIST. doi:10.1145/3526113.3545616.

[arXiv2024_AgentAlignment-SocialNorms] Shimin Li, Tianxiang Sun, Xipeng Qiu. (2024). Agent Alignment in Evolving Social Norms. CoRR.

[NAACL2022-Findings_Align-GLMs] Ruibo Liu, Ge Zhang, Xinyu Feng, Soroush Vosoughi. (2022). Aligning Generative Language Models with Human Values. {NAACL-HLT. doi:10.18653/V1/2022.FINDINGS-NAACL.18.

[NeurIPS2022_Re-Align] Ruibo Liu, Chenyan Jia, Ge Zhang, Ziyu Zhuang, Tony X. Liu, Soroush Vosoughi. (2022). Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits. NeurIPS.

[arXiv2023_Align-Chatbot] Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang, Zekun Wang, Ruibo Liu, Jing Li, Jie Fu, Pengfei Liu. (2023). Align on the Fly: Adapting Chatbot Behavior to Established Norms. CoRR.

[arXiv2023_Human-AI_Collaboration] Andrew Fuchs, Andrea Passarella, Marco Conti. (2023). Optimizing delegation between human and {AI. CoRR. doi:10.48550/ARXIV.2309.14718.

[ICLR2024_Human-Agent-Collaboration] Yiming Gao, Feiyu Liu, Liang Wang, Zhenjie Lian, Dehua Zheng, Weixuan Wang, Wenjin Yang, Siqin Li, Xianliang Wang, Wenhui Chen, Jing Dai, Qiang Fu, Wei Yang, Lanxiao Huang, Wei Liu. (2024). Enhancing Human Experience in Human-Agent Collaboration: A Human-Centered Modeling Approach Based on Positive Human Gain. {ICLR.

[yoran2024assistantbenchwebagentssolve] Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant. (2024). AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?.

[arXiv2024_Human-Agent-Collaboration] Xueyang Feng, Zhi-Yuan Chen, Yujia Qin, Yankai Lin, Xu Chen, Zhiyuan Liu, Ji-Rong Wen. (2024). Large Language Model-based Human-Agent Collaboration for Complex Task Solving. CoRR.

[TEVC2023_MultiAgent-Collaboration-SocialRoles] Yaqing Hou, Mingyang Sun, Yifeng Zeng, Yew-Soon Ong, Yaochu Jin, Hongwei Ge, Qiang Zhang. (2023). A Multi-agent Cooperative Learning System with Evolution of Social Roles. IEEE Transactions on Evolutionary Computation. doi:10.1109/TEVC.2023.3268076.

[EMNLP2023-Findings_LEGO] Zhitao He, Pengfei Cao, Yubo Chen, Kang Liu, Ruopeng Li, Mengshu Sun, Jun Zhao. (2023). {LEGO:. {EMNLP.

[Nature2023_Role-Play-LLM] Murray Shanahan, Kyle McDonell, Laria Reynolds. (2023). Role play with large language models. Nat.. doi:10.1038/S41586-023-06647-8.

[arXiv2023_PlayGames-LLM] Elif Akata, Lion Schulz, Julian Coda{-. (2023). Playing repeated games with Large Language Models. CoRR. doi:10.48550/arXiv.2305.16867.

[arXiv2023_Agent-Simulate-OpinionDynamics] Yun{-. (2023). Simulating Opinion Dynamics with Networks of LLM-based Agents. CoRR. doi:10.48550/ARXIV.2311.09618.

[arXiv2023_Agent-Simulate-OpinionDynamics_Survey] Yun{-. (2023). Computational Agent-based Models in Opinion Dynamics: {A. CoRR. doi:10.48550/ARXIV.2306.03446.

[arXiv2024_AI-Human-Creative] Haonan Wang, James Zou, Michael Mozer, Anirudh Goyal, Alex Lamb, Linjun Zhang, Weijie J Su, Zhun Deng, Michael Qizhe Xie, Hannah Brown, Kenji Kawaguchi. (2024). Can AI Be as Creative as Humans?. CoRR.

[arXiv2023_LLM-Simulator] Chuyi Kong, Yaxin Fan, Xiang Wan, Feng Jiang, Benyou Wang. (2023). Large Language Model as a User Simulator. CoRR. doi:10.48550/ARXIV.2308.11534.

[arXiv2023_MetaAgents] Yuan Li, Yixuan Zhang, Lichao Sun. (2023). MetaAgents: Simulating Interactions of Human Behaviors for LLM-based Task-oriented Coordination via Collaborative Generative Agents. CoRR. doi:10.48550/ARXIV.2310.06500.

[meyer2021entspann] Meyer, B, Zill, A, Dilba, D, Voermans, S. (2021). Entspann dich, Deutschland! TK-Stressstudie 2021.

[wang2024tokeneconomy] Wang, Junlin, Jain, Siddhartha, Zhang, Dejiao, Ray, Baishakhi, Kumar, Varun, Athiwaratkun, Ben. (2024). Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies. arXiv preprint arXiv:2406.06461.

[PNAS2024_TuringTest_Chatbots-Humans] Qiaozhu Mei, Yutong Xie, Walter Yuan, Matthew O. Jackson. (2024). A Turing test of whether AI chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.2313925121.

[arXiv2024_BehavioralSimulation] Cheng Wang, Chuwen Wang, Yu Zhao, Shirong Zeng, Wang Zhang, Ronghui Ning. (2024). Behavioral Simulation: Exploring A Possible Next Paradigm for Science. CoRR.

[arXiv2023_Agent-BehaviorExplanation] Xijia Zhang, Yue Guo, Simon Stepputtis, Katia P. Sycara, Joseph Campbell. (2023). Understanding Your Agent: Leveraging Large Language Models for Behavior Explanation. CoRR. doi:10.48550/ARXIV.2311.18062.

[arXiv2023_Agent-BehaviorExplaining] Xijia Zhang, Yue Guo, Simon Stepputtis, Katia P. Sycara, Joseph Campbell. (2023). Explaining Agent Behavior with Large Language Models. CoRR. doi:10.48550/ARXIV.2309.10346.

[arXiv2023_Agents-High-Level-Behavior] Maxwell Crouse, Ibrahim Abdelaziz, Kinjal Basu, Soham Dan, Sadhana Kumaravel, Achille Fokoue, Pavan Kapanipathi, Luis A. Lastras. (2023). Formally Specifying the High-Level Behavior of LLM-Based Agents. CoRR. doi:10.48550/ARXIV.2310.08535.

[arXiv2024_Agents-Simulate-Trust] Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Kai Shu, Adel Bibi, Ziniu Hu, Philip H. S. Torr, Bernard Ghanem, Guohao Li. (2024). Can Large Language Model Agents Simulate Human Trust Behaviors?. CoRR. doi:10.48550/ARXIV.2402.04559.

[arXiv2024_Reward-Socially-RL-Agent] Zhaoyue Wang. (2024). Towards Socially and Morally Aware RL agent: Reward Design With LLM. CoRR.

[ICLR2024_LLM-Simulate-CognitiveModel] Marcel Binz, Eric Schulz. (2024). Turning large language models into cognitive models. {ICLR.

[arXiv2023_Agent-Cognitive] Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, Thomas L. Griffiths. (2023). Cognitive Architectures for Language Agents. CoRR. doi:10.48550/ARXIV.2309.02427.

[J2024_Interactions-Cognitive-LLMs] Youzhi Qu, Penghui Du, Wenxin Che, Chen Wei, Chi Zhang, Wanli Ouyang, Yatao Bian, Feiyang Xu, Bin Hu, Kai Du, Haiyan Wu, Jia Liu, Quanying Liu. (2024). Promoting interactions between cognitive science and large language models. The Innovation. doi:https://doi.org/10.1016/j.xinn.2024.100579.

[arXiv2024_Multi-Agent_Conversation_CognitiveBias] Yu He Ke, Rui Yang, Sui An Lie, Taylor Xin Yi Lim, Hairil Rizal Abdullah, Daniel Shu Wei Ting, Nan Liu. (2024). Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias. CoRR.

[arXiv2023_AI-Agent] Sungwoo Lee, Younghyun Oh, Hyunhoe An, Hyebhin Yoon, Karl J. Friston, Seok Jun Hong, Choong{-. (2023). Life-inspired Interoceptive Artificial Intelligence for Autonomous and Adaptive Agents. CoRR. doi:10.48550/ARXIV.2309.05999.

[arXiv2024_Agent-Preference] Zihong He, Changwang Zhang. (2024). AFSPP: Agent Framework for Shaping Preference and Personality with Large Language Models. CoRR.

[PANS2022_Agent-Cooperation-Competition] Euel Elliott, L. Douglas Kiel. (2002). Exploring cooperation and competition using agent-based modeling. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.102079099.

[arXiv2023_LyfeAgents] Zhao Kaiya, Michelangelo Naim, Jovana Kondic, Manuel Cortes, Jiaxin Ge, Shuying Luo, Guangyu Robert Yang, Andrew Ahn. (2023). Lyfe Agents: Generative agents for low-cost real-time social interactions. CoRR. doi:10.48550/ARXIV.2310.02172.

[ICLR2023_Reasoning-Simulation] Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te{-. (2023). Mind's Eye: Grounded Language Model Reasoning through Simulation. {ICLR.

[arXiv2024_Human-AI-Interaction_Gesture] Philipp Wicke. (2024). Probing Language Models' Gesture Understanding for Enhanced Human-AI Interaction. CoRR.

[arXiv2023_Human-like-Agents] Thuy{-. (2023). Credit Assignment: Challenges and Opportunities in Developing Human-like {AI. CoRR. doi:10.48550/ARXIV.2307.08171.

[arXiv2024_Human-Centered-LM] Nikita Soni, Niranjan Balasubramanian, H. Andrew Schwartz, Dirk Hovy. (2024). Comparing Human-Centered Language Modeling: Is it Better to Model Groups, Individual Traits, or Both?. CoRR.

[Book1973_Theory_NLP] Alfred V. Aho, Jeffrey D. Ullman. (1973). The theory of parsing, translation, and compiling. 2: Compiling.

[2018_Theory] Mezirow, Jack. (2018). Transformative learning theory. Contemporary theories of learning.

[USENIX1999_Theory-FaultTolerance] Miguel Castro, Barbara Liskov. (1999). Practical Byzantine Fault Tolerance. {OSDI.

[Book2013_InfoProcess-Collaboration] Noreen M Webb. (2013). Information processing approaches to collaborative learning. The international handbook of collaborative learning.

[Book1999_Personality] Friedman, Howard S, Schustack, Miriam W. (1999). Personality: Classic theories and modern research.

[J2008_Overconfidence] Moore, Don A, Healy, Paul J. (2008). The trouble with overconfidence.. Psychological review.

[Book1985_MedievalPoliticalTheology] Ernst H. Kantorowicz. (1985). The King's Two Bodies: A Study in Medieval Political Theology.

[J2009_SocialPsychology] Johnson, David W, Johnson, Roger T. (2009). An educational psychology success story: Social interdependence theory and cooperative learning. Educational researcher.

[1982_SocialPsychology] Tajfel, Henri. (1982). Social psychology of intergroup relations. Annual review of psychology.

[2004_SocialPsychology] Tajfel, Henri, Turner, John C. (2004). The social identity theory of intergroup behavior. Political psychology.

[Book1988_SoM] Minsky, Marvin. (1988). Society of mind.

[J2003_SoM] Push Singh. (2003). Examining the Society of Mind. Comput. Artif. Intell..

[hu2024evomac] Hu, Yue, Cai, Yuzhu, Du, Yaxin, Zhu, Xinyu, Liu, Xiangrui, Yu, Zijie, Hou, Yuchen, Tang, Shuo, Chen, Siheng. (2024). Self-evolving multi-agent collaboration networks for software development. arXiv preprint arXiv:2410.16946.

[NeurIPS2023_Agent-SoM] Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem. (2023). {CAMEL:. NeurIPS.

[arXiv2023_SoM-NL] Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R. Ashley, R{'{o. (2023). Mindstorms in Natural Language-Based Societies of Mind. CoRR. doi:10.48550/arXiv.2305.17066.

[J2002_TheoryOfMind] Siegal, Michael, Varley, Rosemary. (2002). Neural systems involved in 'theory of mind'. Nature Reviews Neuroscience. doi:10.1038/nrn844.

[J2004_TheoryOfMind] Leslie, Alan M, Friedman, Ori, German, Tim P. (2004). Core mechanisms in ‘theory of mind’. Trends in cognitive sciences. doi:https://doi.org/10.1016/j.tics.2004.10.001.

[EMNLP2022_TheoryOfMind] Maarten Sap, Ronan Le Bras, Daniel Fried, Yejin Choi. (2022). Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs. {EMNLP. doi:10.18653/V1/2022.EMNLP-MAIN.248.

[EMNLP2023_TheoryOfMind-Agents] Huao Li, Yu Quan Chong, Simon Stepputtis, Joseph Campbell, Dana T. Hughes, Charles Lewis, Katia P. Sycara. (2023). Theory of Mind for Multi-Agent Collaboration via Large Language Models. {EMNLP.

[EACL2024_TheoryOfMind] Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, Vered Shwartz. (2024). Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models. {EACL.

[arXiv2023_TheoryOfMind-Agents] Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R. McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, Manaal Faruqui. (2023). How FaR Are Large Language Models From Agents with Theory-of-Mind?. CoRR. doi:10.48550/ARXIV.2310.03051.

[arXiv2023_TheoryofMind-Game] Jiaxian Guo, Bo Yang, Paul Yoo, Bill Yuchen Lin, Yusuke Iwasawa, Yutaka Matsuo. (2023). Suspicion-Agent: Playing Imperfect Information Games with Theory of Mind Aware {GPT-4. CoRR. doi:10.48550/ARXIV.2309.17277.

[Science2010_HumanDynamics] Anita Williams Woolley, Christopher F. Chabris, Alex Pentland, Nada Hashmi, Thomas W. Malone. (2010). Evidence for a Collective Intelligence Factor in the Performance of Human Groups. Science. doi:10.1126/science.1193147.

[1968_GroupDynamics] Dorwin Cartwright, Alvin Zander. (1968). Group dynamics.

[J2018_GroupDynamics] Wilfred R Bion. (2018). Group dynamics: A re-view. New directions in psychoanalysis.

[Book2014_GroupDynamics] Donelson R Forsyth. (2014). Group dynamics.

[Book2018_GroupDynamics] Donelson R Forsyth. (2018). Group dynamics.

[J1987_GroupDynamics] Clayton P Alderfer. (1987). An intergroup perspective on group dynamics. Handbook of organizational behavior.

[yin2023exchange] Yin, Zhangyue, Sun, Qiushi, Chang, Cheng, Guo, Qipeng, Dai, Junqi, Huang, Xuan-Jing, Qiu, Xipeng. (2023). Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.

[chen2024benchmarking] Chen, Jiawei, Lin, Hongyu, Han, Xianpei, Sun, Le. (2024). Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence.

[zhuang2023toolchain] Zhuang, Yuchen, Chen, Xiang, Yu, Tong, Mitra, Saayan, Bursztyn, Victor, Rossi, Ryan A, Sarkhel, Somdeb, Zhang, Chao. (2023). Toolchain: Efficient action space navigation in large language models with a* search*. arXiv preprint arXiv:2310.13227.

[shen2024smallllms] Shen, Weizhou, Li, Chenliang, Chen, Hongzhan, Yan, Ming, Quan, Xiaojun, Chen, Hehong, Zhang, Ji, Huang, Fei. (2024). Small llms are weak tool learners: A multi-llm agent. arXiv preprint arXiv:2401.07324.

[J1998_SmallGroupDynamics-Discussions] David Wyatt Seal, Laura M Bogart, Anke A Ehrhardt. (1998). Small group dynamics: The utility of focus group discussions as a research method. Group Dynamics: Theory, Research, and Practice.

[Book1972_Groupthink] Janis, Irving L. (1972). Victims of Groupthink: A psychological study of foreign-policy decisions and fiascoes..

[J2015_Group] Iyengar, Shanto, Westwood, Sean J. (2015). Fear and loathing across party lines: New evidence on group polarization. American journal of political science.

[J2003_Intergroup-Intragroup_Culture] Masaki Yuki. (2003). Intergroup Comparison versus Intragroup Relationships: A Cross-Cultural Examination of Social Identity Theory in North American and East Asian Cultural Contexts. Social Psychology Quarterly.

[J1995_IntragroupConflict] Karen A Jehn. (1995). A multimethod examination of the benefits and detriments of intragroup conflict. Administrative science quarterly. doi:10.2307/2393638.

[J2003_SmartGroups] Brigid Barron. (2003). When Smart Groups Fail. Journal of the Learning Sciences. doi:10.1207/S15327809JLS1203_1.

[Book2005_CrowdWisdom] Surowiecki, James. (2005). The Wisdom of Crowds.

[J1996_Intelligence] Neisser, Ulric, Boodoo, Gwyneth, Bouchard Jr, Thomas J, Boykin, A Wade, Brody, Nathan, Ceci, Stephen J, Halpern, Diane F, Loehlin, John C, Perloff, Robert, Sternberg, Robert J, others. (1996). Intelligence: knowns and unknowns.. American psychologist.

[1961_Intelligence] Spearman, Charles. . (1961).

[J2004_Conformity] Robert B. Cialdini, Noah J. Goldstein. (2004). Social Influence: Compliance and Conformity. Annual Review of Psychology. doi:10.1146/annurev.psych.55.090902.142015.

[J1969_Conformity] Vernon L. Allen, John M. Levine. (1969). Consensus and conformity. Journal of Experimental Social Psychology. doi:https://doi.org/10.1016/0022-1031(69)90032-8.

[Book2011_Negotiation] Fisher, Roger, Ury, William L, Patton, Bruce. (2011). Getting to yes: Negotiating agreement without giving in.

[J2015_Conformity] Julie C Coultas, Edwin JC van Leeuwen. (2015). Conformity: Definitions, types, and evolutionary grounding. Evolutionary perspectives on social psychology.

[J1967_Consensus-Sociological] Scheff, Thomas J. (1967). Toward a sociological model of consensus. American Sociological Review. doi:10.2307/2091716.

[J1974_ConsensusReaching] Morris H. Degroot. (1974). Reaching a Consensus. Journal of the American Statistical Association. doi:10.1080/01621459.1974.10480137.

[J2018_Emergence-Consensus] Baronchelli, Andrea. (2018). The emergence of consensus: a primer. Royal Society open science. doi:10.1098/rsos.172189.

[arXiv2023_Agent-SocialChoiceTheory] Marc Lanctot, Kate Larson, Yoram Bachrach, Luke Marris, Zun Li, Avishkar Bhoopchand, Thomas W. Anthony, Brian Tanner, Anna Koop. (2023). Evaluating Agents using Social Choice Theory. CoRR. doi:10.48550/ARXIV.2312.03121.

[J2000_Agent-SocialScience] Nigel Gilbert, Pietro Terna. (2000). How to build and use agent-based models in social science. Mind & Society.

[Book2012_Agent-SocialScience] Joshua M Epstein. (2012). Generative social science: Studies in agent-based computational modeling.

[Book2023_Agent-SocialDynamics-Culture] Paul Smaldino. (2023). Modeling social behavior: Mathematical and agent-based models of social dynamics and cultural evolution.

[J2017_SocialInfluence_OpinionDynamics] Andreas Flache, Michael M{. (2017). Models of Social Influence: Towards the Next Frontiers. J. Artif. Soc. Soc. Simul.. doi:10.18564/JASSS.3521.

[J2021_SocietalDynamics_OpinionDynamics] Jan Lorenz, Martin Neumann, Tobias Schr{. (2021). Individual attitude change and societal dynamics: Computational experiments with psychological theories.. Psychological Review.

[PNAS2023_CognitivePsychology_LLM] Marcel Binz, Eric Schulz. (2023). Using cognitive psychology to understand GPT-3. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.2218523120.

[arXiv2023_MachinePsychology] Thilo Hagendorff. (2023). Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods. CoRR. doi:10.48550/ARXIV.2303.13988.

[J2023_LLM-Psychology] Dorottya Demszky, Diyi Yang, David S. Yeager, Christopher J. Bryan, Margarett Clapper, Susannah Chandhok, Johannes C. Eichstaedt, Cameron Hecht, Jeremy Jamieson, Meghann Johnson, Michaela Jones, Danielle Krettek-Cobb, Leslie Lai, Nirel JonesMitchell, Desmond C. Ong, Carol S. Dweck, James J. Gross, James W. Pennebaker. (2023). Using large language models in psychology. Nature Reviews Psychology. doi:10.1038/s44159-023-00241-5.

[NAACL2024-Findings_PsychometricLLM] Tatsuki Kuribayashi, Yohei Oseki, Timothy Baldwin. (2024). Psychometric Predictive Power of Large Language Models. {NAACL.

[arXiv2024_Multi-Agent_PsySafe] Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, Jing Shao. (2024). PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety. CoRR.

[Book2006_DemocracyModel] David Held. (2006). Models of democracy.

[Book2006_Deliberative-Democracy] Diana C Mutz. (2006). Hearing the other side: Deliberative versus participatory democracy.

[arXiv2023_SocialPsychology-Vehicles] Xiao Li, Kaiwen Liu, H. Eric Tseng, Anouck Girard, Ilya V. Kolmanovsky. (2023). Interaction-Aware Decision-Making for Autonomous Vehicles in Forced Merging Scenario Leveraging Social Psychology Factors. CoRR. doi:10.48550/ARXIV.2309.14497.

[Book2019_CriticalThinking] Paul, Richard, Elder, Linda. (2019). The miniature guide to critical thinking concepts and tools.

[Book1994_Framework] Popper, Karl Raimund. (1994). The myth of the framework: In defence of science and rationality.

[J2012_Circulations] Munro, Iain. (2012). The management of circulations: Biopolitical variations after Foucault. International Journal of Management Reviews.

[NeurIPS2021_Dataset-MATH] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt. (2021). Measuring Mathematical Problem Solving With the {MATH. NeurIPS Datasets and Benchmarks.

[arXiv2022_Dataset-ChessMoveValidity] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri{`{a. (2022). Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv preprint. doi:10.48550/arXiv.2206.04615.

[arXiv2023_Dataset-BOLAA] Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese. (2023). {BOLAA:. CoRR. doi:10.48550/arXiv.2308.05960.

[PGN] fsmosca. pgn-standard. GitHub repository.

[ChatGPT-OpenAI] OpenAI. (2022). ChatGPT: Optimizing Language Models for Dialogue.

[NeurIPS2022_InstructGPT] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe. (2022). Training language models to follow instructions with human feedback. NeurIPS.

[arXiv2023_LLaMA] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie{-. (2023). LLaMA: Open and Efficient Foundation Language Models. CoRR. doi:10.48550/ARXIV.2302.13971.

[arXiv2023_Qwen] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, Tianhang Zhu. (2023). Qwen Technical Report. CoRR. doi:10.48550/ARXIV.2309.16609.

[arXiv2023_Mistral] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L{'{e. (2023). Mistral 7B. CoRR. doi:10.48550/ARXIV.2310.06825.

[arXiv2024_Mixtral] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. (2024). Mixtral of Experts. CoRR.

[J2023_Prompt-Survey] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig. (2023). Pre-train, Prompt, and Predict: {A. ACM Comput. Surv.. doi:10.1145/3560815.

[WWW2022_KnowPrompt] Xiang Chen, Ningyu Zhang, Xin Xie, Shumin Deng, Yunzhi Yao, Chuanqi Tan, Fei Huang, Luo Si, Huajun Chen. (2022). KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction. {WWW. doi:10.1145/3485447.3511998.

[ACL2022-Short_P-Tuning] Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, Jie Tang. (2022). P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. {ACL. doi:10.18653/v1/2022.acl-short.8.

[arXiv2024_MoreAgents] Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, Deheng Ye. (2024). More Agents Is All You Need. CoRR.

[reimers2019sentence] Reimers, N. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint arXiv:1908.10084.

[wang2020minilm] Wang, Wenhui, Wei, Furu, Dong, Li, Bao, Hangbo, Yang, Nan, Zhou, Ming. (2020). Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems.

[shirzad2023exphormer] Shirzad, Hamed, Velingker, Ameya, Venkatachalam, Balaji, Sutherland, Danica J, Sinop, Ali Kemal. (2023). Exphormer: Sparse transformers for graphs. International Conference on Machine Learning.

[zhang2025evoflow] Zhang, Guibin, Chen, Kaijie, Wan, Guancheng, Chang, Heng, Cheng, Hong, Wang, Kun, Hu, Shuyue, Bai, Lei. (2025). EvoFlow: Evolving Diverse Agentic Workflows On The Fly. arXiv preprint arXiv:2502.07373.

[tan2023virtual] Tan, Zhen, Guo, Ruocheng, Ding, Kaize, Liu, Huan. (2023). Virtual node tuning for few-shot node classification. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

[zhao2024causality] Zhao, Kesen, Zhang, Liang. (2024). Causality-Inspired Spatial-Temporal Explanations for Dynamic Graph Neural Networks. The Twelfth International Conference on Learning Representations.

[chen2024internet] Chen, Weize, You, Ziming, Li, Ran, Guan, Yitong, Qian, Chen, Zhao, Chenyang, Yang, Cheng, Xie, Ruobing, Liu, Zhiyuan, Sun, Maosong. (2024). Internet of agents: Weaving a web of heterogeneous agents for collaborative intelligence. arXiv preprint arXiv:2407.07061.

[rosenbluth2024distinguished] Rosenbluth, Eran, T{. (2024). Distinguished In Uniform: Self Attention Vs. Virtual Nodes. arXiv preprint arXiv:2405.11951.

[ICLR2023_Self-Consistency] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. {ICLR.

[ACL2023-Findings_Eval-LLM-Behavior] Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran{-. (2023). Discovering Language Model Behaviors with Model-Written Evaluations. {ACL. doi:10.18653/V1/2023.FINDINGS-ACL.847.

[arXiv2023_Analyze-LLM-Behavior] Lingjiao Chen, Matei Zaharia, James Zou. (2023). How is ChatGPT's behavior changing over time?. CoRR. doi:10.48550/ARXIV.2307.09009.

[arXiv2024_AntEval] Yuanzhi Liang, Linchao Zhu, Yi Yang. (2024). AntEval: Quantitatively Evaluating Informativeness and Expressiveness of Agent Social Interactions. CoRR.

[ICLR2024_LLM-Bias-MCS] Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang. (2024). Large Language Models Are Not Robust Multiple Choice Selectors. {ICLR.

[Science2022_AlphaCode] Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R{'{e. (2022). Competition-level code generation with AlphaCode. Science. doi:10.1126/science.abq1158.

[liu2023deja] Liu, Zichang, Wang, Jue, Dao, Tri, Zhou, Tianyi, Yuan, Binhang, Song, Zhao, Shrivastava, Anshumali, Zhang, Ce, Tian, Yuandong, Re, Christopher, others. (2023). Deja vu: Contextual sparsity for efficient llms at inference time. International Conference on Machine Learning.

[arXiv2021_Verifier-Math] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman. (2021). Training Verifiers to Solve Math Word Problems. arXiv prepring.

[roy2016solving] Roy, Subhro, Roth, Dan. (2016). Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413.

[fu2022complexity] Fu, Yao, Peng, Hao, Sabharwal, Ashish, Clark, Peter, Khot, Tushar. (2022). Complexity-based prompting for multi-step reasoning. The Eleventh International Conference on Learning Representations.

[patel2021nlp] Patel, Arkil, Bhattamishra, Satwik, Goyal, Navin. (2021). Are NLP models really able to solve simple math word problems?. arXiv preprint arXiv:2103.07191.

[arXiv2023_ReConcile] Justin Chih-Yao Chen, Swarnadeep Saha, Mohit Bansal. (2023). ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs. arxiv preprint.

[arXiv2023_Multi-Agent-Consensus] Huaben Chen, Wenkang Ji, Lufeng Xu, Shiyu Zhao. (2023). Multi-Agent Consensus Seeking via Large Language Models. CoRR. doi:10.48550/ARXIV.2310.20151.

[arXiv2023_LLMs-Simulate_SocialMedia] Petter T{. (2023). Simulating Social Media Using Large Language Models to Evaluate Alternative News Feed Algorithms. CoRR. doi:10.48550/ARXIV.2310.05984.

[J1981_Alternation] Ashok K. Chandra, Dexter Kozen, Larry J. Stockmeyer. (1981). Alternation. J. {ACM. doi:10.1145/322234.322243.

[ICML2007_L1Regular] Galen Andrew, Jianfeng Gao. (2007). Scalable training of L({. {ICML. doi:10.1145/1273496.1273501.

[Book1997_Algorithms-DataStruct] Dan Gusfield. (1997). Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. doi:10.1017/CBO9780511574931.

[arXiv2015_YaraParser] Mohammad Sadegh Rasooli, Joel R. Tetreault. (2015). Yara Parser: {A. CoRR.

[JMLR2005_LearningStruct] Rie Kubota Ando, Tong Zhang. (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. J. Mach. Learn. Res..

[J1965_FourierSeries-Comp] Cooley, James W., Tukey, John W.. (1965). An algorithm for the machine calculation of complex {F. Mathematics of Computation.

[arXiv2023_Agent-DUMA] Xiaoyu Tian, Liangyu Chen, Na Liu, Yaxuan Liu, Wei Zou, Kaijiang Chen, Ming Cui. (2023). {DUMA:. CoRR. doi:10.48550/ARXIV.2310.18075.

[feng2024agile] Feng, Peiyuan, He, Yichen, Huang, Guanhua, Lin, Yuan, Zhang, Hanchong, Zhang, Yuchen, Li, Hang. (2024). AGILE: A Novel Reinforcement Learning Framework of LLM Agents. arXiv preprint arXiv:2405.14751.

[islam2024open-rag] Islam, Shayekh Bin, Rahman, Md Asib, Hossain, KSM, Hoque, Enamul, Joty, Shafiq, Parvez, Md Rizwan. (2024). Open-rag: Enhanced retrieval-augmented reasoning with open-source large language models. arXiv preprint arXiv:2410.01782.

[gou2023tora] Gou, Zhibin, Shao, Zhihong, Gong, Yeyun, Shen, Yelong, Yang, Yujiu, Huang, Minlie, Duan, Nan, Chen, Weizhu. (2023). Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452.

[qian2025toolrl] Qian, Cheng, Acikgoz, Emre Can, He, Qi, Wang, Hongru, Chen, Xiusi, Hakkani-T{. (2025). ToolRL: Reward is All Tool Learning Needs. arXiv preprint arXiv:2504.13958.

[wang2025otc] Wang, Hongru, Qian, Cheng, Zhong, Wanjun, Chen, Xiusi, Qiu, Jiahao, Huang, Shijue, Jin, Bowen, Wang, Mengdi, Wong, Kam-Fai, Ji, Heng. (2025). OTC: Optimal Tool Calls via Reinforcement Learning. arXiv preprint arXiv:2504.14870.

[finlayson2025post-train-rag] Finlayson, Matthew, Kulikov, Ilia, Bikel, Daniel M, Oguz, Barlas, Chen, Xilun, Pappu, Aasish. (2025). Post-training an LLM for RAG? Train on Self-Generated Demonstrations. arXiv preprint arXiv:2502.10596.

[yuan2023skill] Yuan, Haoqi, Zhang, Chi, Wang, Hongcheng, Xie, Feiyang, Cai, Penglin, Dong, Hao, Lu, Zongqing. (2023). Skill reinforcement learning and planning for open-world long-horizon tasks. arXiv preprint arXiv:2303.16563.

[sun2023adaplanner] Sun, Haotian, Zhuang, Yuchen, Kong, Lingkai, Dai, Bo, Zhang, Chao. (2023). Adaplanner: Adaptive planning from feedback with language models. Advances in neural information processing systems.

[fu2024preact] Fu, Dayuan, Huang, Jianzhao, Lu, Siyuan, Dong, Guanting, Wang, Yejie, He, Keqing, Xu, Weiran. (2024). PreAct: Prediction Enhances Agent's Planning Ability. arXiv preprint arXiv:2402.11534.

[qiao2025agentic-knowself] Qiao, Shuofei, Qiu, Zhisong, Ren, Baochang, Wang, Xiaobin, Ru, Xiangyuan, Zhang, Ningyu, Chen, Xiang, Jiang, Yong, Xie, Pengjun, Huang, Fei, others. (2025). Agentic Knowledgeable Self-awareness. arXiv preprint arXiv:2504.03553.

[modarressi2024memllm-finetune-memory] Modarressi, Ali, K{. (2024). Memllm: Finetuning llms to use an explicit read-write memory. arXiv preprint arXiv:2404.11672.

[liu2024memory-agent-tuning-business] Liu, Jiale, Zeng, Yifan, H{\o. (2024). Memory-Augmented Agent Training for Business Document Understanding. arXiv preprint arXiv:2412.15274.

[modarressi2023ret-llm] Modarressi, Ali, Imani, Ayyoob, Fayyaz, Mohsen, Sch{. (2023). Ret-llm: Towards a general read-write memory for large language models. arXiv preprint arXiv:2305.14322.

[arXiv2023_AgentTuning] Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, Jie Tang. (2023). AgentTuning: Enabling Generalized Agent Abilities for LLMs. CoRR. doi:10.48550/ARXIV.2310.12823.

[arXiv2023_FireAct] Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, Shunyu Yao. (2023). FireAct: Toward Language Agent Fine-tuning. CoRR. doi:10.48550/ARXIV.2310.05915.

[agashe2023evaluating] Agashe, Saaket, Fan, Yue, Wang, Xin Eric. (2023). Evaluating multi-agent coordination abilities in large language models. arXiv preprint arXiv:2310.03903.

[NAACL2024_Agent-Self-Collaboration] Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, Heng Ji. (2024). Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. {NAACL.

[arXiv2023_Benchmarking-Agents] Qian Huang, Jian Vora, Percy Liang, Jure Leskovec. (2023). Benchmarking Large Language Models As {AI. CoRR. doi:10.48550/ARXIV.2310.03302.

[arXiv2023_Agents-SampleEfficiency] Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, Zhaoran Wang. (2023). Reason for Future, Act for Now: {A. CoRR. doi:10.48550/ARXIV.2309.17382.

[arXiv2023_UnifiedAgent-FM] Norman Di Palo, Arunkumar Byravan, Leonard Hasenclever, Markus Wulfmeier, Nicolas Heess, Martin A. Riedmiller. (2023). Towards {A. CoRR. doi:10.48550/ARXIV.2307.09668.

[arXiv2023_Universal-Agent] Anees Aslam. (2023). Universal Language Modelling agent. CoRR. doi:10.48550/ARXIV.2306.06521.

[arXiv2023_Hinder-Agents] Sukai Huang, Nir Lipovetzky, Trevor Cohn. (2023). A Reminder of its Brittleness: Language Reward Shaping May Hinder Learning for Instruction Following Agents. CoRR. doi:10.48550/ARXIV.2305.16621.

[CoLLAs2023_Autotelic-Agents] C{'{e. (2023). Augmenting Autotelic Agents with Large Language Models. CoLLAs.

[EMNLP2023_ConceptualStructure_LLM] Siddharth Suresh, Kushin Mukherjee, Xizheng Yu, Wei{-. (2023). Conceptual structure coheres in human cognition but not in large language models. {EMNLP.

[Games2021_SocialStrategies-CooperativeBehaviour] Robin Watson, Thomas J. H. Morgan, Rachel L. Kendal, Julie Van de Vyver, Jeremy Kendal. (2021). Social Learning Strategies and Cooperative Behaviour: Evidence of Payoff Bias, but Not Prestige or Conformity, in a Social Dilemma Game. Games. doi:10.3390/G12040089.

[ACL2024_AUTOACT-Self-Planning] Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, Huajun Chen. (2024). AUTOACT: Automatic Agent Learning from Scratch via Self-Planning. {ACL.

[arXiv2024_Self-Rewarding-LMs] Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston. (2024). Self-Rewarding Language Models. CoRR.

[serrano2003topology] Serrano, Ma Angeles, Bogun{'a. (2003). Topology of the world trade web. Physical Review E.

[fagiolo2010evolution] Fagiolo, Giorgio, Reyes, Javier, Schiavo, Stefano. (2010). The evolution of the world trade web: a weighted-network analysis. Journal of Evolutionary Economics.

[garlaschelli2007interplay] Garlaschelli, Diego, Di Matteo, Ticiana, Aste, Tomaso, Caldarelli, Guido, Loffredo, Maria I. (2007). Interplay between topology and dynamics in the World Trade Web. The European Physical Journal B.

[pbft] Castro, Miguel, Liskov, Barbara. (1999). Practical Byzantine Fault Tolerance. Proceedings of the Third Symposium on Operating Systems Design and Implementation.

[farahani2013review] Farahani, Reza Zanjirani, Miandoabchi, Elnaz, Szeto, Wai Yuen, Rashidi, Hannaneh. (2013). A review of urban transportation network design problems. European journal of operational research.

[bell1997transportation] Bell, Michael GH, Iida, Yasunori, others. (1997). Transportation network analysis.

[fan2019graph] Fan, Wenqi, Ma, Yao, Li, Qing, He, Yuan, Zhao, Eric, Tang, Jiliang, Yin, Dawei. (2019). Graph neural networks for social recommendation. The world wide web conference.

[belkin2006manifold] Belkin, Mikhail, Niyogi, Partha, Sindhwani, Vikas. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples.. Journal of machine learning research.

[zhou2005learning] Zhou, Dengyong, Huang, Jiayuan, Sch{. (2005). Learning from labeled and unlabeled data on a directed graph. Proceedings of the 22nd international conference on Machine learning.

[wang2019kgat] Wang, Xiang, He, Xiangnan, Cao, Yixin, Liu, Meng, Chua, Tat-Seng. (2019). Kgat: Knowledge graph attention network for recommendation. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining.

[nabti2017querying] Nabti, Chemseddine, Seba, Hamida. (2017). Querying massive graph data: A compress and search approach. Future Generation Computer Systems.

[ribeiro2021survey] Ribeiro, Pedro, Paredes, Pedro, Silva, Miguel EP, Aparicio, David, Silva, Fernando. (2021). A survey on subgraph counting: concepts, algorithms, and applications to network motifs and graphlets. ACM Computing Surveys (CSUR).

[bouhenni2021survey] Bouhenni, Sarra, Yahiaoui, Said, Nouali-Taboudjemat, Nadia, Kheddouci, Hamamache. (2021). A survey on distributed graph pattern matching in massive graphs. ACM Computing Surveys (CSUR).

[fulber2020network] Fulber-Garcia, Vinicius, Duarte Jr, Elias P, Huff, Alexandre, dos Santos, Carlos RP. (2020). Network service topology: Formalization, taxonomy and the custom specification model. Computer Networks.

[arnold2020cloud] Arnold, Todd, He, Jia, Jiang, Weifan, Calder, Matt, Cunha, Italo, Giotsas, Vasileios, Katz-Bassett, Ethan. (2020). Cloud provider connectivity in the flat internet. Proceedings of the ACM Internet Measurement Conference.

[fu2020topology] Fu, Xiuwen, Pace, Pasquale, Aloi, Gianluca, Yang, Lin, Fortino, Giancarlo. (2020). Topology optimization against cascading failures on wireless sensor networks using a memetic algorithm. Computer Networks.

[zhu2024llms] Zhu, Yuqi, Wang, Xiaohan, Chen, Jing, Qiao, Shuofei, Ou, Yixin, Yao, Yunzhi, Deng, Shumin, Chen, Huajun, Zhang, Ningyu. (2024). Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities. World Wide Web.

[zhu2022multi] Zhu, Xiangru, Li, Zhixu, Wang, Xiaodan, Jiang, Xueyao, Sun, Penglei, Wang, Xuwu, Xiao, Yanghua, Yuan, Nicholas Jing. (2022). Multi-modal knowledge graph construction and application: A survey. IEEE Transactions on Knowledge and Data Engineering.

[zhu2021network] Zhu, Hang, Gupta, Varun, Ahuja, Satyajeet Singh, Tian, Yuandong, Zhang, Ying, Jin, Xin. (2021). Network planning with deep reinforcement learning. Proceedings of the 2021 ACM SIGCOMM 2021 Conference.

[huang2022language] Huang, Wenlong, Abbeel, Pieter, Pathak, Deepak, Mordatch, Igor. (2022). Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. International conference on machine learning.

[sun2023pearl] Sun, Simeng, Liu, Yang, Wang, Shuohang, Zhu, Chenguang, Iyyer, Mohit. (2023). Pearl: Prompting large language models to plan and execute actions over long documents. arXiv preprint arXiv:2305.14564.

[ruan2023tptu] Ruan, Jingqing, Chen, Yihong, Zhang, Bin, Xu, Zhiwei, Bao, Tianpeng, Mao, Hangyu, Li, Ziyue, Zeng, Xingyu, Zhao, Rui, others. (2023). Tptu: Task planning and tool usage of large language model-based ai agents. NeurIPS 2023 Foundation Models for Decision Making Workshop.

[sun2023self] Sun, Xiangguo, Cheng, Hong, Liu, Bo, Li, Jia, Chen, Hongyang, Xu, Guandong, Yin, Hongzhi. (2023). Self-supervised hypergraph representation learning for sociological analysis. IEEE Transactions on Knowledge and Data Engineering.

[li2023survey] Li, Yuhan, Li, Zhixun, Wang, Peisong, Li, Jia, Sun, Xiangguo, Cheng, Hong, Yu, Jeffrey Xu. (2023). A survey of graph meets large language model: Progress and future directions. arXiv preprint arXiv:2311.12399.

[sun2023all] Sun, Xiangguo, Cheng, Hong, Li, Jia, Liu, Bo, Guan, Jihong. (2023). All in One: Multi-Task Prompting for Graph Neural Networks. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

[shu2024llm] Shu, Zhiyao, Sun, Xiangguo, Cheng, Hong. (2024). When llm meets hypergraph: A sociological analysis on personality via online social networks. arXiv preprint arXiv:2407.03568.

[park2023generativeagentsinteractivesimulacra] Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein. (2023). Generative Agents: Interactive Simulacra of Human Behavior.

[Drori_2022] Melanie Swan, Takashi Kido, Eric Roland, Renato P. dos Santos. (2023). Math Agents: Computational Infrastructure, Mathematical Embedding, and Genomics. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.2123433119.

[guo2024deepseekcoderlargelanguagemodel] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang. (2024). DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence.

[yu2024metamathbootstrapmathematicalquestions] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu. (2024). MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models.

[chen2023frugalgptuselargelanguage] Lingjiao Chen, Matei Zaharia, James Zou. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.

[ding2024hybridllmcostefficientqualityaware] Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah. (2024). Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing.

[hu2024routerbenchbenchmarkmultillmrouting] Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay. (2024). RouterBench: A Benchmark for Multi-LLM Routing System.

[shnitzer2023largelanguagemodelrouting] Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, Mikhail Yurochkin. (2023). Large Language Model Routing with Benchmark Datasets.

[_akota_2024] Zhang, Zhuosheng, Zhang, Aston, Li, Mu, Smola, Alex. (2022). Automatic {Chain. Proceedings of the 17th ACM International Conference on Web Search and Data Mining. doi:10.48550/arXiv.2210.03493.

[zhang_aflow_2024] Zhang, Jiayi, Xiang, Jinyu, Yu, Zhaoyang, Teng, Fengwei, Chen, Xionghui, Chen, Jiaqi, Zhuge, Mingchen, Cheng, Xin, Hong, Sirui, Wang, Jinlin, Zheng, Bingnan, Liu, Bang, Luo, Yuyu, Wu, Chenglin. (2024). {AFlow.

[ADAS] Hu, Shengran, Lu, Cong, Clune, Jeff. (2024). Automated {Design.

[eot] Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuanjing Huang, Xipeng Qiu. (2023). Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication.

[dai2024costeffectiveonlinemultillmselection] Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C. S. Lui. (2024). Cost-Effective Online Multi-LLM Selection with Versatile Reward Models.

[medagentslargelanguagemodels] Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, Mark Gerstein. (2024). MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning.

[ong2024routellmlearningroutellms] Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica. (2024). RouteLLM: Learning to Route LLMs with Preference Data.

[mohammadshahi2024routoolearningroutelarge] Alireza Mohammadshahi, Arshad Rafiq Shaikh, Majid Yazdani. (2024). Routoo: Learning to Route to Large Language Models Effectively.

[feng2024graphroutergraphbasedrouterllm] Tao Feng, Yanzhen Shen, Jiaxuan You. (2024). GraphRouter: A Graph-based Router for LLM Selections.

[devlin2019bertpretrainingdeepbidirectional] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

[Meta-structure_Discovery_Chen_2024] Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, Yongfeng Zhang. (2024). AutoFlow: Automated Workflow Generation for Large Language Model Agents. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. doi:10.1145/3637528.3671965.

[abdin2024phi3technicalreporthighly] Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matthew Dixon, Ronen Eldan, Victor Fragoso, Jianfeng Gao, Mei Gao, Min Gao, Amit Garg, Allie Del Giorno, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Wenxiang Hu, Jamie Huynh, Dan Iter, Sam Ade Jacobs, Mojan Javaheripi, Xin Jin, Nikos Karampatziakis, Piero Kauffmann, Mahoud Khademi, Dongwoo Kim, Young Jin Kim, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Xihui Lin, Zeqi Lin, Ce Liu, Liyuan Liu, Mengchen Liu, Weishung Liu, Xiaodong Liu, Chong Luo, Piyush Madan, Ali Mahmoudzadeh, David Majercak, Matt Mazzola, Caio César Teodoro Mendes, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Liliang Ren, Gustavo de Rosa, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Yelong Shen, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Praneetha Vaddamanu, Chunyu Wang, Guanhua Wang, Lijuan Wang, Shuohang Wang, Xin Wang, Yu Wang, Rachel Ward, Wen Wen, Philipp Witte, Haiping Wu, Xiaoxia Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Jilong Xue, Sonali Yadav, Fan Yang, Jianwei Yang, Yifan Yang, Ziyi Yang, Donghan Yu, Lu Yuan, Chenruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, Xiren Zhou. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.

[lepagnol-etal-2024-small] Lepagnol, Pierre, Gerald, Thomas, Ghannay, Sahar, Servan, Christophe, Rosset, Sophie. (2024). Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).

[srivatsa2024harnessingpowermultipleminds] KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar. (2024). Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing.

[stripelis-etal-2024-tensoropera] Stripelis, Dimitris, Xu, Zhaozhuo, Hu, Zijian, Shah, Alay Dilipbhai, Jin, Han, Yao, Yuhang, Zhang, Jipeng, Zhang, Tong, Avestimehr, Salman, He, Chaoyang. (2024). {T. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. doi:10.18653/v1/2024.emnlp-industry.34.

[erdogan2025plan-and-act] Erdogan, Lutfi Eren, Lee, Nicholas, Kim, Sehoon, Moon, Suhong, Furuta, Hiroki, Anumanchipalli, Gopala, Keutzer, Kurt, Gholami, Amir. (2025). Plan-and-act: Improving planning of agents for long-horizon tasks. arXiv preprint arXiv:2503.09572.

[huang2024hardertasksneedexperts] Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Songfang Huang, Yansong Feng. (2024). Harder Tasks Need More Experts: Dynamic Routing in MoE Models.

[aghdam2024damoedynamicexpertallocation] Maryam Akhavan Aghdam, Hongpeng Jin, Yanzhao Wu. (2024). DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models.

[zhang2024cutcrapeconomicalcommunication] Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, Tianlong Chen. (2024). Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems.

[shang2024agentsquareautomaticllmagent] Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, Yong Li. (2024). AgentSquare: Automatic LLM Agent Search in Modular Design Space.

[chang2023surveyevaluationlargelanguage] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie. (2023). A Survey on Evaluation of Large Language Models.

[minaee2024largelanguagemodelssurvey] Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao. (2024). Large Language Models: A Survey.

[zingg2023detectingoptimisingteaminteractions] Christian Zingg, Alexander von Gernler, Carsten Arzig, Frank Schweitzer, Christoph Gote. (2023). Detecting and Optimising Team Interactions in Software Development.

[morethancode] Ramin, Frederike, Matthies, Christoph, Teusner, Ralf. (2020). More than Code: Contributions in Scrum Software Engineering Teams. Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops. doi:10.1145/3387940.3392241.

[zhang-etal-2024-exploring] Zhang, Jintian, Xu, Xin, Zhang, Ningyu, Liu, Ruibo, Hooi, Bryan, Deng, Shumin. (2024). Exploring Collaboration Mechanisms for {LLM. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2024.acl-long.782.

[reimers2019sentencebertsentenceembeddingsusing] Nils Reimers, Iryna Gurevych. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

[mbpp] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton. (2021). Program Synthesis with Large Language Models.

[yuksekgonul2024textgradautomaticdifferentiationtext] Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, James Zou. (2024). TextGrad: Automatic.

[marro2024scalablecommunicationprotocolnetworks] Samuele Marro, Emanuele La Malfa, Jesse Wright, Guohao Li, Nigel Shadbolt, Michael Wooldridge, Philip Torr. (2024). A Scalable Communication Protocol for Networks of Large Language Models.

[gilardi2023chatgptoutperformcrowd] Gilardi, Fabrizio, Alizadeh, Meysam, Kubli, Ma{. (2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences.

[latif2025canllmaidinannotating] Latif, Siddique, Usama, Muhammad, Malik, Muhammad Ibrahim, Schuller, Bj{. (2025). Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers [Research Frontier]. IEEE Computational Intelligence Magazine.

[shen2024smallllmsweaktool] Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, Fei Huang. (2024). Small LLMs Are Weak Tool Learners: A Multi-LLM Agent.

[wu2025optimas] Wu, Shirley, Sarthi, Parth, Zhao, Shiyu, Lee, Aaron, Shandilya, Herumb, Grobelnik, Adrian Mladenic, Choudhary, Nurendra, Huang, Eddie, Subbian, Karthik, Zhang, Linjun, others. (2025). Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards. arXiv preprint arXiv:2507.03041.

[gao2025flowreasonerreinforcingquerylevelmetaagents] Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, Tianyu Pang. (2025). FlowReasoner: Reinforcing Query-Level Meta-Agents.

[nie2025weakforstrongtrainingweakmetaagent] Fan Nie, Lan Feng, Haotian Ye, Weixin Liang, Pan Lu, Huaxiu Yao, Alexandre Alahi, James Zou. (2025). Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors.

[wang2025scoreflowmasteringllmagent] Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, Bryon Aragam. (2025). ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization.

[ghareeb2025robinmultiagentautomatingscientific] Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, Samuel G. Rodriques. (2025). Robin: A multi-agent system for automating scientific discovery.

[chen2025optimizingmodelselectioncompound] Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica. (2025). Optimizing Model Selection for Compound AI Systems.

[zhang2025agent-who-and-when] Zhang, Shaokun, Yin, Ming, Zhang, Jieyu, Liu, Jiale, Han, Zhiguang, Zhang, Jingyang, Li, Beibin, Wang, Chi, Wang, Huazheng, Chen, Yiran, others. (2025). Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems. arXiv preprint arXiv:2505.00212.

[comanici2025gemini] Comanici, Gheorghe, Bieber, Eric, Schaekermann, Mike, Pasupat, Ice, Sachdeva, Noveen, Dhillon, Inderjit, Blistein, Marcel, Ram, Ori, Zhang, Dan, Rosen, Evan, others. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261.

[grattafiori2024llama] Grattafiori, Aaron, Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, Mathur, Akhil, Schelten, Alan, Vaughan, Alex, others. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

[wei2025swerladvancingllmreasoning] Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, Sida I. Wang. (2025). SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution.

[wang2025openhandsopenplatformai] Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, Graham Neubig. (2025). OpenHands: An Open Platform for AI Software Developers as Generalist Agents.

[gou2024critic] Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Nan Duan, Weizhu Chen. (2024). {CRITIC. The Twelfth International Conference on Learning Representations.

[anthropicClaudeSonnet] Anthropic. (2025). {C.

[gpt4.1] OpenAI. (2025). GPT-4.1 Model Card.

[wu2025humanmemoryaimemory] Yaxiong Wu, Sheng Liang, Chen Zhang, Yichao Wang, Yongyue Zhang, Huifeng Guo, Ruiming Tang, Yong Liu. (2025). From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs.

[zhang2025darwingodelmachineopenended] Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune. (2025). Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents.

[zhang2025agentorchestraorchestratinghierarchicalmultiagent] Wentao Zhang, Liang Zeng, Yuzhen Xiao, Yongcong Li, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, Bo An. (2025). AgentOrchestra: Orchestrating Hierarchical Multi-Agent Intelligence with the Tool-Environment-Agent(TEA) Protocol.

[qiu2025agentdistilltrainingfreeagentdistillation] Jiahao Qiu, Xinzhe Juan, Yimin Wang, Ling Yang, Xuan Qi, Tongcheng Zhang, Jiacheng Guo, Yifu Lu, Zixin Yao, Hongru Wang, Shilong Liu, Xun Jiang, Liu Leqi, Mengdi Wang. (2025). AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes.

[fang2025mempexploringagentprocedural] Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang. (2025). Memp: Exploring Agent Procedural Memory.

[fang2025cognitivekernelproframeworkdeep] Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Hongming Zhang, Haitao Mi, Dong Yu. (2025). Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training.

[qin2025flashsearcherfasteffectiveweb] Tianrui Qin, Qianben Chen, Sinuo Wang, He Xing, King Zhu, He Zhu, Dingfeng Shi, Xinxin Liu, Ge Zhang, Jiaheng Liu, Yuchen Eleanor Jiang, Xitong Gao, Wangchunshu Zhou. (2025). Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution.

[wu2025webwalkerbenchmarkingllmsweb] Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang. (2025). WebWalker: Benchmarking LLMs in Web Traversal.

[wu2025evolverselfevolvingllmagents] Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, Botian Shi. (2025). EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle.

[rasmussen2025zeptemporalknowledgegraph] Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory.

[ruc-memory-survey] Zhang, Zeyu, Dai, Quanyu, Bo, Xiaohe, Ma, Chen, Li, Rui, Chen, Xu, Zhu, Jieming, Dong, Zhenhua, Wen, Ji-Rong. (2025). A Survey on the Memory Mechanism of Large Language Model-based Agents. ACM Trans. Inf. Syst.. doi:10.1145/3748302.

[zhao2025pyvisionagenticvisiondynamic] Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei. (2025). PyVision: Agentic Vision with Dynamic Tooling.

[sun2025hierarchicalmemoryhighefficiencylongterm] Haoran Sun, Shaoning Zeng. (2025). Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents.

[yang2025learningjobexperiencedrivenselfevolving] Cheng Yang, Xuemeng Yang, Licheng Wen, Daocheng Fu, Jianbiao Mei, Rong Wu, Pinlong Cai, Yufan Shen, Nianchen Deng, Botian Shi, Yu Qiao, Haifeng Li. (2025). Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks.

[wang2025mobileagenteselfevolvingmobileassistant] Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji. (2025). Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks.

[yin2023exchangeofthoughtenhancinglargelanguage] Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuanjing Huang, Xipeng Qiu. (2023). Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication.

[fang2025comprehensivesurveyselfevolvingai] Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, Zaiqiao Meng. (2025). A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems.

[tran2025multiagentcollaborationmechanismssurvey] Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O'Sullivan, Hoang D. Nguyen. (2025). Multi-Agent Collaboration Mechanisms: A Survey of LLMs.

[shi2025aime] Shi, Yexuan, Wang, Mingyu, Cao, Yunxiang, Lai, Hongjie, Lan, Junjian, Han, Xin, Wang, Yu, Geng, Jie, Li, Zhenan, Xia, Zihao, others. (2025). Aime: Towards Fully-Autonomous Multi-Agent Framework. arXiv preprint arXiv:2507.11988.

[phan2025hle] Phan, Long, Gatti, Alice, Han, Ziwen, Li, Nathaniel, Hu, Josephina, Zhang, Hugh, Zhang, Chen Bo Calvin, Shaaban, Mohamed, Ling, John, Shi, Sean, others. (2025). Humanity's last exam. arXiv preprint arXiv:2501.14249.

[suzgun2025dynamiccheatsheettesttimelearning] Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, James Zou. (2025). Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory.

[orhan2023recognitionrecallretentionfewshot] A. Emin Orhan. (2023). Recognition, recall, and retention of few-shot memories in large language models.

[wang2024agentworkflowmemory] Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig. (2024). Agent Workflow Memory.

[tang2025agentkbleveragingcrossdomain] Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou. (2025). Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving.

[zhang2025memengineunifiedmodularlibrary] Zeyu Zhang, Quanyu Dai, Xu Chen, Rui Li, Zhongyang Li, Zhenhua Dong. (2025). MemEngine: A Unified and Modular Library for Developing Advanced Memory of LLM-based Agents.

[zhang2025agentracerinducingfailurellm] Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, Shuicheng Yan. (2025). AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?.

[wang2025huxleygodelmachinehumanlevelcoding] Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, Jürgen Schmidhuber. (2025). *Huxley-G*.

[hu2025memoryageaiagentssurvey] Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan. (2025). Memory in the Age of AI Agents.

[openaiIntroducingGPT5] OpenAI. (2025). {I.

[deepseekai2025deepseekv32pushingfrontieropen] DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoran Wei, Haowei Zhang, Haowen Luo, Haozhe Ji, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, Jialiang Huang, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jingchang Chen, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jinhua Zhu, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexin Huang, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Liang Zhao, Liangsheng Yin, Lihua Guo, Lingxiao Luo, Linwang Ma, Litong Wang, Liyue Zhang, M. S. Di, M. Y Xu, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Panpan Huang, Peixin Cong, Peiyi Wang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, S. H. Liu, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoyuan Chen, Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, Songyang Zhou, Tao Ni, Tao Yun, Tian Pei, Tian Ye, Tianyuan Yue, Wangding Zeng, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjun Gao, Wentao Zhang, Xi Gao, Xiangwen Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xingyou Li, Xinyu Yang, Xinyuan Li, Xu Chen, Xuecheng Su, Xuehai Pan, Xuheng Lin, Xuwei Fu, Y. Q. Wang, Yang Zhang, Yanhong Xu, Yanru Ma, Yao Li, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yiliang Xiong, Ying He, Ying Zhou, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuduan Wang, Yue Gong, Yuhan Wu, Yuheng Zou, Yukun Li, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehua Zhao, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhiyu Wu, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang, Ziwei Xie, Ziyi Gao, Zizheng Pan, Zongqing Yao, Bei Feng, Hui Li, J. L. Cai, Jiaqi Ni, Lei Xu, Meng Li, Ning Tian, R. J. Chen, R. L. Jin, S. S. Li, Shuang Zhou, Tianyu Sun, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xinnan Song, Xinyi Zhou, Y. X. Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Dongjie Ji, Jian Liang, Jianzhong Guo, Jin Chen, Leyi Xia, Miaojun Wang, Mingming Li, Peng Zhang, Ruyi Chen, Shangmian Sun, Shaoqing Wu, Shengfeng Ye, T. Wang, W. L. Xiao, Wei An, Xianzu Wang, Xiaowen Sun, Xiaoxiang Wang, Ying Tang, Yukun Zha, Zekai Zhang, Zhe Ju, Zhen Zhang, Zihua Qu. (2025). DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models.

[cai2025trainingfreegrouprelativepolicy] Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, Xing Sun. (2025). Training-Free Group Relative Policy Optimization.

[zheng2025skillweaverwebagentsselfimprove] Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, Yu Su. (2025). SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills.

[tang2025chemagentselfupdatinglibrarylarge] Xiangru Tang, Tianyu Hu, Muyang Ye, Yanjun Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, Arman Cohan, Mark Gerstein. (2025). ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning.

[ye2025h2r] Ye, Shicheng, Yu, Chao, Ke, Kaiqiang, Xu, Chengdong, Wei, Yinqi. (2025). H2R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents. arXiv preprint arXiv:2509.12810.

[zhang2025gmemorytracinghierarchicalmemory] Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan. (2025). G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems.

[ouyang2025reasoningbankscalingagentselfevolving] Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, Tomas Pfister. (2025). ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory.

[wen2024diluknowledgedrivenapproachautonomous] Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yu Qiao. (2024). DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models.

[yang2025qwen3technicalreport] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, Zihan Qiu. (2025). Qwen3 Technical Report.

[githubGitHubAg2aiag2] Microsoft. (2024). {G.

[fourney2024magenticonegeneralistmultiagentsolving] Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, Saleema Amershi. (2024). Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks.

[ghafarollahi2024sciagentsautomatingscientificdiscovery] Alireza Ghafarollahi, Markus J. Buehler. (2024). SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning.

[liu2023codegeneratedchatgptreally] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang. (2023). Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation.

[xu2025kodcodediversechallengingverifiable] Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, Radha Poovendran. (2025). KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding.

[liu2025guardreasoner-vl] Liu, Yue, Zhai, Shengfang, Du, Mingzhe, Chen, Yulin, Cao, Tri, Gao, Hongcheng, Wang, Cheng, Li, Xinfeng, Wang, Kun, Fang, Junfeng, others. (2025). Guardreasoner-vl: Safeguarding vlms via reinforced reasoning. arXiv preprint arXiv:2505.11049.

[sutton1988learning] Sutton, Richard S. (1988). Learning to predict by the methods of temporal differences. Machine learning.

[pignatelli2023surveycreditassignment] Pignatelli, Eduardo, Ferret, Johan, Geist, Matthieu, Mesnard, Thomas, van Hasselt, Hado, Pietquin, Olivier, Toni, Laura. (2023). A survey of temporal credit assignment in deep reinforcement learning. arXiv preprint arXiv:2312.01072.

[he2025collabuiagent] Zhitao He, Zijun Liu, Peng Li, Yi R Fung, Ming Yan, Ji Zhang, Fei Huang, Yang Liu. (2025). Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization.

[arjona2019rudder] Arjona-Medina, Jose A, Gillhofer, Michael, Widrich, Michael, Unterthiner, Thomas, Brandstetter, Johannes, Hochreiter, Sepp. (2019). Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems.

[yin2023distributionalmegtagradient] Yin, Haiyan, Shuicheng, YAN, Xu, Zhongwen. (2023). Distributional Meta-Gradient Reinforcement Learning. The Eleventh International Conference on Learning Representations.

[narayan2018don] Narayan, Shashi, Cohen, Shay B, Lapata, Mirella. (2018). Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.

[ji2025pkusaferlhfmultilevelsafetyalignment] Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, Sirui Han, Yike Guo, Yaodong Yang. (2025). PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference.

[wang2025improvingmodelalignmentcollective] Junlin Wang, Roy Xie, Shang Zhu, Jue Wang, Ben Athiwaratkun, Bhuwan Dhingra, Shuaiwen Leon Song, Ce Zhang, James Zou. (2025). Improving Model Alignment Through Collective Intelligence of Open-Source LLMS.

[lambert2024rewardbenchevaluatingrewardmodels] Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi. (2024). RewardBench: Evaluating Reward Models for Language Modeling.

[mahan2024generativerewardmodels] Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, Alon Albalak. (2024). Generative Reward Models.

[sengupta2025magvmultiagentframeworksynthetic] Saptarshi Sengupta, Harsh Vashistha, Kristal Curtis, Akshay Mallipeddi, Abhinav Mathur, Joseph Ross, Liang Gou. (2025). MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification.

[jiang2024multiagentvqaexploringmultiagent] Bowen Jiang, Zhijun Zhuang, Shreyas S. Shivakumar, Dan Roth, Camillo J. Taylor. (2024). Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering.

[zhang2024chainagentslargelanguage] Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, Sercan Ö. Arik. (2024). Chain of Agents: Large Language Models Collaborating on Long-Context Tasks.

[hu2025owloptimizedworkforcelearning] Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Qiguang Chen, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Bernard Ghanem, Ping Luo, Guohao Li. (2025). OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation.

[zhang2025agentorchestrahierarchicalmultiagentframework] Wentao Zhang, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, Bo An. (2025). AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving.

[OpenAI-gpt4o] Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, Mathur, Akhil, Schelten, Alan, Yang, Amy, Fan, Angela, others. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

[anthropicdd2024claude] Anthropic. (2024). Model card addendum: Claude 3.5 Haiku and upgraded Claude 3.5 Sonnet.

[geminiteam2024gemini15unlockingmultimodal] Team, Gemini, Georgiev, Petko, Lei, Ving Ian, Burnell, Ryan, Bai, Libin, Gulati, Anmol, Tanzer, Garrett, Vincent, Damien, Pan, Zhufeng, Wang, Shibo, others. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.

[deepseekai2024deepseekv3technicalreport] DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wanjia Zhao, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha, Yunfan Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, Zizheng Pan. (2024). DeepSeek-V3 Technical Report.

[humangroup] Phillips, Katherine, O'Reilly, Charles. (1998). Demography and Diversity in Organizations: A Review of 40 Years of Research. Research in Organizational Behavior. doi:10.1126/science.1193147.

[kimiteam2025kimik2openagentic] Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaoru Hao, Tianhong He, Weiran He, Wenyang He, Chao Hong, Yangyang Hu, Zhenxing Hu, Weixiao Huang, Zhiqi Huang, Zihao Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yongsheng Kang, Guokun Lai, Cheng Li, Fang Li, Haoyang Li, Ming Li, Wentao Li, Yanhao Li, Yiwei Li, Zhaowei Li, Zheming Li, Hongzhan Lin, Xiaohan Lin, Zongyu Lin, Chengyin Liu, Chenyu Liu, Hongzhang Liu, Jingyuan Liu, Junqi Liu, Liang Liu, Shaowei Liu, T. Y. Liu, Tianwei Liu, Weizhou Liu, Yangyang Liu, Yibo Liu, Yiping Liu, Yue Liu, Zhengying Liu, Enzhe Lu, Lijun Lu, Shengling Ma, Xinyu Ma, Yingwei Ma, Shaoguang Mao, Jie Mei, Xin Men, Yibo Miao, Siyuan Pan, Yebo Peng, Ruoyu Qin, Bowen Qu, Zeyu Shang, Lidong Shi, Shengyuan Shi, Feifan Song, Jianlin Su, Zhengyuan Su, Xinjie Sun, Flood Sung, Heyi Tang, Jiawen Tao, Qifeng Teng, Chensi Wang, Dinglu Wang, Feng Wang, Haiming Wang, Jianzhou Wang, Jiaxing Wang, Jinhong Wang, Shengjie Wang, Shuyi Wang, Yao Wang, Yejie Wang, Yiqin Wang, Yuxin Wang, Yuzhi Wang, Zhaoji Wang, Zhengtao Wang, Zhexu Wang, Chu Wei, Qianqian Wei, Wenhao Wu, Xingzhe Wu, Yuxin Wu, Chenjun Xiao, Xiaotong Xie, Weimin Xiong, Boyu Xu, Jing Xu, Jinjing Xu, L. H. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinran Xu, Yangchuan Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Xiaofei Yang, Ying Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Xingcheng Yao, Wenjie Ye, Zhuorui Ye, Bohong Yin, Longhui Yu, Enming Yuan, Hongbang Yuan, Mengjie Yuan, Haobing Zhan, Dehao Zhang, Hao Zhang, Wanlu Zhang, Xiaobin Zhang, Yangkun Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Haotian Zhao, Yikai Zhao, Huabin Zheng, Shaojie Zheng, Jianren Zhou, Xinyu Zhou, Zaida Zhou, Zhen Zhu, Weiyu Zhuang, Xinxing Zu. (2025). Kimi K2: Open Agentic Intelligence.

[zhang2025deepresearchsurveyautonomous] Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, Xiangyu Zhao. (2025). Deep Research: A Survey of Autonomous Research Agents.

[wei2025aiscienceagenticscience] Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guangshuai Wang, Zhiqiang Gao, Juntai Cao, Zijie Qiu, Xuming He, Qiang Zhang, Chenyu You, Shuangjia Zheng, Ning Ding, Wanli Ouyang, Nanqing Dong, Yu Cheng, Siqi Sun, Lei Bai, Bowen Zhou. (2025). From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery.

[bai2025interns1scientificmultimodalfoundation] Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqing Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Hanglei Hu, Jucheng Hu, Ming Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xu Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peiji Li, Pengze Li, Shuaibin Li, Tianbin Li, Wei Li, Yuqiang Li, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hongwei Liu, Jiangning Liu, Jiyao Liu, Junnan Liu, Kai Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shudong Liu, Wei Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hongxia Lv, Huijie Lv, Qitan Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jianpeng Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yichuan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jiantao Qiu, Xiaoye Qu, Yuan Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Chunfeng Song, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, Kexian Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Jiaqi Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Linchen Xiao, Shuhao Xing, Chao Xu, Huihui Xu, Jun Xu, Ruiliang Xu, Wanghan Xu, GanLin Yang, Yuming Yang, Haochen Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Yuhang Zang, Bo Zhang, Chao Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiaosheng Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yunhua Zhou, Dongsheng Zhu, Lin Zhu, Yicheng Zou. (2025). Intern-S1: A Scientific Multimodal Foundation Model.

[du2025deepresearchbenchcomprehensivebenchmark] Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao. (2025). DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents.

[chen2025xbenchtrackingagentsproductivity] Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, Chen Sun, Han Hou, Hui Yang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi-Hsin Hung, Yuan Jiang, Zexuan Liu, Zihan Yin, Zijian Ma, Zhiwen Mo. (2025). xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations.

[llamaindexLlamaIndexBuild] LlamaIndex Team. (2025). {L.

[meituanlongcatteam2025longcatflashtechnicalreport] Meituan LongCat Team, Bayan, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, Chengcheng Han, Chenguang Xi, Chi Zhang, Chong Peng, Chuan Qin, Chuyu Zhang, Cong Chen, Congkui Wang, Dan Ma, Daoru Pan, Defei Bu, Dengchang Zhao, Deyang Kong, Dishan Liu, Feiye Huo, Fengcun Li, Fubao Zhang, Gan Dong, Gang Liu, Gang Xu, Ge Li, Guoqiang Tan, Guoyuan Lin, Haihang Jing, Haomin Fu, Haonan Yan, Haoxing Wen, Haozhe Zhao, Hong Liu, Hongmei Shi, Hongyan Hao, Hongyin Tang, Huantian Lv, Hui Su, Jiacheng Li, Jiahao Liu, Jiahuan Li, Jiajun Yang, Jiaming Wang, Jian Yang, Jianchao Tan, Jiaqi Sun, Jiaqi Zhang, Jiawei Fu, Jiawei Yang, Jiaxi Hu, Jiayu Qin, Jingang Wang, Jiyuan He, Jun Kuang, Junhui Mei, Kai Liang, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Liang Gao, Liang Shi, Lianhui Ma, Lin Qiu, Lingbin Kong, Lingtong Si, Linkun Lyu, Linsen Guo, Liqi Yang, Lizhi Yan, Mai Xia, Man Gao, Manyuan Zhang, Meng Zhou, Mengxia Shen, Mingxiang Tuo, Mingyang Zhu, Peiguang Li, Peng Pei, Peng Zhao, Pengcheng Jia, Pingwei Sun, Qi Gu, Qianyun Li, Qingyuan Li, Qiong Huang, Qiyuan Duan, Ran Meng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shizhe Wu, Shuai Liang, Shuo Wang, Suogui Dang, Tao Fang, Tao Li, Tefeng Chen, Tianhao Bai, Tianhao Zhou, Tingwen Xie, Wei He, Wei Huang, Wei Liu, Wei Shi, Wei Wang, Wei Wu, Weikang Zhao, Wen Zan, Wenjie Shi, Xi Nan, Xi Su, Xiang Li, Xiang Mei, Xiangyang Ji, Xiangyu Xi, Xiangzhou Huang, Xianpeng Li, Xiao Fu, Xiao Liu, Xiao Wei, Xiaodong Cai, Xiaolong Chen, Xiaoqing Liu, Xiaotong Li, Xiaowei Shi, Xiaoyu Li, Xili Wang, Xin Chen, Xing Hu, Xingyu Miao, Xinyan He, Xuemiao Zhang, Xueyuan Hao, Xuezhi Cao, Xunliang Cai, Xurui Yang, Yan Feng, Yang Bai, Yang Chen, Yang Yang, Yaqi Huo, Yerui Sun, Yifan Lu, Yifan Zhang, Yipeng Zang, Yitao Zhai, Yiyang Li, Yongjing Yin, Yongkang Lv, Yongwei Zhou, Yu Yang, Yuchen Xie, Yueqing Sun, Yuewen Zheng, Yuhuai Wei, Yulei Qian, Yunfan Liang, Yunfang Tai, Yunke Zhao, Zeyang Yu, Zhao Zhang, Zhaohua Yang, Zhenchao Zhang, Zhikang Xia, Zhiye Zou, Zhizhao Zeng, Zhongda Su, Zhuofan Chen, Zijian Zhang, Ziwen Wang, Zixu Jiang, Zizhe Zhao, Zongyu Wang, Zunhai Su. (2025). LongCat-Flash Technical Report.

[yang2024oasis] Yang, Ziyi, Zhang, Zaibin, Zheng, Zirui, Jiang, Yuxian, Gan, Ziyue, Wang, Zhiyu, Ling, Zijian, Chen, Jinsong, Ma, Martz, Dong, Bowen, others. (2024). Oasis: Open agents social interaction simulations on one million agents. arXiv preprint arXiv:2411.11581.

[barandoni2024automatingcustomerneedsanalysis] Simone Barandoni, Filippo Chiarello, Lorenzo Cascone, Emiliano Marrale, Salvatore Puccio. (2024). Automating Customer Needs Analysis: A Comparative Study of Large Language Models in the Travel Industry.

MemEvolve: A Meta-Evolving Memory Framework​

Introduction​

Related Work​

EvolveLab: A Unified Codebase for Self-Evolving Memory​

Preliminary​

Modular Design Space of Memory Systems​

EvolveLab Codebase​

MemEvolve: A Meta-Evolving Memory Framework​

Dual-Evolution Process​

Diagnose-and-Design Evolution​

Experiments​

Experiment Setup​

Main Results​

Self-Evolving Memory Comparison​

Meta-Evolving Dynamics​

Conclusion​

Contributions​

Contributions​

Contributions​

Corresponding Authors

Experiments​

Appendix

EvolveLab Implementation​

Unified Interface and Abstract Base Class​

Standardized Data Carriers​

Implementation Examples: ExpeL and SkillWeaver​

Experiment Details​

Dataset Details​

Memory System Demonstration​

References​

MemEvolve: A Meta-Evolving Memory Framework

Introduction

Related Work

EvolveLab: A Unified Codebase for Self-Evolving Memory

Preliminary

Modular Design Space of Memory Systems

EvolveLab Codebase

MemEvolve: A Meta-Evolving Memory Framework

Dual-Evolution Process

Diagnose-and-Design Evolution

Experiments

Experiment Setup

Main Results

Self-Evolving Memory Comparison

Meta-Evolving Dynamics

Conclusion

Contributions

Contributions

Contributions

Experiments

EvolveLab Implementation

Unified Interface and Abstract Base Class

Standardized Data Carriers

Implementation Examples: ExpeL and SkillWeaver

Experiment Details

Dataset Details

Memory System Demonstration

References