ourmethod: Tracing Hierarchical Memory for Multi-Agent Systems
Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan
Abstract
Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents, yet their capacity for self-evolution remains hampered by underdeveloped memory architectures. Upon close inspection, we are alarmed to discover that prevailing MAS memory mechanisms (1) are overly simplistic, completely disregarding the nuanced inter-agent collaboration trajectories, and (2) lack cross-trial and agent-specific customization, in stark contrast to the expressive memory developed for single agents. To bridge this gap, we introduce \ourmethod, a hierarchical, agentic memory system for MAS inspired by organizational memory theory~[BOOK_1991organizational_memory], which manages the lengthy MAS interaction via a three-tier graph hierarchy: insight, query, and interaction graphs. Upon receiving a new user query, \ourmethod performs bi-directional memory traversal to retrieve both high-level, generalizable insights that enable the system to leverage cross-trial knowledge, and fine-grained, condensed interaction trajectories that compactly encode prior collaboration experiences. Upon task execution, the entire hierarchy evolves by assimilating new collaborative trajectories, nurturing the progressive evolution of agent teams. Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that \ourmethod improves success rates in embodied action and accuracy in knowledge QA by up to $20.89%$ and $10.12%$, respectively, without any modifications to the original frameworks. Our codes are available at https://github.com/bingreeky/GMemory.
Introduction
As Large Language Models (LLMs) continue to redefine the frontier of artificial intelligence, LLMdriven agents have exhibited unprecedented prowess in perception [2, 3, 4, 5], planning [6, 7, 8], reasoning [9, 10], and action [11, 12], which have catalyzed remarkable progress across diverse downstream domains, including code generation [13, 14], data analysis [15], embodied tasks [16] and autonomous driving [3, 17, 18]. Building upon the impressive competencies of single agents, LLMbased Multi-Agent Systems (MAS) have been demonstrated to push the boundaries of single model capacity [19, 20, 21]. Similar to collective intelligence arising from human social collaboration [22, 23, 24], MAS orchestrates multiple agents [25, 26, 27], whether through cooperation [28, 29, 30, 31] or competition [32, 33, 34], to transcend the cognitive and specialized limitations of solitary agents.
Self-Evolving Agents. What especially characterizes LLM agents is their self-evolving capacity , i.e. , the ability to continuously adapt and improve through interactions with the environment, as seen in prior works where such adaptability has led to two- to three-fold quantitative improvements [35]. The central driving force behind such self-evolving nature is memory mechanism of agents [36, 37, 38], which parallels human abilities to accumulate knowledge, process past experiences, and

Figure 1: ( Left ) We report the token cost of several single-agent and MAS baselines on ALFWorld benchmark; ( Right ) The overview of G-Memory 's three-tier hierarchical memory architecture, encompassing the insight graph, query graph and interaction (utterance) graph.
retrieve relevant information. Previous successful memory mechanism designs, including both inside-trial memory ( i.e. , context retained within solving one single query) and cross-trial memory ( i.e. , experience accumulated across multiple tasks) [39], have empowered agents to excel in diverse applications such as personalized chat [36, 40, 41], recommendation [42], embodied action [43, 16], and social simulation [19, 44, 45], enabling them to evolve into experiential learners that effectively leverage past experiences and world knowledge.
Self-Evolving MAS. However, such self-evolving capacity remains largely absent in multi-agent systems. Most existing MAS are still constrained by manually defined workflows, such as the Standard Operating Procedures (SOP) in MetaGPT [21] and ChatDev [46], or rely on pre-defined communication topologies in MacNet [47] and AgentPrune [30]. More recent automated MASs, such as GPTSwarm [48], ADAS [49], AFlow [50], and MaAS [51] have made it to automatically optimize inter-agent topologies or prompts, which, nevertheless, ultimately yield giant and cumbersome MAS architectures, lacking the agility to self-adjust with accumulated collaboration experience.
Memory for MAS. The absence of the aforementioned self-evolving capacity is, in fact, rooted in the lack of memory mechanisms specifically tailored for MAS. One may challenge this claim from two perspectives: ❶ Do existing MASs lack memory mechanisms altogether? Not entirely. Classical MAS frameworks such as MetaGPT, ChatDev, and Exchange-of-Thought [52] incorporate memory-related designs. However, these are often limited to inside-trial memory [52], while cross-trial memory, if present, remains rudimentary-typically involving the transmission of overly condensed artifacts ( e.g. , final solutions or execution results) [21, 46, 47], and failing to enable meaningful learning from collaborative experience. ❷ Why not directly transfer existing single-agent memory mechanisms to MAS? Unfortunately, such a transfer is far from straightforward. The inherent nature of MAS, i.e. , multi-turn orchestration across multiple agents [26, 27], leads to substantially longer task-solving trajectories compared to single-agent settings (up to 10 × more tokens, as demonstrated by Figure 1 ( Left )). This poses a significant challenge to traditional retrieval-based memory designs [36, 37, 16], as naive feeding of the entire long-context trajectory without proper abstraction from a collaborative perspective offers little benefit. Given the aforementioned challenges, a natural question arises:

How can we design a memory mechanism capable of storing, retrieving, and managing the lengthy interaction history of multi-agent systems, such that agent teams can benefit from concise and instructive experience and insights?
The Present Work: G-Memory . In response to the above question, we introduce a ✿ Graph-based Agentic ✿✿✿✿✿✿✿ Memory Mechanism for LLM-based Multi-Agent Systems , dubbed G-Memory , which manages the complex and lengthy interaction history of MAS through a three-tier hierarchical graph structure:
- ✱ Insight Graph , which abstracts generalizable insights from historical experience;
- ✱ Query Graph , which encodes meta-information of task queries and their connectivity;
- ✱ Interaction Graph , which stores fine-grained textual communication logs among agents.
Figure 1 ( Right ) visualizes these structures, and their formal definitions are placed in Section 3. When a new query arrives, G-Memory efficiently retrieves relevant query records by leveraging the topology of the query graph, and then traverses upward ( i.e. , query → insight graph) to extract associated highlevel insights and downward ( i.e. , query → interaction graph) to identify core interaction subgraphs that are most pertinent to the task at hand, thereby mitigating information overload. Based on the
retrieved memory, G-Memory offers actionable guidance to the MAS, e.g. , division of labor, task decomposition, and lessons from past failures. Upon the completion of a task, all three levels of the memory hierarchy are updated in an agentic manner, with newly distilled insights, enriched query records, detailed MAS trajectories, and their level of detailed associations. Through this refinement, G-Memory functions as a plug-and-play module that can be seamlessly embedded into mainstream MAS frameworks, empowering evolving inter-agent collaboration and collective intelligence.
Our contributions are summarized as follows:
- ❶ Bottleneck Identification. We conduct a thorough review of existing multi-agent systems and identify a fundamental bottleneck in their self-evolving capabilities, which is largely attributed to the oversimplified memory architectures.
- ❷ Practical Solution. We propose G-Memory , a hierarchical agentic memory architecture for MAS, which models complex and prolonged inter-agent collaboration through a three-tier structure comprising insight, query, and interaction graphs.
- ❸ Experimental Evaluation. Extensive experiments across five benchmarks show that G-Memory is (I) high-performing , improving state-of-the-art MAS by up to 20 . 89% and 10 . 12% on embodied action and knowledge QA tasks, respectively; and (II) resource-friendly , maintaining comparable or even lower token usage than mainstream memory designs.
Self-Evolving Agents.
Self-Evolving MAS.
Memory for MAS.
This section outlines the management workflow of G-Memory , as illustrated in Figure 2. Specifically, upon the arrival of a new query Q , G-Memory first conducts coarse-grained retrieval to identify pertinent trajectory records ( ▷ Section 4.1). It then performs bi-directional hierarchical memory traversal: upward to retrieve collective cognitive insights, and downward to distill concrete procedural trajectories ( ▷ Section 4.2). After the memory-augmented MAS completes the query execution, the hierarchical memory architecture is jointly updated based on environmental feedback, thereby achieving the institutionalization of group knowledge ( ▷ Section 4.3).

Figure 2: The overview of our proposed G-Memory .
The Present Work: ourmethod.
Related Works
Single-Agent Memory. Memory serves as a primary driving force for agents to accumulate experiences and explore the world through interactions with the environment [53, 54, 55, 56]. It plays a critical role in both task-solving and social simulation LLM agents, and this work primarily focuses on the former. Early research on agent memory was confined to simple inside-trial memory, mainly addressing limitations posed by the LLM context window in chatbot applications, including MemoryBank [36], ChatDB [40], MemoChat [41], and MemGPT [37], which typically adopt retrievalaugmented generation (RAG)-style, similarity-based chunk retrieval. Subsequent developments have progressed toward more cognitively inspired memory architectures, including (1) memory scope extended to cross-trial memory like ExpeL [43] and Synapse [57]; (2) application domains broadened to include computer control [57], embodied action [58], scientific discovery [59], coding and reasoning [60]; and (3) management techniques evolved from coarse-grained textual similarity toward more sophisticated abstraction and summarization of acquired knowledge and experiences [19], as seen in A-Mem [61], Mem0 [62] and MemInsight [63]. More discussions are in Appendix D.
Memory in Multi-agent System. However, the memory mechanisms tailored for MAS remain markedly underexplored. Some representative frameworks, such as LLM-Debate [20, 33] and Mixture-of-Agent [64], omit memory components altogether. Others merely adopt simplistic insidetrial memory schemes [47, 52]. Even in frameworks that attempt cross-trial memory [46], the memory is merely compressed as the final outcome artifacts, overlooking the nuanced agent interactions. Collectively, there is a pressing need for a principled memory architecture that can capture, organize, and retrieve the inherently intricate task-solving processes unique to MAS [39].
LLM-based Multi-Agent Systems. Our work focuses on task-solving MAS, which, unlike their single-agent counterparts, often lack the capacity for continual evolution through interaction with the environment [65, 66]. Early frameworks such as AutoGen [13], CAMEL [24], and AgentVerse [67] rely entirely on pre-defined workflows. More recent efforts [68, 69, 50, 49, 70, 31] introduce a degree of adaptivity by generating dynamic MAS in response to environmental feedback. However, such evolution is often one-shot : for example, AFlow [50] employs Monte Carlo Tree Search to construct a complex MAS tailored to a specific task domain, which yet lacks the capacity to evolve with increasing task exposure or transfer across domains [51, 71]. From this perspective, constructing MAS with genuine self-evolving capabilities remains an open and challenging research frontier.
Single-Agent Memory.
This section outlines the management workflow of G-Memory , as illustrated in Figure 2. Specifically, upon the arrival of a new query Q , G-Memory first conducts coarse-grained retrieval to identify pertinent trajectory records ( ▷ Section 4.1). It then performs bi-directional hierarchical memory traversal: upward to retrieve collective cognitive insights, and downward to distill concrete procedural trajectories ( ▷ Section 4.2). After the memory-augmented MAS completes the query execution, the hierarchical memory architecture is jointly updated based on environmental feedback, thereby achieving the institutionalization of group knowledge ( ▷ Section 4.3).

Figure 2: The overview of our proposed G-Memory .
Memory in Multi-agent System.
In this section, we detail the setups of our three adopted MAS frameworks, AutoGen, DyLAN and MacNet:
LLM-based Multi-Agent Systems.
In this section, we detail the setups of our three adopted MAS frameworks, AutoGen, DyLAN and MacNet:
Preliminary
In this section, we establish the notation and formalize key concepts of multi-agent systems and G-Memory 's hierarchical memory architecture.
Multi-agent System Formalization. Consider a multi-agent framework represented by a directed graph G = ( V , E ) , where |V| = N is the number of agents and E ⊆ V×V defines their communication
channels. Each node C i ∈ V corresponds to an individual agent described by the quadruple:
$$
$$
where Base i denotes the underlying large language model instance, Role i specifies the agent's designated role or persona, Mem i encapsulates its memory state, including past interactions or external knowledge stores, and Plugin i is the set of auxiliary tools ( e.g. , web-search engine).
Upon receiving a user query Q , the system evolves through T synchronous communication epochs. At each epoch t , we derive a topological ordering π = [ π 1 , . . . , π N ] of the nodes such that if there is an edge from π j to π k , then j < k , which guarantees that every agent processes its inputs only after all its predecessors have acted. For each agent C i in π , its output at iteration t is computed as:
$$
$$
where: r ( t ) i denotes the response generated by C i (which may include reasoning steps, intermediate analyses, or final proposals), P ( t ) sys comprises global instructions (including each agent's R i ), N -( C i ) is the set of in-neighbors of C i , whose outputs serve as contextual inputs. After all agents have acted, a global aggregation operator A fuses the collection of responses into an interim solution a ( t ) :
$$
$$
Common implementations for A include majority voting schemes [48], hierarchical summarization via dedicated aggregator agents [13, 30], or simply adopting the final agent's output as the answer [47]. These epochs iterate for t = { 1 , . . . , T } until either a preset limit is reached or an early-stopping criterion is met [72], producing the final response a ( T ) to the query Q .
Memory Architecture. Our proposed G-Memory orchestrates and manages the memory of multiagent systems via the following three hierarchical graph structures:
[ ✱ ] Interaction Graph (Utterance Graph). For query Q , let G ( Q ) inter = ( U ( Q ) , E ( Q ) u ) denote its interaction trajectory, where (i) nodes U ( Q ) = { u i } represent atomic utterances, with each u i ≜ ( A i , m i ) containing A i ∈ V (speaking agent), and m i (textual content), (ii) Edges E ( Q ) u ⊆ U ( Q ) × U ( Q ) follow temporal relationships: ( u j , u k ) ∈ E ( Q ) u ⇐⇒ u j is transmitted to and inspires u k .
[ ✱ ] Query Graph. The query graph, storing previously tackled queries and metadata, is as follows:
$$
$$
where Q = { q i } is the node set, node q i ≜ ( Q i , Ψ i , G ( Q i ) inter ) is composed of the original query Q i , task status Ψ i ∈ { Failed , Resolved } , and its associated interaction graph G ( Q i ) inter . The edges E q ⊆ Q × Q encode semantic relationships between queries. The query graph enables retrieval beyond coarse metrics such as embedding similarity, with its meticulous topology.
$$
$$
where the node set I = { ι k } represents distilled insights, each node ι k is composed of the insight content κ k and the set of supporting queries Ω k ⊆ Q . The edges E i ⊆ I × I × Q forming hyper-connections where ( ι m , ι n , q j ) indicates insight ι m contextualizes ι n through query q j .
Multi-agent System Formalization.
In this section, we detail the setups of our three adopted MAS frameworks, AutoGen, DyLAN and MacNet:
Memory Architecture.
After completing memory augmentation for each agent, the system G is executed as outlined in Section 3, yielding a final solution a ( T ) and receiving environmental feedback, including execution status Ψ i ∈ { Failed , Resolved } , token usage, and other performance metrics. Subsequently, G-Memory updates its hierarchical memory architecture to incorporate this new query. At the interaction level , G-Memory traces each agent's utterances to construct the interaction graph G ( Q ) inter , which is then stored. At the query level , a new query node is instantiated and added to the query graph Q query :
$$
$$
where edges are established between q new and (ii) the set Q R containing the topM relevant historical queries identified in Equation (7), and (ii) the set of queries ⋃ ι k ∈I ret Ω k that support the insights I S utilized for solving Q . G next query denotes the updated query graph.
Finally, at the insight level , G-Memory integrates the learning from the completed query Q into the insight graph G insight = ( I , E i ) . First, possible new insights summarizing the experience are generated and structurally linked via a summarization function J ( · , · ) (see prompt in Appendix C) as follows:
$$
$$
where edges are added to connect the previously utilized insights which inspires the completion of Q in Equation (6). Afterward, the supporting query sets ( Ω k ) for the utilized insights ( I S ) are updated to include q new, reflecting their relevance to this successful (or failed) application:
$$
$$
where the final node set I next incorporates the new insight and the updated versions of the utilized insights, and the resulting graph G next insight thus encapsulates the integrated knowledge. This continuous update cycle across all hierarchical levels enables G-Memory to learn and adaptively refine its collective memory based on ongoing experience.
Table 1: Performance comparison with single/multi-agent memory architectures on five benchmarks. The underlying LLM backbone is GPT-4o-mini . We highlight the best and second best results.
G-Memory
This section outlines the management workflow of G-Memory , as illustrated in Figure 2. Specifically, upon the arrival of a new query Q , G-Memory first conducts coarse-grained retrieval to identify pertinent trajectory records ( ▷ Section 4.1). It then performs bi-directional hierarchical memory traversal: upward to retrieve collective cognitive insights, and downward to distill concrete procedural trajectories ( ▷ Section 4.2). After the memory-augmented MAS completes the query execution, the hierarchical memory architecture is jointly updated based on environmental feedback, thereby achieving the institutionalization of group knowledge ( ▷ Section 4.3).

Figure 2: The overview of our proposed G-Memory .
Coarse-grained Memory Retrieval
As a plug-in designed for seamless integration into mainstream MAS, G-Memory is triggered when the MAS G encounters a new user query Q . As emphasized in organizational memory theory [1], efficient knowledge retrieval typically begins with broadly relevant schemas prior to more fine-grained access. Following this principle, G-Memory first performs a coarse-grained similarity-based retrieval over the query graph G query to efficiently obtain a sketched set of queries Q S :
$$
$$
where v ( · ) maps queries into fixed-length embeddings using models such as MiniLM [73]. While Equation (4) retrieves semantically similar historical queries, the similarity may be only superficial or noisy. Therefore, G-Memory further enlarges the relevant set via hop expansion on the query graph:
$$
$$
where ˜ Q S is augmented with the 1-hop neighbors of Q S on the query graph G query . However, it is suboptimal to directly feed these relevant records as input akin to certain single-agent memory systems [41, 37]. On one hand, the excessive context length may overwhelm the LLM; on the other hand, agents in MAS play distinct roles and should be assigned specialized memory tailored to their functions. To address this, the next section introduces a bi-directional processing scheme in G-Memory that operates over both abstract and fine-grained memory levels.
Bi-directional Memory Traversal
Subsequent to identifying the expanded set of relevant query nodes ˜ Q S within G query , G-Memory executes a bi-directional memory traversal to furnish multi-granularity memory support. Specifically, G-Memory first performs an upward traversal ( G query →G insight ), retrieving insight nodes that may provide high-level guidance for the current task:
̸
$$
$$
where Π Q→I is a query-to-insight projector that identifies all the insight nodes whose supporting query sets intersect with the input query set, and the retrieved insights I S encapsulate distilled, generalized knowledge potentially relevant for orienting the MAS G 's strategic approach to Q .
Beyond generalized insights, the fine-grained textual interaction history of the MAS is equally valuable, as it reveals the underlying reasoning patterns that led to successful or failed collaborations [68, 74, 75]. To utilize these concisely, in the downward traversal ( G query → G interaction ),
G-Memory employs an LLM-facilitated graph sparsifier S LLM ( · , · ) to extract the core subgraph that encapsulates essential inter-agent collaboration:
$$
$$
where R LLM ( Q,q j ) rates the relevancy of historical queries w.r.t. Q , and the sparsifier S LLM ( G ( Q j ) inter , Q ) constructs a sparsified graph ˆ G ( Q j ) inter = ( ˆ U ( Q j ) , ˆ E ( Q j ) u ) from the original G ( Q j ) inter by identifying and retaining dialogue elements. Please refer to Appendix C for their implementations.
Upon completing the bi-directional traversal, we obtain both generalizable insights ( I S ) and detailed collaborative trajectories ( { ˆ G Q i inter } | M | i =1 ). G-Memory then proceeds to provide specialized memory support for each agent C ∈ V within the MAS G .
$$
$$
where the operator Φ( · ; · ) evaluates the utility and relevance of each insight ι k ∈ I S and sparsified interaction graph ˆ G ( Q j ) inter concerning the agent's specific role Role i and the task Q (see Appendix C). Based on this evaluation, Φ intializes each agent's internal memory state Mem i with filtered insights, interaction snippets, summaries thereof, equipping it with pertinent historical context before it participates in the subsequent reasoning epochs of the MAS. It is worth noting that G-Memory is invoked at the onset of solving query Q in our implementation. However, practitioners may flexibly configure more fine-grained invocation strategies, such as at the beginning of each MAS dialogue round or selectively for specific agents, based on their needs.
Hierarchy Memory Update
After completing memory augmentation for each agent, the system G is executed as outlined in Section 3, yielding a final solution a ( T ) and receiving environmental feedback, including execution status Ψ i ∈ { Failed , Resolved } , token usage, and other performance metrics. Subsequently, G-Memory updates its hierarchical memory architecture to incorporate this new query. At the interaction level , G-Memory traces each agent's utterances to construct the interaction graph G ( Q ) inter , which is then stored. At the query level , a new query node is instantiated and added to the query graph Q query :
$$
$$
where edges are established between q new and (ii) the set Q R containing the topM relevant historical queries identified in Equation (7), and (ii) the set of queries ⋃ ι k ∈I ret Ω k that support the insights I S utilized for solving Q . G next query denotes the updated query graph.
Finally, at the insight level , G-Memory integrates the learning from the completed query Q into the insight graph G insight = ( I , E i ) . First, possible new insights summarizing the experience are generated and structurally linked via a summarization function J ( · , · ) (see prompt in Appendix C) as follows:
$$
$$
where edges are added to connect the previously utilized insights which inspires the completion of Q in Equation (6). Afterward, the supporting query sets ( Ω k ) for the utilized insights ( I S ) are updated to include q new, reflecting their relevance to this successful (or failed) application:
$$
$$
where the final node set I next incorporates the new insight and the updated versions of the utilized insights, and the resulting graph G next insight thus encapsulates the integrated knowledge. This continuous update cycle across all hierarchical levels enables G-Memory to learn and adaptively refine its collective memory based on ongoing experience.
Table 1: Performance comparison with single/multi-agent memory architectures on five benchmarks. The underlying LLM backbone is GPT-4o-mini . We highlight the best and second best results.
Experiment
In this section, we conduct extensive experiments to answer: ( RQ1 ) How does G-Memory perform compared to existing single/multi-agent memory architectures? ( RQ2 ) Does G-Memory incur excessive resource overhead? ( RQ3 ) How sensitive is G-Memory to its key components and parameters?
Experiment Setup
Datasets and Benchmarks. To thoroughly evaluate the effectiveness of G-Memory , we adopt five widely-adopted benchmarks across three domains: (1) Knowledge reasoning , including HotpotQA [76] and FEVER [77]; (2) Embodied action , including ALFWorld [78] and SciWorld [79]; (3) Game , namely PDDL [80]. Details on these benchmarks are in Appendix A.1.
Baselines. We select four representative single-agent memory baselines, including non-memory, Voyager [16], MemoryBank [36], and Generative Agents [19], as well as three multi-agent memory implementations from MetaGPT [21], ChatDev [46], and MacNet [47], denoted as MetaGPT-M, ChatDev-M, and MacNet-M, respectively. Details are in Appendix A.2.
MAS and LLM Backbones. We select three representative multi-agent frameworks to integrate with G-Memory and the baselines, including AutoGen [13], DyLAN [72], and MacNet [47]. More details on the MAS setups are placed in Appendix A.3. For instantiating these MAS frameworks, we adopt two open-source LLMs, Qwen-2.5-7b and Qwen-2.5-14b , as well as one proprietary LLM, gpt-4o-mini . The deployment of Qwen series is via local instantiation using Ollama 1 , and GPT models are accessed via OpenAI APIs.
Parameter Configurations. We implement the embedding function v ( · ) in Equation (4) with ALL-MINILM-L6-V2 [81]. The number of the most relevant interaction graphs M in Equation (7) is set among { 2 , 3 , 4 , 5 } , and the number of relevant queries k in Equation (4) is set among { 1 , 2 } . The detailed ablation study on hyper-parameters is placed in Section 5.4.
Datasets and Benchmarks.
Baselines.
In this section, we provide detailed descriptions of each baseline used in our comparison:
MAS and LLM Backbones.
Parameter Configurations.
In this section, we describe the datasets used in our experiments:
Evaluation Metrics. We use exact match accuracy for FEVER and HotpotQA. For ScienceWorld and PDDL, we report the progress rate , and for ALFWorld, we use the success rate as the evaluation metric.
Main Results (RQ1)
Tables 1, 2 and 3 comprehensively report the performance of different memory architectures across three LLM backbones and three MAS frameworks. We summarize the key observations as follows:
1 http://github.com/ollama/ollama

Figure 3: Cost analysis of G-Memory . We showcase the performance versus the overall system token cost when combined with different memory architectures.
Takeaway ➊ : G-Memory consistently improves performance across all task domains and MAS frameworks. As shown in Table 2, when integrated with AutoGen and MacNet (powered by Qwen-2.5-7b ), G-Memory surpasses the best-performing single-/multi-agent memory baselines by an average of 6 . 8% and 5 . 5% , respectively. With the more capable Qwen-2.5-14b , the improvement is even more pronounced: in Table 3, G-Memory boosts MacNet's performance on ALFWorld from 58 . 21% to 79 . 10% , achieving a substantial 20 . 89% gain.
Takeaway ➋ : Multi-agent systems demand specialized memory designs. A thorough examination of existing baselines reveals a surprising insight: most memory mechanisms fail to consistently benefit MAS settings. In Table 2, baselines such as Voyager and MemoryBank degrade AutoGen's performance on PDDL by as much as 4 . 17% and 1 . 34% , respectively. We attribute this to the inability of these methods to provide agent role-specific memory support, which is essential in the PDDL strategic game tasks, where effective division of labor is critical to success. Even MAS-oriented designs, such as ChatDev-M, result in a 2 . 32% performance drop when applied to MacNet+SciWorld. We attribute this to ChatDev-M's narrow memory scope-storing only the execution results of past queries, which provides limited utility in embodied action environments. These findings highlight the necessity of G-Memory 's core characteristics: role-specific memory cues, abstracted high-level insights, and trajectory condensation-all of which are critical for effective memory in MAS.
Cost Analysis (RQ2)
To evaluate the efficiency of G-Memory in terms of token consumption, we visualize the performance versus token cost trade-off across various settings, as shown in Figures 3 and 7. Our findings are:
Takeaway ➌ : G-Memory achieves high-performing collective memory without excessive token consumption. As depicted in Figure 3, G-Memory consistently delivers the highest performance improvement ( 10 . 32% ↑ over no-memory setting on PDDL+AutoGen) while maintaining a modest increase in token consumption (only 1 . 4 × 10 6 ). In contrast, MetaGPT-M incurred an additional 2 . 2 × 10 6 tokens for a mere 4 . 07% gain. This clearly demonstrates the token-efficiency of G-Memory .
Framework Analysis (RQ3)
Sensitivity Analysis. Regarding the hop expansion, as shown in Figure 4a, 1-hop expansion consistently yields the best or near-best performance across tasks, with peak accuracies of 85 . 82% (ALFWorld), 55 . 24% (PDDL) in AutoGen. In contrast, 2-hop and 3-hop settings often degrade performance, e.g. , PDDL drops to 49 . 79% (2-hop). This suggests that excessive hop expansion may introduce irrelevant insights during memory upward traversal, impairing task-specific reasoning. Similarly, Figure 4b shows that the optimal k is among { 1 , 2 } . Larger k values ( e.g. , k =5 ) can significantly degrade the system performance, e.g. , 7 . 71% ↓ on ALFWorld+AutoGen and 2 . 5% ↓ on FEVER+DyLAN, indicating that retrieving more queries may introduce task-irrelevant noise. Collectively, we employ 1-hop expansion and k ∈ { 1 , 2 } throughout the experiments.
Ablation Study. Figure 4c presents an ablation of G-Memory by isolating the impact of the highlevel insight module ( I S in Equation (6)) and fine-grained interactions ( { ˆ G Q i inter } | M | i =1 in Equation (7)). As shown, removing either part leads to a consistent performance drop. When only fine-grained interactions are enabled, the average scores drop by 4 . 47% ↓ for AutoGen and 3 . 82% ↓ for DyLAN

- (a) Sensitivity analysis on #hop.
- (b) Sensitivity analysis on parameter k .
- (c) Ablation study on two variants of G-Memory .


Figure 4: (a) Sensitivity analysis of the hop expansion in Equation (5); (b) Sensitivity analysis of the number of selected queries k in Equation (4); (c) We study two variants of G-Memory : merely providing high-level insights ( i.e. , the insights I S in Equation (6)) or fine-grained interactions ( i.e. , the core trajectories in Equation (7)). All the experiments here are done with Qwen-2.5-14b .

Figure 5: Case study of G-Memory .
compared to the full method. Conversely, enabling only insights leads to smaller drops of 3 . 95% and 3 . 39% . This indicates that while both components are contributive, interactions offer a slightly greater impact, likely due to their preserving more fine-grained, dialogue-level contextual grounding.
Sensitivity Analysis.
To evaluate the efficiency of G-Memory in terms of token consumption, we visualize the performance versus token cost trade-off across various settings, as shown in Figures 3 and 7. Our findings are:
Takeaway ➌ : G-Memory achieves high-performing collective memory without excessive token consumption. As depicted in Figure 3, G-Memory consistently delivers the highest performance improvement ( 10 . 32% ↑ over no-memory setting on PDDL+AutoGen) while maintaining a modest increase in token consumption (only 1 . 4 × 10 6 ). In contrast, MetaGPT-M incurred an additional 2 . 2 × 10 6 tokens for a mere 4 . 07% gain. This clearly demonstrates the token-efficiency of G-Memory .
Ablation Study.
Figure 5 illustrates concrete memory cues provided by G-Memory across diverse tasks. For example, in the ALFWorld+AutoGen setting, given the task query ' put a clean cloth in countertop ', G-Memory successfully retrieves a highly analogous historical query, ' put a clean egg in microwave '-both requiring the object to be in a clean state. Alongside this, G-Memory surfaces a critical trajectory segment where the solver agent attempts to place the egg in the microwave before cleaning, prompting the ground agent to intervene. This collaborative trajectory offers actionable guidance for the current task. Moreover, the high-level insights retrieved by G-Memory prove equally valuable for task execution. In the context of HotpotQA's web search task, G-Memory retrieves an insight warning against 'mistakenly referring', which helps prevent agents from incorrectly answering based on similarly named individuals. Overall, G-Memory provides effective multi-level memory support across varied domains, including embodied action, knowledge reasoning, and game environments.
Case Study
Figure 5 illustrates concrete memory cues provided by G-Memory across diverse tasks. For example, in the ALFWorld+AutoGen setting, given the task query ' put a clean cloth in countertop ', G-Memory successfully retrieves a highly analogous historical query, ' put a clean egg in microwave '-both requiring the object to be in a clean state. Alongside this, G-Memory surfaces a critical trajectory segment where the solver agent attempts to place the egg in the microwave before cleaning, prompting the ground agent to intervene. This collaborative trajectory offers actionable guidance for the current task. Moreover, the high-level insights retrieved by G-Memory prove equally valuable for task execution. In the context of HotpotQA's web search task, G-Memory retrieves an insight warning against 'mistakenly referring', which helps prevent agents from incorrectly answering based on similarly named individuals. Overall, G-Memory provides effective multi-level memory support across varied domains, including embodied action, knowledge reasoning, and game environments.
Conclusion & Limitation
In this paper, we conduct a thorough examination of existing memory architectures designed for multi-agent systems (MAS) and identify that their overly simplified designs fundamentally hinder the systems' capacity for self-evolution. To bridge this gap, we propose G-Memory , a hierarchical memory framework that organizes the complex and extended interaction trajectories of MAS into a three-tier graph hierarchy: the insight , query , and interaction graphs. G-Memory provides each agent with customized and hierarchical memory cues, ranging from abstract, generalizable insights
to fine-grained, task-critical collaborative segments, and dynamically evolves its knowledge base across episodes. Extensive experiments demonstrate that G-Memory can be seamlessly integrated into state-of-the-art MAS frameworks, significantly enhancing their self-evolution capability, e.g. , up to 20 . 89% ↑ improvement on embodied action tasks. Limitations: Although G-Memory has been evaluated across three domains and five benchmarks, further validation on more diverse tasks ( e.g. , medical QA) would strengthen its soundness, which we leave for future work.
Impact Statement
G-Memory introduces a structured, hierarchical memory architecture for multi-agent systems (MAS), enabling large language model (LLM)-based agents to store, recall, and reason over past experiences with enhanced task generalization and cooperation efficiency. The broader impacts of this work include advancing the development of scalable and adaptive collective intelligence, with potential applications in long-term robotic planning, real-world decision-making systems, and collaborative AI assistants. However, if the underlying language model is compromised or adversarially manipulated, the memory mechanisms could amplify incorrect reasoning. We urge responsible deployment of this architecture with appropriate safeguards, including continual validation, adversarial robustness checks, and alignment with human values.
Experimental Details
Dataset Descriptions
In this section, we describe the datasets used in our experiments:
Evaluation Metrics. We use exact match accuracy for FEVER and HotpotQA. For ScienceWorld and PDDL, we report the progress rate , and for ALFWorld, we use the success rate as the evaluation metric.
Evaluation Metrics.
Baseline Setup
In this section, we provide detailed descriptions of each baseline used in our comparison:
Multi-agent System Setup
In this section, we detail the setups of our three adopted MAS frameworks, AutoGen, DyLAN and MacNet:
AutoGen
AutoGen [13] is a popular multi-agent orchestration framework, to coordinate interactions among specialized agents for problem-solving tasks. Specifically, we utilize their A3 : Decision Making structure, which is composed of: (1) a Solver Agent , responsible for generating solutions, initialized with the system prompt 'You are a smart agent designed to solve problems.'; (2) a Ground Truth Agent , which critically evaluates the solver's output and identifies potential errors based on a reference standard; and (3) an Executor Agent , tasked with translating validated solutions into executable commands. This modular design enables transparent, verifiable, and actionable multiagent collaboration.
DyLAN
DyLAN [72] is a debate-style framework similar to LLM-Debate, but incorporates a more efficient agent-wise early stopping mechanism during multi-turn interactions. DyLAN utilizes an agent selection algorithm based on an unsupervised metric, namely the Agent Importance Score , which identifies the most contributive agents through a preliminary trial tailored to the specific task. In our implementation of DyLAN, three agents engage in the debate, while an additional ranker agent evaluates their relative importance.
MacNet
MacNet [47] is a representative work that explores decentralized and scalable multi-agent systems. Its key feature lies in the absence of a central agent; instead, it introduces edge agents , which are invoked between agent interactions to provide actionable instructions to the next agent based on the previous agent's outputs. In our implementation, we adopt the random graph topology from MacNet, shown to be robust across diverse scenarios, and employ five agents in addition to the edge agents.
Additional Experiment Results
RQ1 Results
Tables 2 and 3 present additional experimental results using Qwen-2.5-7b and Qwen-2.5-14b as the LLM backbones. Appendix B.1 illustrates the success rate curves on ALFWorld as the number of trials increases, comparing different MAS frameworks combined with various memory architectures. As shown in Figures 6b and 6c, G-Memory consistently enables MAS frameworks to achieve success with fewer trials and leads to higher final performance ceilings.

(a) The performance trajectory of AutoGen on ALFWorld.
RQ2 Results
Figure 7 provides additional comparisons of token cost across various benchmarks and MAS frameworks when combined with different memory architectures. Overall, G-Memory incurs only a marginal or no increase in token cost compared to classical baselines such as Generative and MetaGPT-M, while consistently delivering the most significant performance improvements.
Case Study
Figure 5 illustrates concrete memory cues provided by G-Memory across diverse tasks. For example, in the ALFWorld+AutoGen setting, given the task query ' put a clean cloth in countertop ', G-Memory successfully retrieves a highly analogous historical query, ' put a clean egg in microwave '-both requiring the object to be in a clean state. Alongside this, G-Memory surfaces a critical trajectory segment where the solver agent attempts to place the egg in the microwave before cleaning, prompting the ground agent to intervene. This collaborative trajectory offers actionable guidance for the current task. Moreover, the high-level insights retrieved by G-Memory prove equally valuable for task execution. In the context of HotpotQA's web search task, G-Memory retrieves an insight warning against 'mistakenly referring', which helps prevent agents from incorrectly answering based on similarly named individuals. Overall, G-Memory provides effective multi-level memory support across varied domains, including embodied action, knowledge reasoning, and game environments.
Case Study on Insight Graphs
Figure 8 visualizes the high-level insights summarized by G-Memory on the ALFWorld benchmark across different MAS frameworks and LLM backbones. Given that ALFWorld naturally consists of diverse task categories, we further examine how insight nodes corresponding to different task types are interconnected. Overall, we observe dense intra-category connections among insights derived from similar tasks, while also noting the emergence of meaningful inter-category links, reflecting transferable patterns across task domains.
Case Study on Query Graphs
Figures 9 to 11 visualize the query graphs constructed by G-Memory on the ALFWorld, PDDL, and SciWorld benchmarks. Recall that a directed edge between two query nodes indicates that the historical trajectory of one query offers useful guidance for the execution of another. We observe emergent clustering patterns, where groups of semantically similar queries form densely connected subgraphs, while sparser inter-cluster edges capture cross-task inspirations. These patterns demonstrate G-Memory 's ability to effectively organize and relate collaborative experiences through structured memory reasoning.
(b) The performance trajectory of DyLAN on ALFWorld.
Table 2: Performance comparison with single/multi-agent memory architectures on five benchmarks. The underlying LLM backbone is Qwen-2.5-7b . We highlight the best and second best results.
Prompt Set
Discussion with Related Works
$$ C_i = (\mathsf{Base}_i, \mathsf{Role}_i, \mathsf{Mem}_i, \mathsf{Plugin}_i), $$
$$ \mathcal{G}\mathsf{query} = (\mathcal{Q}, \mathcal{E}\mathsf{q}) = \left(\bigl{Q_i, \Psi_i, \mathcal{G}\mathsf{inter}^{(Q_i)}\bigl}{i=1}^{|\mathcal{Q}|}, \mathcal{E}_\mathsf{q}\right), $$
$$ \label{eq:similarity} \mathcal{Q}^\mathcal{S} = \underset{ {q_i \in \mathcal{Q}} \text{ s.t. } |\mathcal{Q}^\mathcal{S}| = k}{\operatorname{arg,top\text{-}k}} \left( \frac{\mathbf{v}(Q) \cdot \mathbf{v}(q_i)}{|\mathbf{v}(Q)|, |\mathbf{v}(q_i)|} \right), $$ \tag{eq:similarity}
$$ \label{eq:hop_expansion} \Tilde{\mathcal{Q}}^\mathcal{S} = \mathcal{Q}^\mathcal{S} \cup \left{ Q_k \in \mathcal{Q} \mid \exists Q_j \in \mathcal{Q}^\mathcal{S},\ Q_k \in \mathcal{N}^+(Q_j) \cup \mathcal{N}^-(Q_j) \right}, $$ \tag{eq:hop_expansion}
$$ \label{eq:upward_retrieval} \mathcal{I}^\mathcal{S} = \Pi_{\mathcal{Q} \to \mathcal{I}}(\Tilde{\mathcal{Q}}^\mathcal{S}), ;\Pi_{\mathcal{Q} \to \mathcal{I}}(\mathcal{S}_q) \triangleq \left{ \iota_k \in \mathcal{I} \mid \Omega_k \cap \mathcal{S}_q \neq \emptyset \right}, $$ \tag{eq:upward_retrieval}
$$ \label{eq:downward_retrieval} {\hat{\mathcal{G}}^{Q_i}{\mathsf{inter}}}{i=1}^{|M|} = \Bigl{ \mathcal{S}\text{LLM}(\mathcal{G}\mathsf{inter}^{(Q_j)}, Q) \mid q_j \in!!!! \underset{ {q'k \in \Tilde{\mathcal{Q}}^\mathcal{S}} \text{ s.t. } |\cdot| = M}{\operatorname{arg top-M}} !!!!!\mathcal{R}\text{LLM}(Q, q'_k) \Bigl}, $$ \tag{eq:downward_retrieval}
$$ \label{eq:specialized_allocation} \mathsf{Mem}i \leftarrow \Phi \left( \mathcal{I}^\mathcal{S}, {\hat{\mathcal{G}}^{Q_i}{\mathsf{inter}}}_{i=1}^{|M|}; \mathsf{Role}_i,Q \right), ; \forall C_i = (\mathsf{Base}_i, \mathsf{Role}_i, \mathsf{Mem}_i, \mathsf{Plugin}_i) \in \mathcal{V}, $$ \tag{eq:specialized_allocation}
$$ \label{eq:query_graph_update} \begin{gathered} q_\text{new} \leftarrow (Q, \Psi, \mathcal{G}\mathsf{inter}^{(Q)}),; \mathcal{N}\text{conn} \leftarrow \mathcal{Q}^\mathcal{R} \cup \Bigl( \bigcup_{\iota_k \in \mathcal{I}^\mathcal{S}} \Omega_k \Bigl), \ \mathcal{E}\text{new} \leftarrow { (q_n, q\text{new}) \mid q_n \in \mathcal{N}\text{conn} },; \mathcal{G}\mathsf{query}^\text{next} \leftarrow (\mathcal{Q} \cup {q_\text{new}}, \mathcal{E}\mathsf{q} \cup \mathcal{E}\text{new}), \end{gathered} $$ \tag{eq:query_graph_update}
$$ \label{eq:insight_add_simple} \begin{gathered} \iota_\text{new} = (\mathcal{J}(\mathcal{G}\mathsf{inter}^{(Q)}, \Psi), {q\text{new}}), ;\mathcal{E}\text{i, new} \leftarrow { (\iota_k, \iota\text{new}, q_\text{new}) \mid \iota_k \in \mathcal{I}^\mathcal{S} } \ \mathcal{G}\mathsf{insight}' \leftarrow (\mathcal{I} \cup {\iota\text{new}}, \mathcal{E}\mathsf{i} \cup \mathcal{E}\text{i, new}) \end{gathered} $$ \tag{eq:insight_add_simple}
$$ \label{eq:insight_update_simple} \begin{gathered} \mathcal{I}^\text{next} \leftarrow (\mathcal{I} \setminus \mathcal{I}\text{ret}) \cup {(\kappa_k, \Omega_k \cup {q\text{new}}) \mid \iota_k = (\kappa_k, \Omega_k) \in \mathcal{I}\text{ret} } \cup {\iota\text{new}} \ \mathcal{G}\mathsf{insight}^\text{next} \leftarrow (\mathcal{I}^\text{next}, \mathcal{E}\mathsf{i} \cup \mathcal{E}_\text{i, new}), \end{gathered} $$ \tag{eq:insight_update_simple}
$$ r_i^{(t)} = C_i\Bigl(P_{\mathrm{sys}}^{(t)}, Q, {r_j^{(t)} : C_j\in \mathcal{N}^-(C_i)}\Bigr), $$
In this section, we further discuss the relationship between G-Memory and several recent agent memory frameworks. For A-Mem [61], while both A-Mem and G-Memory aim to enhance the memory capabilities of LLM agents, they differ in two key aspects. First, A-Mem is tailored for single-agent scenarios, whereas G-Memory is designed for processing MAS's lengthy and nuanced interaction trajectory. Second, A-Mem emphasizes atomic memory construction for chatbot-style interactions, while G-Memory focuses on distilling reusable strategies from collaborative task execution, where fine-grained atomicity is neither required nor beneficial. For Mem0 [62], although it also employs a graph-based structure, it remains within the chatbot paradigm. Its graph is closer to a knowledge graph, where nodes represent factual entities and edges represent relations, fundamentally differing from G-Memory 's agent-centric memory graphs that encode trajectories, decisions, and coordination patterns across agents.
| MAS | Memory | ALFWorld | SciWorld | PDDL | HotpotQA | FEVER | Avg. |
|---|---|---|---|---|---|---|---|
| AutoGen COLM 2024 | No-memory Voyager MemoryBank Generative MetaGPT ChatDev MacNet G-Memory (Ours) | 77.61 ↑ 0 . 00 85.07 ↑ 7 . 46 74.96 ↓ 2 . 65 86.36 ↑ 8 . 75 81.34 ↑ 3 . 73 79.85 ↑ 2 . 24 76.55 ↓ 1 . 06 88.81 ↑ 11 . 20 | 54.49 ↑ 0 . 00 62.36 ↑ 7 . 87 53.11 ↓ 1 . 38 61.19 ↑ 6 . 70 61.91 ↑ 7 . 42 50.96 ↓ 3 . 53 55.44 ↑ 0 . 95 67.40 ↑ 12 . 91 | 23.53 ↑ 0 . 00 24.56 ↑ 1 . 03 20.41 ↓ 3 . 12 25.53 ↑ 2 . 00 21.63 ↓ 1 . 90 16.65 ↓ 6 . 88 22.94 ↓ 0 . 59 27.77 ↑ 4 . 24 | 28.57 ↑ 0 . 00 32.32 ↑ 3 . 75 33.67 ↑ 5 . 10 31.63 ↑ 3 . 06 32.67 ↑ 4 . 10 24.49 ↓ 4 . 08 28.36 ↓ 0 . 21 35.67 ↑ 7 . 10 | 57.13 ↑ 0 . 00 63.27 ↑ 6 . 14 61.22 ↑ 4 . 09 60.20 ↑ 3 . 07 62.67 ↑ 5 . 54 59.18 ↑ 2 . 05 60.87 ↑ 3 . 74 66.24 ↑ 9 . 11 | 48.27 ↑ 0 . 00 53.52 ↑ 5 . 25 48.67 ↑ 0 . 40 52.98 ↑ 4 . 71 52.04 ↑ 3 . 77 46.23 ↓ 2 . 04 48.83 ↑ 0 . 56 57.18 ↑ 8 . 91 |
| DyLAN COLM 2024 | No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M G-Memory (Ours) | 56.72 ↑ 0 . 00 66.42 ↑ 9 . 70 55.22 ↓ 1 . 50 67.91 ↑ 11 . 19 69.40 ↑ 12 . 68 46.27 ↓ 10 . 45 53.44 ↓ 3 . 28 70.90 | 55.38 ↑ 0 . 00 62.83 ↑ 7 . 45 54.74 ↓ 0 . 64 64.16 ↑ 8 . 78 62.37 ↑ 6 . 99 53.35 ↓ 2 . 03 54.32 ↓ 1 . 06 65.64 | 11.62 ↑ 0 . 00 15.10 ↑ 3 . 48 8.08 ↓ 3 . 54 13.87 ↑ 2 . 25 14.45 ↑ 2 . 83 10.75 ↓ 0 . 87 12.11 ↑ 0 . 49 18.95 | 31.69 ↑ 0 . 00 32.64 ↑ 0 . 95 29.59 ↓ 2 . 10 29.29 ↓ 2 . 40 32.34 ↑ 0 . 65 22.45 ↓ 9 . 24 30.12 ↓ 1 . 57 34.69 | 60.20 ↑ 0 . 00 62.24 ↑ 2 . 04 59.13 ↓ 1 . 07 62.30 ↑ 2 . 10 60.20 ↑ 0 . 00 58.33 ↓ 1 . 87 61.10 ↑ 0 . 90 64.22 | 43.12 ↑ 0 . 00 47.85 ↑ 4 . 73 41.35 ↓ 1 . 77 47.51 ↑ 4 . 39 47.75 ↑ 4 . 63 38.23 ↓ 4 . 89 42.22 ↓ 0 . 90 50.88 |
| MacNet ICLR 2025 | No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M G-Memory (Ours) | 51.49 ↑ 0 . 00 61.94 ↑ 10 . 45 50.00 ↓ 1 . 49 62.69 ↑ 11 . 20 63.70 ↑ 12 . 21 49.25 ↓ 2 . 24 53.44 ↑ 1 . 95 67.16 | 57.53 ↑ 0 . 00 64.53 ↑ 7 . 00 60.15 ↑ 2 . 62 65.49 ↑ 7 . 96 65.27 ↑ 7 . 74 56.58 ↓ 0 . 95 56.14 ↓ 1 . 39 68.11 | 12.18 ↑ 0 . 00 14.06 ↑ 1 . 88 8.64 ↓ 3 . 54 7.92 ↓ 4 . 26 16.03 ↑ 3 . 85 13.51 ↑ 1 . 33 13.59 ↑ 1 . 41 24.33 | 28.57 ↑ 0 . 00 32.65 ↑ 4 . 08 33.67 ↑ 5 . 10 29.59 ↑ 1 . 02 31.00 ↑ 2 . 43 29.00 ↑ 0 . 43 27.89 ↓ 0 . 68 35.69 | 60.29 ↑ 0 . 00 62.54 ↑ 2 . 25 61.22 ↑ 0 . 93 63.27 ↑ 2 . 98 59.33 ↓ 0 . 96 59.18 ↓ 1 . 11 59.20 ↓ 1 . 09 64.44 | 42.01 ↑ 0 . 00 47.14 ↑ 5 . 13 42.74 ↑ 0 . 73 45.79 ↑ 3 . 78 47.07 ↑ 5 . 06 41.50 ↓ 0 . 51 42.05 ↑ 0 . 04 51.95 |
| MAS | Inter. | Insi. | PDDL | FEVER |
|---|---|---|---|---|
| AutoGen | ✔ ◦ ✔ | ◦ ✔ ✔ | 54.46 50.00 55.24 | 63.27 68.77 71.43 |
| DyLAN | ✔ ◦ ✔ | ◦ ✔ ✔ | 48.75 46.69 51.12 | 61.39 64.31 66.66 |
| MAS | Memory | ALFWorld | SciWorld | PDDL | HotpotQA | FEVER | Avg. |
|---|---|---|---|---|---|---|---|
| Vanilla LLM | No-memory Voyager MemoryBank Generative | 37.31 ↑ 0 . 00 38.19 ↑ 0 . 88 40.30 ↑ 2 . 99 39.16 ↑ 1 . 85 ↑ 0 . 00 | 23.49 ↑ 0 . 00 24.11 ↑ 0 . 62 21.64 ↓ 1 . 85 26.10 ↑ 2 . 61 30.27 ↑ 0 . 00 | 10.86 ↑ 0 . 00 12.14 ↑ 1 . 28 14.36 ↑ 3 . 50 11.37 ↑ 0 . 51 | 20.26 ↑ 0 . 00 19.12 ↓ 1 . 14 18.79 ↓ 1 . 47 23.48 ↑ 3 . 22 | 48.17 ↑ 0 . 00 49.68 ↑ 1 . 51 47.66 ↓ 0 . 51 52.50 ↑ 4 . 33 | 28.02 ↑ 0 . 00 28.65 ↑ 0 . 63 28.55 ↑ 0 . 53 30.52 ↑ 2 . 50 38.30 ↑ 0 . 00 |
| AutoGen COLM 2024 | No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M G-Memory (Ours) | 52.99 55.22 ↑ 2 . 23 53.37 ↑ 0 . 38 62.69 ↑ 9 . 70 55.52 ↑ 2 . 53 46.27 ↓ 6 . 72 53.18 ↑ 0 . 19 67.91 ↑ 14 . 92 | 26.70 ↓ 3 . 57 27.33 ↓ 2 . 94 31.45 ↑ 1 . 18 32.44 ↑ 2 . 17 28.67 ↓ 1 . 60 31.10 ↑ 0 . 83 34.89 ↑ 4 . 62 | 16.17 ↑ 0 . 00 12.00 ↓ 4 . 17 14.83 ↓ 1 . 34 17.88 ↑ 1 . 71 17.04 ↑ 0 . 87 13.42 ↓ 2 . 75 16.89 ↑ 0 . 72 21.01 ↑ 4 . 84 | 33.33 ↑ 0 . 00 34.29 ↑ 0 . 96 32.67 ↓ 0 . 66 34.17 ↑ 0 . 84 35.36 ↑ 2 . 03 31.11 ↓ 2 . 22 34.29 ↑ 0 . 96 37.34 ↑ 4 . 01 | 58.74 ↑ 0 . 00 52.44 ↓ 6 . 30 59.45 ↑ 0 . 71 61.25 ↑ 2 . 51 63.33 ↑ 4 . 59 61.32 ↑ 2 . 58 58.43 ↓ 0 . 31 64.34 ↑ 5 . 60 | 36.13 ↓ 2 . 17 37.53 ↓ 0 . 77 41.49 ↑ 3 . 19 40.74 ↑ 2 . 44 36.16 ↓ 2 . 14 38.78 ↑ 0 . 48 45.10 ↑ 6 . 80 |
| DyLAN COLM 2024 | No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M G-Memory (Ours) No-memory | 41.34 ↑ 0 . 00 51.49 ↑ 10 . 15 46.46 ↑ 5 . 12 48.52 ↑ 7 . 18 42.54 ↑ 1 . 20 39.85 ↓ 1 . 49 42.48 ↑ 1 . 14 52.99 ↑ 11 . 65 44.03 ↑ 0 . 00 | 29.84 ↑ 0 . 00 26.66 ↓ 3 . 18 26.99 ↓ 2 . 85 31.55 ↑ 1 . 71 30.93 ↑ 1 . 09 28.25 ↓ 1 . 59 28.22 ↓ 1 . 62 33.81 ↑ 3 . 97 | 13.56 ↑ 0 . 00 10.62 ↓ 2 . 94 14.10 ↑ 0 . 54 16.31 ↑ 2 . 75 14.47 ↑ 0 . 91 7.14 ↓ 6 . 42 14.23 ↑ 0 . 67 20.71 ↑ 7 . 15 | 24.29 ↑ 0 . 00 26.23 ↑ 1 . 94 22.44 ↓ 1 . 85 26.54 ↑ 2 . 25 19.33 ↓ 4 . 96 17.32 ↓ 6 . 97 25.12 ↑ 0 . 83 29.33 ↑ 5 . 04 22.24 ↑ 0 . 00 | 56.23 ↑ 0 . 00 55.39 ↓ 0 . 84 59.21 ↑ 2 . 98 50.19 ↓ 6 . 04 57.22 ↑ 0 . 99 50.67 ↓ 5 . 56 55.34 ↓ 0 . 89 63.67 ↑ 7 . 44 | 33.05 ↑ 0 . 00 34.08 ↑ 1 . 03 33.84 ↑ 0 . 79 34.62 ↑ 1 . 57 32.90 ↓ 0 . 15 28.65 ↓ 4 . 41 33.08 ↑ 0 . 03 40.10 ↑ 7 . 05 |
| MacNet ICLR 2025 | Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M (Ours) | 47.01 ↑ 2 . 98 52.24 ↑ 8 . 21 48.51 ↑ 4 . 48 52.99 ↑ 8 . 96 44.78 ↑ 0 . 75 43.55 ↓ 0 . 48 54.48 | 28.76 ↑ 0 . 00 28.88 ↑ 0 . 12 27.86 ↓ 0 . 90 31.05 ↑ 2 . 29 29.87 ↑ 1 . 11 26.44 ↓ 2 . 32 30.11 ↑ 1 . 35 32.23 | 13.36 ↑ 0 . 00 11.36 ↓ 2 . 00 13.33 ↓ 0 . 03 14.04 ↑ 0 . 68 16.58 ↑ 3 . 22 10.19 ↓ 3 . 17 12.91 ↓ 0 . 45 17.48 | 25.67 ↑ 3 . 43 23.97 ↑ 1 . 73 24.49 ↑ 2 . 25 25.51 ↑ 3 . 27 16.32 ↓ 5 . 92 21.77 ↓ 0 . 47 27.53 | 55.12 ↑ 0 . 00 58.78 ↑ 3 . 66 54.18 ↓ 0 . 94 56.08 ↑ 0 . 96 53.88 ↓ 1 . 24 56.02 ↑ 0 . 90 50.71 ↓ 4 . 41 59.14 | 32.70 ↑ 0 . 00 34.34 ↑ 1 . 64 34.32 ↑ 1 . 61 34.83 ↑ 2 . 13 35.77 ↑ 3 . 06 30.75 ↓ 1 . 95 31.81 ↓ 0 . 89 38.17 |
| MAS | Memory | ALFWorld | SciWorld | PDDL | HotpotQA | FEVER | Avg. |
|---|---|---|---|---|---|---|---|
| AutoGen COLM 2024 | No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M | 74.63 ↑ 0 . 00 76.87 ↑ 2 . 24 70.15 ↓ 4 . 48 74.63 ↑ 0 . 00 82.09 ↑ 7 . 46 67.16 ↓ 7 . 47 73.65 ↓ 0 . 98 85.82 ↑ 11 . 19 | 46.84 ↑ 0 . 00 59.00 ↑ 12 . 16 54.18 ↑ 7 . 34 57.37 ↑ 10 . 53 58.86 ↑ 12 . 02 40.69 ↓ 6 . 15 42.14 ↓ 4 . 70 60.62 ↑ 13 . 78 | 44.92 ↑ 0 . 00 50.21 ↑ 5 . 29 39.54 ↓ 5 . 38 54.46 ↑ 9 . 54 48.99 ↑ 4 . 07 43.11 ↓ 1 . 81 45.94 ↑ 1 . 02 55.24 ↑ 10 . 32 | 24.49 ↑ 0 . 00 31.33 ↑ 6 . 84 32.65 ↑ 8 . 16 33.21 ↑ 8 . 72 31.63 ↑ 7 . 14 31.77 ↑ 7 . 28 26.72 ↑ 2 . 23 34.61 ↑ 10 . 12 | 63.27 ↑ 0 . 00 61.22 ↓ 2 . 05 64.29 ↑ 1 . 02 63.27 ↑ 0 . 00 62.27 ↓ 1 . 00 61.28 ↓ 1 . 99 64.69 ↑ 1 . 42 71.43 ↑ 8 . 16 | 50.83 ↑ 0 . 00 55.73 ↑ 4 . 90 52.16 ↑ 1 . 33 56.59 ↑ 5 . 76 56.77 ↑ 5 . 94 48.80 ↓ 2 . 03 50.63 ↓ 0 . 20 61.54 ↑ 10 . 71 |
| DyLAN COLM 2024 | G-Memory (Ours) No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M G-Memory (Ours) | 76.12 ↑ 0 . 00 72.39 ↓ 3 . 73 76.87 ↑ 0 . 75 77.91 ↑ 1 . 79 79.10 ↑ 2 . 98 74.63 ↓ 1 . 49 72.77 ↓ 3 . 35 81.34 ↑ 5 . 22 | 53.24 ↑ 0 . 00 58.93 ↑ 5 . 69 57.92 ↑ 4 . 68 61.52 ↑ 8 . 28 61.29 ↑ 8 . 05 54.03 ↑ 0 . 79 52.22 ↓ 1 . 02 64.68 ↑ 11 . 44 | 41.83 ↑ 0 . 00 48.54 ↑ 6 . 71 39.65 ↓ 2 . 18 46.69 ↑ 4 . 86 49.75 ↑ 7 . 92 44.44 ↑ 2 . 61 42.98 ↑ 1 . 15 51.12 ↑ 9 . 29 | 30.61 ↑ 0 . 00 30.71 ↑ 0 . 10 29.59 ↓ 1 . 02 31.33 ↑ 0 . 72 28.61 ↓ 2 . 00 30.67 ↑ 0 . 06 29.22 ↓ 1 . 39 34.63 ↑ 4 . 02 | 63.34 ↑ 0 . 00 65.31 ↑ 1 . 97 63.25 ↓ 0 . 09 61.39 ↓ 1 . 95 64.11 ↑ 0 . 77 62.25 ↓ 1 . 09 62.69 ↓ 0 . 65 66.66 ↑ 3 . 32 | 53.03 ↑ 0 . 00 55.18 ↑ 2 . 15 53.46 ↑ 0 . 43 55.77 ↑ 2 . 74 56.57 ↑ 3 . 54 53.20 ↑ 0 . 18 51.98 ↓ 1 . 05 59.69 ↑ 6 . 66 |
| MacNet ICLR 2025 | No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M G-Memory (Ours) | 58.21 ↑ 0 . 00 63.43 ↑ 5 . 22 62.21 ↑ 4 . 00 73.13 ↑ 14 . 92 70.43 ↑ 12 . 22 68.66 ↑ 10 . 45 60.45 ↑ 2 . 24 79.10 ↑ 20 . 89 | 52.21 ↑ 0 . 00 60.24 ↑ 8 . 03 55.52 ↑ 3 . 31 60.83 ↑ 8 . 62 59.70 ↑ 7 . 49 45.98 ↓ 6 . 23 51.14 ↓ 1 . 07 61.74 ↑ 9 . 53 | 41.74 ↑ 0 . 00 43.95 ↑ 2 . 21 38.26 ↓ 3 . 48 44.00 ↑ 2 . 26 42.34 ↑ 0 . 60 42.19 ↑ 0 . 45 39.22 ↓ 2 . 52 45.76 ↑ 4 . 02 | 28.60 ↑ 0 . 00 29.67 ↑ 1 . 07 26.53 ↓ 2 . 07 30.53 ↑ 1 . 93 26.26 ↓ 2 . 34 29.49 ↑ 0 . 89 28.77 ↑ 0 . 17 32.33 ↑ 3 . 73 | 64.65 ↑ 0 . 00 62.24 ↓ 2 . 41 65.22 ↑ 0 . 57 65.31 ↑ 0 . 66 66.33 ↑ 1 . 68 59.18 ↓ 5 . 47 62.42 ↓ 2 . 23 70.33 ↑ 5 . 68 | 49.08 ↑ 0 . 00 51.91 ↑ 2 . 82 49.55 ↑ 0 . 47 54.76 ↑ 5 . 68 53.01 ↑ 3 . 93 49.10 ↑ 0 . 02 48.40 ↓ 0 . 68 57.85 ↑ 8 . 77 |
| MAS | Memory | ALFWorld | SciWorld | PDDL | HotpotQA | FEVER | Avg. |
|---|---|---|---|---|---|---|---|
| AutoGen COLM 2024 | No-memory Voyager MemoryBank Generative MetaGPT ChatDev MacNet G-Memory (Ours) | 77.61 ↑ 0 . 00 85.07 ↑ 7 . 46 74.96 ↓ 2 . 65 86.36 ↑ 8 . 75 81.34 ↑ 3 . 73 79.85 ↑ 2 . 24 76.55 ↓ 1 . 06 88.81 ↑ 11 . 20 | 54.49 ↑ 0 . 00 62.36 ↑ 7 . 87 53.11 ↓ 1 . 38 61.19 ↑ 6 . 70 61.91 ↑ 7 . 42 50.96 ↓ 3 . 53 55.44 ↑ 0 . 95 67.40 ↑ 12 . 91 | 23.53 ↑ 0 . 00 24.56 ↑ 1 . 03 20.41 ↓ 3 . 12 25.53 ↑ 2 . 00 21.63 ↓ 1 . 90 16.65 ↓ 6 . 88 22.94 ↓ 0 . 59 27.77 ↑ 4 . 24 | 28.57 ↑ 0 . 00 32.32 ↑ 3 . 75 33.67 ↑ 5 . 10 31.63 ↑ 3 . 06 32.67 ↑ 4 . 10 24.49 ↓ 4 . 08 28.36 ↓ 0 . 21 35.67 ↑ 7 . 10 | 57.13 ↑ 0 . 00 63.27 ↑ 6 . 14 61.22 ↑ 4 . 09 60.20 ↑ 3 . 07 62.67 ↑ 5 . 54 59.18 ↑ 2 . 05 60.87 ↑ 3 . 74 66.24 ↑ 9 . 11 | 48.27 ↑ 0 . 00 53.52 ↑ 5 . 25 48.67 ↑ 0 . 40 52.98 ↑ 4 . 71 52.04 ↑ 3 . 77 46.23 ↓ 2 . 04 48.83 ↑ 0 . 56 57.18 ↑ 8 . 91 |
| DyLAN COLM 2024 | No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M G-Memory (Ours) | 56.72 ↑ 0 . 00 66.42 ↑ 9 . 70 55.22 ↓ 1 . 50 67.91 ↑ 11 . 19 69.40 ↑ 12 . 68 46.27 ↓ 10 . 45 53.44 ↓ 3 . 28 70.90 | 55.38 ↑ 0 . 00 62.83 ↑ 7 . 45 54.74 ↓ 0 . 64 64.16 ↑ 8 . 78 62.37 ↑ 6 . 99 53.35 ↓ 2 . 03 54.32 ↓ 1 . 06 65.64 | 11.62 ↑ 0 . 00 15.10 ↑ 3 . 48 8.08 ↓ 3 . 54 13.87 ↑ 2 . 25 14.45 ↑ 2 . 83 10.75 ↓ 0 . 87 12.11 ↑ 0 . 49 18.95 | 31.69 ↑ 0 . 00 32.64 ↑ 0 . 95 29.59 ↓ 2 . 10 29.29 ↓ 2 . 40 32.34 ↑ 0 . 65 22.45 ↓ 9 . 24 30.12 ↓ 1 . 57 34.69 | 60.20 ↑ 0 . 00 62.24 ↑ 2 . 04 59.13 ↓ 1 . 07 62.30 ↑ 2 . 10 60.20 ↑ 0 . 00 58.33 ↓ 1 . 87 61.10 ↑ 0 . 90 64.22 | 43.12 ↑ 0 . 00 47.85 ↑ 4 . 73 41.35 ↓ 1 . 77 47.51 ↑ 4 . 39 47.75 ↑ 4 . 63 38.23 ↓ 4 . 89 42.22 ↓ 0 . 90 50.88 |
| MacNet ICLR 2025 | No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M G-Memory (Ours) | 51.49 ↑ 0 . 00 61.94 ↑ 10 . 45 50.00 ↓ 1 . 49 62.69 ↑ 11 . 20 63.70 ↑ 12 . 21 49.25 ↓ 2 . 24 53.44 ↑ 1 . 95 67.16 | 57.53 ↑ 0 . 00 64.53 ↑ 7 . 00 60.15 ↑ 2 . 62 65.49 ↑ 7 . 96 65.27 ↑ 7 . 74 56.58 ↓ 0 . 95 56.14 ↓ 1 . 39 68.11 | 12.18 ↑ 0 . 00 14.06 ↑ 1 . 88 8.64 ↓ 3 . 54 7.92 ↓ 4 . 26 16.03 ↑ 3 . 85 13.51 ↑ 1 . 33 13.59 ↑ 1 . 41 24.33 | 28.57 ↑ 0 . 00 32.65 ↑ 4 . 08 33.67 ↑ 5 . 10 29.59 ↑ 1 . 02 31.00 ↑ 2 . 43 29.00 ↑ 0 . 43 27.89 ↓ 0 . 68 35.69 | 60.29 ↑ 0 . 00 62.54 ↑ 2 . 25 61.22 ↑ 0 . 93 63.27 ↑ 2 . 98 59.33 ↓ 0 . 96 59.18 ↓ 1 . 11 59.20 ↓ 1 . 09 64.44 | 42.01 ↑ 0 . 00 47.14 ↑ 5 . 13 42.74 ↑ 0 . 73 45.79 ↑ 3 . 78 47.07 ↑ 5 . 06 41.50 ↓ 0 . 51 42.05 ↑ 0 . 04 51.95 |
| MAS | Inter. | Insi. | PDDL | FEVER |
|---|---|---|---|---|
| AutoGen | ✔ ◦ ✔ | ◦ ✔ ✔ | 54.46 50.00 55.24 | 63.27 68.77 71.43 |
| DyLAN | ✔ ◦ ✔ | ◦ ✔ ✔ | 48.75 46.69 51.12 | 61.39 64.31 66.66 |
| MAS | Memory | ALFWorld | SciWorld | PDDL | HotpotQA | FEVER | Avg. |
|---|---|---|---|---|---|---|---|
| Vanilla LLM | No-memory Voyager MemoryBank Generative | 37.31 ↑ 0 . 00 38.19 ↑ 0 . 88 40.30 ↑ 2 . 99 39.16 ↑ 1 . 85 ↑ 0 . 00 | 23.49 ↑ 0 . 00 24.11 ↑ 0 . 62 21.64 ↓ 1 . 85 26.10 ↑ 2 . 61 30.27 ↑ 0 . 00 | 10.86 ↑ 0 . 00 12.14 ↑ 1 . 28 14.36 ↑ 3 . 50 11.37 ↑ 0 . 51 | 20.26 ↑ 0 . 00 19.12 ↓ 1 . 14 18.79 ↓ 1 . 47 23.48 ↑ 3 . 22 | 48.17 ↑ 0 . 00 49.68 ↑ 1 . 51 47.66 ↓ 0 . 51 52.50 ↑ 4 . 33 | 28.02 ↑ 0 . 00 28.65 ↑ 0 . 63 28.55 ↑ 0 . 53 30.52 ↑ 2 . 50 38.30 ↑ 0 . 00 |
| AutoGen COLM 2024 | No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M G-Memory (Ours) | 52.99 55.22 ↑ 2 . 23 53.37 ↑ 0 . 38 62.69 ↑ 9 . 70 55.52 ↑ 2 . 53 46.27 ↓ 6 . 72 53.18 ↑ 0 . 19 67.91 ↑ 14 . 92 | 26.70 ↓ 3 . 57 27.33 ↓ 2 . 94 31.45 ↑ 1 . 18 32.44 ↑ 2 . 17 28.67 ↓ 1 . 60 31.10 ↑ 0 . 83 34.89 ↑ 4 . 62 | 16.17 ↑ 0 . 00 12.00 ↓ 4 . 17 14.83 ↓ 1 . 34 17.88 ↑ 1 . 71 17.04 ↑ 0 . 87 13.42 ↓ 2 . 75 16.89 ↑ 0 . 72 21.01 ↑ 4 . 84 | 33.33 ↑ 0 . 00 34.29 ↑ 0 . 96 32.67 ↓ 0 . 66 34.17 ↑ 0 . 84 35.36 ↑ 2 . 03 31.11 ↓ 2 . 22 34.29 ↑ 0 . 96 37.34 ↑ 4 . 01 | 58.74 ↑ 0 . 00 52.44 ↓ 6 . 30 59.45 ↑ 0 . 71 61.25 ↑ 2 . 51 63.33 ↑ 4 . 59 61.32 ↑ 2 . 58 58.43 ↓ 0 . 31 64.34 ↑ 5 . 60 | 36.13 ↓ 2 . 17 37.53 ↓ 0 . 77 41.49 ↑ 3 . 19 40.74 ↑ 2 . 44 36.16 ↓ 2 . 14 38.78 ↑ 0 . 48 45.10 ↑ 6 . 80 |
| DyLAN COLM 2024 | No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M G-Memory (Ours) No-memory | 41.34 ↑ 0 . 00 51.49 ↑ 10 . 15 46.46 ↑ 5 . 12 48.52 ↑ 7 . 18 42.54 ↑ 1 . 20 39.85 ↓ 1 . 49 42.48 ↑ 1 . 14 52.99 ↑ 11 . 65 44.03 ↑ 0 . 00 | 29.84 ↑ 0 . 00 26.66 ↓ 3 . 18 26.99 ↓ 2 . 85 31.55 ↑ 1 . 71 30.93 ↑ 1 . 09 28.25 ↓ 1 . 59 28.22 ↓ 1 . 62 33.81 ↑ 3 . 97 | 13.56 ↑ 0 . 00 10.62 ↓ 2 . 94 14.10 ↑ 0 . 54 16.31 ↑ 2 . 75 14.47 ↑ 0 . 91 7.14 ↓ 6 . 42 14.23 ↑ 0 . 67 20.71 ↑ 7 . 15 | 24.29 ↑ 0 . 00 26.23 ↑ 1 . 94 22.44 ↓ 1 . 85 26.54 ↑ 2 . 25 19.33 ↓ 4 . 96 17.32 ↓ 6 . 97 25.12 ↑ 0 . 83 29.33 ↑ 5 . 04 22.24 ↑ 0 . 00 | 56.23 ↑ 0 . 00 55.39 ↓ 0 . 84 59.21 ↑ 2 . 98 50.19 ↓ 6 . 04 57.22 ↑ 0 . 99 50.67 ↓ 5 . 56 55.34 ↓ 0 . 89 63.67 ↑ 7 . 44 | 33.05 ↑ 0 . 00 34.08 ↑ 1 . 03 33.84 ↑ 0 . 79 34.62 ↑ 1 . 57 32.90 ↓ 0 . 15 28.65 ↓ 4 . 41 33.08 ↑ 0 . 03 40.10 ↑ 7 . 05 |
| MacNet ICLR 2025 | Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M (Ours) | 47.01 ↑ 2 . 98 52.24 ↑ 8 . 21 48.51 ↑ 4 . 48 52.99 ↑ 8 . 96 44.78 ↑ 0 . 75 43.55 ↓ 0 . 48 54.48 | 28.76 ↑ 0 . 00 28.88 ↑ 0 . 12 27.86 ↓ 0 . 90 31.05 ↑ 2 . 29 29.87 ↑ 1 . 11 26.44 ↓ 2 . 32 30.11 ↑ 1 . 35 32.23 | 13.36 ↑ 0 . 00 11.36 ↓ 2 . 00 13.33 ↓ 0 . 03 14.04 ↑ 0 . 68 16.58 ↑ 3 . 22 10.19 ↓ 3 . 17 12.91 ↓ 0 . 45 17.48 | 25.67 ↑ 3 . 43 23.97 ↑ 1 . 73 24.49 ↑ 2 . 25 25.51 ↑ 3 . 27 16.32 ↓ 5 . 92 21.77 ↓ 0 . 47 27.53 | 55.12 ↑ 0 . 00 58.78 ↑ 3 . 66 54.18 ↓ 0 . 94 56.08 ↑ 0 . 96 53.88 ↓ 1 . 24 56.02 ↑ 0 . 90 50.71 ↓ 4 . 41 59.14 | 32.70 ↑ 0 . 00 34.34 ↑ 1 . 64 34.32 ↑ 1 . 61 34.83 ↑ 2 . 13 35.77 ↑ 3 . 06 30.75 ↓ 1 . 95 31.81 ↓ 0 . 89 38.17 |
| MAS | Memory | ALFWorld | SciWorld | PDDL | HotpotQA | FEVER | Avg. |
|---|---|---|---|---|---|---|---|
| AutoGen COLM 2024 | No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M | 74.63 ↑ 0 . 00 76.87 ↑ 2 . 24 70.15 ↓ 4 . 48 74.63 ↑ 0 . 00 82.09 ↑ 7 . 46 67.16 ↓ 7 . 47 73.65 ↓ 0 . 98 85.82 ↑ 11 . 19 | 46.84 ↑ 0 . 00 59.00 ↑ 12 . 16 54.18 ↑ 7 . 34 57.37 ↑ 10 . 53 58.86 ↑ 12 . 02 40.69 ↓ 6 . 15 42.14 ↓ 4 . 70 60.62 ↑ 13 . 78 | 44.92 ↑ 0 . 00 50.21 ↑ 5 . 29 39.54 ↓ 5 . 38 54.46 ↑ 9 . 54 48.99 ↑ 4 . 07 43.11 ↓ 1 . 81 45.94 ↑ 1 . 02 55.24 ↑ 10 . 32 | 24.49 ↑ 0 . 00 31.33 ↑ 6 . 84 32.65 ↑ 8 . 16 33.21 ↑ 8 . 72 31.63 ↑ 7 . 14 31.77 ↑ 7 . 28 26.72 ↑ 2 . 23 34.61 ↑ 10 . 12 | 63.27 ↑ 0 . 00 61.22 ↓ 2 . 05 64.29 ↑ 1 . 02 63.27 ↑ 0 . 00 62.27 ↓ 1 . 00 61.28 ↓ 1 . 99 64.69 ↑ 1 . 42 71.43 ↑ 8 . 16 | 50.83 ↑ 0 . 00 55.73 ↑ 4 . 90 52.16 ↑ 1 . 33 56.59 ↑ 5 . 76 56.77 ↑ 5 . 94 48.80 ↓ 2 . 03 50.63 ↓ 0 . 20 61.54 ↑ 10 . 71 |
| DyLAN COLM 2024 | G-Memory (Ours) No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M G-Memory (Ours) | 76.12 ↑ 0 . 00 72.39 ↓ 3 . 73 76.87 ↑ 0 . 75 77.91 ↑ 1 . 79 79.10 ↑ 2 . 98 74.63 ↓ 1 . 49 72.77 ↓ 3 . 35 81.34 ↑ 5 . 22 | 53.24 ↑ 0 . 00 58.93 ↑ 5 . 69 57.92 ↑ 4 . 68 61.52 ↑ 8 . 28 61.29 ↑ 8 . 05 54.03 ↑ 0 . 79 52.22 ↓ 1 . 02 64.68 ↑ 11 . 44 | 41.83 ↑ 0 . 00 48.54 ↑ 6 . 71 39.65 ↓ 2 . 18 46.69 ↑ 4 . 86 49.75 ↑ 7 . 92 44.44 ↑ 2 . 61 42.98 ↑ 1 . 15 51.12 ↑ 9 . 29 | 30.61 ↑ 0 . 00 30.71 ↑ 0 . 10 29.59 ↓ 1 . 02 31.33 ↑ 0 . 72 28.61 ↓ 2 . 00 30.67 ↑ 0 . 06 29.22 ↓ 1 . 39 34.63 ↑ 4 . 02 | 63.34 ↑ 0 . 00 65.31 ↑ 1 . 97 63.25 ↓ 0 . 09 61.39 ↓ 1 . 95 64.11 ↑ 0 . 77 62.25 ↓ 1 . 09 62.69 ↓ 0 . 65 66.66 ↑ 3 . 32 | 53.03 ↑ 0 . 00 55.18 ↑ 2 . 15 53.46 ↑ 0 . 43 55.77 ↑ 2 . 74 56.57 ↑ 3 . 54 53.20 ↑ 0 . 18 51.98 ↓ 1 . 05 59.69 ↑ 6 . 66 |
| MacNet ICLR 2025 | No-memory Voyager MemoryBank Generative MetaGPT-M ChatDev-M MacNet-M G-Memory (Ours) | 58.21 ↑ 0 . 00 63.43 ↑ 5 . 22 62.21 ↑ 4 . 00 73.13 ↑ 14 . 92 70.43 ↑ 12 . 22 68.66 ↑ 10 . 45 60.45 ↑ 2 . 24 79.10 ↑ 20 . 89 | 52.21 ↑ 0 . 00 60.24 ↑ 8 . 03 55.52 ↑ 3 . 31 60.83 ↑ 8 . 62 59.70 ↑ 7 . 49 45.98 ↓ 6 . 23 51.14 ↓ 1 . 07 61.74 ↑ 9 . 53 | 41.74 ↑ 0 . 00 43.95 ↑ 2 . 21 38.26 ↓ 3 . 48 44.00 ↑ 2 . 26 42.34 ↑ 0 . 60 42.19 ↑ 0 . 45 39.22 ↓ 2 . 52 45.76 ↑ 4 . 02 | 28.60 ↑ 0 . 00 29.67 ↑ 1 . 07 26.53 ↓ 2 . 07 30.53 ↑ 1 . 93 26.26 ↓ 2 . 34 29.49 ↑ 0 . 89 28.77 ↑ 0 . 17 32.33 ↑ 3 . 73 | 64.65 ↑ 0 . 00 62.24 ↓ 2 . 41 65.22 ↑ 0 . 57 65.31 ↑ 0 . 66 66.33 ↑ 1 . 68 59.18 ↓ 5 . 47 62.42 ↓ 2 . 23 70.33 ↑ 5 . 68 | 49.08 ↑ 0 . 00 51.91 ↑ 2 . 82 49.55 ↑ 0 . 47 54.76 ↑ 5 . 68 53.01 ↑ 3 . 93 49.10 ↑ 0 . 02 48.40 ↓ 0 . 68 57.85 ↑ 8 . 77 |
















$$ \mathcal{G}\mathsf{insight} = (\mathcal{I}, \mathcal{E}\mathsf{i}) = \Bigl( \langle\underbrace{ \kappa_k, \Omega_k }{\iota_k}\rangle{k=1}^{|\mathcal{I}|}, \mathcal{E}_\mathsf{i} \Bigl), $$
References
[Aho:72] Alfred V. Aho, Jeffrey D. Ullman. (1972). The Theory of Parsing, Translation and Compiling.
[zhou2024symbolic] Zhou, Wangchunshu, Ou, Yixin, Ding, Shengwei, Li, Long, Wu, Jialong, Wang, Tiannan, Chen, Jiamin, Wang, Shuai, Xu, Xiaohua, Zhang, Ningyu, others. (2024). Symbolic learning enables self-evolving agents. arXiv preprint arXiv:2406.18532.
[liang2024self] Liang, Xuechen, Tao, Meiling, Xia, Yinghui, Shi, Tianyu, Wang, Jun, Yang, JingSong. (2024). Self-evolving Agents with reflective and memory-augmented abilities. arXiv preprint arXiv:2409.00872.
[wang2024moa] Wang, Junlin, Wang, Jue, Athiwaratkun, Ben, Zhang, Ce, Zou, James. (2024). Mixture-of-agents enhances large language model capabilities. arXiv preprint arXiv:2406.04692.
[chen2024routerdc] Chen, Shuhao, Jiang, Weisen, Lin, Baijiong, Kwok, James T, Zhang, Yu. (2024). RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models. arXiv preprint arXiv:2409.19886.
[han2024wildguard] Han, Seungju, Rao, Kavel, Ettinger, Allyson, Jiang, Liwei, Lin, Bill Yuchen, Lambert, Nathan, Choi, Yejin, Dziri, Nouha. (2024). Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495.
[APA:83] {American Psychological Association. (1983). Publications Manual.
[Chandra:81] Ashok K. Chandra, Dexter C. Kozen, Larry J. Stockmeyer. (1981). Alternation. Journal of the Association for Computing Machinery. doi:10.1145/322234.322243.
[andrew2007scalable] Andrew, Galen, Gao, Jianfeng. (2007). Scalable training of {L1. Proceedings of the 24th International Conference on Machine Learning.
[Gusfield:97] Dan Gusfield. (1997). Algorithms on Strings, Trees and Sequences.
[rasooli-tetrault-2015] Mohammad Sadegh Rasooli, Joel R. Tetreault. (2015). Yara Parser: {A. Computing Research Repository.
[Ando2005] Ando, Rie Kubota, Zhang, Tong. (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. Journal of Machine Learning Research.
[AMSTrans = "American Mathematical Society Translations" } @String{AMSTrans = "Amer. Math. Soc. Transl." } @String{BullAMS = "Bulletin of the American Mathematical Society" } @String{BullAMS = "Bull. Amer. Math. Soc." } @String{ProcAMS = "Proceedings of the American Mathematical Society" } @String{ProcAMS = "Proc. Amer. Math. Soc." } @String{TransAMS = "Transactions of the American Mathematical Society" } @String{TransAMS = "Trans. Amer. Math. Soc." }
%ACM @String{CACM = "Communications of the {ACM}" } @String{CACM = "Commun. {ACM}" } @String{CompServ = "Comput. Surveys" } @String{JACM = "J. ACM" } @String{ACMMathSoft = "{ACM} Transactions on Mathematical Software" } @String{ACMMathSoft = "{ACM} Trans. Math. Software" } @String{SIGNUM = "{ACM} {SIGNUM} Newsletter" } @String{SIGNUM = "{ACM} {SIGNUM} Newslett." }
@String{AmerSocio = "American Journal of Sociology" } @String{AmerStatAssoc = "Journal of the American Statistical Association" } @String{AmerStatAssoc = "J. Amer. Statist. Assoc." } @String{ApplMathComp = "Applied Mathematics and Computation" } @String{ApplMathComp = "Appl. Math. Comput." } @String{AmerMathMonthly = "American Mathematical Monthly" } @String{AmerMathMonthly = "Amer. Math. Monthly" } @String{BIT = "{BIT}" } @String{BritStatPsych = "British Journal of Mathematical and Statistical Psychology" } @String{BritStatPsych = "Brit. J. Math. Statist. Psych." } @String{CanMathBull = "Canadian Mathematical Bulletin" } @String{CanMathBull = "Canad. Math. Bull." } @String{CompApplMath = "Journal of Computational and Applied Mathematics" } @String{CompApplMath = "J. Comput. Appl. Math." } @String{CompPhys = "Journal of Computational Physics" } @String{CompPhys = "J. Comput. Phys." } @String{CompStruct = "Computers and Structures" } @String{CompStruct = "Comput. & Structures" } @String{CompJour = "The Computer Journal" } @String{CompJour = "Comput. J." } @String{CompSysSci = "Journal of Computer and System Sciences" } @String{CompSysSci = "J. Comput. System Sci." } @String{Computing = "Computing" } @String{ContempMath = "Contemporary Mathematics" } @String{ContempMath = "Contemp. Math." } @String{Crelle = "Crelle's Journal" } @String{GiornaleMath = "Giornale di Mathematiche" } @String{GiornaleMath = "Giorn. Mat." } % didn't find in AMS MR.] Bengio, Yoshua, LeCun, Yann. (2007). Scaling Learning Algorithms Towards {AI. Large Scale Kernel Machines.
[li2024gslb] Li, Zhixun, Sun, Xin, Luo, Yifan, Zhu, Yanqiao, Chen, Dingshuo, Luo, Yingtao, Zhou, Xiangxin, Liu, Qiang, Wu, Shu, Wang, Liang, others. (2024). GSLB: the graph structure learning benchmark. Advances in Neural Information Processing Systems.
[Hinton06] Hinton, Geoffrey E., Osindero, Simon, Teh, Yee Whye. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation.
[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, Bengio, Yoshua. (2016). Deep learning.
[Tang:12KDDCross] Jie Tang, Sen Wu, Jimeng Sun, Hang Su. (2012). Cross-domain Collaboration Recommendation. KDD'2012.
[sankar2020dysat] Sankar, Aravind, Wu, Yanhong, Gou, Liang, Zhang, Wei, Yang, Hao. (2020). Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. Proceedings of the 13th International Conference on Web Search and Data Mining.
[wu2022handling] Wu, Qitian, Zhang, Hengrui, Yan, Junchi, Wipf, David. (2022). Handling Distribution Shifts on Graphs: An Invariance Perspective. International Conference on Learning Representations.
[mikolov2013efficient] Tom{'{a. (2013). Efficient Estimation of Word Representations in Vector Space. 1st International Conference on Learning Representations.
[wu2022discovering] Yingxin Wu, Xiang Wang, An Zhang, Xiangnan He, Tat{-. (2022). Discovering Invariant Rationales for Graph Neural Networks. The Tenth International Conference on Learning Representations.
[zhu2021shift] Zhu, Qi, Ponomareva, Natalia, Han, Jiawei, Perozzi, Bryan. (2021). Shift-robust gnns: Overcoming the limitations of localized graph training data. Advances in Neural Information Processing Systems.
[gagnon2022woods] Gagnon-Audet, Jean-Christophe, Ahuja, Kartik, Darvishi-Bayazi, Mohammad-Javad, Dumas, Guillaume, Rish, Irina. (2022). WOODS: Benchmarks for Out-of-Distribution Generalization in Time Series Tasks. arXiv preprint arXiv:2203.09978.
[du2021adarnn] Du, Yuntao, Wang, Jindong, Feng, Wenjie, Pan, Sinno, Qin, Tao, Xu, Renjun, Wang, Chongjun. (2021). Adarnn: Adaptive learning and forecasting of time series. Proceedings of the 30th ACM International Conference on Information & Knowledge Management.
[kim2021reversible] Kim, Taesung, Kim, Jinhee, Tae, Yunwon, Park, Cheonbok, Choi, Jang-Ho, Choo, Jaegul. (2021). Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift. International Conference on Learning Representations.
[venkateswaran2021environment] Venkateswaran, Praveen, Muthusamy, Vinod, Isahagian, Vatche, Venkatasubramanian, Nalini. (2021). Environment agnostic invariant risk minimization for classification of sequential datasets. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining.
[driess2023palm-e] Driess, Danny, Xia, Fei, Sajjadi, Mehdi SM, Lynch, Corey, Chowdhery, Aakanksha, Wahid, Ayzaan, Tompson, Jonathan, Vuong, Quan, Yu, Tianhe, Huang, Wenlong, others. (2023). Palm-e: An embodied multimodal language model.
[huang2024understanding] Huang, Xu, Liu, Weiwen, Chen, Xiaolong, Wang, Xingmei, Wang, Hao, Lian, Defu, Wang, Yasheng, Tang, Ruiming, Chen, Enhong. (2024). Understanding the planning of LLM agents: A survey. arXiv preprint arXiv:2402.02716.
[wang2024executable] Wang, Xingyao, Chen, Yangyi, Yuan, Lifan, Zhang, Yizhe, Li, Yunzhu, Peng, Hao, Ji, Heng. (2024). Executable code actions elicit better llm agents. Forty-first International Conference on Machine Learning.
[putta2024agentq] Putta, Pranav, Mills, Edmund, Garg, Naman, Motwani, Sumeet, Finn, Chelsea, Garg, Divyansh, Rafailov, Rafael. (2024). Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199.
[masterman2024landscape] Masterman, Tula, Besen, Sandi, Sawtell, Mason, Chao, Alex. (2024). The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey. arXiv preprint arXiv:2404.11584.
[lu2021diversify] Lu, Wang, Wang, Jindong, Chen, Yiqiang, Sun, Xinwei. (2021). DIVERSIFY to Generalize: Learning Generalized Representations for Time Series Classification. arXiv preprint.
[zhu2024knowagent] Zhu, Yuqi, Qiao, Shuofei, Ou, Yixin, Deng, Shumin, Lyu, Shiwei, Shen, Yue, Liang, Lei, Gu, Jinjie, Chen, Huajun, Zhang, Ningyu. (2024). Knowagent: Knowledge-augmented planning for llm-based agents. arXiv preprint arXiv:2403.03101.
[skarding2021foundations] Skarding, Joakim, Gabrys, Bogdan, Musial, Katarzyna. (2021). Foundations and Modeling of Dynamic Networks Using Dynamic Graph Neural Networks: A Survey. IEEE Access.
[zhu2022learnable] Zhu, Yuecai, Lyu, Fuyuan, Hu, Chengming, Chen, Xi, Liu, Xue. (2022). Learnable Encoder-Decoder Architecture for Dynamic Graph: A Survey. arXiv preprint arXiv:2203.10480.
[zhang2024cut] Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, Tianlong Chen. (2024). Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems. arXiv preprint arXiv:2410.02506.
[wang2021inductive] Yanbang Wang, Yen{-. (2021). Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks. 9th International Conference on Learning Representations.
[cong2021dynamic] Cong, Weilin, Wu, Yanhong, Tian, Yuandong, Gu, Mengting, Xia, Yinglong, Mahdavi, Mehrdad, Chen, Chun-cheng Jason. (2021). Dynamic Graph Representation Learning via Graph Transformer Networks. arXiv preprint arXiv:2111.10447.
[hong2024data-interpreter] Hong, Sirui, Lin, Yizhang, Liu, Bang, Liu, Bangbang, Wu, Binhao, Zhang, Ceyao, Wei, Chenxing, Li, Danyang, Chen, Jiaqi, Zhang, Jiayi, others. (2024). Data interpreter: An llm agent for data science. arXiv preprint arXiv:2402.18679.
[yang2021discrete] Yang, Menglin, Zhou, Min, Kalander, Marcus, Huang, Zengfeng, King, Irwin. (2021). Discrete-time Temporal Network Embedding via Implicit Hierarchical Learning in Hyperbolic Space. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining.
[sun2021hyperbolic] Sun, Li, Zhang, Zhongbao, Zhang, Jiawei, Wang, Feiyang, Peng, Hao, Su, Sen, Yu, Philip S. (2021). Hyperbolic variational graph neural network for modeling dynamic graphs. Proceedings of the AAAI Conference on Artificial Intelligence.
[xu2020inductive] Da Xu, Chuanwei Ruan, Evren K{. (2020). Inductive representation learning on temporal graphs. 8th International Conference on Learning Representations.
[wang2021tcl] Wang, Lu, Chang, Xiaofu, Li, Shuang, Chu, Yunfei, Li, Hui, Zhang, Wei, He, Xiaofeng, Song, Le, Zhou, Jingren, Yang, Hongxia. (2021). Tcl: Transformer-based dynamic graph modelling via contrastive learning. arXiv preprint arXiv:2105.07944.
[rossi2020temporal] Rossi, Emanuele, Chamberlain, Ben, Frasca, Fabrizio, Eynard, Davide, Monti, Federico, Bronstein, Michael. (2020). Temporal graph networks for deep learning on dynamic graphs. arXiv preprint arXiv:2006.10637.
[hajiramezanali2019variational] Hajiramezanali, Ehsan, Hasanzadeh, Arman, Narayanan, Krishna, Duffield, Nick, Zhou, Mingyuan, Qian, Xiaoning. (2019). Variational graph recurrent neural networks. Advances in neural information processing systems.
[pareja2020evolvegcn] Pareja, Aldo, Domeniconi, Giacomo, Chen, Jie, Ma, Tengfei, Suzumura, Toyotaro, Kanezashi, Hiroki, Kaler, Tim, Schardl, Tao, Leiserson, Charles. (2020). Evolvegcn: Evolving graph convolutional networks for dynamic graphs. Proceedings of the AAAI Conference on Artificial Intelligence.
[seo2018structured] Seo, Youngjoo, Defferrard, Micha{. (2018). Structured sequence modeling with graph convolutional recurrent networks. International Conference on Neural Information Processing.
[kipf2016variational] Kipf, Thomas N, Welling, Max. (2016). Variational graph auto-encoders. arXiv preprint arXiv:1611.07308.
[arjovsky2019invariant] Arjovsky, Martin, Bottou, L{'e. (2019). Invariant risk minimization. arXiv preprint.
[sagawa2019distributionally] Sagawa, Shiori, Koh, Pang Wei, Hashimoto, Tatsunori B, Liang, Percy. Distributionally Robust Neural Networks. International Conference on Learning Representations.
[krueger2021out] Krueger, David, Caballero, Ethan, Jacobsen, Joern-Henrik, Zhang, Amy, Binas, Jonathan, Zhang, Dinghuai, Le Priol, Remi, Courville, Aaron. (2021). Out-of-distribution generalization via risk extrapolation (rex). International Conference on Machine Learning.
[cadene2019rubi] Cadene, Remi, Dancette, Corentin, Cord, Matthieu, Parikh, Devi, others. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems.
[aggarwal2014evolutionary] Aggarwal, Charu, Subbian, Karthik. (2014). Evolutionary network analysis: A survey. ACM Computing Surveys (CSUR).
[qiu2020temporal] Qiu, Zhenyu, Hu, Wenbin, Wu, Jia, Liu, Weiwei, Du, Bo, Jia, Xiaohua. (2020). Temporal network embedding with high-order nonlinear information. Proceedings of the AAAI Conference on Artificial Intelligence.
[huang2020motif] Huang, Hong, Fang, Zixuan, Wang, Xiao, Miao, Youshan, Jin, Hai. (2020). Motif-Preserving Temporal Network Embedding.. IJCAI.
[zhou2018dynamic] Zhou, Lekui, Yang, Yang, Ren, Xiang, Wu, Fei, Zhuang, Yueting. (2018). Dynamic network embedding by modeling triadic closure process. Proceedings of the AAAI conference on artificial intelligence.
[trivedi2019dyrep] Trivedi, Rakshit, Farajtabar, Mehrdad, Biswal, Prasenjeet, Zha, Hongyuan. (2019). Dyrep: Learning representations over dynamic graphs. International conference on learning representations.
[ding2021closer] Ding, Mucong, Kong, Kezhi, Chen, Jiuhai, Kirchenbauer, John, Goldblum, Micah, Wipf, David, Huang, Furong, Goldstein, Tom. (2021). A Closer Look at Distribution Shifts and Out-of-Distribution Generalization on Graphs.
[kovanen2011temporal] Kovanen, Lauri, Karsai, M{'a. (2011). Temporal motifs in time-dependent networks. Journal of Statistical Mechanics: Theory and Experiment.
[benson2016higher] Benson, Austin R, Gleich, David F, Leskovec, Jure. (2016). Higher-order organization of complex networks. Science.
[paranjape2017motifs] Paranjape, Ashwin, Benson, Austin R, Leskovec, Jure. (2017). Motifs in temporal networks. Proceedings of the tenth ACM international conference on web search and data mining.
[zitnik2019evolution] Zitnik, Marinka, Sosi{\v{c. (2019). Evolution of resilience in protein interactomes across the tree of life. Proceedings of the National Academy of Sciences.
[coleman1994foundations] Coleman, James S. (1994). Foundations of social theory.
[huang2015triadic] Huang, Hong, Tang, Jie, Liu, Lu, Luo, JarDer, Fu, Xiaoming. (2015). Triadic closure pattern analysis and prediction in social networks. IEEE Transactions on Knowledge and Data Engineering.
[kovanen2013temporal] Kovanen, Lauri, Kaski, Kimmo, Kert{'e. (2013). Temporal motifs reveal homophily, gender-specific patterns, and group talk in call sequences. Proceedings of the National Academy of Sciences.
[glymour2016causal] Glymour, Madelyn, Pearl, Judea, Jewell, Nicholas P. (2016). Causal inference in statistics: A primer.
[pearl2000models] Pearl, Judea, others. (2000). Models, reasoning and inference. Cambridge, UK: CambridgeUniversityPress.
[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, {\L. (2017). Attention is all you need. Advances in neural information processing systems.
[tian2006characterization] Tian, Jin, Kang, Changsung, Pearl, Judea. (2006). A characterization of interventional distributions in semi-Markovian causal models.
[brown1992survivorship] Brown, Stephen J, Goetzmann, William, Ibbotson, Roger G, Ross, Stephen A. (1992). Survivorship bias in performance studies. The Review of Financial Studies.
[berk1983introduction] Berk, Richard A. (1983). An introduction to sample selection bias in sociological data. American sociological review.
[simmel1950sociology] Simmel, Georg. (1950). The sociology of georg simmel.
[shen2021towards] Shen, Zheyan, Liu, Jiashuo, He, Yue, Zhang, Xingxuan, Xu, Renzhe, Yu, Han, Cui, Peng. (2021). Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624.
[nascimento2021dynamic] Nascimento, Diego C, Pimentel, Bruno A, Souza, Renata MCR, Costa, Lilia, Gon{\c{c. (2021). Dynamic graph in a symbolic data framework: An account of the causal relation using COVID-19 reports and some reflections on the financial world. Chaos, Solitons & Fractals.
[zhang2021dyngraphtrans] Zhang, Shilei, Suzumura, Toyotaro, Zhang, Li. (2021). DynGraphTrans: Dynamic Graph Embedding via Modified Universal Transformer Networks for Financial Transaction Data. 2021 IEEE International Conference on Smart Data Services (SMDS).
[berger2006framework] Berger-Wolf, Tanya Y, Saia, Jared. (2006). A framework for analysis of dynamic social networks. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.
[greene2010tracking] Greene, Derek, Doyle, Donal, Cunningham, Padraig. (2010). Tracking the evolution of communities in dynamic social networks. 2010 international conference on advances in social networks analysis and mining.
[peng2021dynamic] Peng, Hao, Du, Bowen, Liu, Mingsheng, Liu, Mingzhe, Ji, Shumei, Wang, Senzhang, Zhang, Xu, He, Lifang. (2021). Dynamic graph convolutional network for long-term traffic flow prediction with reinforcement learning. Information Sciences.
[peng2020spatial] Peng, Hao, Wang, Hongfei, Du, Bowen, Bhuiyan, Md Zakirul Alam, Ma, Hongyuan, Liu, Jianwei, Wang, Lihong, Yang, Zeyu, Du, Linfeng, Wang, Senzhang, others. (2020). Spatial temporal incidence dynamic graph neural networks for traffic flow forecasting. Information Sciences.
[wang2022causal] Wang, Wenjie, Lin, Xinyu, Feng, Fuli, He, Xiangnan, Lin, Min, Chua, Tat-Seng. (2022). Causal Representation Learning for Out-of-Distribution Recommendation. Proceedings of the ACM Web Conference 2022.
[jin2021community] Jin, Tian, Wu, Qiong, Ou, Xuan, Yu, Jianjun. (2021). Community detection and co-author recommendation in co-author networks. International Journal of Machine Learning and Cybernetics.
[ahuja2020empirical] Kartik Ahuja, Jun Wang, Amit Dhurandhar, Karthikeyan Shanmugam, Kush R. Varshney. (2021). Empirical or Invariant Risk Minimization? {A. 9th International Conference on Learning Representations.
[huang2020graph] Huang, Kexin, Zitnik, Marinka. (2020). Graph meta learning via local subgraphs. Advances in Neural Information Processing Systems.
[li2022out] Li, Haoyang, Wang, Xin, Zhang, Ziwei, Zhu, Wenwu. (2022). Out-Of-Distribution Generalization on Graphs: A Survey. arXiv preprint.
[kingma2014adam] Diederik P. Kingma, Jimmy Ba. (2015). Adam: {A. 3rd International Conference on Learning Representations.
[paszke2019pytorch] Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer, Adam, Bradbury, James, Chanan, Gregory, Killeen, Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca, others. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems.
[ba2016layer] Ba, Jimmy Lei, Kiros, Jamie Ryan, Hinton, Geoffrey E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
[Fey/Lenssen/2019] Fey, Matthias, Lenssen, Jan E.. (2019). Fast Graph Representation Learning with {PyTorch Geometric. ICLR Workshop on Representation Learning on Graphs and Manifolds.
[chang2020invariant] Chang, Shiyu, Zhang, Yang, Yu, Mo, Jaakkola, Tommi. (2020). Invariant rationalization. International Conference on Machine Learning.
[ahuja2020invariant] Ahuja, Kartik, Shanmugam, Karthikeyan, Varshney, Kush, Dhurandhar, Amit. (2020). Invariant risk minimization games. International Conference on Machine Learning.
[rosenfeld2020risks] Elan Rosenfeld, Pradeep Kumar Ravikumar, Andrej Risteski. (2021). The Risks of Invariant Risk Minimization. 9th International Conference on Learning Representations.
[mitrovic2020representation] Jovana Mitrovic, Brian McWilliams, Jacob C. Walker, Lars Holger Buesing, Charles Blundell. (2021). Representation Learning via Invariant Causal Mechanisms. 9th International Conference on Learning Representations.
[barrat2004architecture] Barrat, Alain, Barthelemy, Marc, Pastor-Satorras, Romualdo, Vespignani, Alessandro. (2004). The architecture of complex weighted networks. Proceedings of the national academy of sciences.
[Cho2014LearningPR] Kyunghyun Cho, Bart van Merrienboer, Çaglar G{. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP.
[hochreiter1997long] Hochreiter, Sepp, Schmidhuber, J{. (1997). Long short-term memory. Neural computation.
[qin2022graph] Qin, Yijian, Wang, Xin, Zhang, Ziwei, Xie, Pengtao, Zhu, Wenwu. (2022). Graph Neural Architecture Search Under Distribution Shifts. International Conference on Machine Learning.
[li2022ood] Li, Haoyang, Wang, Xin, Zhang, Ziwei, Zhu, Wenwu. (2022). Ood-gnn: Out-of-distribution generalized graph neural network. IEEE Transactions on Knowledge and Data Engineering.
[zhang2022learning] Zeyang Zhang, Ziwei Zhang, Xin Wang, Wenwu Zhu. (2022). Learning to Solve Travelling Salesman Problem with Hardness-Adaptive Curriculum. Thirty-Sixth {AAAI.
[bengio2013representation] Bengio, Yoshua, Courville, Aaron, Vincent, Pascal. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence.
[hsieh2018learning] Hsieh, Jun-Ting, Liu, Bingbin, Huang, De-An, Fei-Fei, Li F, Niebles, Juan Carlos. (2018). Learning to decompose and disentangle representations for video prediction. Advances in neural information processing systems.
[ma2018disentangled] Ma, Liqian, Sun, Qianru, Georgoulis, Stamatios, Van Gool, Luc, Schiele, Bernt, Fritz, Mario. (2018). Disentangled person image generation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[ma2019disentangled] Ma, Jianxin, Cui, Peng, Kuang, Kun, Wang, Xin, Zhu, Wenwu. (2019). Disentangled graph convolutional networks. International conference on machine learning.
[yang2020factorizable] Yang, Yiding, Feng, Zunlei, Song, Mingli, Wang, Xinchao. (2020). Factorizable graph convolutional networks. Advances in Neural Information Processing Systems.
[wang2022disentangled] Wang, Xin, Chen, Hong, Zhou, Yuwei, Ma, Jianxin, Zhu, Wenwu. (2022). Disentangled Representation Learning for Recommendation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[li2021disentangled] Li, Haoyang, Wang, Xin, Zhang, Ziwei, Yuan, Zehuan, Li, Hang, Zhu, Wenwu. (2021). Disentangled contrastive learning on graphs. Advances in Neural Information Processing Systems.
[li2022disentangled] Li, Haoyang, Zhang, Ziwei, Wang, Xin, Zhu, Wenwu. (2022). Disentangled Graph Contrastive Learning With Independence Promotion. IEEE Transactions on Knowledge and Data Engineering.
[chen2021curriculum] Chen, Hong, Chen, Yudong, Wang, Xin, Xie, Ruobing, Wang, Rui, Xia, Feng, Zhu, Wenwu. (2021). Curriculum Disentangled Recommendation with Noisy Multi-feedback. Advances in Neural Information Processing Systems.
[wang2021multimodal] Wang, Xin, Chen, Hong, Zhu, Wenwu. (2021). Multimodal disentangled representation for recommendation. 2021 IEEE International Conference on Multimedia and Expo (ICME).
[ma2020disentangled] Ma, Jianxin, Zhou, Chang, Yang, Hongxia, Cui, Peng, Wang, Xin, Zhu, Wenwu. (2020). Disentangled self-supervision in sequential recommenders. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
[ma2019learning] Ma, Jianxin, Zhou, Chang, Cui, Peng, Yang, Hongxia, Zhu, Wenwu. (2019). Learning disentangled representations for recommendation. Advances in neural information processing systems.
[wang2020disenhan] Wang, Yifan, Tang, Suyao, Lei, Yuntong, Song, Weiping, Wang, Sheng, Zhang, Ming. (2020). Disenhan: Disentangled heterogeneous graph attention network for recommendation. Proceedings of the 29th ACM International Conference on Information & Knowledge Management.
[chen2016infogan] Chen, Xi, Duan, Yan, Houthooft, Rein, Schulman, John, Sutskever, Ilya, Abbeel, Pieter. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems.
[denton2017unsupervised] Denton, Emily L, others. (2017). Unsupervised learning of disentangled representations from video. Advances in neural information processing systems.
[tran2017disentangled] Tran, Luan, Yin, Xi, Liu, Xiaoming. (2017). Disentangled representation learning gan for pose-invariant face recognition. Proceedings of the IEEE conference on computer vision and pattern recognition.
[liu2020independence] Liu, Yanbei, Wang, Xiao, Wu, Shu, Xiao, Zhitao. (2020). Independence promoted graph disentangled networks. Proceedings of the AAAI Conference on Artificial Intelligence.
[zhang2021disentangled] Zhang, Wenbin, Zhang, Liming, Pfoser, Dieter, Zhao, Liang. (2021). Disentangled dynamic graph deep generation. Proceedings of the 2021 SIAM International Conference on Data Mining (SDM).
[du2022disentangled] Yuanqi Du, Xiaojie Guo, Hengning Cao, Yanfang Ye, Liang Zhao. (2022). Disentangled Spatiotemporal Graph Generative Models. Thirty-Sixth {AAAI.
[chen2022invariance] Chen, Yongqiang, Zhang, Yonggang, Yang, Han, Ma, Kaili, Xie, Binghui, Liu, Tongliang, Han, Bo, Cheng, James. (2022). Invariance Principle Meets Out-of-Distribution Generalization on Graphs. arXiv preprint.
[fan2021generalizing] Fan, Shaohua, Wang, Xiao, Shi, Chuan, Cui, Peng, Wang, Bai. (2021). Generalizing Graph Neural Networks on Out-Of-Distribution Graphs. arXiv preprint arXiv:2111.10657.
[chang2020continuous] Chang, Xiaofu, Liu, Xuqin, Wen, Jianfeng, Li, Shuang, Fang, Yanming, Song, Le, Qi, Yuan. (2020). Continuous-time dynamic graph learning via neural interaction processes. Proceedings of the 29th ACM International Conference on Information & Knowledge Management.
[huang2021coupled] Huang, Zijie, Sun, Yizhou, Wang, Wei. (2021). Coupled Graph ODE for Learning Interacting System Dynamics.. KDD.
[li2021intention] Li, Haoyang, Wang, Xin, Zhang, Ziwei, Ma, Jianxin, Cui, Peng, Zhu, Wenwu. (2021). Intention-aware sequential recommendation with structured intent transition. IEEE Transactions on Knowledge and Data Engineering.
[cai2021structural] Cai, Lei, Chen, Zhengzhang, Luo, Chen, Gui, Jiaping, Ni, Jingchao, Li, Ding, Chen, Haifeng. (2021). Structural temporal graph neural networks for anomaly detection in dynamic graphs. Proceedings of the 30th ACM international conference on Information & Knowledge Management.
[deng2020dynamic] Deng, Songgaojun, Rangwala, Huzefa, Ning, Yue. (2020). Dynamic knowledge graph based multi-event forecasting. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
[yao2021interpretable] Yao, Yuhang, Joe-Wong, Carlee. (2021). Interpretable clustering on dynamic graphs with recurrent graph neural networks. Proceedings of the AAAI Conference on Artificial Intelligence.
[you2019hierarchical] You, Jiaxuan, Wang, Yichen, Pal, Aditya, Eksombatchai, Pong, Rosenburg, Chuck, Leskovec, Jure. (2019). Hierarchical temporal convolutional networks for dynamic recommender systems. The world wide web conference.
[wang2021tedic] Wang, Yanbang, Li, Pan, Bai, Chongyang, Leskovec, Jure. (2021). TEDIC: Neural modeling of behavioral patterns in dynamic social interaction networks. Proceedings of the Web Conference 2021.
[wu2020temp] Jiapeng Wu, Meng Cao, Jackie Chi Kit Cheung, William L. Hamilton. (2020). TeMP: Temporal Message Passing for Temporal Knowledge Graph Completion. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, {EMNLP.
[li2019fates] Li, Haoyang, Cui, Peng, Zang, Chengxi, Zhang, Tianyang, Zhu, Wenwu, Lin, Yishi. (2019). Fates of Microscopic Social Ecosystems: Keep Alive or Dead?. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
[yao2022wildtime] Huaxiu Yao, Caroline Choi, Yoonho Lee, Pang Wei Koh, Chelsea Finn. (2022). Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time. Proceedings of the Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
[yao2022improving] Yao, Huaxiu, Wang, Yu, Li, Sai, Zhang, Linjun, Liang, Weixin, Zou, James, Finn, Chelsea. (2022). Improving Out-of-Distribution Robustness via Selective Augmentation. Proceeding of the Thirty-ninth International Conference on Machine Learning.
[li2022gil] Li, Haoyang, Zhang, Ziwei, Wang, Xin, Zhu, Wenwu. (2022). Learning Invariant Graph Representations for Out-of-Distribution Generalization. Thirty-Sixth Conference on Neural Information Processing Systems.
[zhang2022dynamic] Zhang, Zeyang, Wang, Xin, Zhang, Ziwei, Li, Haoyang, Qin, Zhou, Zhu, Wenwu. (2022). Dynamic graph neural networks under spatio-temporal distribution shift. Advances in Neural Information Processing Systems.
[hu2020open] Hu, Weihua, Fey, Matthias, Zitnik, Marinka, Dong, Yuxiao, Ren, Hongyu, Liu, Bowen, Catasta, Michele, Leskovec, Jure. (2020). Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems.
[Tang:08KDD] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, Zhong Su. (2008). ArnetMiner: Extraction and Mining of Academic Social Networks. KDD'08.
[sinha2015overview] Sinha, Arnab, Shen, Zhihong, Song, Yang, Ma, Hao, Eide, Darrin, Hsu, Bo-june Paul, Wang, Kuansan. (2015). An overview of microsoft academic service (mas) and applications. Proceedings of the 24th international conference on world wide web.
[wang2020microsoft] Wang, Kuansan, Shen, Zhihong, Huang, Chiyuan, Wu, Chieh-Han, Dong, Yuxiao, Kanakia, Anshul. (2020). Microsoft academic graph: When experts are not enough. Quantitative Science Studies.
[mikolov2013distributed] Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, Dean, Jeff. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems.
[kipf2016semi] Thomas N. Kipf, Max Welling. (2017). Semi-Supervised Classification with Graph Convolutional Networks. 5th International Conference on Learning Representations.
[velivckovicgraph] Veli{\v{c. Graph Attention Networks. International Conference on Learning Representations.
[9847099] Bevilacqua, Beatrice, Zhou, Yangze, Ribeiro, Bruno. (2021). Size-invariant graph representations for graph classification extrapolations. IEEE Transactions on Pattern Analysis and Machine Intelligence. doi:10.1109/TPAMI.2021.3086060.
[han2022g] Han, Xiaotian, Jiang, Zhimeng, Liu, Ninghao, Hu, Xia. (2022). G-mixup: Graph data augmentation for graph classification. International Conference on Machine Learning.
[9780235] Wang, Shihao, Yu, Zhiding, Jiang, Xiaohui, Lan, Shiyi, Shi, Min, Chang, Nadine, Kautz, Jan, Li, Ying, Alvarez, Jose M. (2024). Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. arXiv preprint arXiv:2405.01533. doi:10.1109/TPAMI.2021.3122444.
[zhong2024memorybank] Zhong, Wanjun, Guo, Lianghong, Gao, Qiqi, Ye, He, Wang, Yanlin. (2024). Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence.
[packer2023memgpt] Packer, Charles, Fang, Vivian, Patil, Shishir_G, Lin, Kevin, Wooders, Sarah, Gonzalez, Joseph_E. (2023). MemGPT: Towards LLMs as Operating Systems..
[hu2023chatdb] Hu, Chenxu, Fu, Jie, Du, Chenzhuang, Luo, Simian, Zhao, Junbo, Zhao, Hang. (2023). Chatdb: Augmenting llms with databases as their symbolic memory. arXiv preprint arXiv:2306.03901.
[lu2023memochat] Lu, Junru, An, Siyu, Lin, Mingbao, Pergola, Gabriele, He, Yulan, Yin, Di, Sun, Xing, Wu, Yunsheng. (2023). Memochat: Tuning llms to use memos for consistent long-range open-domain conversation. arXiv preprint arXiv:2308.08239.
[li2023metaagents] Li, Yuan, Zhang, Yixuan, Sun, Lichao. (2023). Metaagents: Simulating interactions of human behaviors for llm-based task-oriented coordination via collaborative generative agents. arXiv preprint arXiv:2310.06500.
[gao2023s3] Gao, Chen, Lan, Xiaochong, Lu, Zhihong, Mao, Jinzhu, Piao, Jinghua, Wang, Huandong, Jin, Depeng, Li, Yong. (2023). S3: Social-network simulation system with large language model-empowered agents. arXiv preprint arXiv:2307.14984.
[wang2023recmind] Wang, Yancheng, Jiang, Ziyan, Chen, Zheng, Yang, Fan, Zhou, Yingxue, Cho, Eunah, Fan, Xing, Huang, Xiaojiang, Lu, Yanbin, Yang, Yingzhen. (2023). Recmind: Large language model powered agent for recommendation. arXiv preprint arXiv:2308.14296.
[zhao2024expel] Zhao, Andrew, Huang, Daniel, Xu, Quentin, Lin, Matthieu, Liu, Yong-Jin, Huang, Gao. (2024). Expel: Llm agents are experiential learners. Proceedings of the AAAI Conference on Artificial Intelligence.
[modarressi2024memllm] Modarressi, Ali, K{. (2024). Memllm: Finetuning llms to use an explicit read-write memory. arXiv preprint arXiv:2404.11672.
[chen2024driving] Chen, Long, Sinavski, Oleg, H{. (2024). Driving with llms: Fusing object-level vector modality for explainable autonomous driving. 2024 IEEE International Conference on Robotics and Automation (ICRA).
[sun2024optimizing] Sun, Yuan, Salami Pargoo, Navid, Jin, Peter, Ortiz, Jorge. (2024). Optimizing autonomous driving for safety: A human-centric approach with llm-enhanced rlhf. Companion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing.
[yang2024embodied] Yang, Yijun, Zhou, Tianyi, Li, Kanxue, Tao, Dapeng, Li, Lusong, Shen, Li, He, Xiaodong, Jiang, Jing, Shi, Yuhui. (2024). Embodied multi-modal agent trained by an llm from a parallel textworld. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[li2024embodied] Li, Manling, Zhao, Shiyu, Wang, Qineng, Wang, Kangrui, Zhou, Yu, Srivastava, Sanjana, Gokmen, Cem, Lee, Tony, Li, Erran Li, Zhang, Ruohan, others. (2024). Embodied agent interface: Benchmarking llms for embodied decision making. Advances in Neural Information Processing Systems.
[zheng2023steve] Zheng, Sipeng, Liu, Jiazheng, Feng, Yicheng, Lu, Zongqing. (2023). Steve-eye: Equipping llm-based embodied agents with visual perception in open worlds. arXiv preprint arXiv:2310.13255.
[wei2024editable] Wei, Yuxi, Wang, Zi, Lu, Yifan, Xu, Chenxin, Liu, Changxing, Zhao, Hao, Chen, Siheng, Wang, Yanfeng. (2024). Editable scene simulation for autonomous driving via collaborative llm-agents. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[liu2021heterogeneous] Liu, Jiashuo, Hu, Zheyuan, Cui, Peng, Li, Bo, Shen, Zheyan. (2021). Heterogeneous risk minimization. International Conference on Machine Learning.
[yue2025masrouter] Yue, Yanwei, Zhang, Guibin, Liu, Boyang, Wan, Guancheng, Wang, Kun, Cheng, Dawei, Qi, Yiyan. (2025). Masrouter: Learning to route llms for multi-agent systems. arXiv preprint arXiv:2502.11133.
[wang2024battleagentbench] Wang, Wei, Zhang, Dan, Feng, Tao, Wang, Boyan, Tang, Jie. (2024). Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems. arXiv preprint arXiv:2408.15971.
[10.1145/3604427] Li, Haoyang, Zhang, Ziwei, Wang, Xin, Zhu, Wenwu. (2023). Invariant Node Representation Learning under Distribution Shifts with Multiple Latent Environments. ACM Transactions on Information Systems (TOIS). doi:10.1145/3604427.
[cai2023user] Cai, Desheng, Qian, Shengsheng, Fang, Quan, Hu, Jun, Xu, Changsheng. (2023). User cold-start recommendation via inductive heterogeneous graph neural network. ACM Transactions on Information Systems (TOIS).
[chen2020neural] Chen, Xu, Xiong, Kun, Zhang, Yongfeng, Xia, Long, Yin, Dawei, Huang, Jimmy Xiangji. (2020). Neural feature-aware recommendation with signed hypergraph convolutional network. ACM Transactions on Information Systems (TOIS).
[huang2023position] Huang, Liwei, Ma, Yutao, Liu, Yanbo, Danny Du, Bohong, Wang, Shuliang, Li, Deyi. (2023). Position-enhanced and time-aware graph convolutional network for sequential recommendations. ACM Transactions on Information Systems (TOIS).
[ma2023kr] Ma, Ting, Huang, Longtao, Lu, Qianqian, Hu, Songlin. (2023). Kr-gcn: Knowledge-aware reasoning with graph convolution network for explainable recommendation. ACM Transactions on Information Systems (TOIS).
[yang2021hgat] Yang, Tianchi, Hu, Linmei, Shi, Chuan, Ji, Houye, Li, Xiaoli, Nie, Liqiang. (2021). HGAT: Heterogeneous graph attention networks for semi-supervised short text classification. ACM Transactions on Information Systems (TOIS).
[zhang2022efraudcom] Zhang, Ge, Li, Zhao, Huang, Jiaming, Wu, Jia, Zhou, Chuan, Yang, Jian, Gao, Jianliang. (2022). efraudcom: An e-commerce fraud detection system via competitive graph neural networks. ACM Transactions on Information Systems (TOIS).
[zitnik2018modeling] Zitnik, Marinka, Agrawal, Monica, Leskovec, Jure. (2018). Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics.
[li2022graph] Li, Michelle M, Huang, Kexin, Zitnik, Marinka. (2022). Graph representation learning in biomedicine and healthcare. Nature Biomedical Engineering.
[li2023preference] Li, Yakun, Hou, Lei, Li, Juanzi. (2023). Preference-aware Graph Attention Networks for Cross-Domain Recommendations with Collaborative Knowledge Graph. ACM Transactions on Information Systems (TOIS).
[xie2021graph] Xie, Qianqian, Zhu, Yutao, Huang, Jimin, Du, Pan, Nie, Jian-Yun. (2021). Graph neural collaborative topic model for citation recommendation. ACM Transactions on Information Systems (TOIS).
[wang2021combining] Wang, Hongwei, Leskovec, Jure. (2021). Combining graph convolutional neural networks and label propagation. ACM Transactions on Information Systems (TOIS).
[bi2023predicting] Bi, Wendong, Xu, Bingbing, Sun, Xiaoqian, Xu, Li, Shen, Huawei, Cheng, Xueqi. (2023). Predicting the silent majority on graphs: Knowledge transferable graph neural network. Proceedings of the ACM Web Conference 2023.
[wu2020dynamic] Wu, Junshuang, Zhang, Richong, Mao, Yongyi, Guo, Hongyu, Soflaei, Masoumeh, Huai, Jinpeng. (2020). Dynamic graph convolutional networks for entity linking. Proceedings of The ACM Web Conference 2020.
[taheri2019learning] Taheri, Aynaz, Gimpel, Kevin, Berger-Wolf, Tanya. (2019). Learning to represent the evolution of dynamic graphs with recurrent models. Proceedings of the ACM Web Conference 2019.
[bai2023hgwavenet] Bai, Qijie, Nie, Changli, Zhang, Haiwei, Zhao, Dongming, Yuan, Xiaojie. (2023). HGWaveNet: A Hyperbolic Graph Neural Network for Temporal Link Prediction. Proceedings of the ACM Web Conference 2023.
[liu2022confidence] Liu, Hongrui, Hu, Binbin, Wang, Xiao, Shi, Chuan, Zhang, Zhiqiang, Zhou, Jun. (2022). Confidence may cheat: Self-training on graph neural networks under distribution shift. Proceedings of the ACM Web Conference 2022.
[tang2023dynamic] Tang, Haoran, Wu, Shiqing, Xu, Guandong, Li, Qing. (2023). Dynamic Graph Evolution Learning for Recommendation. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[fu2021sdg] Fu, Dongqi, He, Jingrui. (2021). Sdg: A simplified and dynamic graph neural network. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[wang2022disenctr] Wang, Yifan, Qin, Yifang, Sun, Fang, Zhang, Bo, Hou, Xuyang, Hu, Ke, Cheng, Jia, Lei, Jun, Zhang, Ming. (2022). DisenCTR: Dynamic graph-based disentangled representation for click-through rate prediction. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[zhao2023time] Zhao, Ziwei, Zhu, Xi, Xu, Tong, Lizhiyu, Aakas, Yu, Yu, Li, Xueying, Yin, Zikai, Chen, Enhong. (2023). Time-interval Aware Share Recommendation via Bi-directional Continuous Time Dynamic Graphs. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[yang2023generic] Yang, Zhengyi, He, Xiangnan, Zhang, Jizhi, Wu, Jiancan, Xin, Xin, Chen, Jiawei, Wang, Xiang. (2023). A Generic Learning Framework for Sequential Recommendation with Distribution Shifts. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[gao2023alleviating] Gao, Yuan, Wang, Xiang, He, Xiangnan, Liu, Zhenguang, Feng, Huamin, Zhang, Yongdong. (2023). Alleviating structural distribution shift in graph anomaly detection. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining.
[liu2023good] Liu, Yixin, Ding, Kaize, Liu, Huan, Pan, Shirui. (2023). Good-d: On unsupervised graph out-of-distribution detection. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining.
[yang2023interpretable] Yang, Qiang, Ma, Changsheng, Zhang, Qiannan, Gao, Xin, Zhang, Chuxu, Zhang, Xiangliang. (2023). Interpretable Research Interest Shift Detection with Temporal Heterogeneous Graphs. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining.
[wang2023tutorial] Wang, Jindong, Li, Haoliang, Pan, Sinno, Xie, Xing. (2023). A Tutorial on Domain Generalization. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining.
[chen2022learning] Chen, Cen, Ye, Tiandi, Wang, Li, Gao, Ming. (2022). Learning to generalize in heterogeneous federated networks. Proceedings of the 31st ACM International Conference on Information & Knowledge Management.
[wang2022adagcl] Wang, Yili, Zhou, Kaixiong, Miao, Rui, Liu, Ninghao, Wang, Xin. (2022). AdaGCL: Adaptive Subgraph Contrastive Learning to Generalize Large-scale Graph Training. Proceedings of the 31st ACM International Conference on Information & Knowledge Management.
[wang2022imbalanced] Wang, Yu, Zhao, Yuying, Shah, Neil, Derr, Tyler. (2022). Imbalanced graph classification via graph-of-graph neural networks. Proceedings of the 31st ACM International Conference on Information & Knowledge Management.
[wang2019heterogeneous] Wang, Xiao, Ji, Houye, Shi, Chuan, Wang, Bai, Ye, Yanfang, Cui, Peng, Yu, Philip S. (2019). Heterogeneous graph attention network. The world wide web conference.
[wang2017community] Wang, Xiao, Cui, Peng, Wang, Jing, Pei, Jian, Zhu, Wenwu, Yang, Shiqiang. (2017). Community preserving network embedding. Proceedings of the AAAI conference on artificial intelligence.
[zhou2020graph] Zhou, Jie, Cui, Ganqu, Hu, Shengding, Zhang, Zhengyan, Yang, Cheng, Liu, Zhiyuan, Wang, Lifeng, Li, Changcheng, Sun, Maosong. (2020). Graph neural networks: A review of methods and applications. AI open.
[wu2020comprehensive] Wu, Zonghan, Pan, Shirui, Chen, Fengwen, Long, Guodong, Zhang, Chengqi, Philip, S Yu. (2020). A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems.
[xu2018powerful] Xu, Keyulu, Hu, Weihua, Leskovec, Jure, Jegelka, Stefanie. (2018). How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826.
[zhang2023graph] Zhang, Ziwei, Li, Haoyang, Zhang, Zeyang, Qin, Yijian, Wang, Xin, Zhu, Wenwu. (2023). Graph meets llms: Towards large graph models. NeurIPS 2023 Workshop: New Frontiers in Graph Learning.
[Bengio+chapter2007] Bengio, Yoshua, LeCun, Yann. (2007). Scaling Learning Algorithms Towards {AI. Large Scale Kernel Machines.
[debate2-thu] Liang, Tian, He, Zhiwei, Jiao, Wenxiang, Wang, Xing, Wang, Yan, Wang, Rui, Yang, Yujiu, Tu, Zhaopeng, Shi, Shuming. (2023). Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate.
[PHPrompting] Zheng, Chuanyang, Liu, Zhengying, Xie, Enze, Li, Zhenguo, Li, Yu. (2023). Progressive-Hint Prompting Improves Reasoning in Large Language Models.
[blender] Jiang, Dongfu, Ren, Xiang, Lin, Bill Yuchen. (2023). {LLM. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
[bondy1976graph] Bondy, John Adrian, Murty, Uppaluri Siva Ramachandra, others. (1976). Graph theory with applications.
[wang2023selfconsistency] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. The Eleventh International Conference on Learning Representations.
[spielman2008graph] Spielman, Daniel A, Srivastava, Nikhil. (2008). Graph sparsification by effective resistances. Proceedings of the fortieth annual ACM symposium on Theory of computing.
[chen2021unified] Chen, Tianlong, Sui, Yongduo, Chen, Xuxi, Zhang, Aston, Wang, Zhangyang. (2021). A unified lottery ticket hypothesis for graph neural networks. International conference on machine learning.
[zhang2024graph] Zhang, Guibin, Wang, Kun, Huang, Wei, Yue, Yanwei, Wang, Yang, Zimmermann, Roger, Zhou, Aojun, Cheng, Dawei, Zeng, Jin, Liang, Yuxuan. (2024). Graph lottery ticket automated. The Twelfth International Conference on Learning Representations.
[achille2018critical] Achille, Alessandro, Rovere, Matteo, Soatto, Stefano. (2018). Critical learning periods in deep networks. International Conference on Learning Representations.
[entezari2020all] Entezari, Negin, Al-Sayouri, Saba A, Darvishzadeh, Amirali, Papalexakis, Evangelos E. (2020). All you need is low (rank) defending against adversarial attacks on graphs. Proceedings of the 13th international conference on web search and data mining.
[ennadir2024simple] Ennadir, Sofiane, Abbahaddou, Yassine, Lutzeyer, Johannes F, Vazirgiannis, Michalis, Bostr{. (2024). A Simple and Yet Fairly Effective Defense for Graph Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence.
[alchihabi2023efficient] Alchihabi, Abdullah, En, Qing, Guo, Yuhong. (2023). Efficient Low-Rank GNN Defense Against Structural Attacks. 2023 IEEE International Conference on Knowledge Graph (ICKG).
[you2019drawing] You, Haoran, Li, Chaojian, Xu, Pengfei, Fu, Yonggan, Wang, Yue, Chen, Xiaohan, Baraniuk, Richard G, Wang, Zhangyang, Lin, Yingyan. (2019). Drawing early-bird tickets: Towards more efficient training of deep networks. arXiv preprint arXiv:1909.11957.
[zhang2024two] Zhang, Guibin, Yue, Yanwei, Wang, Kun, Fang, Junfeng, Sui, Yongduo, Wang, Kai, Liang, Yuxuan, Cheng, Dawei, Pan, Shirui, Chen, Tianlong. (2024). Two heads are better than one: Boosting graph sparse training via semantic and topological awareness. arXiv preprint arXiv:2402.01242.
[williams1992simple] Williams, Ronald J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning.
[li2023api] Li, Minghao, Zhao, Yingxiu, Yu, Bowen, Song, Feifan, Li, Hangyu, Yu, Haiyang, Li, Zhoujun, Huang, Fei, Li, Yongbin. (2023). Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
[wang2023brave] Wang, Kun, Liang, Yuxuan, Li, Xinglin, Li, Guohao, Ghanem, Bernard, Zimmermann, Roger, Yi, Huahui, Zhang, Yudong, Wang, Yang, others. (2023). Brave the wind and the waves: Discovering robust and generalizable graph lottery tickets. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[chen2023demystifying] Chen, Yuhan, Ye, Haojie, Vedula, Sanketh, Bronstein, Alex, Dreslinski, Ronald, Mudge, Trevor, Talati, Nishil. (2023). Demystifying graph sparsification algorithms in graph properties preservation. Proceedings of the VLDB Endowment.
[augmented-lm-survey] {Mialon. {Augmented Language Models: a Survey. arXiv e-prints.
[audio-gpt] Huang, Rongjie, Li, Mingze, Yang, Dongchao, Shi, Jiatong, Chang, Xuankai, Ye, Zhenhui, Wu, Yuning, Hong, Zhiqing, Huang, Jiawei, Liu, Jinglin, Ren, Yi, Zhao, Zhou, Watanabe, Shinji. (2023). AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.
[visual-gpt] Chen, Jun, Guo, Han, Yi, Kai, Li, Boyang, Elhoseiny, Mohamed. (2021). VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning.
[neural-sequence] {Dabagia. {Computation with Sequences in the Brain. arXiv e-prints.
[khattab2023dspy] Khattab, Omar, Singhvi, Arnav, Maheshwari, Paridhi, Zhang, Zhiyuan, Santhanam, Keshav, Vardhamanan, Sri, Haq, Saiful, Sharma, Ashutosh, Joshi, Thomas T, Moazam, Hanna, others. (2023). Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714.
[RUC2024agent-memory] Zhang, Zeyu, Bo, Xiaohe, Ma, Chen, Li, Rui, Chen, Xu, Dai, Quanyu, Zhu, Jieming, Dong, Zhenhua, Wen, Ji-Rong. (2024). A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501.
[yang2023investlm] Yang, Yi, Tang, Yixuan, Tam, Kar Yan. (2023). Investlm: A large language model for investment using financial domain instruction tuning. arXiv preprint arXiv:2309.13064.
[li2024survey-mas] Li, Xinyi, Wang, Sai, Zeng, Siqi, Wu, Yu, Yang, Yi. (2024). A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth.
[zhu2023ghost] Zhu, Xizhou, Chen, Yuntao, Tian, Hao, Tao, Chenxin, Su, Weijie, Yang, Chenyu, Huang, Gao, Li, Bin, Lu, Lewei, Wang, Xiaogang, others. (2023). Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144.
[zheng2023synapse] Zheng, Longtao, Wang, Rundong, Wang, Xinrun, An, Bo. (2023). Synapse: Trajectory-as-exemplar prompting with memory for computer control. arXiv preprint arXiv:2306.07863.
[zelikman2023self] Zelikman, Eric, Lorch, Eliana, Mackey, Lester, Kalai, Adam Tauman. (2023). Self-taught optimizer (stop): Recursively self-improving code generation. arXiv preprint arXiv:2310.02304.
[zhang2025maas] Zhang, Guibin, Niu, Luyang, Fang, Junfeng, Wang, Kun, Bai, Lei, Wang, Xiang. (2025). Multi-agent Architecture Search via Agentic Supernet. arXiv preprint arXiv:2502.04180.
[yuan2024evoagent] Yuan, Siyu, Song, Kaitao, Chen, Jiangjie, Tan, Xu, Li, Dongsheng, Yang, Deqing. (2024). EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms. arXiv preprint arXiv:2406.14228.
[hu2024adas] Hu, Shengran, Lu, Cong, Clune, Jeff. (2024). Automated design of agentic systems. arXiv preprint arXiv:2408.08435.
[shang2024agentsquare] Shang, Yu, Li, Yu, Zhao, Keyu, Ma, Likai, Liu, Jiahe, Xu, Fengli, Li, Yong. (2024). AgentSquare: Automatic LLM Agent Search in Modular Design Space. arXiv preprint arXiv:2410.06153.
[liu2024evaluating] Liu, Jiawei, Xie, Songrun, Wang, Junhao, Wei, Yuxiang, Ding, Yifeng, Zhang, Lingming. (2024). Evaluating Language Models for Efficient Code Generation. arXiv preprint arXiv:2408.06450.
[ling2017program] Ling, Wang, Yogatama, Dani, Dyer, Chris, Blunsom, Phil. (2017). Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
[software-dev] Qian, Chen, Cong, Xin, Yang, Cheng, Chen, Weize, Su, Yusheng, Xu, Juyuan, Liu, Zhiyuan, Sun, Maosong. (2023). Communicative Agents for Software Development.
[debate3-multi-models] Xiong, Kai, Ding, Xiao, Cao, Yixin, Liu, Ting, Qin, Bing. (2023). Examining the Inter-Consistency of Large Language Models: An In-depth Analysis via Debate.
[chen2024compundLLM] Chen, Lingjiao, Davis, Jared Quincy, Hanin, Boris, Bailis, Peter, Stoica, Ion, Zaharia, Matei, Zou, James. (2024). Are more llm calls all you need? towards scaling laws of compound inference systems. arXiv preprint arXiv:2403.02419.
[piatti2024cooperate] Piatti, Giorgio, Jin, Zhijing, Kleiman-Weiner, Max, Sch{. (2024). Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents. arXiv preprint arXiv:2404.16698.
[zhang2006introduction] Zhang, Ping, Chartrand, Gary. (2006). Introduction to graph theory. Tata McGraw-Hill.
[ma2023laser] Ma, Kaixin, Zhang, Hongming, Wang, Hongwei, Pan, Xiaoman, Yu, Wenhao, Yu, Dong. (2023). Laser: Llm agent with state-space exploration for web navigation. arXiv preprint arXiv:2309.08172.
[bouzenia2024repairagent] Bouzenia, Islem, Devanbu, Premkumar, Pradel, Michael. (2024). Repairagent: An autonomous, llm-based agent for program repair. arXiv preprint arXiv:2403.17134.
[thorne2018fever] Thorne, James, Vlachos, Andreas, Christodoulopoulos, Christos, Mittal, Arpit. (2018). FEVER: a large-scale dataset for fact extraction and VERification. arXiv preprint arXiv:1803.05355.
[yang2018hotpotqa] Yang, Zhilin, Qi, Peng, Zhang, Saizheng, Bengio, Yoshua, Cohen, William W, Salakhutdinov, Ruslan, Manning, Christopher D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
[zhou2025reso] Zhou, Heng, Geng, Hejia, Xue, Xiangyuan, Yin, Zhenfei, Bai, Lei. (2025). ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks. arXiv preprint arXiv:2503.02390.
[zhao2025sirius] Zhao, Wanjia, Yuksekgonul, Mert, Wu, Shirley, Zou, James. (2025). SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning. arXiv preprint arXiv:2502.04780.
[ishibashi2024selforganize-mother] Ishibashi, Yoichi, Nishimura, Yoshimasa. (2024). Self-organized agents: A llm multi-agent framework toward ultra large-scale code generation and optimization. arXiv preprint arXiv:2404.02183.
[BOOK_1991organizational_memory] Walsh, James P, Ungson, Gerardo Rivera. (1991). Organizational memory. Academy of management review.
[mmlu] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt. (2021). Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR).
[hendrycks2021ethics] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt. (2021). Aligning AI With Shared Human Values. Proceedings of the International Conference on Learning Representations (ICLR).
[hendrycksmath2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt. (2021). Measuring Mathematical Problem Solving With the MATH Dataset. NeurIPS.
[generative-agents-simulacra] Park, Joon Sung, O'Brien, Joseph C., Cai, Carrie J., Ringel Morris, Meredith, Liang, Percy, Bernstein, Michael S.. (2023). Generative Agents: Interactive Simulacra of Human Behavior.
[zhang2024classroom] Zhang, Zheyuan, Zhang-Li, Daniel, Yu, Jifan, Gong, Linlu, Zhou, Jinchang, Liu, Zhiyuan, Hou, Lei, Li, Juanzi. (2024). Simulating classroom education with llm-empowered agents. arXiv preprint arXiv:2406.19226.
[zhao2023competeai] Zhao, Qinlin, Wang, Jindong, Zhang, Yixuan, Jin, Yiqiao, Zhu, Kaijie, Chen, Hao, Xie, Xing. (2023). Competeai: Understanding the competition behaviors in large language model-based agents. arXiv preprint arXiv:2310.17512.
[multi-persona] Wang, Zhenhailong, Mao, Shaoguang, Wu, Wenshan, Ge, Tao, Wei, Furu, Ji, Heng. (2023). Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration.
[bargaining-feedback] Fu, Yao, Peng, Hao, Khot, Tushar, Lapata, Mirella. (2023). Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback.
[sot] Ning, Xuefei, Lin, Zinan, Zhou, Zixuan, Yang, Huazhong, Wang, Yu. (2023). Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding.
[hua2023war] Hua, Wenyue, Fan, Lizhou, Li, Lingyao, Mei, Kai, Ji, Jianchao, Ge, Yingqiang, Hemphill, Libby, Zhang, Yongfeng. (2023). War and peace (waragent): Large language model-based multi-agent simulation of world wars. arXiv preprint arXiv:2311.17227.
[chen2023gamegpt] Chen, Dake, Wang, Hanbin, Huo, Yunhao, Li, Yuzhao, Zhang, Haoyang. (2023). Gamegpt: Multi-agent collaborative framework for game development. arXiv preprint arXiv:2310.08067.
[cohen2023lm] Cohen, Roi, Hamri, May, Geva, Mor, Globerson, Amir. (2023). Lm vs lm: Detecting factual errors via cross examination. arXiv preprint arXiv:2305.13281.
[self-correction] Pan, Liangming, Saxon, Michael, Xu, Wenda, Nathani, Deepak, Wang, Xinyi, Wang, William Yang. (2023). Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies.
[qian2024scaling] Qian, Chen, Xie, Zihao, Wang, Yifei, Liu, Wei, Dang, Yufan, Du, Zhuoyun, Chen, Weize, Yang, Cheng, Liu, Zhiyuan, Sun, Maosong. (2024). Scaling Large-Language-Model-based Multi-Agent Collaboration. arXiv preprint arXiv:2406.07155.
[zhuge2024gptswarm] Zhuge, Mingchen, Wang, Wenyi, Kirsch, Louis, Faccio, Francesco, Khizbullin, Dmitrii, Schmidhuber, J{. (2024). GPTSwarm: Language Agents as Optimizable Graphs. Forty-first International Conference on Machine Learning.
[openai2023gpt4] OpenAI. (2023). GPT-4 Technical Report.
[coalitional-game] Saad, Walid, Han, Zhu, Debbah, Merouane, Hjorungnes, Are, Basar, Tamer. (2009). Coalitional game theory for communication networks. IEEE Signal Processing Magazine.
[held-yang-2023-shapley] Held, William, Yang, Diyi. (2023). Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.
[imp-score] Chen, Mark, Tworek, Jerry, Jun, Heewoo, Yuan, Qiming, Ponde de Oliveira Pinto, Henrique, Kaplan, Jared, Edwards, Harri, Burda, Yuri, Joseph, Nicholas, Brockman, Greg, Ray, Alex, Puri, Raul, Krueger, Gretchen, Petrov, Michael, Khlaaf, Heidy, Sastry, Girish, Mishkin, Pamela, Chan, Brooke, Gray, Scott, Ryder, Nick, Pavlov, Mikhail, Power, Alethea, Kaiser, Lukasz, Bavarian, Mohammad, Winter, Clemens, Tillet, Philippe, Petroski Such, Felipe, Cummings, Dave, Plappert, Matthias, Chantzis, Fotios, Barnes, Elizabeth, Herbert-Voss, Ariel, Hebgen Guss, William, Nichol, Alex, Paino, Alex, Tezak, Nikolas, Tang, Jie, Babuschkin, Igor, Balaji, Suchir, Jain, Shantanu, Saunders, William, Hesse, Christopher, Carr, Andrew N., Leike, Jan, Achiam, Josh, Misra, Vedant, Morikawa, Evan, Radford, Alec, Knight, Matthew, Brundage, Miles, Murati, Mira, Mayer, Katie, Welinder, Peter, McGrew, Bob, Amodei, Dario, McCandlish, Sam, Sutskever, Ilya, Zaremba, Wojciech. (2021). Evaluating Large Language Models Trained on Code. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[minecraft-agent] {Zhu. {Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory. arXiv e-prints.
[meta-gpt] Hong, Sirui, Zheng, Xiawu, Chen, Jonathan, Cheng, Yuheng, Wang, Jinlin, Zhang, Ceyao, Wang, Zili, Yau, Steven Ka Shing, Lin, Zijuan, Zhou, Liyang, Ran, Chenyu, Xiao, Lingfeng, Wu, Chenglin. (2023). MetaGPT: Meta Programming for Multi-Agent Collaborative Framework.
[self-debug] Olausson, Theo X., Priya Inala, Jeevana, Wang, Chenglong, Gao, Jianfeng, Solar-Lezama, Armando. (2023). Demystifying GPT Self-Repair for Code Generation.
[zhang2024g-designer] Zhang, Guibin, Yue, Yanwei, Sun, Xiangguo, Wan, Guancheng, Yu, Miao, Fang, Junfeng, Wang, Kun, Chen, Tianlong, Cheng, Dawei. (2024). G-designer: Architecting multi-agent communication topologies via graph neural networks. arXiv preprint arXiv:2410.11782.
[chen2023codet] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, Weizhu Chen. (2023). CodeT: Code Generation with Generated Tests. The Eleventh International Conference on Learning Representations.
[self-evaluation-decode] {Xie. {Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding. arXiv e-prints.
[llm-overconfident] {Xiong. {Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. arXiv e-prints.
[llm-as-a-judge] Zheng, Lianmin, Chiang, Wei-Lin, Sheng, Ying, Zhuang, Siyuan, Wu, Zhanghao, Zhuang, Yonghao, Lin, Zi, Li, Zhuohan, Li, Dacheng, Xing, Eric. P, Zhang, Hao, Gonzalez, Joseph E., Stoica, Ion. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.
[sliding-window] Qin, Zhen, Jagerman, Rolf, Hui, Kai, Zhuang, Honglei, Wu, Junru, Shen, Jiaming, Liu, Tianqi, Liu, Jialu, Metzler, Donald, Wang, Xuanhui, Bendersky, Michael. (2023). Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting.
[SHAP] Lundberg, Scott M., Lee, Su-In. (2017). A Unified Approach to Interpreting Model Predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems.
[shapley-origin] Stan Lipovetsky, Michael Conklin. {Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry.
[got] Besta, Maciej, Blach, Nils, Kubicek, Ales, Gerstenberger, Robert, Gianinazzi, Lukas, Gajda, Joanna, Lehmann, Tomasz, Podstawski, Michal, Niewiadomski, Hubert, Nyczyk, Piotr, Hoefler, Torsten. (2023). Graph of Thoughts: Solving Elaborate Problems with Large Language Models.
[agent-bench] Liu, Xiao, Yu, Hao, Zhang, Hanchen, Xu, Yifan, Lei, Xuanyu, Lai, Hanyu, Gu, Yu, Ding, Hangliang, Men, Kaiwen, Yang, Kejuan, Zhang, Shudan, Deng, Xiang, Zeng, Aohan, Du, Zhengxiao, Zhang, Chenhui, Shen, Sheng, Zhang, Tianjun, Su, Yu, Sun, Huan, Huang, Minlie, Dong, Yuxiao, Tang, Jie. (2023). AgentBench: Evaluating LLMs as Agents.
[cot] Wei, Jason, Wang, Xuezhi, Schuurmans, Dale, Bosma, Maarten, Ichter, Brian, Xia, Fei, Chi, Ed, Le, Quoc, Zhou, Denny. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
[kim2014convolutional] Kim, Yoon. (2014). Convolutional neural networks for sentence classification. arXiv:1408.5882.
[nair2010rectified] Nair, Vinod, Hinton, Geoffrey E. (2010). Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th international conference on machine learning (ICML-10).
[hendrycks2016gaussian] Hendrycks, Dan, Gimpel, Kevin. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
[listmle] Xia, Fen, Liu, Tie-Yan, Wang, Jue, Zhang, Wensheng, Li, Hang. (2008). Listwise Approach to Learning to Rank: Theory and Algorithm. Proceedings of the 25th International Conference on Machine Learning.
[autogpt] Toran Bruce Richards, et al.. (2023). Auto-GPT: An Autonomous GPT-4 Experiment. GitHub repository.
[autogen] Wu, Qingyun, Bansal, Gagan, Zhang, Jieyu, Wu, Yiran, Zhang, Shaokun, Zhu, Erkang, Li, Beibin, Jiang, Li, Zhang, Xiaoyun, Wang, Chi. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. arXiv:2308.08155.
[yao2023react] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, Yuan Cao. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. The Eleventh International Conference on Learning Representations.
[voyager] {Wang. {Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv e-prints.
[jin2023surrealdriver] Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, Jiangtao Gong. (2023). SurrealDriver: Designing Generative Driver Agent Simulation Framework in Urban Contexts based on Large Language Model.
[human-team-optimization] Lykourentzou, Ioanna, Vinella, Federica Lucia, Ahmed, Faez, Papastathis, Costas, Papangelis, Konstantinos, Khan, Vassilis-Javed, Masthoff, Judith. (2022). Self-Organization in Online Collaborative Work Settings. Collective Intelligence.
[babyagi] Yohei Nakajima. (2023). BabyAGI. GitHub repository.
[agentgpt] Reworkd. (2023). AgentGPT. GitHub repository.
[zhang2023astools] Zhang, Jintian, Xu, Xin, Deng, Shumin. (2023). Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124.
[pesce2023learning] Pesce, Emanuele, Montana, Giovanni. (2023). Learning multi-agent coordination through connectivity-driven communication. Machine Learning.
[liu2022temporal] Liu, Yuntao, Dou, Yong, Li, Yuan, Xu, Xinhai, Liu, Donghong. (2022). Temporal dynamic weighted graph convolution for multi-agent reinforcement learning. Proceedings of the Annual Meeting of the Cognitive Science Society.
[hu2024magraph] Hu, Shengchao, Shen, Li, Zhang, Ya, Tao, Dacheng. (2024). Learning multi-agent communication from graph modeling perspective. arXiv preprint arXiv:2405.08550.
[bolaa] Liu, Zhiwei, Yao, Weiran, Zhang, Jianguo, Xue, Le, Heinecke, Shelby, Murthy, Rithesh, Feng, Yihao, Chen, Zeyuan, Niebles, Juan Carlos, Arpit, Devansh, Xu, Ran, Mui, Phil, Wang, Huan, Xiong, Caiming, Savarese, Silvio. (2023). BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents.
[cumulative-cr] Zhang, Yifan, Yang, Jingqin, Yuan, Yang, Chi-Chih Yao, Andrew. (2023). Cumulative Reasoning with Large Language Models.
[tptu] Ruan, Jingqing, Chen, Yihong, Zhang, Bin, Xu, Zhiwei, Bao, Tianpeng, Du, Guoqing, Shi, Shiwei, Mao, Hangyu, Zeng, Xingyu, Zhao, Rui. (2023). TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents.
[lu2023chameleon] Lu, Pan, Peng, Baolin, Cheng, Hao, Galley, Michel, Chang, Kai-Wei, Wu, Ying Nian, Zhu, Song-Chun, Gao, Jianfeng. (2023). Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. arXiv preprint arXiv:2304.09842.
[deep-network] Sordoni, Alessandro, Yuan, Xingdi, Côté, Marc-Alexandre, Pereira, Matheus, Trischler, Adam, Xiao, Ziang, Hosseini, Arian, Niedtner, Friederike, Le Roux, Nicolas. (2023). Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference.
[chen2023agentverse] Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, Zhiyuan Liu, Maosong Sun, Jie Zhou. (2023). AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors in Agents.
[rasal2024llm] Rasal, Sumedh. (2024). Llm harmony: Multi-agent communication for problem solving. arXiv preprint arXiv:2401.01312.
[madaan2023selfrefine] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, Peter Clark. (2023). Self-Refine: Iterative Refinement with Self-Feedback.
[aggarwal2023adaptive-consistency] Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam. (2023). Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning with LLMs.
[tot] Yao, Shunyu, Yu, Dian, Zhao, Jeffrey, Shafran, Izhak, Griffiths, Thomas L., Cao, Yuan, Narasimhan, Karthik. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601.
[embodied] {Zhang. {Building Cooperative Embodied Agents Modularly with Large Language Models. arXiv e-prints.
[chateval] {Chan. {ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. arXiv e-prints.
[llm-dba] {Zhou. {LLM As DBA. arXiv e-prints.
[gpt-iot] {Nascimento. {GPT-in-the-Loop: Adaptive Decision-Making for Multiagent Systems. arXiv e-prints.
[ma2024agentboard] Ma, Chang, Zhang, Junlei, Zhu, Zhihao, Yang, Cheng, Yang, Yujiu, Jin, Yaohui, Lan, Zhenzhong, Kong, Lingpeng, He, Junxian. (2024). Agentboard: An analytical evaluation board of multi-turn llm agents. arXiv preprint arXiv:2401.13178.
[post-2018-sacrebleu] Post, Matt. (2018). A Call for Clarity in Reporting {BLEU. Proceedings of the Third Conference on Machine Translation: Research Papers.
[wang2022scienceworld] Wang, Ruoyao, Jansen, Peter, C{^o. (2022). Scienceworld: Is your agent smarter than a 5th grader?. arXiv preprint arXiv:2203.07540.
[Ren2020CodeBLEUAM] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, M. Zhou, Ambrosio Blanco, Shuai Ma. (2020). CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. ArXiv.
[shridhar2020alfworld] Shridhar, Mohit, Yuan, Xingdi, C{^o. (2020). Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.
[trueskill] Herbrich, Ralf, Minka, Tom, Graepel, Thore. (2006). TrueSkill\texttrademark : A Bayesian Skill Rating System. Advances in Neural Information Processing Systems.
[self-collab-codegen] {Dong. {Self-collaboration Code Generation via ChatGPT. arXiv e-prints.
[human-collaboration] Arthur C. Graesser, Stephen M. Fiore, Samuel Greiff, Jessica Andrews-Todd, Peter W. Foltz, Friedrich W. Hesse. (2018). Advancing the Science of Collaborative Problem Solving. Psychological Science in the Public Interest.
[multi-agent] Hao, Rui, Hu, Linmei, Qi, Weijian, Wu, Qingliu, Zhang, Yirui, Nie, Liqiang. (2023). ChatLLM Network: More brains, More intelligence. IEEE Access.
[human-team-building] Zhang, Jintian, Xu, Xin, Deng, Shumin. (2023). Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124.
[arXiv2023_Survey-LLM] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian{-. (2023). A Survey of Large Language Models. arXiv preprint. doi:10.48550/arXiv.2303.18223.
[arXiv2023_Survey-MLLM] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen. (2023). A Survey on Multimodal Large Language Models. arXiv preprint. doi:10.48550/arXiv.2306.13549.
[arXiv2023_Survey-LLM-KGCR] Yuqi Zhu, Xiaohan Wang, Jing Chen, Shuofei Qiao, Yixin Ou, Yunzhi Yao, Shumin Deng, Huajun Chen, Ningyu Zhang. (2023). LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities. CoRR. doi:10.48550/arXiv.2305.13168.
[arXiv2024_Survey-LLM-Psychology-Applications] Luoma Ke, Song Tong, Peng Chen, Kaiping Peng. (2024). Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review. CoRR.
[FCS2024_Survey-Agent] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji{-. (2024). A Survey on Large Language Model based Autonomous Agents. Front. Comput. Sci..
[arXiv2023_Survey-Agent_2] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huan, Tao Gui. (2023). The Rise and Potential of Large Language Model Based Agents: {A. arxiv preprint.
[arXiv2023_Survey-Agent_3] Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, Yong Li. (2023). Large Language Models Empowered Agent-based Modeling and Simulation: A Survey and Perspectives. CoRR.
[arXiv2024_Survey-Agent_4] Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, Xiuqiang He. (2024). Exploring Large Language Model based Intelligent Agents: Definitions, Methods, and Prospects. CoRR.
[arXiv2024_Survey-Agents-CompExp] Qun Ma, Xiao Xue, Deyu Zhou, Xiangning Yu, Donghua Liu, Xuwen Zhang, Zihan Zhao, Yifan Shen, Peilin Ji, Juanjuan Li, Gang Wang, Wanpeng Ma. (2024). Computational Experiments Meet Large Language Model Based Agents: A Survey and Perspective. CoRR.
[arXiv2024_Survey-MultiAgent] Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang. (2024). Large Language Model based Multi-Agents: A Survey of Progress and Challenges. CoRR.
[arXiv2024_Survey-MultiAgent_2] Pouya Pezeshkpour, Eser Kandogan, Nikita Bhutani, Sajjadur Rahman, Tom Mitchell, Estevam Hruschka. (2024). Reasoning Capacity in Multi-Agent Systems: Limitations, Challenges and Human-Centered Solutions. CoRR.
[arXiv2024_Survey-MultiAgent-System] Hung Du, Srikanth Thudumu, Rajesh Vasa, Kon Mouzakis. (2024). A Survey on Context-Aware Multi-Agent Systems: Techniques, Challenges and Future Directions. CoRR.
[arXiv2024_Survey-MultiAgent-System_2] Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, Zhaozhuo Xu, Chaoyang He. (2024). LLM Multi-Agent Systems: Challenges and Open Problems. CoRR.
[J2024_Survey-AI-SocialScience] Ruoxi Xu, Yingfei Sun, Mengjie Ren, Shiguang Guo, Ruotong Pan, Hongyu Lin, Le Sun, Xianpei Han. (2024). AI for social science and social science of AI: A Survey. Information Processing & Management. doi:https://doi.org/10.1016/j.ipm.2024.103665.
[arXiv2023_Survey-MultiAgentCooperation] Yali Du, Joel Z. Leibo, Usman Islam, Richard Willis, Peter Sunehag. (2023). A Review of Cooperation in Multi-agent Learning. CoRR. doi:10.48550/ARXIV.2312.05162.
[arXiv2024_Survey-AgentAI_MMInteraction] Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, Katsushi Ikeuchi, Hoi Vo, Li Fei-Fei, Jianfeng Gao. (2024). Agent AI: Surveying the Horizons of Multimodal Interaction. CoRR.
[arXiv2024_Survey-CooperativeAgent-RL] Jiechuan Jiang, Kefan Su, Zongqing Lu. (2024). Fully Decentralized Cooperative Multi-Agent Reinforcement Learning: A Survey. CoRR.
[ICLR2024_Sycophancy-LLM] Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield{-. (2024). Towards Understanding Sycophancy in Language Models. {ICLR.
[J2008_Survey-MultiAgent-Reinforce] Lucian Busoniu, Robert Babuska, Bart De Schutter. (2008). A Comprehensive Survey of Multiagent Reinforcement Learning. {IEEE. doi:10.1109/TSMCC.2007.913919.
[arXiv2023_Survey-MultiAgent-Reinforce] Dom Huh, Prasant Mohapatra. (2023). Multi-agent Reinforcement Learning: A Comprehensive Survey. arxiv preprint.
[arXiv2023_Survey-Hallucination_LFM] Vipula Rawte, Amit P. Sheth, Amitava Das. (2023). A Survey of Hallucination in Large Foundation Models. CoRR. doi:10.48550/arXiv.2309.05922.
[J2023_Survey-Hallucination_NLG] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, Pascale Fung. (2023). Survey of Hallucination in Natural Language Generation. {ACM. doi:10.1145/3571730.
[arXiv2023_Survey-Hallucination_LLM] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi. (2023). Siren's Song in the {AI. CoRR. doi:10.48550/ARXIV.2309.01219.
[arXiv2024_Survey-Hallucination] Junliang Luo, Tianyu Li, Di Wu, Michael Jenkin, Steve Liu, Gregory Dudek. (2024). Hallucination Detection and Hallucination Mitigation: An Investigation. CoRR.
[arXiv2024_Analysis-Hallucination] Ziwei Xu, Sanjay Jain, Mohan Kankanhalli. (2024). Hallucination is Inevitable: An Innate Limitation of Large Language Models. CoRR.
[1995_MultiAgent-System] Gerhard Wei{\ss. (1995). Adaptation and Learning in Multi-Agent Systems: Some Remarks and a Bibliography. Adaption and Learning in Multi-Agent Systems. doi:10.1007/3-540-60923-7_16.
[J2000_MultiAgent-System] Peter Stone, Manuela M. Veloso. (2000). Multiagent Systems: {A. Auton. Robots. doi:10.1023/A:1008942012299.
[Book2006_MultiAgent-System] Jos'{e. (2006). Fundamentals of Multiagent Systems: Using NetLogo Models.
[Book2009_MultiAgent-System] Michael J. Wooldridge. (2009). An Introduction to MultiAgent Systems, Second Edition.
[arXiv2024_Formal-LLM] Zelong Li, Wenyue Hua, Hao Wang, He Zhu, Yongfeng Zhang. (2024). Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents. CoRR.
[arXiv2023_Agents] Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu Chen, Wentao Zhang, Ningyu Zhang, Huajun Chen, Peng Cui, Mrinmaya Sachan. (2023). Agents: An Open-source Framework for Autonomous Language Agents. CoRR. doi:10.48550/arXiv.2309.07870.
[arXiv2023_OpenAgents] Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, Tao Yu. (2023). OpenAgents: An Open Platform for Language Agents in the Wild. CoRR. doi:10.48550/ARXIV.2310.10634.
[arXiv2023_AutoAgents] Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, B{. (2023). AutoAgents: {A. CoRR. doi:10.48550/ARXIV.2309.17288.
[arXiv2023_CGMI] Jinxin Shi, Jiabao Zhao, Yilei Wang, Xingjiao Wu, Jiawen Li, Liang He. (2023). {CGMI:. CoRR. doi:10.48550/ARXIV.2308.12503.
[arXiv2023_Believability-Agents] Yang Xiao, Yi Cheng, Jinlan Fu, Jiashuo Wang, Wenjie Li, Pengfei Liu. (2023). How Far Are We from Believable AI Agents? A Framework for Evaluating the Believability of Human Behavior Simulation. CoRR.
[arXiv2023_MAgIC] Lin Xu, Zhiyuan Hu, Daquan Zhou, Hongyu Ren, Zhen Dong, Kurt Keutzer, See{-. (2023). MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration. CoRR. doi:10.48550/ARXIV.2311.08562.
[arXiv2024_AgentBoard] Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, Junxian He. (2024). AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents. CoRR.
[arXiv2024_Evaluate-Agents] Lize Alberts, Geoff Keeling, Amanda McCroskery. (2024). What makes for a 'good' social actor? Using respect as a lens to evaluate interactions with language agents. CoRR.
[arXiv2023_InterAct] Po{-. (2023). InterAct: Exploring the Potentials of ChatGPT as a Cooperative Agent. CoRR. doi:10.48550/ARXIV.2308.01552.
[arXiv2023_Multi-Agent-Collaboration_Intelligent] Yashar Talebirad, Amirhossein Nadiri. (2023). Multi-Agent Collaboration: Harnessing the Power of Intelligent {LLM. CoRR. doi:10.48550/ARXIV.2306.03314.
[UIST2023_Agent-Simulate-Interaction] Joon Sung Park, Joseph C. O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein. (2023). Generative Agents: Interactive Simulacra of Human Behavior. {UIST.
[AAAI2024_CooperativeAgents_ProAgent] Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song{-. (2024). ProAgent: Building Proactive Cooperative Agents with Large Language Models. {AAAI. doi:10.1609/AAAI.V38I16.29710.
[sun2023corex] Sun, Qiushi, Yin, Zhangyue, Li, Xiang, Wu, Zhiyong, Qiu, Xipeng, Kong, Lingpeng. (2023). Corex: Pushing the boundaries of complex reasoning through multi-model collaboration. arXiv preprint arXiv:2310.00280.
[chen2024comm] Chen, Pei, Han, Boran, Zhang, Shuai. (2024). CoMM: Collaborative multi-agent, multi-reasoning-path prompting for complex problem solving. arXiv preprint arXiv:2404.17729.
[ICLR2024_MultiAgent_AgentVerse] Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi{-. (2024). AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors in Agents. {ICLR.
[arXiv2023_Dynamic-LLM-Agent] Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, Diyi Yang. (2023). Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization. CoRR.
[EMNLP2023-Demo_CollaborativeLLMs] Kai Lv, Shuo Zhang, Tianle Gu, Shuhao Xing, Jiawei Hong, Keyu Chen, Xiaoran Liu, Yuqing Yang, Honglin Guo, Tengxiao Liu, Yu Sun, Qipeng Guo, Hang Yan, Xipeng Qiu. (2023). CoLLiE: Collaborative Training of Large Language Models in an Efficient Way. {EMNLP.
[J2024_MechAgents-MultiAgentCollaborations] Bo Ni, Markus J. Buehler. (2024). MechAgents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge. Extreme Mechanics Letters. doi:https://doi.org/10.1016/j.eml.2024.102131.
[arXiv2023_MultiAgent-Coordination-Eval] Saaket Agashe, Yue Fan, Xin Eric Wang. (2023). Evaluating Multi-Agent Coordination Abilities in Large Language Models. CoRR. doi:10.48550/ARXIV.2310.03903.
[ICLR2024_Agent-Interactive-Eval] Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis{-. (2024). {SOTOPIA:. {ICLR.
[AAMAS2024_AgentInteraction-Quantifying] Yuxin Chen, Chen Tang, Ran Tian, Chenran Li, Jinning Li, Masayoshi Tomizuka, Wei Zhan. (2024). Quantifying Agent Interaction in Multi-agent Reinforcement Learning for Cost-efficient Generalization. {AAMAS. doi:10.5555/3635637.3663107.
[arXiv2023_LLM-Deliberation] Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Sch{. (2023). LLM-Deliberation: Evaluating LLMs with Interactive Multi-Agent Negotiation Games. CoRR. doi:10.48550/ARXIV.2309.17234.
[arXiv2023_MultiAgent-Cooperation] Rafael Pina, Varuna De Silva, Corentin Artaud. (2023). Discovering Causality for Efficient Cooperation in Multi-Agent Environments. CoRR. doi:10.48550/ARXIV.2306.11846.
[arXiv2023_MultiAgent-Algorithms] Lin Yang, Xuchuang Wang, Mohammad Hajiesmaili, Lijun Zhang, John C. S. Lui, Don Towsley. (2023). Cooperative Multi-agent Bandits: Distributed Algorithms with Optimal Individual Regret and Constant Communication Costs. CoRR. doi:10.48550/ARXIV.2308.04314.
[Book1971_Rhetoric] Perelman, Chaim. (1971). The new rhetoric.
[Book2005_Society-Dissent] Cass R Sunstein. (2005). Why societies need dissent.
[J2009_DecisionMaking] Leila Amgoud, Henri Prade. (2009). Using arguments for making and explaining decisions. Artif. Intell.. doi:10.1016/j.artint.2008.11.006.
[arXiv2023_MultiAgent-Debate] Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. CoRR.
[yan2024depending] Yan, Yikuan, Zhang, Yaolun, Huang, Keman. (2024). Depending on yourself when you should: Mentoring LLM with RL agents to become the master in cybersecurity games. arXiv preprint arXiv:2403.17674.
[holt2024l2mac] Holt, Samuel, Luyten, Max Ruiz, van der Schaar, Mihaela. (2024). L2MAC: Large Language Model Automatic Computer for Extensive Code Generation. The Twelfth International Conference on Learning Representations.
[zhou2023large] Zhou, Zihao, Hu, Bin, Zhao, Chenyang, Zhang, Pu, Liu, Bin. (2023). Large language model as a policy teacher for training reinforcement learning agents. arXiv preprint arXiv:2311.13373.
[arXiv2023_MultiAgent-Debate_2] Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming Shi. (2023). Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. CoRR.
[ICLR2024_Multiagent-Debate-Embeddings] Chau Pham, Boyi Liu, Yingxiang Yang, Zhengyu Chen, Tianyi Liu, Jianbo Yuan, Bryan A. Plummer, Zhaoran Wang, Hongxia Yang. (2024). Let Models Speak Ciphers: Multiagent Debate through Embeddings. {ICLR.
[ICLR2024_Multiagent-Debate-Eval] Chi{-. (2024). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. {ICLR.
[J1985_Reflection] R. J. Bogumil. (1985). The reflective practitioner: How professionals think in action. Proc. {IEEE. doi:10.1109/PROC.1985.13210.
[Book2010_Reflection] Bolton, Gillie. (2010). Reflective practice: Writing and professional development.
[zelikman2023parsel] Zelikman, Eric, Huang, Qian, Poesia, Gabriel, Goodman, Noah, Haber, Nick. (2023). Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions. Advances in Neural Information Processing Systems.
[huang2024anpl] Huang, Di, Nan, Ziyuan, Hu, Xing, Jin, Pengwei, Peng, Shaohui, Wen, Yuanbo, Zhang, Rui, Du, Zidong, Guo, Qi, Pu, Yewen, others. (2024). ANPL: towards natural programming with interactive decomposition. Advances in Neural Information Processing Systems.
[reflexion] Noah Shinn, Beck Labash, Ashwin Gopinath. (2023). Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint. doi:10.48550/arXiv.2303.11366.
[NeurIPS2023_Self-Refine] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, Peter Clark. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS.
[J2003_Reflection-ContinuingEducation] Mezirow, Jack. (2003). How critical reflection triggers transformative learning. Adult and Continuing Education: Teaching, learning and research.
[arXiv2024_Self-Contrast] Wenqi Zhang, Yongliang Shen, Linjuan Wu, Qiuying Peng, Jun Wang, Yueting Zhuang, Weiming Lu. (2024). Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives. CoRR.
[arXiv2023_InteractiveNLP] Zekun Wang, Ge Zhang, Kexin Yang, Ning Shi, Wangchunshu Zhou, Shaochun Hao, Guangzheng Xiong, Yizhi Li, Mong Yuan Sim, Xiuying Chen, Qingqing Zhu, Zhenzhu Yang, Adam Nik, Qi Liu, Chenghua Lin, Shi Wang, Ruibo Liu, Wenhu Chen, Ke Xu, Dayiheng Liu, Yike Guo, Jie Fu. (2023). Interactive Natural Language Processing. CoRR. doi:10.48550/arXiv.2305.13246.
[J2023_InteractiveNLP] Guanhua Zhang, Matteo Bortoletto, Zhiming Hu, Lei Shi, Mihai B{^{a. (2023). Exploring Natural Language Processing Methods for Interactive Behaviour Modelling. {INTERACT. doi:10.1007/978-3-031-42286-7_1.
[NMI2023_Human-Like-AI] Edgar A. Du{'{e. (2023). A social path to human-like artificial intelligence. Nat. Mac. Intell.. doi:10.1038/S42256-023-00754-X.
[ICLR2024_LLM-Simulate-Society] Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M. Dai, Diyi Yang, Soroush Vosoughi. (2024). Training Socially Aligned Language Models in Simulated Human Society. {ICLR.
[arXiv2023_LLMAgents-Simulate-Society_S3] Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, Yong Li. (2023). S({. CoRR. doi:10.48550/ARXIV.2307.14984.
[UIST2022_SocialSimulacra] Joon Sung Park, Lindsay Popowski, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein. (2022). Social Simulacra: Creating Populated Prototypes for Social Computing Systems. {UIST. doi:10.1145/3526113.3545616.
[arXiv2024_AgentAlignment-SocialNorms] Shimin Li, Tianxiang Sun, Xipeng Qiu. (2024). Agent Alignment in Evolving Social Norms. CoRR.
[NAACL2022-Findings_Align-GLMs] Ruibo Liu, Ge Zhang, Xinyu Feng, Soroush Vosoughi. (2022). Aligning Generative Language Models with Human Values. {NAACL-HLT. doi:10.18653/V1/2022.FINDINGS-NAACL.18.
[NeurIPS2022_Re-Align] Ruibo Liu, Chenyan Jia, Ge Zhang, Ziyu Zhuang, Tony X. Liu, Soroush Vosoughi. (2022). Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits. NeurIPS.
[arXiv2023_Align-Chatbot] Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang, Zekun Wang, Ruibo Liu, Jing Li, Jie Fu, Pengfei Liu. (2023). Align on the Fly: Adapting Chatbot Behavior to Established Norms. CoRR.
[arXiv2023_Human-AI_Collaboration] Andrew Fuchs, Andrea Passarella, Marco Conti. (2023). Optimizing delegation between human and {AI. CoRR. doi:10.48550/ARXIV.2309.14718.
[ICLR2024_Human-Agent-Collaboration] Yiming Gao, Feiyu Liu, Liang Wang, Zhenjie Lian, Dehua Zheng, Weixuan Wang, Wenjin Yang, Siqin Li, Xianliang Wang, Wenhui Chen, Jing Dai, Qiang Fu, Wei Yang, Lanxiao Huang, Wei Liu. (2024). Enhancing Human Experience in Human-Agent Collaboration: A Human-Centered Modeling Approach Based on Positive Human Gain. {ICLR.
[arXiv2024_Human-Agent-Collaboration] Xueyang Feng, Zhi-Yuan Chen, Yujia Qin, Yankai Lin, Xu Chen, Zhiyuan Liu, Ji-Rong Wen. (2024). Large Language Model-based Human-Agent Collaboration for Complex Task Solving. CoRR.
[TEVC2023_MultiAgent-Collaboration-SocialRoles] Yaqing Hou, Mingyang Sun, Yifeng Zeng, Yew-Soon Ong, Yaochu Jin, Hongwei Ge, Qiang Zhang. (2023). A Multi-agent Cooperative Learning System with Evolution of Social Roles. IEEE Transactions on Evolutionary Computation. doi:10.1109/TEVC.2023.3268076.
[EMNLP2023-Findings_LEGO] Zhitao He, Pengfei Cao, Yubo Chen, Kang Liu, Ruopeng Li, Mengshu Sun, Jun Zhao. (2023). {LEGO:. {EMNLP.
[Nature2023_Role-Play-LLM] Murray Shanahan, Kyle McDonell, Laria Reynolds. (2023). Role play with large language models. Nat.. doi:10.1038/S41586-023-06647-8.
[arXiv2023_PlayGames-LLM] Elif Akata, Lion Schulz, Julian Coda{-. (2023). Playing repeated games with Large Language Models. CoRR. doi:10.48550/arXiv.2305.16867.
[arXiv2023_Agent-Simulate-OpinionDynamics] Yun{-. (2023). Simulating Opinion Dynamics with Networks of LLM-based Agents. CoRR. doi:10.48550/ARXIV.2311.09618.
[arXiv2023_Agent-Simulate-OpinionDynamics_Survey] Yun{-. (2023). Computational Agent-based Models in Opinion Dynamics: {A. CoRR. doi:10.48550/ARXIV.2306.03446.
[arXiv2024_AI-Human-Creative] Haonan Wang, James Zou, Michael Mozer, Anirudh Goyal, Alex Lamb, Linjun Zhang, Weijie J Su, Zhun Deng, Michael Qizhe Xie, Hannah Brown, Kenji Kawaguchi. (2024). Can AI Be as Creative as Humans?. CoRR.
[arXiv2023_LLM-Simulator] Chuyi Kong, Yaxin Fan, Xiang Wan, Feng Jiang, Benyou Wang. (2023). Large Language Model as a User Simulator. CoRR. doi:10.48550/ARXIV.2308.11534.
[arXiv2023_MetaAgents] Yuan Li, Yixuan Zhang, Lichao Sun. (2023). MetaAgents: Simulating Interactions of Human Behaviors for LLM-based Task-oriented Coordination via Collaborative Generative Agents. CoRR. doi:10.48550/ARXIV.2310.06500.
[meyer2021entspann] Meyer, B, Zill, A, Dilba, D, Voermans, S. (2021). Entspann dich, Deutschland! TK-Stressstudie 2021.
[wang2024tokeneconomy] Wang, Junlin, Jain, Siddhartha, Zhang, Dejiao, Ray, Baishakhi, Kumar, Varun, Athiwaratkun, Ben. (2024). Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies. arXiv preprint arXiv:2406.06461.
[PNAS2024_TuringTest_Chatbots-Humans] Qiaozhu Mei, Yutong Xie, Walter Yuan, Matthew O. Jackson. (2024). A Turing test of whether AI chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.2313925121.
[arXiv2024_BehavioralSimulation] Cheng Wang, Chuwen Wang, Yu Zhao, Shirong Zeng, Wang Zhang, Ronghui Ning. (2024). Behavioral Simulation: Exploring A Possible Next Paradigm for Science. CoRR.
[arXiv2023_Agent-BehaviorExplanation] Xijia Zhang, Yue Guo, Simon Stepputtis, Katia P. Sycara, Joseph Campbell. (2023). Understanding Your Agent: Leveraging Large Language Models for Behavior Explanation. CoRR. doi:10.48550/ARXIV.2311.18062.
[arXiv2023_Agent-BehaviorExplaining] Xijia Zhang, Yue Guo, Simon Stepputtis, Katia P. Sycara, Joseph Campbell. (2023). Explaining Agent Behavior with Large Language Models. CoRR. doi:10.48550/ARXIV.2309.10346.
[arXiv2023_Agents-High-Level-Behavior] Maxwell Crouse, Ibrahim Abdelaziz, Kinjal Basu, Soham Dan, Sadhana Kumaravel, Achille Fokoue, Pavan Kapanipathi, Luis A. Lastras. (2023). Formally Specifying the High-Level Behavior of LLM-Based Agents. CoRR. doi:10.48550/ARXIV.2310.08535.
[arXiv2024_Agents-Simulate-Trust] Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Kai Shu, Adel Bibi, Ziniu Hu, Philip H. S. Torr, Bernard Ghanem, Guohao Li. (2024). Can Large Language Model Agents Simulate Human Trust Behaviors?. CoRR. doi:10.48550/ARXIV.2402.04559.
[arXiv2024_Reward-Socially-RL-Agent] Zhaoyue Wang. (2024). Towards Socially and Morally Aware RL agent: Reward Design With LLM. CoRR.
[ICLR2024_LLM-Simulate-CognitiveModel] Marcel Binz, Eric Schulz. (2024). Turning large language models into cognitive models. {ICLR.
[arXiv2023_Agent-Cognitive] Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, Thomas L. Griffiths. (2023). Cognitive Architectures for Language Agents. CoRR. doi:10.48550/ARXIV.2309.02427.
[J2024_Interactions-Cognitive-LLMs] Youzhi Qu, Penghui Du, Wenxin Che, Chen Wei, Chi Zhang, Wanli Ouyang, Yatao Bian, Feiyang Xu, Bin Hu, Kai Du, Haiyan Wu, Jia Liu, Quanying Liu. (2024). Promoting interactions between cognitive science and large language models. The Innovation. doi:https://doi.org/10.1016/j.xinn.2024.100579.
[arXiv2024_Multi-Agent_Conversation_CognitiveBias] Yu He Ke, Rui Yang, Sui An Lie, Taylor Xin Yi Lim, Hairil Rizal Abdullah, Daniel Shu Wei Ting, Nan Liu. (2024). Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias. CoRR.
[arXiv2023_AI-Agent] Sungwoo Lee, Younghyun Oh, Hyunhoe An, Hyebhin Yoon, Karl J. Friston, Seok Jun Hong, Choong{-. (2023). Life-inspired Interoceptive Artificial Intelligence for Autonomous and Adaptive Agents. CoRR. doi:10.48550/ARXIV.2309.05999.
[arXiv2024_Agent-Preference] Zihong He, Changwang Zhang. (2024). AFSPP: Agent Framework for Shaping Preference and Personality with Large Language Models. CoRR.
[PANS2022_Agent-Cooperation-Competition] Euel Elliott, L. Douglas Kiel. (2002). Exploring cooperation and competition using agent-based modeling. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.102079099.
[arXiv2023_LyfeAgents] Zhao Kaiya, Michelangelo Naim, Jovana Kondic, Manuel Cortes, Jiaxin Ge, Shuying Luo, Guangyu Robert Yang, Andrew Ahn. (2023). Lyfe Agents: Generative agents for low-cost real-time social interactions. CoRR. doi:10.48550/ARXIV.2310.02172.
[ICLR2023_Reasoning-Simulation] Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te{-. (2023). Mind's Eye: Grounded Language Model Reasoning through Simulation. {ICLR.
[arXiv2024_Human-AI-Interaction_Gesture] Philipp Wicke. (2024). Probing Language Models' Gesture Understanding for Enhanced Human-AI Interaction. CoRR.
[arXiv2023_Human-like-Agents] Thuy{-. (2023). Credit Assignment: Challenges and Opportunities in Developing Human-like {AI. CoRR. doi:10.48550/ARXIV.2307.08171.
[arXiv2024_Human-Centered-LM] Nikita Soni, Niranjan Balasubramanian, H. Andrew Schwartz, Dirk Hovy. (2024). Comparing Human-Centered Language Modeling: Is it Better to Model Groups, Individual Traits, or Both?. CoRR.
[Book1973_Theory_NLP] Alfred V. Aho, Jeffrey D. Ullman. (1973). The theory of parsing, translation, and compiling. 2: Compiling.
[2018_Theory] Mezirow, Jack. (2018). Transformative learning theory. Contemporary theories of learning.
[USENIX1999_Theory-FaultTolerance] Miguel Castro, Barbara Liskov. (1999). Practical Byzantine Fault Tolerance. {OSDI.
[Book2013_InfoProcess-Collaboration] Noreen M Webb. (2013). Information processing approaches to collaborative learning. The international handbook of collaborative learning.
[Book1999_Personality] Friedman, Howard S, Schustack, Miriam W. (1999). Personality: Classic theories and modern research.
[J2008_Overconfidence] Moore, Don A, Healy, Paul J. (2008). The trouble with overconfidence.. Psychological review.
[Book1985_MedievalPoliticalTheology] Ernst H. Kantorowicz. (1985). The King's Two Bodies: A Study in Medieval Political Theology.
[J2009_SocialPsychology] Johnson, David W, Johnson, Roger T. (2009). An educational psychology success story: Social interdependence theory and cooperative learning. Educational researcher.
[1982_SocialPsychology] Tajfel, Henri. (1982). Social psychology of intergroup relations. Annual review of psychology.
[2004_SocialPsychology] Tajfel, Henri, Turner, John C. (2004). The social identity theory of intergroup behavior. Political psychology.
[Book1988_SoM] Minsky, Marvin. (1988). Society of mind.
[J2003_SoM] Push Singh. (2003). Examining the Society of Mind. Comput. Artif. Intell..
[hu2024evomac] Hu, Yue, Cai, Yuzhu, Du, Yaxin, Zhu, Xinyu, Liu, Xiangrui, Yu, Zijie, Hou, Yuchen, Tang, Shuo, Chen, Siheng. (2024). Self-evolving multi-agent collaboration networks for software development. arXiv preprint arXiv:2410.16946.
[NeurIPS2023_Agent-SoM] Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem. (2023). {CAMEL:. NeurIPS.
[arXiv2023_SoM-NL] Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R. Ashley, R{'{o. (2023). Mindstorms in Natural Language-Based Societies of Mind. CoRR. doi:10.48550/arXiv.2305.17066.
[J2002_TheoryOfMind] Siegal, Michael, Varley, Rosemary. (2002). Neural systems involved in 'theory of mind'. Nature Reviews Neuroscience. doi:10.1038/nrn844.
[J2004_TheoryOfMind] Leslie, Alan M, Friedman, Ori, German, Tim P. (2004). Core mechanisms in ‘theory of mind’. Trends in cognitive sciences. doi:https://doi.org/10.1016/j.tics.2004.10.001.
[EMNLP2022_TheoryOfMind] Maarten Sap, Ronan Le Bras, Daniel Fried, Yejin Choi. (2022). Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs. {EMNLP. doi:10.18653/V1/2022.EMNLP-MAIN.248.
[EMNLP2023_TheoryOfMind-Agents] Huao Li, Yu Quan Chong, Simon Stepputtis, Joseph Campbell, Dana T. Hughes, Charles Lewis, Katia P. Sycara. (2023). Theory of Mind for Multi-Agent Collaboration via Large Language Models. {EMNLP.
[EACL2024_TheoryOfMind] Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, Vered Shwartz. (2024). Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models. {EACL.
[arXiv2023_TheoryOfMind-Agents] Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R. McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, Manaal Faruqui. (2023). How FaR Are Large Language Models From Agents with Theory-of-Mind?. CoRR. doi:10.48550/ARXIV.2310.03051.
[arXiv2023_TheoryofMind-Game] Jiaxian Guo, Bo Yang, Paul Yoo, Bill Yuchen Lin, Yusuke Iwasawa, Yutaka Matsuo. (2023). Suspicion-Agent: Playing Imperfect Information Games with Theory of Mind Aware {GPT-4. CoRR. doi:10.48550/ARXIV.2309.17277.
[Science2010_HumanDynamics] Anita Williams Woolley, Christopher F. Chabris, Alex Pentland, Nada Hashmi, Thomas W. Malone. (2010). Evidence for a Collective Intelligence Factor in the Performance of Human Groups. Science. doi:10.1126/science.1193147.
[1968_GroupDynamics] Dorwin Cartwright, Alvin Zander. (1968). Group dynamics.
[J2018_GroupDynamics] Wilfred R Bion. (2018). Group dynamics: A re-view. New directions in psychoanalysis.
[Book2014_GroupDynamics] Donelson R Forsyth. (2014). Group dynamics.
[Book2018_GroupDynamics] Donelson R Forsyth. (2018). Group dynamics.
[J1987_GroupDynamics] Clayton P Alderfer. (1987). An intergroup perspective on group dynamics. Handbook of organizational behavior.
[yin2023exchange] Yin, Zhangyue, Sun, Qiushi, Chang, Cheng, Guo, Qipeng, Dai, Junqi, Huang, Xuan-Jing, Qiu, Xipeng. (2023). Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
[chen2024benchmarking] Chen, Jiawei, Lin, Hongyu, Han, Xianpei, Sun, Le. (2024). Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence.
[zhuang2023toolchain] Zhuang, Yuchen, Chen, Xiang, Yu, Tong, Mitra, Saayan, Bursztyn, Victor, Rossi, Ryan A, Sarkhel, Somdeb, Zhang, Chao. (2023). Toolchain: Efficient action space navigation in large language models with a* search*. arXiv preprint arXiv:2310.13227.
[shen2024smallllms] Shen, Weizhou, Li, Chenliang, Chen, Hongzhan, Yan, Ming, Quan, Xiaojun, Chen, Hehong, Zhang, Ji, Huang, Fei. (2024). Small llms are weak tool learners: A multi-llm agent. arXiv preprint arXiv:2401.07324.
[J1998_SmallGroupDynamics-Discussions] David Wyatt Seal, Laura M Bogart, Anke A Ehrhardt. (1998). Small group dynamics: The utility of focus group discussions as a research method. Group Dynamics: Theory, Research, and Practice.
[Book1972_Groupthink] Janis, Irving L. (1972). Victims of Groupthink: A psychological study of foreign-policy decisions and fiascoes..
[J2015_Group] Iyengar, Shanto, Westwood, Sean J. (2015). Fear and loathing across party lines: New evidence on group polarization. American journal of political science.
[J2003_Intergroup-Intragroup_Culture] Masaki Yuki. (2003). Intergroup Comparison versus Intragroup Relationships: A Cross-Cultural Examination of Social Identity Theory in North American and East Asian Cultural Contexts. Social Psychology Quarterly.
[J1995_IntragroupConflict] Karen A Jehn. (1995). A multimethod examination of the benefits and detriments of intragroup conflict. Administrative science quarterly. doi:10.2307/2393638.
[J2003_SmartGroups] Brigid Barron. (2003). When Smart Groups Fail. Journal of the Learning Sciences. doi:10.1207/S15327809JLS1203_1.
[Book2005_CrowdWisdom] Surowiecki, James. (2005). The Wisdom of Crowds.
[J1996_Intelligence] Neisser, Ulric, Boodoo, Gwyneth, Bouchard Jr, Thomas J, Boykin, A Wade, Brody, Nathan, Ceci, Stephen J, Halpern, Diane F, Loehlin, John C, Perloff, Robert, Sternberg, Robert J, others. (1996). Intelligence: knowns and unknowns.. American psychologist.
[1961_Intelligence] Spearman, Charles. . (1961).
[J2004_Conformity] Robert B. Cialdini, Noah J. Goldstein. (2004). Social Influence: Compliance and Conformity. Annual Review of Psychology. doi:10.1146/annurev.psych.55.090902.142015.
[J1969_Conformity] Vernon L. Allen, John M. Levine. (1969). Consensus and conformity. Journal of Experimental Social Psychology. doi:https://doi.org/10.1016/0022-1031(69)90032-8.
[Book2011_Negotiation] Fisher, Roger, Ury, William L, Patton, Bruce. (2011). Getting to yes: Negotiating agreement without giving in.
[J2015_Conformity] Julie C Coultas, Edwin JC van Leeuwen. (2015). Conformity: Definitions, types, and evolutionary grounding. Evolutionary perspectives on social psychology.
[J1967_Consensus-Sociological] Scheff, Thomas J. (1967). Toward a sociological model of consensus. American Sociological Review. doi:10.2307/2091716.
[J1974_ConsensusReaching] Morris H. Degroot. (1974). Reaching a Consensus. Journal of the American Statistical Association. doi:10.1080/01621459.1974.10480137.
[J2018_Emergence-Consensus] Baronchelli, Andrea. (2018). The emergence of consensus: a primer. Royal Society open science. doi:10.1098/rsos.172189.
[arXiv2023_Agent-SocialChoiceTheory] Marc Lanctot, Kate Larson, Yoram Bachrach, Luke Marris, Zun Li, Avishkar Bhoopchand, Thomas W. Anthony, Brian Tanner, Anna Koop. (2023). Evaluating Agents using Social Choice Theory. CoRR. doi:10.48550/ARXIV.2312.03121.
[J2000_Agent-SocialScience] Nigel Gilbert, Pietro Terna. (2000). How to build and use agent-based models in social science. Mind & Society.
[Book2012_Agent-SocialScience] Joshua M Epstein. (2012). Generative social science: Studies in agent-based computational modeling.
[Book2023_Agent-SocialDynamics-Culture] Paul Smaldino. (2023). Modeling social behavior: Mathematical and agent-based models of social dynamics and cultural evolution.
[J2017_SocialInfluence_OpinionDynamics] Andreas Flache, Michael M{. (2017). Models of Social Influence: Towards the Next Frontiers. J. Artif. Soc. Soc. Simul.. doi:10.18564/JASSS.3521.
[J2021_SocietalDynamics_OpinionDynamics] Jan Lorenz, Martin Neumann, Tobias Schr{. (2021). Individual attitude change and societal dynamics: Computational experiments with psychological theories.. Psychological Review.
[PNAS2023_CognitivePsychology_LLM] Marcel Binz, Eric Schulz. (2023). Using cognitive psychology to understand GPT-3. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.2218523120.
[arXiv2023_MachinePsychology] Thilo Hagendorff. (2023). Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods. CoRR. doi:10.48550/ARXIV.2303.13988.
[J2023_LLM-Psychology] Dorottya Demszky, Diyi Yang, David S. Yeager, Christopher J. Bryan, Margarett Clapper, Susannah Chandhok, Johannes C. Eichstaedt, Cameron Hecht, Jeremy Jamieson, Meghann Johnson, Michaela Jones, Danielle Krettek-Cobb, Leslie Lai, Nirel JonesMitchell, Desmond C. Ong, Carol S. Dweck, James J. Gross, James W. Pennebaker. (2023). Using large language models in psychology. Nature Reviews Psychology. doi:10.1038/s44159-023-00241-5.
[NAACL2024-Findings_PsychometricLLM] Tatsuki Kuribayashi, Yohei Oseki, Timothy Baldwin. (2024). Psychometric Predictive Power of Large Language Models. {NAACL.
[arXiv2024_Multi-Agent_PsySafe] Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, Jing Shao. (2024). PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety. CoRR.
[Book2006_DemocracyModel] David Held. (2006). Models of democracy.
[Book2006_Deliberative-Democracy] Diana C Mutz. (2006). Hearing the other side: Deliberative versus participatory democracy.
[arXiv2023_SocialPsychology-Vehicles] Xiao Li, Kaiwen Liu, H. Eric Tseng, Anouck Girard, Ilya V. Kolmanovsky. (2023). Interaction-Aware Decision-Making for Autonomous Vehicles in Forced Merging Scenario Leveraging Social Psychology Factors. CoRR. doi:10.48550/ARXIV.2309.14497.
[Book2019_CriticalThinking] Paul, Richard, Elder, Linda. (2019). The miniature guide to critical thinking concepts and tools.
[Book1994_Framework] Popper, Karl Raimund. (1994). The myth of the framework: In defence of science and rationality.
[J2012_Circulations] Munro, Iain. (2012). The management of circulations: Biopolitical variations after Foucault. International Journal of Management Reviews.
[NeurIPS2021_Dataset-MATH] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt. (2021). Measuring Mathematical Problem Solving With the {MATH. NeurIPS Datasets and Benchmarks.
[arXiv2022_Dataset-ChessMoveValidity] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri{`{a. (2022). Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv preprint. doi:10.48550/arXiv.2206.04615.
[arXiv2023_Dataset-BOLAA] Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese. (2023). {BOLAA:. CoRR. doi:10.48550/arXiv.2308.05960.
[PGN] fsmosca. pgn-standard. GitHub repository.
[ChatGPT-OpenAI] OpenAI. (2022). ChatGPT: Optimizing Language Models for Dialogue.
[NeurIPS2022_InstructGPT] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe. (2022). Training language models to follow instructions with human feedback. NeurIPS.
[arXiv2023_LLaMA] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie{-. (2023). LLaMA: Open and Efficient Foundation Language Models. CoRR. doi:10.48550/ARXIV.2302.13971.
[arXiv2023_Qwen] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, Tianhang Zhu. (2023). Qwen Technical Report. CoRR. doi:10.48550/ARXIV.2309.16609.
[arXiv2023_Mistral] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L{'{e. (2023). Mistral 7B. CoRR. doi:10.48550/ARXIV.2310.06825.
[arXiv2024_Mixtral] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. (2024). Mixtral of Experts. CoRR.
[J2023_Prompt-Survey] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig. (2023). Pre-train, Prompt, and Predict: {A. ACM Comput. Surv.. doi:10.1145/3560815.
[WWW2022_KnowPrompt] Xiang Chen, Ningyu Zhang, Xin Xie, Shumin Deng, Yunzhi Yao, Chuanqi Tan, Fei Huang, Luo Si, Huajun Chen. (2022). KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction. {WWW. doi:10.1145/3485447.3511998.
[ACL2022-Short_P-Tuning] Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, Jie Tang. (2022). P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. {ACL. doi:10.18653/v1/2022.acl-short.8.
[arXiv2024_MoreAgents] Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, Deheng Ye. (2024). More Agents Is All You Need. CoRR.
[reimers2019sentence] Reimers, N. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint arXiv:1908.10084.
[wang2020minilm] Wang, Wenhui, Wei, Furu, Dong, Li, Bao, Hangbo, Yang, Nan, Zhou, Ming. (2020). Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems.
[shirzad2023exphormer] Shirzad, Hamed, Velingker, Ameya, Venkatachalam, Balaji, Sutherland, Danica J, Sinop, Ali Kemal. (2023). Exphormer: Sparse transformers for graphs. International Conference on Machine Learning.
[zhang2025evoflow] Zhang, Guibin, Chen, Kaijie, Wan, Guancheng, Chang, Heng, Cheng, Hong, Wang, Kun, Hu, Shuyue, Bai, Lei. (2025). EvoFlow: Evolving Diverse Agentic Workflows On The Fly. arXiv preprint arXiv:2502.07373.
[tan2023virtual] Tan, Zhen, Guo, Ruocheng, Ding, Kaize, Liu, Huan. (2023). Virtual node tuning for few-shot node classification. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
[zhao2024causality] Zhao, Kesen, Zhang, Liang. (2024). Causality-Inspired Spatial-Temporal Explanations for Dynamic Graph Neural Networks. The Twelfth International Conference on Learning Representations.
[chen2024internet] Chen, Weize, You, Ziming, Li, Ran, Guan, Yitong, Qian, Chen, Zhao, Chenyang, Yang, Cheng, Xie, Ruobing, Liu, Zhiyuan, Sun, Maosong. (2024). Internet of agents: Weaving a web of heterogeneous agents for collaborative intelligence. arXiv preprint arXiv:2407.07061.
[rosenbluth2024distinguished] Rosenbluth, Eran, T{. (2024). Distinguished In Uniform: Self Attention Vs. Virtual Nodes. arXiv preprint arXiv:2405.11951.
[ICLR2023_Self-Consistency] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. {ICLR.
[ACL2023-Findings_Eval-LLM-Behavior] Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran{-. (2023). Discovering Language Model Behaviors with Model-Written Evaluations. {ACL. doi:10.18653/V1/2023.FINDINGS-ACL.847.
[arXiv2023_Analyze-LLM-Behavior] Lingjiao Chen, Matei Zaharia, James Zou. (2023). How is ChatGPT's behavior changing over time?. CoRR. doi:10.48550/ARXIV.2307.09009.
[arXiv2024_AntEval] Yuanzhi Liang, Linchao Zhu, Yi Yang. (2024). AntEval: Quantitatively Evaluating Informativeness and Expressiveness of Agent Social Interactions. CoRR.
[ICLR2024_LLM-Bias-MCS] Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang. (2024). Large Language Models Are Not Robust Multiple Choice Selectors. {ICLR.
[Science2022_AlphaCode] Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R{'{e. (2022). Competition-level code generation with AlphaCode. Science. doi:10.1126/science.abq1158.
[liu2023deja] Liu, Zichang, Wang, Jue, Dao, Tri, Zhou, Tianyi, Yuan, Binhang, Song, Zhao, Shrivastava, Anshumali, Zhang, Ce, Tian, Yuandong, Re, Christopher, others. (2023). Deja vu: Contextual sparsity for efficient llms at inference time. International Conference on Machine Learning.
[arXiv2021_Verifier-Math] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman. (2021). Training Verifiers to Solve Math Word Problems. arXiv prepring.
[roy2016solving] Roy, Subhro, Roth, Dan. (2016). Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413.
[fu2022complexity] Fu, Yao, Peng, Hao, Sabharwal, Ashish, Clark, Peter, Khot, Tushar. (2022). Complexity-based prompting for multi-step reasoning. The Eleventh International Conference on Learning Representations.
[patel2021nlp] Patel, Arkil, Bhattamishra, Satwik, Goyal, Navin. (2021). Are NLP models really able to solve simple math word problems?. arXiv preprint arXiv:2103.07191.
[arXiv2023_ReConcile] Justin Chih-Yao Chen, Swarnadeep Saha, Mohit Bansal. (2023). ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs. arxiv preprint.
[arXiv2023_Multi-Agent-Consensus] Huaben Chen, Wenkang Ji, Lufeng Xu, Shiyu Zhao. (2023). Multi-Agent Consensus Seeking via Large Language Models. CoRR. doi:10.48550/ARXIV.2310.20151.
[arXiv2023_LLMs-Simulate_SocialMedia] Petter T{. (2023). Simulating Social Media Using Large Language Models to Evaluate Alternative News Feed Algorithms. CoRR. doi:10.48550/ARXIV.2310.05984.
[J1981_Alternation] Ashok K. Chandra, Dexter Kozen, Larry J. Stockmeyer. (1981). Alternation. J. {ACM. doi:10.1145/322234.322243.
[ICML2007_L1Regular] Galen Andrew, Jianfeng Gao. (2007). Scalable training of L({. {ICML. doi:10.1145/1273496.1273501.
[Book1997_Algorithms-DataStruct] Dan Gusfield. (1997). Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. doi:10.1017/CBO9780511574931.
[arXiv2015_YaraParser] Mohammad Sadegh Rasooli, Joel R. Tetreault. (2015). Yara Parser: {A. CoRR.
[JMLR2005_LearningStruct] Rie Kubota Ando, Tong Zhang. (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. J. Mach. Learn. Res..
[J1965_FourierSeries-Comp] Cooley, James W., Tukey, John W.. (1965). An algorithm for the machine calculation of complex {F. Mathematics of Computation.
[arXiv2023_Agent-DUMA] Xiaoyu Tian, Liangyu Chen, Na Liu, Yaxuan Liu, Wei Zou, Kaijiang Chen, Ming Cui. (2023). {DUMA:. CoRR. doi:10.48550/ARXIV.2310.18075.
[arXiv2023_AgentTuning] Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, Jie Tang. (2023). AgentTuning: Enabling Generalized Agent Abilities for LLMs. CoRR. doi:10.48550/ARXIV.2310.12823.
[arXiv2023_FireAct] Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, Shunyu Yao. (2023). FireAct: Toward Language Agent Fine-tuning. CoRR. doi:10.48550/ARXIV.2310.05915.
[agashe2023evaluating] Agashe, Saaket, Fan, Yue, Wang, Xin Eric. (2023). Evaluating multi-agent coordination abilities in large language models. arXiv preprint arXiv:2310.03903.
[NAACL2024_Agent-Self-Collaboration] Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, Heng Ji. (2024). Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. {NAACL.
[arXiv2023_Benchmarking-Agents] Qian Huang, Jian Vora, Percy Liang, Jure Leskovec. (2023). Benchmarking Large Language Models As {AI. CoRR. doi:10.48550/ARXIV.2310.03302.
[arXiv2023_Agents-SampleEfficiency] Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, Zhaoran Wang. (2023). Reason for Future, Act for Now: {A. CoRR. doi:10.48550/ARXIV.2309.17382.
[arXiv2023_UnifiedAgent-FM] Norman Di Palo, Arunkumar Byravan, Leonard Hasenclever, Markus Wulfmeier, Nicolas Heess, Martin A. Riedmiller. (2023). Towards {A. CoRR. doi:10.48550/ARXIV.2307.09668.
[arXiv2023_Universal-Agent] Anees Aslam. (2023). Universal Language Modelling agent. CoRR. doi:10.48550/ARXIV.2306.06521.
[arXiv2023_Hinder-Agents] Sukai Huang, Nir Lipovetzky, Trevor Cohn. (2023). A Reminder of its Brittleness: Language Reward Shaping May Hinder Learning for Instruction Following Agents. CoRR. doi:10.48550/ARXIV.2305.16621.
[CoLLAs2023_Autotelic-Agents] C{'{e. (2023). Augmenting Autotelic Agents with Large Language Models. CoLLAs.
[EMNLP2023_ConceptualStructure_LLM] Siddharth Suresh, Kushin Mukherjee, Xizheng Yu, Wei{-. (2023). Conceptual structure coheres in human cognition but not in large language models. {EMNLP.
[Games2021_SocialStrategies-CooperativeBehaviour] Robin Watson, Thomas J. H. Morgan, Rachel L. Kendal, Julie Van de Vyver, Jeremy Kendal. (2021). Social Learning Strategies and Cooperative Behaviour: Evidence of Payoff Bias, but Not Prestige or Conformity, in a Social Dilemma Game. Games. doi:10.3390/G12040089.
[ACL2024_AUTOACT-Self-Planning] Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, Huajun Chen. (2024). AUTOACT: Automatic Agent Learning from Scratch via Self-Planning. {ACL.
[arXiv2024_Self-Rewarding-LMs] Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston. (2024). Self-Rewarding Language Models. CoRR.
[serrano2003topology] Serrano, Ma Angeles, Bogun{'a. (2003). Topology of the world trade web. Physical Review E.
[fagiolo2010evolution] Fagiolo, Giorgio, Reyes, Javier, Schiavo, Stefano. (2010). The evolution of the world trade web: a weighted-network analysis. Journal of Evolutionary Economics.
[garlaschelli2007interplay] Garlaschelli, Diego, Di Matteo, Ticiana, Aste, Tomaso, Caldarelli, Guido, Loffredo, Maria I. (2007). Interplay between topology and dynamics in the World Trade Web. The European Physical Journal B.
[pbft] Castro, Miguel, Liskov, Barbara. (1999). Practical Byzantine Fault Tolerance. Proceedings of the Third Symposium on Operating Systems Design and Implementation.
[farahani2013review] Farahani, Reza Zanjirani, Miandoabchi, Elnaz, Szeto, Wai Yuen, Rashidi, Hannaneh. (2013). A review of urban transportation network design problems. European journal of operational research.
[bell1997transportation] Bell, Michael GH, Iida, Yasunori, others. (1997). Transportation network analysis.
[fan2019graph] Fan, Wenqi, Ma, Yao, Li, Qing, He, Yuan, Zhao, Eric, Tang, Jiliang, Yin, Dawei. (2019). Graph neural networks for social recommendation. The world wide web conference.
[belkin2006manifold] Belkin, Mikhail, Niyogi, Partha, Sindhwani, Vikas. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples.. Journal of machine learning research.
[zhou2005learning] Zhou, Dengyong, Huang, Jiayuan, Sch{. (2005). Learning from labeled and unlabeled data on a directed graph. Proceedings of the 22nd international conference on Machine learning.
[wang2019kgat] Wang, Xiang, He, Xiangnan, Cao, Yixin, Liu, Meng, Chua, Tat-Seng. (2019). Kgat: Knowledge graph attention network for recommendation. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining.
[nabti2017querying] Nabti, Chemseddine, Seba, Hamida. (2017). Querying massive graph data: A compress and search approach. Future Generation Computer Systems.
[ribeiro2021survey] Ribeiro, Pedro, Paredes, Pedro, Silva, Miguel EP, Aparicio, David, Silva, Fernando. (2021). A survey on subgraph counting: concepts, algorithms, and applications to network motifs and graphlets. ACM Computing Surveys (CSUR).
[bouhenni2021survey] Bouhenni, Sarra, Yahiaoui, Said, Nouali-Taboudjemat, Nadia, Kheddouci, Hamamache. (2021). A survey on distributed graph pattern matching in massive graphs. ACM Computing Surveys (CSUR).
[fulber2020network] Fulber-Garcia, Vinicius, Duarte Jr, Elias P, Huff, Alexandre, dos Santos, Carlos RP. (2020). Network service topology: Formalization, taxonomy and the custom specification model. Computer Networks.
[arnold2020cloud] Arnold, Todd, He, Jia, Jiang, Weifan, Calder, Matt, Cunha, Italo, Giotsas, Vasileios, Katz-Bassett, Ethan. (2020). Cloud provider connectivity in the flat internet. Proceedings of the ACM Internet Measurement Conference.
[fu2020topology] Fu, Xiuwen, Pace, Pasquale, Aloi, Gianluca, Yang, Lin, Fortino, Giancarlo. (2020). Topology optimization against cascading failures on wireless sensor networks using a memetic algorithm. Computer Networks.
[zhu2024llms] Zhu, Yuqi, Wang, Xiaohan, Chen, Jing, Qiao, Shuofei, Ou, Yixin, Yao, Yunzhi, Deng, Shumin, Chen, Huajun, Zhang, Ningyu. (2024). Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities. World Wide Web.
[zhu2022multi] Zhu, Xiangru, Li, Zhixu, Wang, Xiaodan, Jiang, Xueyao, Sun, Penglei, Wang, Xuwu, Xiao, Yanghua, Yuan, Nicholas Jing. (2022). Multi-modal knowledge graph construction and application: A survey. IEEE Transactions on Knowledge and Data Engineering.
[zhu2021network] Zhu, Hang, Gupta, Varun, Ahuja, Satyajeet Singh, Tian, Yuandong, Zhang, Ying, Jin, Xin. (2021). Network planning with deep reinforcement learning. Proceedings of the 2021 ACM SIGCOMM 2021 Conference.
[huang2022language] Huang, Wenlong, Abbeel, Pieter, Pathak, Deepak, Mordatch, Igor. (2022). Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. International conference on machine learning.
[sun2023pearl] Sun, Simeng, Liu, Yang, Wang, Shuohang, Zhu, Chenguang, Iyyer, Mohit. (2023). Pearl: Prompting large language models to plan and execute actions over long documents. arXiv preprint arXiv:2305.14564.
[ruan2023tptu] Ruan, Jingqing, Chen, Yihong, Zhang, Bin, Xu, Zhiwei, Bao, Tianpeng, Mao, Hangyu, Li, Ziyue, Zeng, Xingyu, Zhao, Rui, others. (2023). Tptu: Task planning and tool usage of large language model-based ai agents. NeurIPS 2023 Foundation Models for Decision Making Workshop.
[sun2023self] Sun, Xiangguo, Cheng, Hong, Liu, Bo, Li, Jia, Chen, Hongyang, Xu, Guandong, Yin, Hongzhi. (2023). Self-supervised hypergraph representation learning for sociological analysis. IEEE Transactions on Knowledge and Data Engineering.
[li2023survey] Li, Yuhan, Li, Zhixun, Wang, Peisong, Li, Jia, Sun, Xiangguo, Cheng, Hong, Yu, Jeffrey Xu. (2023). A survey of graph meets large language model: Progress and future directions. arXiv preprint arXiv:2311.12399.
[sun2023all] Sun, Xiangguo, Cheng, Hong, Li, Jia, Liu, Bo, Guan, Jihong. (2023). All in One: Multi-Task Prompting for Graph Neural Networks. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
[shu2024llm] Shu, Zhiyao, Sun, Xiangguo, Cheng, Hong. (2024). When llm meets hypergraph: A sociological analysis on personality via online social networks. arXiv preprint arXiv:2407.03568.
[park2023generativeagentsinteractivesimulacra] Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein. (2023). Generative Agents: Interactive Simulacra of Human Behavior.
[Drori_2022] Melanie Swan, Takashi Kido, Eric Roland, Renato P. dos Santos. (2023). Math Agents: Computational Infrastructure, Mathematical Embedding, and Genomics. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.2123433119.
[guo2024deepseekcoderlargelanguagemodel] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang. (2024). DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence.
[yu2024metamathbootstrapmathematicalquestions] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu. (2024). MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models.
[chen2023frugalgptuselargelanguage] Lingjiao Chen, Matei Zaharia, James Zou. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.
[ding2024hybridllmcostefficientqualityaware] Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah. (2024). Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing.
[hu2024routerbenchbenchmarkmultillmrouting] Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay. (2024). RouterBench: A Benchmark for Multi-LLM Routing System.
[shnitzer2023largelanguagemodelrouting] Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, Mikhail Yurochkin. (2023). Large Language Model Routing with Benchmark Datasets.
[xu2025a-mem] Xu, Wujiang, Mei, Kai, Gao, Hang, Tan, Juntao, Liang, Zujie, Zhang, Yongfeng. (2025). A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110.
[_akota_2024] Chhikara, Prateek, Khant, Dev, Aryan, Saket, Singh, Taranjeet, Yadav, Deshraj. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv preprint arXiv:2504.19413. doi:10.1145/3616855.3635825.
[salama2025meminsight] Salama, Rana, Cai, Jason, Yuan, Michelle, Currey, Anna, Sunkara, Monica, Zhang, Yi, Benajiba, Yassine. (2025). MemInsight: Autonomous Memory Augmentation for LLM Agents. arXiv preprint arXiv:2503.21760.
[tang2025chemagent] Tang, Xiangru, Hu, Tianyu, Ye, Muyang, Shao, Yanjun, Yin, Xunjian, Ouyang, Siru, Zhou, Wangchunshu, Lu, Pan, Zhang, Zhuosheng, Zhao, Yilun, others. (2025). ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning. arXiv preprint arXiv:2501.06590.
[zhang_automatic_2022] Zhang, Zhuosheng, Zhang, Aston, Li, Mu, Smola, Alex. (2022). Automatic {Chain. doi:10.48550/arXiv.2210.03493.
[zhang_aflow_2024] Zhang, Jiayi, Xiang, Jinyu, Yu, Zhaoyang, Teng, Fengwei, Chen, Xionghui, Chen, Jiaqi, Zhuge, Mingchen, Cheng, Xin, Hong, Sirui, Wang, Jinlin, Zheng, Bingnan, Liu, Bang, Luo, Yuyu, Wu, Chenglin. (2024). {AFlow.
[ADAS] Hu, Shengran, Lu, Cong, Clune, Jeff. (2024). Automated {Design.
[eot] Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuanjing Huang, Xipeng Qiu. (2023). Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication.
[dai2024costeffectiveonlinemultillmselection] Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C. S. Lui. (2024). Cost-Effective Online Multi-LLM Selection with Versatile Reward Models.
[medagentslargelanguagemodels] Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, Mark Gerstein. (2024). MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning.
[ong2024routellmlearningroutellms] Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica. (2024). RouteLLM: Learning to Route LLMs with Preference Data.
[mohammadshahi2024routoolearningroutelarge] Alireza Mohammadshahi, Arshad Rafiq Shaikh, Majid Yazdani. (2024). Routoo: Learning to Route to Large Language Models Effectively.
[feng2024graphroutergraphbasedrouterllm] Tao Feng, Yanzhen Shen, Jiaxuan You. (2024). GraphRouter: A Graph-based Router for LLM Selections.
[devlin2019bertpretrainingdeepbidirectional] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
[Meta-structure_Discovery_Chen_2024] Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, Yongfeng Zhang. (2024). AutoFlow: Automated Workflow Generation for Large Language Model Agents. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. doi:10.1145/3637528.3671965.
[abdin2024phi3technicalreporthighly] Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matthew Dixon, Ronen Eldan, Victor Fragoso, Jianfeng Gao, Mei Gao, Min Gao, Amit Garg, Allie Del Giorno, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Wenxiang Hu, Jamie Huynh, Dan Iter, Sam Ade Jacobs, Mojan Javaheripi, Xin Jin, Nikos Karampatziakis, Piero Kauffmann, Mahoud Khademi, Dongwoo Kim, Young Jin Kim, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Xihui Lin, Zeqi Lin, Ce Liu, Liyuan Liu, Mengchen Liu, Weishung Liu, Xiaodong Liu, Chong Luo, Piyush Madan, Ali Mahmoudzadeh, David Majercak, Matt Mazzola, Caio César Teodoro Mendes, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Liliang Ren, Gustavo de Rosa, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Yelong Shen, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Praneetha Vaddamanu, Chunyu Wang, Guanhua Wang, Lijuan Wang, Shuohang Wang, Xin Wang, Yu Wang, Rachel Ward, Wen Wen, Philipp Witte, Haiping Wu, Xiaoxia Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Jilong Xue, Sonali Yadav, Fan Yang, Jianwei Yang, Yifan Yang, Ziyi Yang, Donghan Yu, Lu Yuan, Chenruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, Xiren Zhou. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.
[lepagnol-etal-2024-small] Lepagnol, Pierre, Gerald, Thomas, Ghannay, Sahar, Servan, Christophe, Rosset, Sophie. (2024). Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).
[srivatsa2024harnessingpowermultipleminds] KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar. (2024). Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing.
[stripelis-etal-2024-tensoropera] Stripelis, Dimitris, Xu, Zhaozhuo, Hu, Zijian, Shah, Alay Dilipbhai, Jin, Han, Yao, Yuhang, Zhang, Jipeng, Zhang, Tong, Avestimehr, Salman, He, Chaoyang. (2024). {T. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. doi:10.18653/v1/2024.emnlp-industry.34.
[erdogan2025plan-and-act] Erdogan, Lutfi Eren, Lee, Nicholas, Kim, Sehoon, Moon, Suhong, Furuta, Hiroki, Anumanchipalli, Gopala, Keutzer, Kurt, Gholami, Amir. (2025). Plan-and-act: Improving planning of agents for long-horizon tasks. arXiv preprint arXiv:2503.09572.
[huang2024hardertasksneedexperts] Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Songfang Huang, Yansong Feng. (2024). Harder Tasks Need More Experts: Dynamic Routing in MoE Models.
[aghdam2024damoedynamicexpertallocation] Maryam Akhavan Aghdam, Hongpeng Jin, Yanzhao Wu. (2024). DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models.
[zhang2024cutcrapeconomicalcommunication] Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, Tianlong Chen. (2024). Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems.
[shang2024agentsquareautomaticllmagent] Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, Yong Li. (2024). AgentSquare: Automatic LLM Agent Search in Modular Design Space.
[chang2023surveyevaluationlargelanguage] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie. (2023). A Survey on Evaluation of Large Language Models.
[minaee2024largelanguagemodelssurvey] Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao. (2024). Large Language Models: A Survey.
[zingg2023detectingoptimisingteaminteractions] Christian Zingg, Alexander von Gernler, Carsten Arzig, Frank Schweitzer, Christoph Gote. (2023). Detecting and Optimising Team Interactions in Software Development.
[morethancode] Ramin, Frederike, Matthies, Christoph, Teusner, Ralf. (2020). More than Code: Contributions in Scrum Software Engineering Teams. Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops. doi:10.1145/3387940.3392241.
[zhang-etal-2024-exploring] Zhang, Jintian, Xu, Xin, Zhang, Ningyu, Liu, Ruibo, Hooi, Bryan, Deng, Shumin. (2024). Exploring Collaboration Mechanisms for {LLM. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2024.acl-long.782.
[reimers2019sentencebertsentenceembeddingsusing] Nils Reimers, Iryna Gurevych. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
[mbpp] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton. (2021). Program Synthesis with Large Language Models.
[OpenAI-gpt4o] Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, Mathur, Akhil, Schelten, Alan, Yang, Amy, Fan, Angela, others. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
[anthropicdd2024claude] Anthropic. (2024). Model card addendum: Claude 3.5 Haiku and upgraded Claude 3.5 Sonnet.
[geminiteam2024gemini15unlockingmultimodal] Team, Gemini, Georgiev, Petko, Lei, Ving Ian, Burnell, Ryan, Bai, Libin, Gulati, Anmol, Tanzer, Garrett, Vincent, Damien, Pan, Zhufeng, Wang, Shibo, others. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.
[deepseekai2024deepseekv3technicalreport] DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wanjia Zhao, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha, Yunfan Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, Zizheng Pan. (2024). DeepSeek-V3 Technical Report.
[humangroup] Phillips, Katherine, O'Reilly, Charles. (1998). Demography and Diversity in Organizations: A Review of 40 Years of Research. Research in Organizational Behavior. doi:10.1126/science.1193147.
[barandoni2024automatingcustomerneedsanalysis] Simone Barandoni, Filippo Chiarello, Lorenzo Cascone, Emiliano Marrale, Salvatore Puccio. (2024). Automating Customer Needs Analysis: A Comparative Study of Large Language Models in the Travel Industry.