Skip to main content

Formatting Instructions For NeurIPS 2024

% Xiaoxin He$^{1}$ \quad Yijun Tian$^{2}$ \quad Yifei Sun$^{1}$ \quad Nitesh V. Chawla$^{2}$ \quad Thomas Laurent$^3$ 0.10cm, Yann LeCun$^{4,5$ \quad Xavier Bresson$^1$ \quad Bryan Hooi$^1$} 0.15cm, {\normalfont $^1$National University of Singapore 0.18cm $^2$University of Notre Dame\quad $^3$Loyola Marymount University\quad }, $^4$New York University\quad $^5$Meta AI

Abstract

% Given a graph with textual attributes, we enable users to `chat with their graph': that is, to ask questions about the graph using a conversational interface. In response to a user's questions, our method provides textual replies and highlights the relevant parts of the graph. While existing works integrate large language models (LLMs) and graph neural networks (GNNs) in various ways, they mostly focus on either conventional graph tasks (such as node, edge, and graph classification), or on answering simple graph queries on small or synthetic graphs. In contrast, we develop a flexible question-answering framework targeting real-world textual graphs, applicable to multiple applications including scene graph understanding, common sense reasoning, and knowledge graph reasoning. Toward this goal, we first develop a Graph Question Answering (GraphQA) benchmark with data collected from different tasks. Then, we propose our G-Retriever approach, which integrates the strengths of GNNs, LLMs, and Retrieval-Augmented Generation (RAG), and can be fine-tuned to enhance graph understanding via soft prompting. To resist hallucination and to allow for textual graphs that greatly exceed the LLM's context window size, G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem. Empirical evaluations show that our method outperforms baselines on textual graph tasks from multiple domains, scales well with larger graph sizes, and resists hallucination.~Our codes and datasets are available at: \url{https://anonymous.4open.science/r/G-Retriever} %

Introduction

Graphs and Large Language Models (LLMs). The advent of LLMs has significantly shaped the artificial intelligence landscape. As these models are applied to increasingly diverse tasks, their ability to process complex structured data will be increasingly vital. In particular, in our interconnected world, a significant portion of real-world data inherently possesses a graph structure, such as the Web, e-commerce, recommendation systems, knowledge graphs, and many others. Moreover, many of these involve graphs with textual attributes ( i.e., textual graphs ), making them well-suited for LLM-centric methods. This has spurred interest in combining graph-based technologies, particularly graph neural networks (GNNs), with LLMs to enhance their reasoning on graphs [44, 15, 24].

The Present Work: Enabling 'Chat With Your Graph'. While existing works integrate LLMs and GNNs in various ways, they mostly focus on conventional graph tasks such as node, edge and graph

1 Our codes and datasets are available at: https://github.com/XiaoxinHe/G-Retriever

Figure 1: We develop a flexible question-answering framework targeting real-world textual graph applications via a unified conversational interface. Presented here are examples showcasing the model's adeptness in handling generative and creative queries in practical graph-related tasks: common sense reasoning, scene understanding, and knowledge graph reasoning, respectively.

Figure 1: We develop a flexible question-answering framework targeting real-world textual graph applications via a unified conversational interface. Presented here are examples showcasing the model's adeptness in handling generative and creative queries in practical graph-related tasks: common sense reasoning, scene understanding, and knowledge graph reasoning, respectively.

classification [8], or answering simple questions on small or synthetic graphs [44, 31]. In contrast, we develop a flexible question-answering framework targeting complex and real-world graphs. This framework enables users to 'chat with their graph' via a unified conversational interface, representing a leap towards intuitive interaction with graph data, as demonstrated in Figure 1.

The Need for a Comprehensive GraphQA Benchmark. Question answering (QA) is a fundamentally important task in natural language processing, serving as a key benchmark for assessing LLMs and providing a unified interface for various capabilities. Despite extensive research in QA, a comprehensive benchmark specifically tailored for the graph modality is lacking. In contrast to existing benchmarks that focus on basic graph-based reasoning tasks such as node degree, edge existence, and shortest path [6, 44], our benchmark addresses complex and real-world graph applications including common sense reasoning, scene understanding, and knowledge graph reasoning (refer to Figure 2). This is vital for measuring progress toward a model capable of answering a wide range of questions about graphs from diverse applications.

New Architecture for GraphQA. To enable effective and efficient graph QA, even on large graphs, we propose G-Retriever , a new framework combining the strengths of GNNs, LLMs, and RAG (Figure 3). Next, we will discuss the motivation, strengths, and details of our model.

Tackling Hallucination in Graph LLMs. LLMs are prone to hallucination, a phenomenon where the generated content is factually inaccurate or nonsensical [12]. We validate the presence of this issue in graph settings. In particular, we employ a baseline method that adapts MiniGPT-4 [57] to graphs, where a frozen LLM interacts with a trainable GNN that encodes graph data as a soft prompt, as in GraphToken [31]. Our findings, shown in Table 1, indicate that hallucination, an important problem in text-based LLMs, is also prevalent in Graph LLMs. This may be attributed to the baseline's inability to recall the entire graph structure from a single graph embedding, leading to the generation of incorrect nodes or edges during the QA task. In contrast, by employing RAG for direct information retrieval from the actual graph, our G-Retriever mitigates this issue, as substantiated by Table 1.

Enhancing Scalability and Efficiency in Graph LLMs. Recent research endeavors have explored translating graphs into natural language, such as by flattening nodes and edges into a text sequence, enabling their processing by LLMs for graph-based tasks [56, 6]. However, this method faces critical scalability issues. Converting a graph with thousands of nodes and edges into a text sequence results in an excessive number of tokens, surpassing the input capacity of many LLMs. An alternative of truncating the graph text sequence to fit the LLM's input token limit leads to loss of information and response quality. G-Retriever overcomes these issues with its RAG component, which allows for effective scaling to larger graphs by selectively retrieving only relevant parts of the graph.

Tailoring the RAG Approach to Graphs. Existing RAG methodologies are primarily designed for simpler data types or knowledge graphs, where information is retrieved in a manner isolated from the graph structure [7, 1, 36, 16]. Hence, we introduce a new retrieval approach for general textual graphs. Notably, we formulate subgraph retrieval as a Prize-Collecting Steiner Tree (PCST)

Table 1: Observation and mitigation of hallucination in graph LLMs.

Table 1: Observation and mitigation of hallucination in graph LLMs.

optimization problem, which takes the neighborhood information into account during retrieval. This also allows the return of a subgraph most relevant to a query, thereby improving explainability.

The contributions of this paper are outlined as follows:

· Pioneering the integration of Graph RAG. We present the first retrieval approach for general textual graph tasks, which greatly enhances scalability and efficiency. · Enabling 'Chat with Your Graph'. We develop a flexible question answering framework to handle complex and real-world textual graphs through a unified conversational interface. · Introducing A Novel GraphQA Benchmark. We introduce a diverse benchmark targeted at real-world graph question answering, filling a crucial research gap. · Empirical Findings. We demonstrate the efficiency and effectiveness of G-Retriever in multiple domains and present the significant finding of hallucination in graph LLMs.

Graphs and Large Language Models. Asignificant body of research has emerged at the intersection of graph-based techniques and LLMs [30, 24, 15, 44, 54]. This exploration spans diverse aspects, ranging from the design of general graph models [47, 25, 51, 19, 40, 31], and multi-modal architectures [23, 49] to practical applications. Noteworthy applications include fundamental graph reasoning [52, 3, 56], node classification [8, 11, 39, 5, 50, 4, 33], graph classification/regression [32, 55], and leveraging LLMs for knowledge graph-related tasks [41, 14, 29].

Retrieval-Augmented Generation (RAG). The concept of Retrieval-Augmented Generation, initially proposed by Lewis et al. [21], has gained increased attention for its ability to mitigate the issue of hallucination within LLMs and enhance trustworthiness and explainability [7]. Despite its success in language-related tasks, the application of retrieval-based approaches to general graph tasks remains largely unexplored. Most existing work focuses primarily on the knowledge graph [38, 1, 36, 16]. Our research is the first to apply a retrieval-based approach to general graph tasks, marking a novel advancement in the field and demonstrating the versatility of RAG beyond language processing.

Parameter-Efficient Fine-Tuning (PEFT). The field of LLMs has witnessed significant advancements through various parameter-efficient fine-tuning techniques. These methodologies have played a crucial role in refining LLMs, boosting their performance while minimizing the need for extensive parameter training. Notable among these techniques are prompt tuning, as introduced by Lester et al. [20], and prefix tuning, proposed by Li and Liang [22]. Furthermore, methods like LoRA [10],

and the LLaMA-adapter [53], have been influential. These advancements in PEFT have laid the foundation for the development of sophisticated multimodal models. Prominent examples in this domain include MiniGPT-4 [57], LLaVA [26], and NExT-Chat [46]. There are also emerging efforts in applying PEFT to graph LLMs, such as GraphLLM [3] and GraphToken [31] for basic graph reasoing tasks and GNP [41] for multi-option QA on knowledge graphs.

Formalization

This section establishes the notation and formalizes key concepts related to textual graphs, language models for text encoding, and large language models and prompt tuning.

Textual Graphs. A textual graph is a graph where nodes and edges possess textual attributes. Formally, it can be defined as G = ( V, E, { x n } n ∈ V , { x e } e ∈ E ) , where V and E represent the sets of nodes and edges, respectively. Additionally, x n ∈ D L n and x e ∈ D L e denote sequential text associate with a node n ∈ V or an edge e ∈ E , where D represents the vocabulary, and L n and L e signify the length of the text associated with the respective node or edge.

Language Models for Text Encoding. In the context of textual graphs, language models (LMs) are essential for encoding the text attributes associated with nodes and edges, thereby learning representations that capture their semantic meaning. For a node n with text attributes x n ∈ D L n , an LM encodes these attributes as:

$$

$$

where z n is the output of the LM, and d is the dimension of the output vector.

Large Language Models and Prompt Tuning. LLMs have introduced a new paradigm for taskadaptation known as 'pre-train, prompt, and predict', replacing the traditional 'pre-train, fine-tune' paradigm. In this paradigm, the LLM is first pre-trained on a large corpus of text data to learn general language representations. Then, rather than fine-tuning the model on task-specific labeled data, the model is prompted with a textual prompt that specifies the task and context. Subsequently, the model generates the output directly based on the prompt and the input.

The LLM, parameterized by weights θ , takes a sequence of tokens X , and a prompt P as input, and generates a sequence of tokens Y = { y 1 , y 2 , . . . , y r } as output. Formally, the probability distribution of the output sequence given the concatenated input sequence and prompt, i.e., [ P ; X ] , is defined as

$$

$$

Here, y <i represents the prefix of sequence y up to position i -1 , and p ( y i | y <i , [ P ; X ]) represents the probability of generating token y i given y <i and [ P ; X ] .

Soft prompt tuning eliminates the need for manual prompt design. Given a series of p tokens X = { x 1 , x 2 , . . . , x p } , after being processed by the text embedder, it forms a matrix X e ∈ R p × d l , where d l is the dimension of the embedding space. Soft prompts can be represented as parameters P e ∈ R q × d l , where q is the length of the prompt. The prompt is then concatenated with the embedded input, forming a single matrix [ P e ; X e ] ∈ R ( q + p ) × d l . This combined matrix is processed by the self-attention layers in LLM as usual. Training involves maximizing the likelihood of Y through backpropagation, with gradient updates applied solely to P e , while θ remains fixed.

Proposed GraphQA Benchmark

Our GraphQA represents a comprehensive and diverse benchmark for graph question-answering. It is tailored to assess the capabilities of models in answering a wide range of questions about graphs across diverse domains.

Data Format

Each entry in the GraphQA benchmark consists of a textual graph, a question related to the graph, and one or more corresponding answers, as illustrated in Figure 2.

Table 2: Summary of datasets used in GraphQA benchmark.

Figure 2: Illustrative examples from the GraphQA benchmark datasets.

Textual Graphs. The textual graph is converted into a natural language format, resulting in a list of nodes and edges, akin to a CSV file format. It is important to note that while multiple methods exist for textualizing a graph, our focus is not on identifying the optimal solution. Instead, we prioritize a straightforward yet empirically effective approach for representing graphs in natural language, facilitating the benchmark's use in diverse GraphQA scenarios.

Questions and Answers. Questions are designed to explore specific elements or relationships within the graph. Answers, residing within the attributes of nodes or edges, often require multi-hop reasoning for accurate identification.

Description of Datasets

The GraphQA benchmark integrates three existing datasets: ExplaGraphs , SceneGraphs , and WebQSP . Table 2 presents the summary statistics of these datasets. It is important to note that these datasets were not originally developed for this work. However, a significant contribution of our research is the standardization and processing of these diverse datasets into a uniform data format suitable for the GraphQA benchmark. These datasets, previously utilized in different contexts, are reintroduced with a new focus tailored for GraphQA. For a detailed comparison with the original datasets, see the Appendix C.

ExplaGraphs is a dataset for generative commonsense reasoning, focusing on creating explanation graphs for stance prediction in debates. It offers detailed, unambiguous commonsense-augmented graphs to evaluate arguments supporting or refuting a belief. The primary task is to assess whether arguments are supportive or contradictory, using accuracy as the metric. We have converted the triplet-form provided in Saha et al. [35] into a standard graph format.

SceneGraphs , a visual question answering dataset, includes 100,000 scene graphs. Each graph details objects, attributes, and relations within an image. This dataset challenges users with tasks requiring spatial understanding and multi-step inference. The task is to answer open-ended questions based on a textual description of a scene graph, evaluated on accuracy. We have sampled from the GQA dataset [13] and constructed standard graphs from the provided JSON files.

WebQSP is a large-scale multi-hop knowledge graph QA dataset consisting of 4,737 questions. It was proposed by Yih et al. [48] and, following Luo et al. [28], utilizes a subset of Freebase, encompassing facts within 2-hops of entities mentioned in the questions. The task involves answering questions that

Figure 3: Overview of the proposed G-Retriever : 1) Indexing: Graphs are indexed for efficient query processing; 2) Retrieval: The most semantically relevant nodes and edges are retrieved, conditioned on the query; 3) Subgraph Construction: A connected subgraph is extracted, covering as many relevant nodes and edges as possible while maintaining a manageable graph size; 4) Generation: An answer is generated using a 'graph prompt', a textualized graph, and the query.

Figure 3: Overview of the proposed G-Retriever : 1) Indexing: Graphs are indexed for efficient query processing; 2) Retrieval: The most semantically relevant nodes and edges are retrieved, conditioned on the query; 3) Subgraph Construction: A connected subgraph is extracted, covering as many relevant nodes and edges as possible while maintaining a manageable graph size; 4) Generation: An answer is generated using a 'graph prompt', a textualized graph, and the query.

require multi-hop reasoning. Given the possibility of multiple answers for the same question, the hit@1 metric is used to assess the precision of the top returned answer.

G-Retriever

In this section, we introduce G-Retriever , a new architecture tailored for GraphQA, which integrates the strengths of GNNs, LLMs, and RAG. To allow efficient fine-tuning while preserving the LLM's pretrained language capabilities, we freeze the LLM and use a soft prompting approach on the output of the GNN. Our RAG-based design mitigates hallucinations through direct retrieval of the graph, while allowing our approach to scale to graphs exceeding the LLM's context window size. To adapt RAG to graphs, we formulate subgraph retrieval as a PCST optimization problem. This approach also allows us to enhance explainability by returning the retrieved subgraph.

G-Retriever comprises four main steps: indexing, retrieval, subgraph construction and generation, as depicted in Figure 3. The implementation details of each step are elaborated in the following sections.

Indexing

We initiate the RAG approach by generating node and graph embeddings using a pre-trained LM. These embeddings are then stored in a nearest neighbor data structure.

To elaborate, consider x n ∈ D L n as the text attributes of node n . Utilizing a pre-trained LM, such as SentenceBert [34], we apply the LM to x n , yielding the representation z n :

$$

$$

where d denotes the dimension of the output vector. Similar preprocessing steps are applied to edges. Refer to Figure 3, Step 1 for an illustrative representation.

Retrieval

For retrieval, we employ the same encoding strategy to the query x q , to ensure consistent treatment of textual information:

$$

$$

Next, to identify the most relevant nodes and edges for the current query, we use a k-nearest neighbors retrieval approach. This method yields a set of 'relevant nodes/edges' based on the similarity between the query and each node or edge. The retrieval operation is defined as:

$$

$$

where z n and z e are the embeddings of node n and edge e , respectively. We use the cosine similarity function, cos ( · , · ) , to measure the similarity between the query representation and the node/edge embeddings. The argtopk operation retrieves the top-k elements based on this similarity, providing a set of nodes V k and edges E k considered most relevant to the query. See Step 2 of Figure 3.

Subgraph Construction

This step aims to construct a subgraph that encompasses as many relevant nodes and edges as possible, while keeping the graph size manageable. This approach offers two key benefits: Firstly, it helps to filter out nodes and edges that are not pertinent to the query. This is crucial because irrelevant information can overshadow the useful data, potentially diverting the focus of the subsequent LLM from the information of interest. Secondly, it enhances efficiency; by keeping the graph size manageable, it becomes feasible to translate the graph into natural language and then input it into the LLM for processing. The Prize-Collecting Steiner Tree algorithm [2] serves as our primary method for identifying such optimally sized and relevant subgraphs. See Step 3 in Figure 3.

Prize-Collecting Steiner Tree (PCST). The PCST problem aims to find a connected subgraph that maximizes the total prize values of its nodes while minimizing the total costs of its edges. Our approach assigns higher prize values to nodes and edges more relevant to the query, as measured by cosine similarity. Specifically, the top k nodes/edges are assigned descending prize values from k down to 1, with the rest assigned zero. The node prize assignment is as follows:

$$

$$

Edge prizes are assigned similarly. The objective is to identify a subgraph, S ∗ = ( V ∗ , E ∗ ) , that optimizes the total prize of nodes and edges, minus the costs associated with the size of the subgraph:

$$

$$

where

$$

$$

and C e denotes a predefined cost per edge, which is adjustable to control the subgraph size.

The original PCST algorithm is designed for node prizes only. However, given the significance of edge semantics in certain scenarios, we adapt the algorithm to accommodate edge prizes as follows: Consider an edge e with a cost C e and a prize P e . If C e > P e , it can be treated as a reduced edge cost of C e -P e . However, if P e > C e , negative edge costs are not allowed in the original algorithm. Our solution involves replacing edge e with a 'virtual node' v e , connected to both endpoints of e . This virtual node is assigned a prize of P e -C e , and the cost of the two new edges leading to the virtual node is set to zero. This modification effectively mirrors the original problem, as including edge e in the original graph is analogous to including the virtual node in the modified graph. Finally, we optimize the PCST problem using a near-linear time approach [9].

Answer Generation

Graph Encoder. Let S ∗ = ( V ∗ , E ∗ ) represent the retrieved subgraph. We use a graph encoder to model the structure of this graph, specifically using a standard Graph Attention Network (GAT) [43]. Our approach for encoding the retrieved subgraph is defined as follows:

$$

$$

Here, POOL denotes the mean pooling operation, and d g is the dimension of the graph encoder.

Projection Layer. We incorporate a multilayer perceptron (MLP) to align the graph token with the vector space of the LLM:

$$

$$

Text Embedder. To leverage the text-reasoning capabilities of LLMs, we transform the retrieved subgraph S ∗ into a textual format. This transformation involves flattening the textual attributes of the

nodes and edges, as illustrated in the green box in Figure 2. We refer to this operation as textualize ( · ) . Subsequently, we combine the textualized graph with the query to generate a response. Let x q denote the query; we concatenate it with the textualized graph textualize ( S ∗ ) . We then map the result to an embedding h t using a text embedder, which is the first layer of a pretrained and frozen LLM:

$$

$$

LLMGeneration with Graph Prompt Tuning. The final stage involves generating the answer Y given the graph token ˆ h g , acting as a soft prompt, and the text embedder output h t . These inputs are fed through the self-attention layers of a pretrained frozen LLM, with parameter θ . The generation process is represented as follows:

$$

$$

where [ ˆ h g ; h t ] concatenates the graph token ˆ h g and the text embedder output h t . While θ is frozen, the graph token ˆ h g receives gradients, enabling the optimization of the parameters of the graph encoder ϕ 1 and the projection layer ϕ 2 through standard backpropagation.

Experiments

Experiment Setup

In the indexing step, we use SentenceBert [34] as the LM to encode all node and edge attributes. In the generation step, we use the open-source Llama2-7b [42] as the LLM and Graph Transformer [37] as the graph encoder. Additional details are provided in Appendix B.1.

Model Configurations

In our experiments, we consider three model configurations:

  1. Inference-only : Using a frozen LLM for direct question answering with textual graph and question, see Figure 4.

Figure 4: Model configuration 1) Inference-only .

Figure

(a) Prompt Tuning

(b) G-Retriever

Figure 5: Model configuration 2) Frozen LLM w/ prompt tuning .

Figure

(a) LoRA

Main Results

In our experiments, we consider three model configurations: 1) Inference-only : Using a frozen LLM for direct question answering; 2) Frozen LLM w/ prompt tuning (PT) : Keeping the parameters of the LLM frozen and adapting only the prompt; 3) Tuned LLM : Fine-tuning the LLM with LoRA [10]. We provide more details in Appendix B.2.

Table 3: Performance comparison across ExplaGraphs , SceneGraphs , and WebQSP datasets for different configurations, including Inference-only, Frozen LLM with prompt tuning (PT), and Tuned LLM settings. Mean scores and standard deviations (mean ± std) are presented. The first best result for each task is highlighted in bold and the second best result is highlighted with an underline.

Table 3 demonstrates the effectiveness of our method across three datasets in various configurations. In the inference-only setting, G-Retriever surpasses all baselines. Notably, LLM can perform even better when no graph knowledge is provided ( i.e., question only), which might be attributed to the complexity and potential noise in the knowledge. For frozen LLM with prompt tuning, G-Retriever outperforms traditional prompt tuning and GraphToken [31], a graph prompt tuning-based method, with average performance increases of 40.6% and 30.8% respectively. Furthermore, when tuned with LoRA, G-Retriever achieves the best performance.

Table 4: Retrieval on graphs significantly improves efficiency.

Table 5: Quantitative comparison of hallucination on the SceneGraphs dataset.

Efficiency Evaluation

The efficiency of our approach is highlighted by the data in Table 4. Implementing our graph-based retrieval significantly decreases the number of tokens required to describe the graphs in text, reduces the number of nodes in graphs, and speeds up the training process. Specifically, for the SceneGraphs dataset, tokens decreased by 83%, nodes by 74%, and training time by 29%. For the WebQSP dataset, tokens decreased by 99%, nodes by 99%, and training time by 67%. These substantial reductions demonstrate the method's efficiency and potential in managing large-scale graph data.

Mitigation of Hallucination

To evaluate hallucination, we instructed the models to answer graph-related questions, specifically identifying supporting nodes or edges from the graph. We manually reviewed 100 responses from both our method and the baseline ( i.e., LLM with graph prompt tuning), verifying the existence of the nodes and edges referenced in the model's output within the actual graph. Table 5 shows that G-Retriever significantly reduces hallucinations by 54% compared to the baseline, as our graph retrieval ensures that the data is sourced directly from the actual graph, leading to less hallucination. See Appendix F for details.

Ablation Study

In this ablation study, we assess the individual impact of key components within our pipeline. As shown in Table 6, there are performance drops when any of these components are removed, with the graph encoder and textualized graph showing declines of 22.51% and 19.19%, respectively. This demonstrates their complementary effects in representing the graph in both textual and embedded formats. Additionally, the retrieval on graphs is also important to the overall performance. Further details are available in Appendix B.3. We also present additional studies on our framework: it is robust to the choice of graph encoders (see Appendix B.4) and benefits from the increased scale of LLMs (see Appendix B.5).

Additionally, we include a detailed comparison with existing retrieval methods (see Appendix D), a discussion on the complexity (see Appendix E), and demonstrations on how to use G-Retriever to 'chat with your graph' (see Appendix G).

Conclusion

In this work, we introduce a new GraphQA benchmark for real-world graph question answering and present G-Retriever , an architecture adept at complex and creative queries. Experimental results show that G-Retriever surpasses baselines in textual graph tasks across multiple domains, scales effectively with larger graph sizes, and demonstrates resistance to hallucination.

Limitations and Future Work: Currently, G-Retriever employs a static retrieval component. Future developments could investigate more sophisticated RAG where the retrieval is trainable.

Acknowledgment

XB is supported by NUS Grant ID R-252-000-B97-133.

Submission of papers to NeurIPS 2024

Style

Retrieval of style files

For retrieval, we employ the same encoding strategy to the query x q , to ensure consistent treatment of textual information:

$$

$$

Next, to identify the most relevant nodes and edges for the current query, we use a k-nearest neighbors retrieval approach. This method yields a set of 'relevant nodes/edges' based on the similarity between the query and each node or edge. The retrieval operation is defined as:

$$

$$

where z n and z e are the embeddings of node n and edge e , respectively. We use the cosine similarity function, cos ( · , · ) , to measure the similarity between the query representation and the node/edge embeddings. The argtopk operation retrieves the top-k elements based on this similarity, providing a set of nodes V k and edges E k considered most relevant to the query. See Step 2 of Figure 3.

Preprint option

Graphs and Large Language Models (LLMs). The advent of LLMs has significantly shaped the artificial intelligence landscape. As these models are applied to increasingly diverse tasks, their ability to process complex structured data will be increasingly vital. In particular, in our interconnected world, a significant portion of real-world data inherently possesses a graph structure, such as the Web, e-commerce, recommendation systems, knowledge graphs, and many others. Moreover, many of these involve graphs with textual attributes ( i.e., textual graphs ), making them well-suited for LLM-centric methods. This has spurred interest in combining graph-based technologies, particularly graph neural networks (GNNs), with LLMs to enhance their reasoning on graphs [44, 15, 24].

The Present Work: Enabling 'Chat With Your Graph'. While existing works integrate LLMs and GNNs in various ways, they mostly focus on conventional graph tasks such as node, edge and graph

1 Our codes and datasets are available at: https://github.com/XiaoxinHe/G-Retriever

Figure 1: We develop a flexible question-answering framework targeting real-world textual graph applications via a unified conversational interface. Presented here are examples showcasing the model's adeptness in handling generative and creative queries in practical graph-related tasks: common sense reasoning, scene understanding, and knowledge graph reasoning, respectively.

Figure 1: We develop a flexible question-answering framework targeting real-world textual graph applications via a unified conversational interface. Presented here are examples showcasing the model's adeptness in handling generative and creative queries in practical graph-related tasks: common sense reasoning, scene understanding, and knowledge graph reasoning, respectively.

classification [8], or answering simple questions on small or synthetic graphs [44, 31]. In contrast, we develop a flexible question-answering framework targeting complex and real-world graphs. This framework enables users to 'chat with their graph' via a unified conversational interface, representing a leap towards intuitive interaction with graph data, as demonstrated in Figure 1.

The Need for a Comprehensive GraphQA Benchmark. Question answering (QA) is a fundamentally important task in natural language processing, serving as a key benchmark for assessing LLMs and providing a unified interface for various capabilities. Despite extensive research in QA, a comprehensive benchmark specifically tailored for the graph modality is lacking. In contrast to existing benchmarks that focus on basic graph-based reasoning tasks such as node degree, edge existence, and shortest path [6, 44], our benchmark addresses complex and real-world graph applications including common sense reasoning, scene understanding, and knowledge graph reasoning (refer to Figure 2). This is vital for measuring progress toward a model capable of answering a wide range of questions about graphs from diverse applications.

New Architecture for GraphQA. To enable effective and efficient graph QA, even on large graphs, we propose G-Retriever , a new framework combining the strengths of GNNs, LLMs, and RAG (Figure 3). Next, we will discuss the motivation, strengths, and details of our model.

Tackling Hallucination in Graph LLMs. LLMs are prone to hallucination, a phenomenon where the generated content is factually inaccurate or nonsensical [12]. We validate the presence of this issue in graph settings. In particular, we employ a baseline method that adapts MiniGPT-4 [57] to graphs, where a frozen LLM interacts with a trainable GNN that encodes graph data as a soft prompt, as in GraphToken [31]. Our findings, shown in Table 1, indicate that hallucination, an important problem in text-based LLMs, is also prevalent in Graph LLMs. This may be attributed to the baseline's inability to recall the entire graph structure from a single graph embedding, leading to the generation of incorrect nodes or edges during the QA task. In contrast, by employing RAG for direct information retrieval from the actual graph, our G-Retriever mitigates this issue, as substantiated by Table 1.

Enhancing Scalability and Efficiency in Graph LLMs. Recent research endeavors have explored translating graphs into natural language, such as by flattening nodes and edges into a text sequence, enabling their processing by LLMs for graph-based tasks [56, 6]. However, this method faces critical scalability issues. Converting a graph with thousands of nodes and edges into a text sequence results in an excessive number of tokens, surpassing the input capacity of many LLMs. An alternative of truncating the graph text sequence to fit the LLM's input token limit leads to loss of information and response quality. G-Retriever overcomes these issues with its RAG component, which allows for effective scaling to larger graphs by selectively retrieving only relevant parts of the graph.

Tailoring the RAG Approach to Graphs. Existing RAG methodologies are primarily designed for simpler data types or knowledge graphs, where information is retrieved in a manner isolated from the graph structure [7, 1, 36, 16]. Hence, we introduce a new retrieval approach for general textual graphs. Notably, we formulate subgraph retrieval as a Prize-Collecting Steiner Tree (PCST)

Table 1: Observation and mitigation of hallucination in graph LLMs.

Table 1: Observation and mitigation of hallucination in graph LLMs.

optimization problem, which takes the neighborhood information into account during retrieval. This also allows the return of a subgraph most relevant to a query, thereby improving explainability.

The contributions of this paper are outlined as follows:

· Pioneering the integration of Graph RAG. We present the first retrieval approach for general textual graph tasks, which greatly enhances scalability and efficiency. · Enabling 'Chat with Your Graph'. We develop a flexible question answering framework to handle complex and real-world textual graphs through a unified conversational interface. · Introducing A Novel GraphQA Benchmark. We introduce a diverse benchmark targeted at real-world graph question answering, filling a crucial research gap. · Empirical Findings. We demonstrate the efficiency and effectiveness of G-Retriever in multiple domains and present the significant finding of hallucination in graph LLMs.

General formatting instructions

This step aims to construct a subgraph that encompasses as many relevant nodes and edges as possible, while keeping the graph size manageable. This approach offers two key benefits: Firstly, it helps to filter out nodes and edges that are not pertinent to the query. This is crucial because irrelevant information can overshadow the useful data, potentially diverting the focus of the subsequent LLM from the information of interest. Secondly, it enhances efficiency; by keeping the graph size manageable, it becomes feasible to translate the graph into natural language and then input it into the LLM for processing. The Prize-Collecting Steiner Tree algorithm [2] serves as our primary method for identifying such optimally sized and relevant subgraphs. See Step 3 in Figure 3.

Prize-Collecting Steiner Tree (PCST). The PCST problem aims to find a connected subgraph that maximizes the total prize values of its nodes while minimizing the total costs of its edges. Our approach assigns higher prize values to nodes and edges more relevant to the query, as measured by cosine similarity. Specifically, the top k nodes/edges are assigned descending prize values from k down to 1, with the rest assigned zero. The node prize assignment is as follows:

$$

$$

Edge prizes are assigned similarly. The objective is to identify a subgraph, S ∗ = ( V ∗ , E ∗ ) , that optimizes the total prize of nodes and edges, minus the costs associated with the size of the subgraph:

$$

$$

where

$$

$$

and C e denotes a predefined cost per edge, which is adjustable to control the subgraph size.

The original PCST algorithm is designed for node prizes only. However, given the significance of edge semantics in certain scenarios, we adapt the algorithm to accommodate edge prizes as follows: Consider an edge e with a cost C e and a prize P e . If C e > P e , it can be treated as a reduced edge cost of C e -P e . However, if P e > C e , negative edge costs are not allowed in the original algorithm. Our solution involves replacing edge e with a 'virtual node' v e , connected to both endpoints of e . This virtual node is assigned a prize of P e -C e , and the cost of the two new edges leading to the virtual node is set to zero. This modification effectively mirrors the original problem, as including edge e in the original graph is analogous to including the virtual node in the modified graph. Finally, we optimize the PCST problem using a near-linear time approach [9].

Headings: first level

Headings: second level

Headings: third level

Paragraphs

Citations, figures, tables, references

Utilizing two A100 GPUs, each with 80GB of memory, we conducted tests on Llama2-7b and WebQSP datasets. Our experiments had a training batch size of 16 and an evaluation batch size of 32, yielding the following results.

Table 12: Performance and Efficiency of Various Methods on the WebQSP dataset.

These results highlight efficiency improvements via graph RAG, which significantly reduces graph size (e.g., eliminating 99% of nodes in the WebQSP dataset) and speeds up running time.

Citations within the text

Footnotes

Figures

Tables

Math

Final instructions

Graphs and Large Language Models (LLMs). The advent of LLMs has significantly shaped the artificial intelligence landscape. As these models are applied to increasingly diverse tasks, their ability to process complex structured data will be increasingly vital. In particular, in our interconnected world, a significant portion of real-world data inherently possesses a graph structure, such as the Web, e-commerce, recommendation systems, knowledge graphs, and many others. Moreover, many of these involve graphs with textual attributes ( i.e., textual graphs ), making them well-suited for LLM-centric methods. This has spurred interest in combining graph-based technologies, particularly graph neural networks (GNNs), with LLMs to enhance their reasoning on graphs [44, 15, 24].

The Present Work: Enabling 'Chat With Your Graph'. While existing works integrate LLMs and GNNs in various ways, they mostly focus on conventional graph tasks such as node, edge and graph

1 Our codes and datasets are available at: https://github.com/XiaoxinHe/G-Retriever

Figure 1: We develop a flexible question-answering framework targeting real-world textual graph applications via a unified conversational interface. Presented here are examples showcasing the model's adeptness in handling generative and creative queries in practical graph-related tasks: common sense reasoning, scene understanding, and knowledge graph reasoning, respectively.

Figure 1: We develop a flexible question-answering framework targeting real-world textual graph applications via a unified conversational interface. Presented here are examples showcasing the model's adeptness in handling generative and creative queries in practical graph-related tasks: common sense reasoning, scene understanding, and knowledge graph reasoning, respectively.

classification [8], or answering simple questions on small or synthetic graphs [44, 31]. In contrast, we develop a flexible question-answering framework targeting complex and real-world graphs. This framework enables users to 'chat with their graph' via a unified conversational interface, representing a leap towards intuitive interaction with graph data, as demonstrated in Figure 1.

The Need for a Comprehensive GraphQA Benchmark. Question answering (QA) is a fundamentally important task in natural language processing, serving as a key benchmark for assessing LLMs and providing a unified interface for various capabilities. Despite extensive research in QA, a comprehensive benchmark specifically tailored for the graph modality is lacking. In contrast to existing benchmarks that focus on basic graph-based reasoning tasks such as node degree, edge existence, and shortest path [6, 44], our benchmark addresses complex and real-world graph applications including common sense reasoning, scene understanding, and knowledge graph reasoning (refer to Figure 2). This is vital for measuring progress toward a model capable of answering a wide range of questions about graphs from diverse applications.

New Architecture for GraphQA. To enable effective and efficient graph QA, even on large graphs, we propose G-Retriever , a new framework combining the strengths of GNNs, LLMs, and RAG (Figure 3). Next, we will discuss the motivation, strengths, and details of our model.

Tackling Hallucination in Graph LLMs. LLMs are prone to hallucination, a phenomenon where the generated content is factually inaccurate or nonsensical [12]. We validate the presence of this issue in graph settings. In particular, we employ a baseline method that adapts MiniGPT-4 [57] to graphs, where a frozen LLM interacts with a trainable GNN that encodes graph data as a soft prompt, as in GraphToken [31]. Our findings, shown in Table 1, indicate that hallucination, an important problem in text-based LLMs, is also prevalent in Graph LLMs. This may be attributed to the baseline's inability to recall the entire graph structure from a single graph embedding, leading to the generation of incorrect nodes or edges during the QA task. In contrast, by employing RAG for direct information retrieval from the actual graph, our G-Retriever mitigates this issue, as substantiated by Table 1.

Enhancing Scalability and Efficiency in Graph LLMs. Recent research endeavors have explored translating graphs into natural language, such as by flattening nodes and edges into a text sequence, enabling their processing by LLMs for graph-based tasks [56, 6]. However, this method faces critical scalability issues. Converting a graph with thousands of nodes and edges into a text sequence results in an excessive number of tokens, surpassing the input capacity of many LLMs. An alternative of truncating the graph text sequence to fit the LLM's input token limit leads to loss of information and response quality. G-Retriever overcomes these issues with its RAG component, which allows for effective scaling to larger graphs by selectively retrieving only relevant parts of the graph.

Tailoring the RAG Approach to Graphs. Existing RAG methodologies are primarily designed for simpler data types or knowledge graphs, where information is retrieved in a manner isolated from the graph structure [7, 1, 36, 16]. Hence, we introduce a new retrieval approach for general textual graphs. Notably, we formulate subgraph retrieval as a Prize-Collecting Steiner Tree (PCST)

Table 1: Observation and mitigation of hallucination in graph LLMs.

Table 1: Observation and mitigation of hallucination in graph LLMs.

optimization problem, which takes the neighborhood information into account during retrieval. This also allows the return of a subgraph most relevant to a query, thereby improving explainability.

The contributions of this paper are outlined as follows:

· Pioneering the integration of Graph RAG. We present the first retrieval approach for general textual graph tasks, which greatly enhances scalability and efficiency. · Enabling 'Chat with Your Graph'. We develop a flexible question answering framework to handle complex and real-world textual graphs through a unified conversational interface. · Introducing A Novel GraphQA Benchmark. We introduce a diverse benchmark targeted at real-world graph question answering, filling a crucial research gap. · Empirical Findings. We demonstrate the efficiency and effectiveness of G-Retriever in multiple domains and present the significant finding of hallucination in graph LLMs.

Preparing PDF files

Margins in LaTeX{

In our experiments, we consider three model configurations: 1) Inference-only : Using a frozen LLM for direct question answering; 2) Frozen LLM w/ prompt tuning (PT) : Keeping the parameters of the LLM frozen and adapting only the prompt; 3) Tuned LLM : Fine-tuning the LLM with LoRA [10]. We provide more details in Appendix B.2.

Table 3: Performance comparison across ExplaGraphs , SceneGraphs , and WebQSP datasets for different configurations, including Inference-only, Frozen LLM with prompt tuning (PT), and Tuned LLM settings. Mean scores and standard deviations (mean ± std) are presented. The first best result for each task is highlighted in bold and the second best result is highlighted with an underline.

Table 3 demonstrates the effectiveness of our method across three datasets in various configurations. In the inference-only setting, G-Retriever surpasses all baselines. Notably, LLM can perform even better when no graph knowledge is provided ( i.e., question only), which might be attributed to the complexity and potential noise in the knowledge. For frozen LLM with prompt tuning, G-Retriever outperforms traditional prompt tuning and GraphToken [31], a graph prompt tuning-based method, with average performance increases of 40.6% and 30.8% respectively. Furthermore, when tuned with LoRA, G-Retriever achieves the best performance.

Table 4: Retrieval on graphs significantly improves efficiency.

Table 5: Quantitative comparison of hallucination on the SceneGraphs dataset.

References

Appendix / supplemental material

Impact Statements

As LLMs are applied to increasingly diverse tasks, their ability to process complex structured data will be increasingly vital. Our work aims to enhance LLMs' ability to interact with graph-structured data, while resisting hallucination, thus improving model reliability. We also enhance explainability, both by returning the retrieved subgraph, and through the use of conversational interfaces for 'chatting with a graph', which allows for better human-AI interaction and for models to behave in a way that is more well-aligned with human expectations.

Experiment

Implementation Settings

Experiments are conducted using 2 NVIDIA A100-80G GPUs. Each experiment is replicated four times, utilizing different seeds for each run to ensure robustness and reproducibility.

Graph Encoder. We use Graph Transformer [37] as the GNN backbone. Our configuration employs 4 layers, each with 4 attention heads, and a hidden dimension size of 1024.

LLM. We use the open-sourced Llama2-7b [42] as the LLM backbone. In fine-tuning the LLM with LoRA [10], the lora_r parameter (dimension for LoRA update matrices) is set to 8, and lora_alpha (scaling factor) is set to 16. The dropout rate is set to 0.05. In prompt tuning, the LLM is configured with 10 virtual tokens. The number of max text length is 512, the number of max new tokens, i.e., the maximum numbers of tokens to generate, is 32.

PCST . For retrieval over graphs via PCST, for the SceneGraphs dataset, we select the top k nodes and edges, setting k to 3. Here, the cost of edges, denoted as C e , is set to 1. Regarding the WebQSP dataset, we set k = 3 for nodes and k = 5 for edges, with the edge cost, C e , adjusted to 0.5. For the ExplaGraphs dataset, which is characterized by a small graph size averaging 5.17 nodes and 4.25 edges (as detailed in Table 2), the entire graph can fit in the LLM's context window size. Consequently, we aim to retrieve the whole graph by setting k to 0, effectively returning the original graph unaltered.

Optimization. We use the AdamW [27] optimizer. We set the initial learning rate at 1e-5, with a weight decay of 0.05. The learning rate decays with a half-cycle cosine decay after the warm-up period. The batch size is 4, and the number of epochs is 10. To prevent overfitting and ensure training efficiency, an early stopping mechanism is implemented with a patience setting of 2 epochs.

Details of Model Configurations

In our experiments, we consider three model configurations:

  1. Inference-only : Using a frozen LLM for direct question answering with textual graph and question, see Figure 4.

Figure 4: Model configuration 1) Inference-only .

Figure

(a) Prompt Tuning

(b) G-Retriever

Figure 5: Model configuration 2) Frozen LLM w/ prompt tuning .

Figure

(a) LoRA

Details of Ablation Study

This section illustrates the modifications made to the original architecture in the ablation study, as presented in Figure 7.

Without Graph Encoder (w/o GraphEncoder): In this setting, we replaced the graph encoder with trainable soft tokens, setting the number of these virtual tokens to 10.

Without Projection Layer (w/o Projection Layer): Here, we removed the projection layer following the graph encoder. We configured the output dimension of the graph encoder to be 4,096, matching the hidden dimension of Llama2-7b. This allows the output graph token (the yellow token in Figure 7b) to be concatenated directly with the LLM tokens (blue tokens).

Without Textualized Graph (w/o Textualized Graph): In this configuration, we modified the textual input to the LLM. Rather than using a combination of the question and the textualized graph, we solely used the question.

(a) w/o GraphEncoder

(a) w/o GraphEncoder

Figure 7: Ablation study configurations.

The Choice of Graph Encoder

In addition to the Graph Transformer [37], we explore other GNNs as the graph encoder, such as GCN [17] and the GAT [43]. The comparative results of these models on the WebQSP and ExplaGraphs datasets are presented in Table 7.

Table 7: Performance of different graph encoders on the WebQSP and ExplaGraphs datasets.

The results demonstrate that our proposed method exhibits consistent robustness across different graph encoders. Notably, all three encoders - GCN, GAT, and GraphTransformer - demonstrate competitive and closely aligned performance on the WebQSP dataset, with Hit@1 scores of 70.70, 70.27, and 70.49, respectively. However, the performance differentiation becomes more pronounced on the ExplaGraphs dataset, where GraphTransformer exhibits a superior Hit@1 score of 0.8516, followed by GAT and GCN with scores of 0.8430 and 0.8394, respectively. This variation in performance across the datasets highlights the importance of encoder selection based on the specific characteristics and requirements of the dataset.

The Choice of LLM

As for the choice of LLM, we considered both Llama2-7b and Llama2-13b. Our experiments demonstrate that stronger LLMs enhance the effectiveness of our method, as shown in Table 8, indicating that it benefits from the increased scale of the LLMs.

Table 8: Performance of different LLMs on the WebQSP dataset.

GraphQA Benchmark

In this section, we detail how our GraphQA benchmark differs from the original datasets, including the specific processing steps we employed. For concrete examples that illustrate the differences between the raw text in the original dataset and in our GraphQA benchmark, please refer to Table 9.

ExplaGraphs . The original dataset 2 [35] represents relationships using triplets. We have standardized this format by converting the triplets into a graph representation. Specifically, each head and tail in a triplet is transformed into a node, and the relation is transformed into an edge. Since the test dataset labels are not available, we have utilized only the training and validation (val) datasets from the original collection. We further divided these into training, val, and test subsets, using a 6:2:2 ratio.

2 https://explagraphs.github.io/

(c) w/o Textualized Graph

Table 9: Comparison of text formats in original datasets and our GraphQA benchmark.

SceneGraphs . The original GQA dataset is designed for real-world visual reasoning and compositional question answering, aiming to address key shortcomings of previous VQA datasets [13]. It comprises 108k images, each associated with a Scene Graph. In our study, we focus differently on graph question answering; hence, we did not utilize the image counterparts, leveraging only the scene graphs from the original dataset. Additionally, the original dataset describes images using JSON files. We simplified the object IDs to suit our research needs. We randomly sampled 100k samples from the original dataset and divided them into training, validation, and test subsets, following a 6:2:2 ratio.

WebQSP . We follow the preprocessing steps from RoG 3 [28]. The original dataset uses a list of triplets format, which we have transformed into our unified graph format. Furthermore, to avoid

3 https://huggingface.co/datasets/rmanluo/RoG-webqsp

discrimination between capital and lowercase words, we have converted all words to lowercase. We used the same dataset split as in the original dataset.

Graph Retrieval-Augmented Generation (GraphRAG)

Comparison with Existing GraphRAG Methods

Most existing GraphRAG methods are designed specifically for knowledge graphs and focus on node, edge, or triples-level retrieval [1, 36, 16, 38]. Our method is different in two main ways: 1) It focuses on more general textual graphs, not just knowledge graphs. 2) It enables the return of a subgraph most closely related to a query, rather than a list of top-k triples. The triples in other methods are chosen in isolation from the graph, failing to capture neighborhood information effectively. In contrast, our method takes the context into account during retrieval.

The Impact of K for Retrieval

We identify the most relevant nodes and edges and use a k-nearest neighbors retrieval approach (see Equation 6). Small k values may omit crucial knowledge or information relevant to the query, while large k values could introduce excessive information, distracting the model from the essential details. To evaluate the impact of the number of k, we have conducted additional experiments by varying the choice of k to 3, 5, 10, and 20.

Table 10: The impact of k on the webqsp dataset.

As shown in Table 10, the Hit@1 metric initially rises for small k values, peaks at a certain point, and then declines for large k values. Determining the optimal k value can be achieved through techniques like cross-validation using a validation set.

The Choice of Similarity Function

The choice of similarity function is also important. In this work, we use cosine similarity, a widely adopted metric for measuring vector similarity in models that process vision and language. For instance, CLIP also employs cosine similarity to assess the similarity between text and image features. Although it might not be the optimal choice, we believe that cosine similarity is a general, representative, and valid choice for facilitating fast retrieval tasks.

The Quality of Retrieval

We quantify the quality of our retrieval method as follows: We examine the retrieval subgraph, and if the label is contained within it, we consider it a successful retrieval. We calculate the retrieval success rate of our method and the retrieval method proposed in KAPING [1] on the WebQSP dataset.

Table 11: The quality of retrieval methods on the WebQSP dataset.

As shown in Table 11, these results validate the effectiveness of our retrieval method. In contrast to the triple-based retrieval in KAPING, which relies on the similarity between triples and the query, our PCST-based subgraph retrieval is more accurate as it takes the graph structure into account. By design, it retrieves connected subgraphs, capturing not just nodes or edges with high relevance scores, but also those that act as "bridges" connecting other highly relevant nodes and edges.

Discussion on the Complexity

The integration of GNNs, LLMs and GraphRAG

G-Retriever is framework integrate the strengths of GNNs, LLMs and GraphRAG. The LLM+X framework, which involves enriching LLMs with multi-modal capabilities by integrating an LLM with an encoder from another modality, is a widely adopted approach. Notable examples include Llava, MiniGPT-4, and Flamingo, among others. They are not complex in terms of understanding or implementation. Regarding the integration of GraphRAG, it does not require training and can be implemented during the preprocessing stage or on the fly. This approach does not significantly increase time complexity or computational complexity. On the contrary, it can substantially reduce the size of the graph ( e.g., eliminating 99% of nodes in the WebQSP dataset), which in turn speeds up the overall running time ( e.g., reducing it from 18.7 min/epoch to 6.2 min/epoch on the WebQSP dataset).

Computational Resources

Utilizing two A100 GPUs, each with 80GB of memory, we conducted tests on Llama2-7b and WebQSP datasets. Our experiments had a training batch size of 16 and an evaluation batch size of 32, yielding the following results.

Table 12: Performance and Efficiency of Various Methods on the WebQSP dataset.

These results highlight efficiency improvements via graph RAG, which significantly reduces graph size (e.g., eliminating 99% of nodes in the WebQSP dataset) and speeds up running time.

Hallucination in Graph LLMs

In this section, we present quantitative results regarding hallucinations in the SceneGraphs dataset.

Baseline. For our baseline, we adapted MiniGPT-4 [57] to graph contexts. This approach involves a frozen LLM interacting with a trainable GNN that encodes graph data as a soft prompt, denoted as LLM+Graph Prompt Tuning. We focus on graph prompt tuning as the baseline, instead of converting the graph into text, since the textual representation of the graph is large and consistently exceeds the input token limits of LLMs.

Experiment Design. We instructed the LLM to answer graph-related questions and to list nodes or edges in the explanation graph that support its answers. Since standard answers for these questions do not exist, allowing the LLM to respond flexibly, it becomes challenging to evaluate its responses. To address this, we manually examined 100 responses generated by our method and the LLM with graph prompt tuning, verifying whether the nodes and edges referenced in the LLM's output actually exist in the graph.

Evaluation Metrics. We assessed the model's faithfulness using three metrics: the fraction of valid nodes (denoted as Valid Nodes), the fraction of valid edges (denoted as Valid Edges), and the fraction of times the entire set of nodes and edges cited was valid (denoted as Fully Valid Graphs).

Results. The results, as depicted in Table 5, illustrate the comparative effectiveness of the G-Retriever over the baseline LLM+Graph Prompt Tuning method in reducing hallucinations. The LLM+Graph Prompt Tuning approach demonstrated a significantly lower accuracy in referencing graph elements,

with only 31% of nodes and 12% of edges being valid, and the entire set of nodes and edges being valid only 8% of the time. In contrast, G-Retriever showed substantial improvements: 77% validity in nodes, 76% in edges, and 62% in the overall validity of referenced node-edge sets. These results underscore the significant reduction in hallucinations with G-Retriever , particularly in the challenging task of accurately citing both nodes and edges in graph-based contexts.

Demonstrations

$$ z_n = \text{LM}(x_n)\in\mathbb{R}^d, $$

$$ p_\theta(Y|[P;X])=\prod_{i=1}^r p_\theta(y_i|y_{<i}, [P;X]). $$

$$ \begin{split} V_k &= \text{argtopk}{n\in V}\text{ cos}(z_q, z_n)\ E_k &= \text{argtopk}{e\in E}\text{ cos}(z_q, z_e), \end{split} $$

$$ \text{prize}(n) = \begin{cases} k - i, & \text{if } n \in V_k \text{ and } n \text{ is the top } i\text{ node}, \ 0, & \text{otherwise}. \end{cases} \label{eq: topk} $$ \tag{eq: topk}

$$ S^*=\underset{\substack{S\subseteq G,\ S \text{ is connected}}}{\text{argmax}}\sum_{n\in V_S} \text{prize}(n)+\sum_{e\in E_S}\text{prize}(e)-\text{cost}(S), , $$

$$ \text{cost}(S)=|E_S| \times C_e, $$

$$ h_t = \text{TextEmbedder}([\text{textualize}(S^*);x_q])\in \mathbb{R}^{L\times{d_l}}, $$

We demonstrate the interaction capabilities of G-Retriever with creative questions on different datasets: ExplaGraphs , SceneGraphs , and WebQSP . These examples are showcased in Tables 13, 14, and 15, respectively. Additionally, the examples are visualized in Figure 1.

Table 13: Example of creative chat on ExplaGraphs .

Table 14: Example of creative chat on SceneGraphs . We utilize GPT-4 to generate an image from the graph, assisting readers in visualizing it

Userelephants grass eating bushes in river on the side of standing by elephant What kind of animal is in the bushes? Answer the question and list the nodes or edges from the graph that support your answer.
LLM w/ Graph Prompt Tuning[Incorrect response with hallucinated nodes and edges] The animal in the bushes is a deer. Nodes: * Deer (node 1) * Bushes (node 2) Edges: * Deer → Bushes (edge 1) * Deer → Grass (edge 2) * Bushes → Grass (edge 3)
G-Retriever (Ours)[Correct response with accurate node and edge references] Based on the information provided in the graph, the animal in the bushes is an elephant. The nodes that support this answer are: * Node 1: The elephant is mentioned in the first line of the graph. The edges that support this answer are: * The edge connects the elephant to the bushes, indicating that the elephant is in the bushes. Therefore, the answer to the question is: The animal in the bushes is an elephant.
DatasetExplaGraphsSceneGraphsWebQSP
#Graphs Avg. #Nodes Avg. #Edges Node Attribute Edge Attribute Task Evaluation Matrix2,766 5.17 4.25 Commonsense concepts Commonsense relations Common sense reasoning Accuracy100,000 19.13 68.44 Object attributes ( e.g., color, shape) Relations ( e.g., actions, spatial relations) Scene graph question answering Accuracy4,737 1370.89 4252.37 Entities in Freebase Relations in Freebase Knowledge based question answering Hit@1
SettingMethodExplaGraphsSceneGraphsWebQSP
Inference-onlyZero-shot Zero-CoT [18] CoT-BAG [44] KAPING [1]0.5650 0.5704 0.5794 0.62270.3974 0.5260 0.5680 0.437541.06 51.30 39.60 52.64
Frozen LLM w/ PTPrompt tuning GraphToken [31] G-Retriever ∆ Prompt tuning0.5763 ± 0.0243 0.8508 ± 0.0551 0.8516 ± 0.0092 ↑ 47.77%0.6341 ± 0.0024 0.4903 ± 0.0105 0.8131 ± 0.0162 ↑ 28.23%48.34 ± 0.64 57.05 ± 0.74 70.49 ± 1.21 ↑ 45.81%
Tuned LLMLoRA G-Retriever w/ LoRA ∆ LoRA0.8538 ± 0.0353 0.8705 ± 0.0329 ↑ 1.95%0.7862 ± 0.0031 0.8683 ± 0.0072 ↑ 11.74%66.03 ± 0.47 73.79 ± 0.70 ↑ 10.44%
DatasetBefore Retrieval (Avg.)Before Retrieval (Avg.)Before Retrieval (Avg.)After Retrieval (Avg.)After Retrieval (Avg.)After Retrieval (Avg.)
# Tokens# NodesMin/Epoch# Tokens# NodesMin/Epoch
SceneGraphs1,39619123.1235 ( ↓ 83%)5 ( ↓ 74%)86.8 ( ↓ 29%)
WebQSP100,6271,37118.7610 ( ↓ 99%)18 ( ↓ 99%)6.2( ↓ 67%)
BaselineG-RetrieverMethod w/o Graph EncoderHit@1∆ G-Retriever
Valid Nodes31%77%w/o Projection Layer69.70 ± 0.68↓ 1.11%
Valid Edges12%76%w/o Textualized Graph56.96 ± 1.83↓ 19.19%
Fully Valid Graphs8%62%w/o Retrieval63.84 ± 0.41↓ 9.43%
Graph EncoderWebQSPExplaGraphs
GCN [17]70.70.8394
GAT [43]70.270.843
Graph Transformer [37]70.490.8516
LLMLlama2-7bLlama2-13b
Hit@170.4975.58
DatasetOriginal datasetGraphQA Benmark
ExplaGraphs(entrapment; capable of; being abused) (being abused; created by; police) (police; capable of; harm) (harm; used for; people) (people; part of; citizens)node_id,node_attr\n 0,entrapment\n 1,being abused\n 2,police\n 3,harm\n 4,people\n 5,citizens\n src,edge_attr,dst\n 0,capable of,1\n 1,created by,2\n 2,capable of,3\n 3,used for,4\n 4,part of,5\n
SceneGraphs"width": 500, "objects": "681267": "name": "banana", "h": 34, "relations": ["object": "681262", "name": "to the left of"], "w": 64, "attributes": ["small", "yellow"], "y": 55, "x": 248, "681265": "name": "spots", "h": 16, "relations": [], "w": 26, "attributes": [], "y": 92, "x": 245, "681264": "name": "bananas", "h": 50, "relations": ["object": "681259", "name": "to the left of"], "w": 49, "attributes": ["small", "yellow"], "y": 32, "x": 268, "681263": "name": "picnic", "h": 374, "relations": [], "w": 499, "attributes": ["delicious"], "y": 0, "x": 0, "681262": "name": "straw", "h": 95, "relations": ["object": "681268", "name": "to the right of", "object": "681267", "name": "to the right of", "object": "681253", "name": "to the right of"], "w": 15, "attributes": ["white", "plastic"], "y": 55, "x": 402, "681261": "name": "meat", "h": 27, "relations": ["object": "681255", "name": "on", "object": "681255", "name": "inside"], "w": 24, "attributes": ["small", "brown", "delicious"], "y": 123, "x": 68, "681260": "name": "rice", "h": 57, "relations": ["object": "681255", "name": "on", "object": "681258", "name": "to the left of"], "w": 93, "attributes": ["piled", "white"], "y": 162, "x": 57, "681269": "name": "onions", "h": 16, "relations": [], "w": 24, "attributes": ["green"], "y": 147, "x": 90, "681268": "name": "tablecloth", "h": 374, "relations": ["object": "681262", "name": "to the left of"], "w": 396, "attributes": ["white"], "y": 0, "x": 0, "681258": "name": "bowl", "h": 99, "relations": ["object": "681255", "name": "next to", "object": "681257", "name": "of", "object": "681255", "name": "near", "object": "681256", "name": "to the right of", "object": "681260", "name": "to the right of", "object": "681255", "name": "to the right of"], "w": 115, "attributes": ["full"], "y": 184, "x": 178, "681259": "name": "plantains", "h": 70, "relations": ["object": "681264", "name": "to the right of"], "w": 45, "attributes": ["red"], "y": 0, "x": 346, "681256": "name": "spoon", "h": 65, "relations": ["object": "681255", "name": "on", "object": "681257", "name": "to the left of", "object": "681255", "name": "in", "object": "681258", "name": "to the left of"], "w": 140, "attributes": ["large", "metal", "silver"], "y": 196, "x": 0, "681257": "name": "dish", "h": 81, "relations": ["object": "681258", "name": "inside", "object": "681256", "name": "to the right of", "object": "681258", "name": "in", "object": "681255", "name": "to the right of"], "w": 108, "attributes": ["cream colored"], "y": 199, "x": 187, "681254": "name": "meal", "h": 111, "relations": [], "w": 130, "attributes": [], "y": 121, "x": 58, "681255": "name": "plate", "h": 138, "relations": ["object": "681257", "name": "to the left of", "object": "681254", "name": "of", "object": "681254", "name": "with", "object": "681258", "name": "near", "object": "681258", "name": "to the left of"], "w": 176, "attributes": ["white", "full"], "y": 111, "x": 30, "681253": "name": "banana", "h": 30, "relations": ["object": "681262", "name": "tonode_id,node_attr 0,"name: banana; attribute: small, yellow; (x,y,w,h): (248, 55, 64, 34)" 1,"name: spots; (x,y,w,h): (245, 92, 26, 16)" 2,"name: bananas; attribute: small, yellow; (x,y,w,h): (268, 32, 49, 50)" 3,"name: picnic; attribute: delicious; (x,y,w,h): (0, 0, 499, 374)" 4,"name: straw; attribute: white, plastic; (x,y,w,h): (402, 55, 15, 95)" 5,"name: meat; attribute: small, brown, delicious; (x,y,w,h): (68, 123, 24, 27)" 6,"name: rice; attribute: piled, white; (x,y,w,h): (57, 162, 93, 57)" 7,"name: onions; attribute: green; (x,y,w,h): (90, 147, 24, 16)" 8,"name: tablecloth; attribute: white; (x,y,w,h): (0, 0, 396, 374)" 9,"name: bowl; attribute: full; (x,y,w,h): (178, 184, 115, 99)" 10,"name: plantains; attribute: red; (x,y,w,h): (346, 0, 45, 70)" 11,"name: spoon; attribute: large, metal, silver; (x,y,w,h): (0, 196, 140, 65)" 12,"name: dish; attribute: cream colored; (x,y,w,h): (187, 199, 108, 81)" 13,"name: meal; (x,y,w,h): (58, 121, 130, 111)" 14,"name: plate; attribute: white, full; (x,y,w,h): (30, 111, 176, 138)" 15,"name: banana; attribute: small, yellow; (x,y,w,h): (237, 87, 73, 30)" src,edge_attr,dst 0,to the left of,4\n 2,to the left of,10\n 4,to the right of,8\n 4,to the right of,0\n 4,to the right of,15\n 5,on,14\n 5,inside,14\n 6,on,14\n 6,to the left of,9\n 8,to the left of,4\n 9,next to,14\n 9,of,12\n 9,near,14\n 9,to the right of,11\n 9,to the right of,6\n 9,to the right of,14\n 10,to the right of,2\n 11,on,14\n 11,to the left of,12\n 11,in,14\n 11,to the left of,9\n 12,inside,9\n 12,to the right of,11\n 12,in,9\n 12,to the right of,14\n 14,to the left of,12\n 14,of,13\n 14,with,13\n 14,near,9\n 14,to the left of,9\n 15,to the left of,4\n
WebQSP[['FedEx Cup', 'sports.sports_award_type.winners', 'm.0n1v8cy'], ['Brandt Snedeker', 'sports.sports_award_winner.awards', 'm.0n1v8cy'], ['FedEx Cup', 'common.topic.article', 'm.08q5wy'], ['FedEx Cup', 'common.topic.notable_for', 'g.12559n8g_'], ['Sports League Award Type', 'freebase.type_profile.published', 'Published'], ['FedEx Cup', 'common.topic.notable_types', 'Sports League Award Type'], ['m.0n1v8cy', 'sports.sports_award.award_winner', 'Brandt Snedeker'], ['Sports League Award Type', 'type.type.expected_by', 'Award'], ['Sports League Award Type', 'common.topic.article', 'm.06zxtxj'], ['2012 PGA Tour', 'sports.sports_league_season.awards', 'm.0n1v8cy'], ['Sports League Award Type', 'freebase.type_hints.included_types', 'Topic'], ['Sports League Award Type', 'type.type.domain', 'Sports'], ['m.0n1v8cy', 'sports.sports_award.award', 'FedEx Cup'], ['Sports League Award Type', 'freebase.type_profile.strict_included_types', 'Topic'], ['Sports League Award Type', 'freebase.type_profile.kind', 'Classification'],node_id,node_attr\n 0,fedex cup\n 1,m.0n1v8cy\n 2,brandt snedeker\n 3,m.08q5wy\n 4,g.12559n8g_\n 5,sports league award type\n 6,published\n 7,award\n 8,m.06zxtxj\n 9,2012 pga tour\n 10,topic\n 11,sports\n 12,classification\n 13,winners\n src,edge_attr,dst 0,sports.sports_award_type.winners,1 2,sports.sports_award_winner.awards,1 0,common.topic.article,3 0,common.topic.notable_for,4 5,freebase.type_profile.published,6 0,common.topic.notable_types,5 1,sports.sports_award.award_winner,2 5,type.type.expected_by,7 5,common.topic.article,8 9,sports.sports_league_season.awards,1 5,freebase.type_hints.included_types,10 5,type.type.domain,11 1,sports.sports_award.award,0
k351020
Hit@10.69770.70630.72480.7039
MethodRetrieval Accuracy
KAPING [1] (top-k triple retrieval)60.81
G-Retriever70.49
SetttingMethodHit@1Time
Inference-onlyQuestion only Textual graph and question61.16 41.0631 min 40 min
Prompt Tuning48.3418.7 min/epoch 6.2 min/epoch
Frozen LLM w/ PTG-Retriever70.49
Tuned LLMLoRA66.0319 min/epoch
G-Retriever w/
LoRA73.796.9 min/epoch
Graphnode_id,node_attr 0,human 1,cosmetic surgery 2,women 3,change appearance 4,addiction 5,searching for perfection 6,stop 7,positive src,edge_attr,dst 0,receives action,1 2,part of,0 1,used for,3 3,capable of,4 4,created by,5
UserWrite a 500-word argument essay based on the information provided in the explanation graph.
ModelThe explanation graph provides a detailed analysis of the reasons why women undergo cosmetic surgery. According to the graph, the main reasons include the desire to change one's appearance, the influence of societal beauty standards, and the addiction to searching for perfection. These reasons are further supported by the various nodes and edges in the graph, which provide additional insights into the motivations and consequences of cosmetic surgery. Firstly, the node labeled "human" highlights the fundamental human desire to feel attractive and confident in one's appearance. This desire is deeply ingrained in our culture, where beauty is often associated with happiness, success, and social acceptance. As a result, many women feel pressure to conform to societal beauty standards, which can lead to feelings of insecurity and low self-esteem. In response to these feelings, some women may turn to cosmetic surgery as a means of improving their appearance and boosting their self-confidence. Secondly, the node labeled "women" highlights the gendered nature of cosmetic surgery. While men also undergo cosmetic surgery, the majority of cosmetic surgery patients are women. This is likely due to the fact that women are more likely to be exposed to societal beauty standards and are more likely to internalize these standards. As a result, women may be more likely to feel insecure about their appearance and more likely to seek cosmetic surgery as a means of improving their looks. Thirdly, the node labeled "addiction" highlights the potential negative consequences of cosmetic surgery. While some women may initially feel improved confidence and self-esteem after un- dergoing cosmetic surgery, the graph suggests that this initial boost may not last. In fact, some women may become addicted to the constant pursuit of perfection, leading to a cycle of surgery and dissatisfaction. This can have serious psychological and emotional consequences, including depression, anxiety, and a loss of self-esteem. Fourthly, the node labeled "searching for perfection" highlights the unrealistic expectations that many women have about the results of cosmetic surgery. The graph suggests that women may have unrealistic expectations about the ability of cosmetic surgery to improve their appearance and boost their self-esteem. This can lead to disappointment and dissatisfaction with the results of surgery, as well as a continued pursuit of perfection through additional surgery. Finally, the node labeled "stop" highlights the importance of critical thinking and self-reflection when it comes to cosmetic surgery. The graph suggests that women should be aware of the potential negative consequences of cosmetic surgery and should carefully consider the reasons why they are undergoing surgery. They should also be aware of the unrealistic expectations that they may have about the results of surgery and should be prepared to address any negative emotions or feelings of dissatisfaction that may arise. In conclusion, the explanation graph provides a detailed analysis of the reasons why women undergo cosmetic surgery. The graph highlights the fundamental human desire to feel attractive and confident in one's appearance, the gendered nature of cosmetic surgery, the potential negative consequences of cosmetic surgery, and the importance of critical thinking and self-reflection when it comes to surgery. By understanding these reasons and the potential consequences of cosmetic surgery, women can make informed decisions about whether or not to undergo surgery and can work to address any negative emotions or feelings of dissatisfaction that may arise.
Graphnode_id,node_attr 0,name: head; attribute: black; (x,y,w,h): (386, 185, 8, 4) 1,name: swimming pool; attribute: small, water; (x,y,w,h): (465, 92, 35, 3) 2,name: shore; (x,y,w,h): (49, 85, 436, 108) 3,name: fence; attribute: wired; (x,y,w,h): (11, 158, 167, 46) 4,name: mountains; attribute: covered; (x,y,w,h): (393, 53, 107, 12) 5,name: water; attribute: blue; (x,y,w,h): (0, 50, 499, 154) 6,name: sheep; attribute: walking, grazing; (x,y,w,h): (243, 191, 20, 13) 7,name: sky; attribute: blue, clear, cloudy; (x,y,w,h): (0, 0, 501, 60) 8,name: sheep; attribute: white, fluffy; (x,y,w,h): (409, 176, 13, 10) 9,name: field; attribute: large, green, grassy; (x,y,w,h): (0, 138, 503, 112) 10,name: grass; (x,y,w,h): (0, 84, 500, 166) 11,name: sheep; attribute: white, fluffy; (x,y,w,h): (398, 178, 13, 11) 12,name: sheep; attribute: white, fluffy; (x,y,w,h): (385, 179, 17, 11) 13,name: sheep; attribute: white, fluffy; (x,y,w,h): (364, 181, 14, 10) 14,name: shore; attribute: sandy; (x,y,w,h): (190, 123, 171, 12) src, edge_attr, dst 0,to the left of,11\n 0,to the right of,13\n 3,to the left of,6\n 6,to the right of,3\n 6,grazing on,10\n 6,in,9\n 6,in,10\n 7,above,5\n 8,eating,10\n 8,to the right of,11\n 8,to the right of,12\n 11,to the left of,8\n 11,to the right of,0\n 11,to the right of,13\n 11,to the right of,12\n 11,eating,10\n 12,eating,10\n 12,to the left of,8\n 12,to the right of,13\n 12,to the left of,11\n 13,to the left of,0\n
Image
UserWrite an 500-word advertisement for this place in the scene graph that would make people want to visit it.
ModelSure! Here's an example of a 500-word advertisement for the place based on the scene graph: "Experience the breathtaking beauty of our place, where the rolling hills meet the sparkling water of the swimming pool. Take a leisurely stroll through the lush green fields, and let the gentle breeze caress your face as you gaze out at the picturesque view. The sheep graze peacefully in the distance, their white coats glistening in the sunlight. The sky is a brilliant blue, with fluffy white clouds drifting lazily across. Come and bask in the serene ambiance of our place, where the natural beauty will leave you in awe."
node_id, node_attr 8, benjamin franklin 63, glass harmonica 98, lightning rod
UserQuestion: what else did ben franklin invent? Surrounding this question, develop an educational tour through the knowledge graph, highlighting interesting facts and connections in an engaging way.
Great! Based on the provided knowledge graph, here's an educational tour that highlights interesting facts and connections related to Benjamin Franklin's inventions and other notable achievements: 1. Franklin's Lightning Rod: Start by exploring Franklin's most famous invention - the lightning rod. Discover how Franklin's invention protected buildings from lightning strikes and how it changed the way people understood electricity. Connection: Franklin worked with Thomas Jefferson on the lightning rod, and their collaboration led to the development of the concept of positive and negative charges. 2. Bifocals: Move on to Franklin's invention of bifocal glasses. Learn how Franklin's vision problems inspired him to create a new type of glasses that allowed people to see both near and far objects clearly. Connection: Franklin's bifocal glasses were a groundbreaking invention that improved the lives of millions of people around the world. 3. Franklin Stove: Explore Franklin's design of the Franklin stove, a cast-iron stove that provided more heat with less fuel. Discover how Franklin's stove revolutionized heating systems and how it helped to reduce fuel costs for households and businesses. Connection: Franklin's stove was an early example of energy efficiency and sustainability, and it paved the way for modern heating systems. 4. Prokop Diviš: Learn about Franklin's connection to Prokop Diviš, a Czech inventor who built on Franklin's work and developed the first electric generator. Discover how Diviš's invention helped to power the Industrial Revolution and how it changed the way people lived and worked. Connection: Franklin's work on the lightning rod inspired Diviš to explore the properties of electricity, and their collaborative efforts helped to advance the field of electrical engineering. 5. Innovator and Original Ideas: Explore Franklin's reputation as an innovator and his ability to come up with original ideas. Discover how Franklin's creativity and curiosity led to many of his inventions and how he was able to see connections between seemingly unrelated things. Connection: Franklin's innovative spirit and ability to think outside the box inspired many other inventors and scientists, including Thomas Edison and Nikola Tesla. 6. Collaboration and Networking: Highlight Franklin's collaborative approach to invention and his ability to network with other inventors and scientists. Discover how Franklin's connections with other innovators helped to advance the field of science and technology. Connection: Franklin's collaborations with other inventors and scientists, such as Thomas Jefferson and Prokop Diviš, helped to accelerate the pace of innovation and pave the way for future breakthroughs. 7. Legacy and Impact: Reflect on Franklin's lasting legacy and the impact of his inventions on modern society. Discover how Franklin's inventions continue to inspire new generations of inventors and scientists and how they have helped to shape the world we live in today. Connection: Franklin's inventions and ideas have had a lasting impact on many fields, including science, technology, and society as a whole. His legacy continues to inspire and motivate people around the world. By following this educational tour through the knowledge graph, students will gain a deeper understanding of Benjamin Franklin's inventions and their impact on modern society. They will also learn about Franklin's collaborative approach to invention, his innovative spirit, and his ability to think outside the box.
Model

Given a graph with textual attributes, we enable users to ‘chat with their graph’: that is, to ask questions about the graph using a conversational interface. In response to a user’s questions, our method provides textual replies and highlights the relevant parts of the graph. While existing works integrate large language models (LLMs) and graph neural networks (GNNs) in various ways, they mostly focus on either conventional graph tasks (such as node, edge, and graph classification), or on answering simple graph queries on small or synthetic graphs. In contrast, we develop a flexible question-answering framework targeting real-world textual graphs, applicable to multiple applications including scene graph understanding, common sense reasoning, and knowledge graph reasoning. Toward this goal, we first develop our Graph Question Answering (GraphQA) benchmark with data collected from different tasks. Then, we propose our G-Retriever approach, which integrates the strengths of GNNs, LLMs, and Retrieval-Augmented Generation (RAG), and can be fine-tuned to enhance graph understanding via soft prompting. To resist hallucination and to allow for textual graphs that greatly exceed the LLM’s context window size, G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem. Empirical evaluations show that our method outperforms baselines on textual graph tasks from multiple domains, scales well with larger graph sizes, and resists hallucination. 111Our codes and datasets are available at: https://github.com/XiaoxinHe/G-Retriever.

Graphs and Large Language Models (LLMs). The advent of LLMs has significantly shaped the artificial intelligence landscape. As these models are applied to increasingly diverse tasks, their ability to process complex structured data will be increasingly vital. In particular, in our interconnected world, a significant portion of real-world data inherently possesses a graph structure, such as the Web, e-commerce, recommendation systems, knowledge graphs, and many others. Moreover, many of these involve graphs with textual attributes (i.e., textual graphs), making them well-suited for LLM-centric methods. This has spurred interest in combining graph-based technologies, particularly graph neural networks (GNNs), with LLMs to enhance their capabilities in understanding and modeling graphs [37, 13, 20].

The Present Work: Enabling ‘Chat With Your Graph’. While existing works integrate LLMs and GNNs in various ways, they mostly focus on conventional graph tasks such as node, edge and graph classification [6], or answering simple questions on small or synthetic graphs [37]. In contrast, we develop a flexible question-answering framework targeting real-world textual graph applications. This framework enables users to ‘chat with their graph’ through a conversational interface, suitable even for complex and creative questions about graphs. As shown in Figure 1, our G-Retriever shows effective graph understanding abilities, while also incorporating the human-friendliness of conversational interfaces.

The Need for a Comprehensive GraphQA Benchmark. Question answering (QA) is a fundamentally important task in natural language processing, serving as a key benchmark for assessing LLMs and providing a unified interface for various capabilities. Despite extensive research in QA, there is a lack of a comprehensive and diverse benchmark specifically for the graph modality. Our study addresses this by introducing our GraphQA benchmark, targeting real-world graph applications: common sense reasoning, scene understanding, and knowledge graph reasoning (refer to Figure 2). This benchmark is vital for measuring the capability of a model in answering a wide range of questions about graphs across diverse applications.

New Architecture for GraphQA. To enable effective and efficient graph QA, even on large graphs, we propose G-Retriever, a new framework combining the strengths of GNNs, LLMs, and RAG (Figure 3). Next, we will discuss the motivation, strengths, and details of our model.

Tackling Hallucination in Graph LLMs. LLMs are prone to hallucination, a phenomenon where the generated content is factually inaccurate or nonsensical [10]. We validate the presence of this issue in graph settings. In particular, we employ a baseline method that adapts MiniGPT-4 [49] to graphs, where a frozen LLM interacts with a trainable GNN that encodes graph data as a soft prompt. Our findings, shown in Table 1, indicate that hallucination, an important problem in text-based LLMs, is also prevalent in Graph LLMs. This may be attributed to the baseline’s inability to recall the entire graph structure from a single graph embedding, leading to the generation of incorrect nodes or edges during the QA task. In contrast, by employing RAG for direct information retrieval from the actual graph, our G-Retriever mitigates this issue, as substantiated by Table 1.

Enhancing Scalability and Efficiency in Graph LLMs. Recent research endeavors have explored translating graphs into natural language, such as by flattening nodes and edges into a text sequence, enabling their processing by LLMs for graph-based tasks [48]. However, this method faces critical scalability issues. Converting a graph with thousands of nodes and edges into a text sequence results in an excessive number of tokens, surpassing the input capacity of many LLMs. An alternative of truncating the graph text sequence to fit the LLM’s input token limit leads to loss of information and response quality. In contrast, G-Retriever overcomes these issues with its RAG component, which allows for effective scaling to larger graphs by selectively retrieving only relevant parts of the graph.

Tailoring the RAG Approach to Graphs. Existing RAG methodologies, primarily designed for simpler data types, fall short in handling the complexities of graphs [5]. Hence, we introduce a new retrieval method tailored for graphs. Notably, we formulate subgraph retrieval as a Prize-Collecting Steiner Tree (PCST) optimization problem, enabling a principled and effective solution. This also allows us to return a subgraph relevant to the generated text as an output of our approach, improving explainability.

The contributions of this paper are outlined as follows:

New Benchmark for GraphQA. We introduce a diverse benchmark targeted at real-world graph QA applications, filling a crucial research gap.

Enabling ‘Chat with a Graph’. G-Retriever is a flexible graph QA framework suitable even for complex and creative questions. It can be efficiently fine-tuned to enhance graph understanding via soft prompting.

Advanced Graph Retrieval Techniques. We pioneer the integration of RAG with graph LLMs to address the issues of hallucination and lack of scalability, and propose a new graph retrieval method based on PCST.

Empirical Findings. We demonstrate the efficiency and effectiveness of G-Retriever in multiple domains, and also present the significant finding of hallucination in graph LLMs.

Graphs and Large Language Models. A significant body of research has emerged at the intersection of graph-based techniques and Large Language Models [26, 20, 13, 37, 46]. This exploration spans diverse aspects, ranging from the design of general graph models [39, 21, 43, 15, 33], and multi-modal architectures [19, 41] to practical applications. Noteworthy applications include fundamental graph reasoning [44, 2, 48], node classification [6, 9, 32, 4, 42, 3, 28], graph classification/regression [27, 47], and leveraging LLMs for knowledge graph-related tasks [34, 12, 24].

Retrieval-Augmented Generation (RAG). The concept of Retrieval-Augmented Generation, initially proposed by [17], has gained increased attention for its ability to mitigate the issue of hallucination within LLMs and enhance trustworthiness and explainability [5]. Despite its success in language-related tasks, the application of retrieval-based approaches to general graph tasks remains largely unexplored. Our research is the first to apply a retrieval-based approach to general graph tasks, marking a novel advancement in the field and demonstrating the versatility of RAG beyond language processing.

Parameter-Efficient Fine-Tuning (PEFT). The field of LLMs has witnessed significant advancements through various parameter-efficient fine-tuning techniques. These methodologies have played a crucial role in refining LLMs, boosting their performance while minimizing the need for extensive parameter training. Notable among these techniques are prompt tuning, as introduced by [16], and prefix tuning, proposed by [18]. Furthermore, methods like LoRA [8], and the LLaMA-adapter [45], have been influential. These advancements in PEFT have laid the foundation for the development of sophisticated multimodal models. Prominent examples in this domain include MiniGPT-4 [49], LLaVA [22], and NExT-Chat [38]. There are also emerging efforts in applying PEFT to graph LLMs, such as GraphLLM [2] for basic graph reasoing tasks and GNP [34] for multi-option QA on knowledge graphs.

This section establishes the notation and formalizes key concepts related to textual graphs, language models for text encoding, and large language models and prompt tuning.

Textual Graphs. A textual graph is a graph where nodes and edges possess textual attributes. Formally, it can be defined as G=(V,E,{xn}n∈V,{xe}e∈E)𝐺𝑉𝐸subscriptsubscript𝑥𝑛𝑛𝑉subscriptsubscript𝑥𝑒𝑒𝐸G=(V,E,{x_{n}}{n\in V},{x{e}}{e\in E}), where V𝑉V and E𝐸E represent the sets of nodes and edges, respectively. Additionally, xn∈DLnsubscript𝑥𝑛superscript𝐷subscript𝐿𝑛x{n}\in D^{L_{n}} and xe∈DLesubscript𝑥𝑒superscript𝐷subscript𝐿𝑒x_{e}\in D^{L_{e}} denote sequential text associate with a node n∈V𝑛𝑉n\in V or an edge e∈E𝑒𝐸e\in E, where D𝐷D represents the vocabulary, and Lnsubscript𝐿𝑛L_{n} and Lesubscript𝐿𝑒L_{e} signify the length of the text associated with the respective node or edge.

Language Models for Text Encoding. In the context of textual graphs, language models (LMs) are essential for encoding the text attributes associated with nodes and edges, thereby learning representations that capture their semantic meaning. For a node n𝑛n with text attributes xn∈DLnsubscript𝑥𝑛superscript𝐷subscript𝐿𝑛x_{n}\in D^{L_{n}}, an LM encodes these attributes as:

where znsubscript𝑧𝑛z_{n} is the output of the LM, and d𝑑d is the dimension of the output vector.

Large Language Models and Prompt Tuning. LLMs have introduced a new paradigm for task-adaptation known as “pre-train, prompt, and predict”, replacing the traditional “pre-train, fine-tune” paradigm. In this paradigm, the LLM is first pre-trained on a large corpus of text data to learn general language representations. Then, rather than fine-tuning the model on task-specific labeled data, the model is prompted with a textual prompt that specifies the task and context. Subsequently, the model generates the output directly based on the prompt and the input.

The LLM, parameterized by weights θ𝜃\theta, takes a sequence of tokens X𝑋X, and a prompt P𝑃P as input, and generate a sequence of tokens Y={y1,y2,…,yr}𝑌subscript𝑦1subscript𝑦2…subscript𝑦𝑟Y={y_{1},y_{2},\ldots,y_{r}} as output. Formally, the probability distribution of the output sequence given the concatenated input sequence and prompt, i.e., [P;X]𝑃𝑋[P;X], is defined as follows:

Here, y<isubscript𝑦absent𝑖y_{<i} represents the prefix of sequence y𝑦y up to position i−1𝑖1i-1, and p​(yi|y<i,[P;X])𝑝conditionalsubscript𝑦𝑖subscript𝑦absent𝑖𝑃𝑋p(y_{i}|y_{<i},[P;X]) represents the probability of generating token yisubscript𝑦𝑖y_{i} given y<isubscript𝑦absent𝑖y_{<i} and [P;X]𝑃𝑋[P;X].

Soft prompt tuning eliminates the need for manual prompt design. Given a series of p𝑝p tokens X={x1,x2,…,xp}𝑋subscript𝑥1subscript𝑥2…subscript𝑥𝑝X={x_{1},x_{2},\ldots,x_{p}}, after being processed by the text embedder, it forms a matrix Xe∈ℝp×dlsubscript𝑋𝑒superscriptℝ𝑝subscript𝑑𝑙X_{e}\in\mathbb{R}^{p\times d_{l}}, where dlsubscript𝑑𝑙d_{l} is the dimension of the embedding space. Soft prompts can be represented as parameters Pe∈ℝq×dlsubscript𝑃𝑒superscriptℝ𝑞subscript𝑑𝑙P_{e}\in\mathbb{R}^{q\times d_{l}}, where q𝑞q is the length of the prompt. The prompt is then concatenated with the embedded input, forming a single matrix [Pe;Xe]∈ℝ(q+p)×dlsubscript𝑃𝑒subscript𝑋𝑒superscriptℝ𝑞𝑝subscript𝑑𝑙[P_{e};X_{e}]\in\mathbb{R}^{(q+p)\times d_{l}}. This combined matrix is processed by the self-attention layers in LLM as usual. Training involves maximizing the likelihood of Y𝑌Y through backpropagation, with gradient updates applied solely to Pesubscript𝑃𝑒P_{e}, while θ𝜃\theta remains fixed.

Our GraphQA represents a comprehensive and diverse benchmark for graph question-answering. It is tailored to assess the capabilities of models in answering a wide range of questions about graphs across diverse domains.

Each entry in the GraphQA benchmark consists of a textual graph, a question related to the graph, and one or more corresponding answers, as illustrated in Figure 2.

Textual Graphs. The textual graph is converted into a natural language format, resulting in a list of nodes and edges, akin to a CSV file format. It is important to note that while multiple methods exist for textualizing a graph, our focus is not on identifying the optimal solution. Instead, we prioritize a straightforward yet empirically effective approach for representing graphs in natural language, facilitating the benchmark’s use in diverse GraphQA scenarios.

Questions and Answers. Questions are designed to explore specific elements or relationships within the graph. Answers, residing within the attributes of nodes or edges, often require multi-hop reasoning for accurate identification.

The GraphQA benchmark integrates three existing datasets: ExplaGraphs, SceneGraphs, and WebQSP. Table 2 presents the summary statistics of these datasets. It is important to note that these datasets were not originally developed for this work. However, a significant contribution of our research is the standardization and processing of these diverse datasets into a uniform data format suitable for the GraphQA benchmark. These datasets, previously utilized in different contexts, are reintroduced with a new focus tailored for GraphQA. For a detailed comparison with the original datasets, see the Appendix B.

ExplaGraphs is a dataset for generative commonsense reasoning, focusing on creating explanation graphs for stance prediction in debates. It offers detailed, unambiguous commonsense-augmented graphs to evaluate arguments supporting or refuting a belief. The primary task is to assess whether arguments are supportive or contradictory, using accuracy as the metric. We have converted the triplet-form provided in [30] into a standard graph format.

SceneGraphs, a visual question answering dataset, includes 10,000 scene graphs. Each graph details objects, attributes, and relations within an image. This dataset challenges users with tasks requiring spatial understanding and multi-step inference. The task is to answer open-ended questions based on a textual description of a scene graph, evaluated on accuracy. We have sampled from the GQA dataset [11] and constructed standard graphs from the provided JSON files.

WebQSP is a large-scale multi-hop knowledge graph QA dataset consisting of 4,737 questions. It was proposed by [40] and, following [25], utilizes a subset of Freebase, encompassing facts within 2-hops of entities mentioned in the questions. The task involves answering questions that require multi-hop reasoning. Given the possibility of multiple answers for the same question, the hit@1 metric is used to assess the precision of the top returned answer.

In this section, we introduce G-Retriever, a new architecture tailored for GraphQA, which integrates the strengths of GNNs, LLMs, and RAG. To allow efficient fine-tuning while preserving the LLM’s pretrained language capabilities, we freeze the LLM and use a soft prompting approach on the output of the GNN. Our RAG-based design mitigates hallucinations through direct retrieval of the graph, while allowing our approach to scale to graphs exceeding the LLM’s context window size. To adapt RAG to graphs, we formulate subgraph retrieval as a PCST optimization problem. This approach also allows us to enhance explainability by returning the retrieved subgraph.

G-Retriever comprises four main steps: indexing, retrieval, subgraph construction and generation, as depicted in Figure 3. The implementation details of each step are elaborated in the following sections.

We initiate the RAG approach by generating node and graph embeddings using a pre-trained LM. These embeddings are then stored in a nearest neighbor data structure.

To elaborate, consider xn∈DLnsubscript𝑥𝑛superscript𝐷subscript𝐿𝑛x_{n}\in D^{L_{n}} as the text attributes of node n𝑛n. Utilizing a pre-trained LM, such as SentenceBert [29], we apply the LM to xnsubscript𝑥𝑛x_{n}, yielding the representation znsubscript𝑧𝑛z_{n}:

where d𝑑d denotes the dimension of the output vector. Similar preprocessing steps are applied to edges. Refer to Figure 3, Step 1 for an illustrative representation.

For retrieval, we employ the same encoding strategy to the query to ensure consistent treatment of textual information:

Next, to identify the most relevant nodes and edges for the current query, we use a k-nearest neighbors retrieval approach. This method yields a set of ‘relevant nodes/edges’ based on the similarity between the query and each node or edge. The retrieval operation is defined as:

where znsubscript𝑧𝑛z_{n} and zesubscript𝑧𝑒z_{e} are the embeddings of node n𝑛n and edge e𝑒e, respectively. We use the cosine similarity function, cos​(⋅,⋅)cos⋅⋅\text{cos}(\cdot,\cdot), to measure the similarity between the query representation and the node/edge embeddings. The argtopk operation retrieves the top-k elements based on this similarity, providing a set of nodes Vksubscript𝑉𝑘V_{k} and edges Eksubscript𝐸𝑘E_{k} considered most relevant to the query. See Step 2 of Figure 3.

This step aims to construct a subgraph that encompasses as many relevant nodes and edges as possible, while keeping the graph size manageable. This approach offers two key benefits: Firstly, it helps to filter out nodes and edges that are not pertinent to the query. This is crucial because irrelevant information can overshadow the useful data, potentially diverting the focus of the subsequent LLM from the information of interest. Secondly, it enhances efficiency; by keeping the graph size manageable, it becomes feasible to translate the graph into natural language and then input it into the LLM for processing. The Prize-Collecting Steiner Tree algorithm [1] serves as our primary method for identifying such optimally sized and relevant subgraphs. See Step 3 in Figure 3.

Prize-Collecting Steiner Tree (PCST). The PCST problem aims to find a connected subgraph that maximizes the total prize values of its nodes while minimizing the total costs of its edges. Our approach assigns higher prize values to nodes and edges more relevant to the query, as measured by cosine similarity. Specifically, the top k𝑘k nodes/edges are assigned descending prize values from k𝑘k down to 1, with the rest assigned zero. The node prize assignment is as follows:

Edge prizes are assigned similarly.

The objective is to identify a subgraph, S∗=(V∗,E∗)superscript𝑆superscript𝑉superscript𝐸S^{}=(V^{},E^{*}), that optimizes the total prize of nodes and edges, minus the costs associated with the size of the subgraph:

and Cesubscript𝐶𝑒C_{e} denotes a predefined cost per edge, which is adjustable to control the subgraph size.

The original PCST algorithm is designed for node prizes only. However, given the significance of edge semantics in certain scenarios, we adapt the algorithm to accommodate edge prizes as follows: Consider an edge e with a cost Cesubscript𝐶𝑒C_{e} and a prize Pesubscript𝑃𝑒P_{e}. If Ce>Pesubscript𝐶𝑒subscript𝑃𝑒C_{e}>P_{e}, it can be treated as a reduced edge cost of Ce−Pesubscript𝐶𝑒subscript𝑃𝑒C_{e}-P_{e}. However, if Pe>Cesubscript𝑃𝑒subscript𝐶𝑒P_{e}>C_{e}, negative edge costs are not allowed in the original algorithm. Our solution involves replacing edge e𝑒e with a ‘virtual node’ vesubscript𝑣𝑒v_{e}, connected to both endpoints of e𝑒e. This virtual node is assigned a prize of Pe−Cesubscript𝑃𝑒subscript𝐶𝑒P_{e}-C_{e}, and the cost of the two new edges leading to the virtual node is set to zero. This modification effectively mirrors the original problem, as including edge e𝑒e in the original graph is analogous to including the virtual node in the modified graph. Finally, we optimize the PCST problem using a near-linear time approach [7].

Graph Encoder. Let S∗=(V∗,E∗)superscript𝑆superscript𝑉superscript𝐸S^{}=(V^{},E^{*}) represent the retrieved subgraph. We use a graph encoder to model the structure of this graph, specifically using a standard Graph Attention Network (GAT) [36]. Our approach for encoding the retrieved subgraph is defined as follows:

Here, POOL denotes the mean pooling operation, and dgsubscript𝑑𝑔d_{g} is the dimension of the graph encoder.

Projection Layer. We incorporate a multilayer perceptron (MLP) to align the graph token with the vector space of the LLM:

Text Embedder. To leverage the text-reasoning capabilities of LLMs, we transform the retrieved subgraph S∗superscript𝑆S^{} into a textual format. This transformation involves flattening the textual attributes of the nodes and edges, as illustrated in the green box in Figure 2. We refer to this operation as textualize​(⋅)textualize⋅\text{textualize}(\cdot). Subsequently, we combine the textualized graph with the query to generate a response. Let xqsubscript𝑥𝑞x_{q} denote the query; we concatenate it with the textualized graph textualize​(S∗)textualizesuperscript𝑆\text{textualize}(S^{}). We then map the result to an embedding htsubscriptℎ𝑡h_{t} using a text embedder, which is the first layer of a pretrained and frozen LLM:

LLM Generation with Graph Prompt Tuning. The final stage involves generating the answer Y𝑌Y given the graph token h^gsubscript^ℎ𝑔\hat{h}{g}, acting as a soft prompt, and the text embedder output htsubscriptℎ𝑡h{t}. These inputs are fed through the self-attention layers of a pretrained frozen LLM, with parameter θ𝜃\theta. The generation process is represented as follows:

where [h^g;ht]subscript^ℎ𝑔subscriptℎ𝑡[\hat{h}{g};h{t}] concatenates the graph token h^gsubscript^ℎ𝑔\hat{h}{g} and the text embedder output htsubscriptℎ𝑡h{t}. While θ𝜃\theta is frozen, the graph token h^gsubscript^ℎ𝑔\hat{h}{g} receives gradients, enabling the optimization of the parameters of the graph encoder ϕ1subscriptitalic-ϕ1\phi{1} and the projection layer ϕ2subscriptitalic-ϕ2\phi_{2} through standard backpropagation.

In the indexing step, we use SentenceBert [29] as the LM to encode all node and edge attributes. In the generation step, we use the open-source Llama2-7b [35] as the LLM for all experiments. We use GAT [36] as the graph encoder. Additional details are provided in Appendix A.1.

In our experiments, we consider three model configurations: 1) Inference-Only: Using a frozen LLM for direct question answering; 2) Frozen LLM + Prompt Tuning (PT): Keeping the parameters of the LLM frozen and adapting only the prompt; 3) Tuned LLM: Fine-tuning the LLM with LoRA.

Inference-Only Performance: G-Retriever surpasses all inference-only baselines. Notably, the integration of textual graphs did not consistently yield positive results, as observed in WebQSP. This inconsistency might be attributed to the complexity and potential noise of the textualized graph, which could hinder the LLM’s direct reasoning capabilities.

Frozen LLM + Prompt Tuning Performance: G-Retriever outperforms traditional prompt tuning across all datasets, with an average performance increase of 35%, underscoring the effectiveness of our approach in leveraging frozen LLMs for improved performance.

Tuned LLM Performance: The combination of our method with LoRA (i.e., G-Retriever + LoRA) achieves the best performance, outperforming standard LoRA fine-tuning by 5.62% on SceneGraphs and 13.56% on WebQSP.

The efficiency of our approach is highlighted by the data presented in Table 4. Implementing our graph-based retrieval led to a significant reduction in the average number of tokens and nodes across datasets. Specifically, for SceneGraphs, an 83% reduction in the average number of tokens was observed. In WebQSP, which features a larger graph size, there was a remarkable decrease of 99% in both tokens and nodes. These substantial reductions demonstrate the efficiency of our method, underscoring its potential in managing large-scale graph data.

To evaluate hallucination, we instructed the models to answer graph-related questions, specifically identifying supporting nodes or edges from the graph. We manually reviewed 100 responses from both our method and the baseline (i.e., LLM with graph prompt tuning), verifying the existence of the nodes and edges referenced in the model’s output within the actual graph. Table 6 shows that G-Retriever significantly reduces hallucinations by 54% compared to the baseline (see Appendix C for details).

In this ablation study, we assess the individual impact of key components within our pipeline, structured into two parts:

Architecture Design: We evaluate the roles of the graph encoder, projection head, and textualized graph. The study indicates performance drops when any of these components are removed, with the graph encoder and textualized graph showing declines of 11.42% and 19.58%, respectively. This demonstrates their complementary effects in representing the graph in both textual and embedded formats.

Retrieval Mechanism: We also assess the impact of various retrieval strategies, including scenarios where either node retrieval or edge retrieval is omitted. Notably, the exclusion of node retrieval leads to a minor decrease in performance by 1.10%, while omitting edge retrieval results in a more substantial decline of 13.29%.

Further details are available in Appendix A.3. Additionally, we present further studies on our framework in the appendix: it is robust to the choice of graph encoders (see Appendix A.4) and benefits from the increased scale of LLMs (see Appendix A.5).

In this work, we introduce a new GraphQA benchmark for real-world graph question answering and present G-Retriever, an architecture adept at complex and creative queries. Experimental results show that G-Retriever surpasses baselines in textual graph tasks across multiple domains, scales effectively with larger graph sizes, and demonstrates resistance to hallucination.

Limitations and Future Work: Currently, G-Retriever employs a static retrieval component. Future developments could investigate more sophisticated RAG where the retrieval is trainable.

XB is supported by NUS Grant ID R-252-000-B97-133.

Experiments are conducted using 2 NVIDIA A100-80G GPUs. Each experiment is replicated four times, utilizing different seeds for each run to ensure robustness and reproducibility.

Graph Encoder. We use GAT [36] as the GNN backbone. Our configuration employs 4 layers, each with 4 attention heads, and a hidden dimension size of 1024.

LLM. We use the open-sourced Llama2-7b [35] as the LLM backbone. In fine-tuning the LLM with LoRA [8], the lora_r parameter (dimension for LoRA update matrices) is set to 8, and lora_alpha (scaling factor) is set to 16. The dropout rate is set to 0.05. In prompt tuning, the LLM is configured with 10 virtual tokens. The number of max text length is 512, the number of max new tokens, i.e., the maximum numbers of tokens to generate, is 32.

PCST. For retrieval over graphs via PCST, for the SceneGraphs dataset, we select the top k𝑘k nodes and edges, setting k𝑘k to 3. Here, the cost of edges, denoted as Cesubscript𝐶𝑒C_{e}, is set to 1. Regarding the WebQSP dataset, we set k=3𝑘3k=3 for nodes and k=5𝑘5k=5 for edges, with the edge cost, Cesubscript𝐶𝑒C_{e}, adjusted to 0.5. For the ExplaGraphs dataset, which is characterized by a small graph size averaging 5.17 nodes and 4.25 edges (as detailed in Table 2), the entire graph can fit in the LLM’s context window size. Consequently, we aim to retrieve the whole graph by setting k𝑘k to 0, effectively returning the original graph unaltered.

Optimization. We use the AdamW [23] optimizer. We set the initial learning rate at 1e-5, with a weight decay of 0.05. The learning rate decays with a half-cycle cosine decay after the warm-up period. The batch size is 4, and the number of epochs is 10. To prevent overfitting and ensure training efficiency, an early stopping mechanism is implemented with a patience setting of 2 epochs.

In our experiments, we consider three model configurations:

  1. Inference-Only: Using a frozen LLM for direct question answering, with query (see Figure 4(a)), or with textual graph and query (see Figure 4(b)).

  2. Frozen LLM + Prompt Tuning (PT): Keeping the parameters of the LLM frozen and adapting only the prompt. This includes soft prompt tuning (see Figure 5(a)) and our G-Retriever method (see Figure5(b)).

  3. Tuned LLM: Fine-tuning the LLM with LoRA. Refer to Figure 6(a) for the standard fine-tuning of an LLM for downstream tasks, and see Figure 6(b) for the combined approach of fine-tuning the LLM with G-Retriever.

This section illustrates the modifications made to the original architecture in the ablation study, as presented in Figure 7.

Without Graph Encoder (w/o GraphEncoder): In this setting, we replaced the graph encoder with trainable soft tokens, setting the number of these virtual tokens to 10.

Without Projection Layer (w/o Projection Layer): Here, we removed the projection layer following the graph encoder. We configured the output dimension of the graph encoder to be 4096, matching the hidden dimension of Llama2-7b. This allows the output graph token (the yellow token in Figure 7(b)) to be concatenated directly with the LLM tokens (blue tokens).

Without Textualized Graph (w/o Textualized Graph): In this configuration, we modified the textual input to the LLM. Rather than using a combination of the query and the textualized graph, we solely used the query.

In addition to the GAT [36], we explore other GNNs as the graph encoder, such as GCN [14] and the GraphTransformer [31]. The comparative results of these models on the WebQSP and ExplaGraphs datasets are presented in Table 7.

The results demonstrate that our proposed method exhibits consistent robustness across different graph encoders. Notably, all three encoders – GCN, GAT, and GraphTransformer – demonstrate competitive and closely aligned performance on the WebQSP dataset, with Hit@1 scores of 0.6853, 0.6732, and 0.6910, respectively. However, the performance differentiation becomes more pronounced on the ExplaGraphs dataset, where GraphTransformer exhibits a superior Hit@1 score of 0.9016, followed by GAT and GCN with scores of 0.8696 and 0.8290, respectively. This variation in performance across the datasets highlights the importance of encoder selection based on the specific characteristics and requirements of the dataset.

As for the choice of LLM, we considered both Llama2-7b and Llama2-13b. Our experiments demonstrate that stronger LLMs enhance the effectiveness of our method, as shown in Table 8, indicating that it benefits from the increased scale of the LLMs.

In this section, we detail how our GraphQA benchmark differs from the original datasets, including the specific processing steps we employed. For concrete examples that illustrate the differences between the raw text in the original dataset and in our GraphQA benchmark, please refer to Table 9.

ExplaGraphs. The original dataset222https://explagraphs.github.io/ [30] represents relationships using triplets. We have standardized this format by converting the triplets into a graph representation. Specifically, each head and tail in a triplet is transformed into a node, and the relation is transformed into an edge. Since the test dataset labels are not available, we have utilized only the training and validation (val) datasets from the original collection. We further divided these into training, val, and test subsets, using a 6:2:2 ratio.

SceneGraphs. The original GQA dataset is designed for real-world visual reasoning and compositional question answering, aiming to address key shortcomings of previous VQA datasets [11]. It comprises 108k images, each associated with a Scene Graph. In our study, we focus differently on graph question answering; hence, we did not utilize the image counterparts, leveraging only the scene graphs from the original dataset. Additionally, the original dataset describes images using JSON files. We simplified the object IDs to suit our research needs. We randomly sampled 100k samples from the original dataset and divided them into training, validation, and test subsets, following a 6:2:2 ratio.

WebQSP. We follow the preprocessing steps from RoG333https://huggingface.co/datasets/rmanluo/RoG-webqsp [25]. The original dataset uses a list of triplets format, which we have transformed into our unified graph format. Furthermore, to avoid discrimination between capital and lowercase words, we have converted all words to lowercase. We used the same dataset split as in the original dataset.

In this section, we present quantitative results regarding hallucinations in the SceneGraphs dataset.

Baseline. For our baseline, we adapted MiniGPT-4 [49] to graph contexts. This approach involves a frozen LLM interacting with a trainable GNN that encodes graph data as a soft prompt, denoted as LLM+Graph Prompt Tuning. We focus on graph prompt tuning as the baseline, instead of converting the graph into text, since the textual representation of the graph is large and consistently exceeds the input token limits of LLMs.

Experiment Design. We instructed the LLM to answer graph-related questions and to list nodes or edges in the explanation graph that support its answers. Since standard answers for these questions do not exist, allowing the LLM to respond flexibly, it becomes challenging to evaluate its responses. To address this, we manually examined 100 responses generated by our method and the LLM with graph prompt tuning, verifying whether the nodes and edges referenced in the LLM’s output actually exist in the graph.

Evaluation Metrics. We assessed the model’s faithfulness using three metrics: the fraction of valid nodes (denoted as Valid Nodes), the fraction of valid edges (denoted as Valid Edges), and the fraction of times the entire set of nodes and edges cited was valid (denoted as Fully Valid Graphs).

Results. The results, as depicted in Table 6, illustrate the comparative effectiveness of the G-Retriever over the baseline LLM+Graph Prompt Tuning method in reducing hallucinations. The LLM+Graph Prompt Tuning approach demonstrated a significantly lower accuracy in referencing graph elements, with only 31% of nodes and 12% of edges being valid, and the entire set of nodes and edges being valid only 8% of the time. In contrast, G-Retriever showed substantial improvements: 77% validity in nodes, 76% in edges, and 62% in the overall validity of referenced node-edge sets. These results underscore the significant reduction in hallucinations with G-Retriever, particularly in the challenging task of accurately citing both nodes and edges in graph-based contexts.

Table: S1.T1: Observation and mitigation of hallucination: comparative analysis of baseline graph LLMs and our G-Retriever.

User
What kind of animal is in the bushes? Answer the question and list the nodes or edges from the graph that support your answer.
LLM + Graph Prompt Tuning[Incorrect response with hallucinated nodes and edges] The animal in the bushes is a deer. Nodes: * Deer (node 1) * Bushes (node 2) Edges: * Deer →→\rightarrow Bushes (edge 1) * Deer →→\rightarrow Grass (edge 2) * Bushes →→\rightarrow Grass (edge 3)
G-Retriever (Ours)[Correct response with accurate node and edge references] Based on the information provided in the graph, the animal in the bushes is an elephant. The nodes that support this answer are: * Node 1: The elephant is mentioned in the first line of the graph. The edges that support this answer are: * The edge connects the elephant to the bushes, indicating that the elephant is in the bushes. Therefore, the answer to the question is: The animal in the bushes is an elephant.

Table: S4.T2: Summary of datasets used in GraphQA benchmark.

DatasetExplaGraphsSceneGraphsWebQSP
#Graphs2,76610,0004,737
Avg. #Nodes5.1719.131370.89
Avg. #Edges4.2568.444252.37
Node AttributeCommonsense conceptsObject attributes (e.g., color, shape, material, activity)Entities in Freebase
Edge AttributeCommonsense relationsRelations (e.g., actions, spatial relations, comparatives)Relations in Freebase
TaskCommon sense reasoningScene graph question answeringKnowledge based question answering
Evaluation MatrixAccuracyAccuracyHit@1

Table: S6.T3: Performance comparison across ExplaGraphs, SceneGraphs, and WebQSP datasets. This table presents mean scores and standard deviations (mean ± std) for different configurations, including Inference-Only, Frozen LLM , and Tuned LLM settings, across the three datasets. A dashed entry indicates the non-applicability of the ‘Question-Only’ method for SceneGraphs, where questions cannot be answered without the textual graph.

SettingMethodExplaGraphsSceneGraphsWebQSP
Inference-OnlyQuestion-Only0.57040.6136
Textual Graph + Question0.58120.35480.4195
Frozen LLM + Prompt TuningPrompt Tuning0.5876 ± 0.00320.6851 ± 0.00110.4975 ± 0.0049
G-Retriever0.8696 ± 0.01290.8614 ± 0.00800.6732 ± 0.0076
Δ Prompt TuningsubscriptΔ Prompt Tuning\Delta_{\text{ Prompt Tuning}}↑↑\uparrow 47.99%↑↑\uparrow 22.97%↑↑\uparrow 35.32%
Tuned LLMLoRA0.8741 ± 0.01860.8594 ± 0.00620.6174 ± 0.0012
G-Retriever + LoRA0.8768 ± 0.02740.9077 ± 0.01040.7011 ± 0.005
Δ LoRAsubscriptΔ LoRA\Delta_{\text{ LoRA}}↑↑\uparrow 0.31%↑↑\uparrow 5.62%↑↑\uparrow 13.56%

Table: S6.T4: Average number of tokens and nodes before and after implementing retrieval.

DatasetBefore Retrieval (Avg.)After Retrieval (Avg.)
# Tokens# Nodes# Tokens# Nodes
SceneGraphs1,39619235 (↓↓\downarrow83%)5 (↓↓\downarrow74%)
WebQSP100,6271,371610 (↓↓\downarrow99%)18 (↓↓\downarrow99%)

Table: A1.T7: Performance of different graph encoders on the WebQSP and ExplaGraphs datasets.

Graph EncoderWebQSPExplaGraphs
GCN [14]0.68530.8290
GAT [36]0.67320.8696
GraphTransformer [31]0.69100.9016

Table: A4.T11: Example of creative chat on SceneGraphs. We utilize GPT-4 to generate an image from the graph, assisting readers in visualizing it

Graphnode_id,node_attr 0,name: head; attribute: black; (x,y,w,h): (386, 185, 8, 4) 1,name: swimming pool; attribute: small, water; (x,y,w,h): (465, 92, 35, 3) 2,name: shore; (x,y,w,h): (49, 85, 436, 108) 3,name: fence; attribute: wired; (x,y,w,h): (11, 158, 167, 46) 4,name: mountains; attribute: covered; (x,y,w,h): (393, 53, 107, 12) 5,name: water; attribute: blue; (x,y,w,h): (0, 50, 499, 154) 6,name: sheep; attribute: walking, grazing; (x,y,w,h): (243, 191, 20, 13) 7,name: sky; attribute: blue, clear, cloudy; (x,y,w,h): (0, 0, 501, 60) 8,name: sheep; attribute: white, fluffy; (x,y,w,h): (409, 176, 13, 10) 9,name: field; attribute: large, green, grassy; (x,y,w,h): (0, 138, 503, 112) 10,name: grass; (x,y,w,h): (0, 84, 500, 166) 11,name: sheep; attribute: white, fluffy; (x,y,w,h): (398, 178, 13, 11) 12,name: sheep; attribute: white, fluffy; (x,y,w,h): (385, 179, 17, 11) 13,name: sheep; attribute: white, fluffy; (x,y,w,h): (364, 181, 14, 10) 14,name: shore; attribute: sandy; (x,y,w,h): (190, 123, 171, 12) src, edge_attr, dst 0,to the left of,11\n 0,to the right of,13\n 3,to the left of,6\n 6,to the right of,3\n 6,grazing on,10\n 6,in,9\n 6,in,10\n 7,above,5\n 8,eating,10\n 8,to the right of,11\n 8,to the right of,12\n 11,to the left of,8\n 11,to the right of,0\n 11,to the right of,13\n 11,to the right of,12\n 11,eating,10\n 12,eating,10\n 12,to the left of,8\n 12,to the right of,13\n 12,to the left of,11\n 13,to the left of,0\n 13,to the left of,11\n 13,to the left of,12\n 13,eating,10\n
Image
UserWrite an 500-word advertisement for this place in the scene graph that would make people want to visit it.
ModelSure! Here’s an example of a 500-word advertisement for the place based on the scene graph: "Experience the breathtaking beauty of our place, where the rolling hills meet the sparkling water of the swimming pool. Take a leisurely stroll through the lush green fields, and let the gentle breeze caress your face as you gaze out at the picturesque view. The sheep graze peacefully in the distance, their white coats glistening in the sunlight. The sky is a brilliant blue, with fluffy white clouds drifting lazily across. Come and bask in the serene ambiance of our place, where the natural beauty will leave you in awe."

Refer to caption We develop a flexible question-answering framework targeting real-world textual graph applications. Presented here are examples showcasing the model’s adeptness in handling generative and creative queries in practical graph-related tasks: common sense reasoning, scene understanding, and knowledge graph reasoning, respectively.

Refer to caption Illustrative examples from the GraphQA benchmark datasets.

Refer to caption Overview of the proposed G-Retriever: 1) Indexing: Graphs are indexed for efficient query processing; 2) Retrieval: The most semantically relevant nodes and edges are retrieved, conditioned on the query; 3) Subgraph Construction: A connected subgraph is extracted, covering as many relevant nodes and edges as possible while maintaining a manageable graph size; 4) Generation: An answer is generated using a ‘graph prompt’, a textualized graph, and the query.

Refer to caption (a) Inference-Only w/ query

Refer to caption (a) Prompt Tuning

Refer to caption (b) G-Retriever

Refer to caption (a) LoRA

Refer to caption (c) w/o Textualized Graph

$$ h_{g}=\text{POOL}(\text{GNN}{\phi{1}}(S^{*}))\in\mathbb{R}^{d_{g}}, $$ \tag{S5.E9}

Userelephants grass eating bushes in river on the side of standing by elephant What kind of animal is in the bushes? Answer the question and list the nodes or edges from the graph that support your answer.
LLM w/ Graph Prompt Tuning[Incorrect response with hallucinated nodes and edges] The animal in the bushes is a deer. Nodes: * Deer (node 1) * Bushes (node 2) Edges: * Deer → Bushes (edge 1) * Deer → Grass (edge 2) * Bushes → Grass (edge 3)
G-Retriever (Ours)[Correct response with accurate node and edge references] Based on the information provided in the graph, the animal in the bushes is an elephant. The nodes that support this answer are: * Node 1: The elephant is mentioned in the first line of the graph. The edges that support this answer are: * The edge connects the elephant to the bushes, indicating that the elephant is in the bushes. Therefore, the answer to the question is: The animal in the bushes is an elephant.
DatasetExplaGraphsSceneGraphsWebQSP
#Graphs Avg. #Nodes Avg. #Edges Node Attribute Edge Attribute Task Evaluation Matrix2,766 5.17 4.25 Commonsense concepts Commonsense relations Common sense reasoning Accuracy100,000 19.13 68.44 Object attributes ( e.g., color, shape) Relations ( e.g., actions, spatial relations) Scene graph question answering Accuracy4,737 1370.89 4252.37 Entities in Freebase Relations in Freebase Knowledge based question answering Hit@1
SettingMethodExplaGraphsSceneGraphsWebQSP
Inference-onlyZero-shot Zero-CoT [18] CoT-BAG [44] KAPING [1]0.5650 0.5704 0.5794 0.62270.3974 0.5260 0.5680 0.437541.06 51.30 39.60 52.64
Frozen LLM w/ PTPrompt tuning GraphToken [31] G-Retriever ∆ Prompt tuning0.5763 ± 0.0243 0.8508 ± 0.0551 0.8516 ± 0.0092 ↑ 47.77%0.6341 ± 0.0024 0.4903 ± 0.0105 0.8131 ± 0.0162 ↑ 28.23%48.34 ± 0.64 57.05 ± 0.74 70.49 ± 1.21 ↑ 45.81%
Tuned LLMLoRA G-Retriever w/ LoRA ∆ LoRA0.8538 ± 0.0353 0.8705 ± 0.0329 ↑ 1.95%0.7862 ± 0.0031 0.8683 ± 0.0072 ↑ 11.74%66.03 ± 0.47 73.79 ± 0.70 ↑ 10.44%
DatasetBefore Retrieval (Avg.)Before Retrieval (Avg.)Before Retrieval (Avg.)After Retrieval (Avg.)After Retrieval (Avg.)After Retrieval (Avg.)
# Tokens# NodesMin/Epoch# Tokens# NodesMin/Epoch
SceneGraphs1,39619123.1235 ( ↓ 83%)5 ( ↓ 74%)86.8 ( ↓ 29%)
WebQSP100,6271,37118.7610 ( ↓ 99%)18 ( ↓ 99%)6.2( ↓ 67%)
BaselineG-RetrieverMethod w/o Graph EncoderHit@1∆ G-Retriever
Valid Nodes31%77%w/o Projection Layer69.70 ± 0.68↓ 1.11%
Valid Edges12%76%w/o Textualized Graph56.96 ± 1.83↓ 19.19%
Fully Valid Graphs8%62%w/o Retrieval63.84 ± 0.41↓ 9.43%
Graph EncoderWebQSPExplaGraphs
GCN [17]70.70.8394
GAT [43]70.270.843
Graph Transformer [37]70.490.8516
LLMLlama2-7bLlama2-13b
Hit@170.4975.58
DatasetOriginal datasetGraphQA Benmark
ExplaGraphs(entrapment; capable of; being abused) (being abused; created by; police) (police; capable of; harm) (harm; used for; people) (people; part of; citizens)node_id,node_attr\n 0,entrapment\n 1,being abused\n 2,police\n 3,harm\n 4,people\n 5,citizens\n src,edge_attr,dst\n 0,capable of,1\n 1,created by,2\n 2,capable of,3\n 3,used for,4\n 4,part of,5\n
SceneGraphs"width": 500, "objects": "681267": "name": "banana", "h": 34, "relations": ["object": "681262", "name": "to the left of"], "w": 64, "attributes": ["small", "yellow"], "y": 55, "x": 248, "681265": "name": "spots", "h": 16, "relations": [], "w": 26, "attributes": [], "y": 92, "x": 245, "681264": "name": "bananas", "h": 50, "relations": ["object": "681259", "name": "to the left of"], "w": 49, "attributes": ["small", "yellow"], "y": 32, "x": 268, "681263": "name": "picnic", "h": 374, "relations": [], "w": 499, "attributes": ["delicious"], "y": 0, "x": 0, "681262": "name": "straw", "h": 95, "relations": ["object": "681268", "name": "to the right of", "object": "681267", "name": "to the right of", "object": "681253", "name": "to the right of"], "w": 15, "attributes": ["white", "plastic"], "y": 55, "x": 402, "681261": "name": "meat", "h": 27, "relations": ["object": "681255", "name": "on", "object": "681255", "name": "inside"], "w": 24, "attributes": ["small", "brown", "delicious"], "y": 123, "x": 68, "681260": "name": "rice", "h": 57, "relations": ["object": "681255", "name": "on", "object": "681258", "name": "to the left of"], "w": 93, "attributes": ["piled", "white"], "y": 162, "x": 57, "681269": "name": "onions", "h": 16, "relations": [], "w": 24, "attributes": ["green"], "y": 147, "x": 90, "681268": "name": "tablecloth", "h": 374, "relations": ["object": "681262", "name": "to the left of"], "w": 396, "attributes": ["white"], "y": 0, "x": 0, "681258": "name": "bowl", "h": 99, "relations": ["object": "681255", "name": "next to", "object": "681257", "name": "of", "object": "681255", "name": "near", "object": "681256", "name": "to the right of", "object": "681260", "name": "to the right of", "object": "681255", "name": "to the right of"], "w": 115, "attributes": ["full"], "y": 184, "x": 178, "681259": "name": "plantains", "h": 70, "relations": ["object": "681264", "name": "to the right of"], "w": 45, "attributes": ["red"], "y": 0, "x": 346, "681256": "name": "spoon", "h": 65, "relations": ["object": "681255", "name": "on", "object": "681257", "name": "to the left of", "object": "681255", "name": "in", "object": "681258", "name": "to the left of"], "w": 140, "attributes": ["large", "metal", "silver"], "y": 196, "x": 0, "681257": "name": "dish", "h": 81, "relations": ["object": "681258", "name": "inside", "object": "681256", "name": "to the right of", "object": "681258", "name": "in", "object": "681255", "name": "to the right of"], "w": 108, "attributes": ["cream colored"], "y": 199, "x": 187, "681254": "name": "meal", "h": 111, "relations": [], "w": 130, "attributes": [], "y": 121, "x": 58, "681255": "name": "plate", "h": 138, "relations": ["object": "681257", "name": "to the left of", "object": "681254", "name": "of", "object": "681254", "name": "with", "object": "681258", "name": "near", "object": "681258", "name": "to the left of"], "w": 176, "attributes": ["white", "full"], "y": 111, "x": 30, "681253": "name": "banana", "h": 30, "relations": ["object": "681262", "name": "tonode_id,node_attr 0,"name: banana; attribute: small, yellow; (x,y,w,h): (248, 55, 64, 34)" 1,"name: spots; (x,y,w,h): (245, 92, 26, 16)" 2,"name: bananas; attribute: small, yellow; (x,y,w,h): (268, 32, 49, 50)" 3,"name: picnic; attribute: delicious; (x,y,w,h): (0, 0, 499, 374)" 4,"name: straw; attribute: white, plastic; (x,y,w,h): (402, 55, 15, 95)" 5,"name: meat; attribute: small, brown, delicious; (x,y,w,h): (68, 123, 24, 27)" 6,"name: rice; attribute: piled, white; (x,y,w,h): (57, 162, 93, 57)" 7,"name: onions; attribute: green; (x,y,w,h): (90, 147, 24, 16)" 8,"name: tablecloth; attribute: white; (x,y,w,h): (0, 0, 396, 374)" 9,"name: bowl; attribute: full; (x,y,w,h): (178, 184, 115, 99)" 10,"name: plantains; attribute: red; (x,y,w,h): (346, 0, 45, 70)" 11,"name: spoon; attribute: large, metal, silver; (x,y,w,h): (0, 196, 140, 65)" 12,"name: dish; attribute: cream colored; (x,y,w,h): (187, 199, 108, 81)" 13,"name: meal; (x,y,w,h): (58, 121, 130, 111)" 14,"name: plate; attribute: white, full; (x,y,w,h): (30, 111, 176, 138)" 15,"name: banana; attribute: small, yellow; (x,y,w,h): (237, 87, 73, 30)" src,edge_attr,dst 0,to the left of,4\n 2,to the left of,10\n 4,to the right of,8\n 4,to the right of,0\n 4,to the right of,15\n 5,on,14\n 5,inside,14\n 6,on,14\n 6,to the left of,9\n 8,to the left of,4\n 9,next to,14\n 9,of,12\n 9,near,14\n 9,to the right of,11\n 9,to the right of,6\n 9,to the right of,14\n 10,to the right of,2\n 11,on,14\n 11,to the left of,12\n 11,in,14\n 11,to the left of,9\n 12,inside,9\n 12,to the right of,11\n 12,in,9\n 12,to the right of,14\n 14,to the left of,12\n 14,of,13\n 14,with,13\n 14,near,9\n 14,to the left of,9\n 15,to the left of,4\n
WebQSP[['FedEx Cup', 'sports.sports_award_type.winners', 'm.0n1v8cy'], ['Brandt Snedeker', 'sports.sports_award_winner.awards', 'm.0n1v8cy'], ['FedEx Cup', 'common.topic.article', 'm.08q5wy'], ['FedEx Cup', 'common.topic.notable_for', 'g.12559n8g_'], ['Sports League Award Type', 'freebase.type_profile.published', 'Published'], ['FedEx Cup', 'common.topic.notable_types', 'Sports League Award Type'], ['m.0n1v8cy', 'sports.sports_award.award_winner', 'Brandt Snedeker'], ['Sports League Award Type', 'type.type.expected_by', 'Award'], ['Sports League Award Type', 'common.topic.article', 'm.06zxtxj'], ['2012 PGA Tour', 'sports.sports_league_season.awards', 'm.0n1v8cy'], ['Sports League Award Type', 'freebase.type_hints.included_types', 'Topic'], ['Sports League Award Type', 'type.type.domain', 'Sports'], ['m.0n1v8cy', 'sports.sports_award.award', 'FedEx Cup'], ['Sports League Award Type', 'freebase.type_profile.strict_included_types', 'Topic'], ['Sports League Award Type', 'freebase.type_profile.kind', 'Classification'],node_id,node_attr\n 0,fedex cup\n 1,m.0n1v8cy\n 2,brandt snedeker\n 3,m.08q5wy\n 4,g.12559n8g_\n 5,sports league award type\n 6,published\n 7,award\n 8,m.06zxtxj\n 9,2012 pga tour\n 10,topic\n 11,sports\n 12,classification\n 13,winners\n src,edge_attr,dst 0,sports.sports_award_type.winners,1 2,sports.sports_award_winner.awards,1 0,common.topic.article,3 0,common.topic.notable_for,4 5,freebase.type_profile.published,6 0,common.topic.notable_types,5 1,sports.sports_award.award_winner,2 5,type.type.expected_by,7 5,common.topic.article,8 9,sports.sports_league_season.awards,1 5,freebase.type_hints.included_types,10 5,type.type.domain,11 1,sports.sports_award.award,0
k351020
Hit@10.69770.70630.72480.7039
MethodRetrieval Accuracy
KAPING [1] (top-k triple retrieval)60.81
G-Retriever70.49
SetttingMethodHit@1Time
Inference-onlyQuestion only Textual graph and question61.16 41.0631 min 40 min
Prompt Tuning48.3418.7 min/epoch 6.2 min/epoch
Frozen LLM w/ PTG-Retriever70.49
Tuned LLMLoRA66.0319 min/epoch
G-Retriever w/
LoRA73.796.9 min/epoch
Graphnode_id,node_attr 0,human 1,cosmetic surgery 2,women 3,change appearance 4,addiction 5,searching for perfection 6,stop 7,positive src,edge_attr,dst 0,receives action,1 2,part of,0 1,used for,3 3,capable of,4 4,created by,5
UserWrite a 500-word argument essay based on the information provided in the explanation graph.
ModelThe explanation graph provides a detailed analysis of the reasons why women undergo cosmetic surgery. According to the graph, the main reasons include the desire to change one's appearance, the influence of societal beauty standards, and the addiction to searching for perfection. These reasons are further supported by the various nodes and edges in the graph, which provide additional insights into the motivations and consequences of cosmetic surgery. Firstly, the node labeled "human" highlights the fundamental human desire to feel attractive and confident in one's appearance. This desire is deeply ingrained in our culture, where beauty is often associated with happiness, success, and social acceptance. As a result, many women feel pressure to conform to societal beauty standards, which can lead to feelings of insecurity and low self-esteem. In response to these feelings, some women may turn to cosmetic surgery as a means of improving their appearance and boosting their self-confidence. Secondly, the node labeled "women" highlights the gendered nature of cosmetic surgery. While men also undergo cosmetic surgery, the majority of cosmetic surgery patients are women. This is likely due to the fact that women are more likely to be exposed to societal beauty standards and are more likely to internalize these standards. As a result, women may be more likely to feel insecure about their appearance and more likely to seek cosmetic surgery as a means of improving their looks. Thirdly, the node labeled "addiction" highlights the potential negative consequences of cosmetic surgery. While some women may initially feel improved confidence and self-esteem after un- dergoing cosmetic surgery, the graph suggests that this initial boost may not last. In fact, some women may become addicted to the constant pursuit of perfection, leading to a cycle of surgery and dissatisfaction. This can have serious psychological and emotional consequences, including depression, anxiety, and a loss of self-esteem. Fourthly, the node labeled "searching for perfection" highlights the unrealistic expectations that many women have about the results of cosmetic surgery. The graph suggests that women may have unrealistic expectations about the ability of cosmetic surgery to improve their appearance and boost their self-esteem. This can lead to disappointment and dissatisfaction with the results of surgery, as well as a continued pursuit of perfection through additional surgery. Finally, the node labeled "stop" highlights the importance of critical thinking and self-reflection when it comes to cosmetic surgery. The graph suggests that women should be aware of the potential negative consequences of cosmetic surgery and should carefully consider the reasons why they are undergoing surgery. They should also be aware of the unrealistic expectations that they may have about the results of surgery and should be prepared to address any negative emotions or feelings of dissatisfaction that may arise. In conclusion, the explanation graph provides a detailed analysis of the reasons why women undergo cosmetic surgery. The graph highlights the fundamental human desire to feel attractive and confident in one's appearance, the gendered nature of cosmetic surgery, the potential negative consequences of cosmetic surgery, and the importance of critical thinking and self-reflection when it comes to surgery. By understanding these reasons and the potential consequences of cosmetic surgery, women can make informed decisions about whether or not to undergo surgery and can work to address any negative emotions or feelings of dissatisfaction that may arise.
Graphnode_id,node_attr 0,name: head; attribute: black; (x,y,w,h): (386, 185, 8, 4) 1,name: swimming pool; attribute: small, water; (x,y,w,h): (465, 92, 35, 3) 2,name: shore; (x,y,w,h): (49, 85, 436, 108) 3,name: fence; attribute: wired; (x,y,w,h): (11, 158, 167, 46) 4,name: mountains; attribute: covered; (x,y,w,h): (393, 53, 107, 12) 5,name: water; attribute: blue; (x,y,w,h): (0, 50, 499, 154) 6,name: sheep; attribute: walking, grazing; (x,y,w,h): (243, 191, 20, 13) 7,name: sky; attribute: blue, clear, cloudy; (x,y,w,h): (0, 0, 501, 60) 8,name: sheep; attribute: white, fluffy; (x,y,w,h): (409, 176, 13, 10) 9,name: field; attribute: large, green, grassy; (x,y,w,h): (0, 138, 503, 112) 10,name: grass; (x,y,w,h): (0, 84, 500, 166) 11,name: sheep; attribute: white, fluffy; (x,y,w,h): (398, 178, 13, 11) 12,name: sheep; attribute: white, fluffy; (x,y,w,h): (385, 179, 17, 11) 13,name: sheep; attribute: white, fluffy; (x,y,w,h): (364, 181, 14, 10) 14,name: shore; attribute: sandy; (x,y,w,h): (190, 123, 171, 12) src, edge_attr, dst 0,to the left of,11\n 0,to the right of,13\n 3,to the left of,6\n 6,to the right of,3\n 6,grazing on,10\n 6,in,9\n 6,in,10\n 7,above,5\n 8,eating,10\n 8,to the right of,11\n 8,to the right of,12\n 11,to the left of,8\n 11,to the right of,0\n 11,to the right of,13\n 11,to the right of,12\n 11,eating,10\n 12,eating,10\n 12,to the left of,8\n 12,to the right of,13\n 12,to the left of,11\n 13,to the left of,0\n
Image
UserWrite an 500-word advertisement for this place in the scene graph that would make people want to visit it.
ModelSure! Here's an example of a 500-word advertisement for the place based on the scene graph: "Experience the breathtaking beauty of our place, where the rolling hills meet the sparkling water of the swimming pool. Take a leisurely stroll through the lush green fields, and let the gentle breeze caress your face as you gaze out at the picturesque view. The sheep graze peacefully in the distance, their white coats glistening in the sunlight. The sky is a brilliant blue, with fluffy white clouds drifting lazily across. Come and bask in the serene ambiance of our place, where the natural beauty will leave you in awe."
node_id, node_attr 8, benjamin franklin 63, glass harmonica 98, lightning rod
UserQuestion: what else did ben franklin invent? Surrounding this question, develop an educational tour through the knowledge graph, highlighting interesting facts and connections in an engaging way.
Great! Based on the provided knowledge graph, here's an educational tour that highlights interesting facts and connections related to Benjamin Franklin's inventions and other notable achievements: 1. Franklin's Lightning Rod: Start by exploring Franklin's most famous invention - the lightning rod. Discover how Franklin's invention protected buildings from lightning strikes and how it changed the way people understood electricity. Connection: Franklin worked with Thomas Jefferson on the lightning rod, and their collaboration led to the development of the concept of positive and negative charges. 2. Bifocals: Move on to Franklin's invention of bifocal glasses. Learn how Franklin's vision problems inspired him to create a new type of glasses that allowed people to see both near and far objects clearly. Connection: Franklin's bifocal glasses were a groundbreaking invention that improved the lives of millions of people around the world. 3. Franklin Stove: Explore Franklin's design of the Franklin stove, a cast-iron stove that provided more heat with less fuel. Discover how Franklin's stove revolutionized heating systems and how it helped to reduce fuel costs for households and businesses. Connection: Franklin's stove was an early example of energy efficiency and sustainability, and it paved the way for modern heating systems. 4. Prokop Diviš: Learn about Franklin's connection to Prokop Diviš, a Czech inventor who built on Franklin's work and developed the first electric generator. Discover how Diviš's invention helped to power the Industrial Revolution and how it changed the way people lived and worked. Connection: Franklin's work on the lightning rod inspired Diviš to explore the properties of electricity, and their collaborative efforts helped to advance the field of electrical engineering. 5. Innovator and Original Ideas: Explore Franklin's reputation as an innovator and his ability to come up with original ideas. Discover how Franklin's creativity and curiosity led to many of his inventions and how he was able to see connections between seemingly unrelated things. Connection: Franklin's innovative spirit and ability to think outside the box inspired many other inventors and scientists, including Thomas Edison and Nikola Tesla. 6. Collaboration and Networking: Highlight Franklin's collaborative approach to invention and his ability to network with other inventors and scientists. Discover how Franklin's connections with other innovators helped to advance the field of science and technology. Connection: Franklin's collaborations with other inventors and scientists, such as Thomas Jefferson and Prokop Diviš, helped to accelerate the pace of innovation and pave the way for future breakthroughs. 7. Legacy and Impact: Reflect on Franklin's lasting legacy and the impact of his inventions on modern society. Discover how Franklin's inventions continue to inspire new generations of inventors and scientists and how they have helped to shape the world we live in today. Connection: Franklin's inventions and ideas have had a lasting impact on many fields, including science, technology, and society as a whole. His legacy continues to inspire and motivate people around the world. By following this educational tour through the knowledge graph, students will gain a deeper understanding of Benjamin Franklin's inventions and their impact on modern society. They will also learn about Franklin's collaborative approach to invention, his innovative spirit, and his ability to think outside the box.
Model

Figure

Figure

References

[hegde2015nearly] Hegde, Chinmay, Indyk, Piotr, Schmidt, Ludwig. (2015). A nearly-linear time framework for graph-structured sparsity. International Conference on Machine Learning.

[gao2023retrieval_rag_survey] Gao, Yunfan, Xiong, Yun, Gao, Xinyu, Jia, Kangxiang, Pan, Jinliu, Bi, Yuxi, Dai, Yi, Sun, Jiawei, Wang, Haofen. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv preprint arXiv:2312.10997.

[wang2023can_graphllm_survey] Wang, Heng, Feng, Shangbin, He, Tianxing, Tan, Zhaoxuan, Han, Xiaochuang, Tsvetkov, Yulia. (2023). Can Language Models Solve Graph Problems in Natural Language?. arXiv preprint arXiv:2305.10037.

[pan2023integrating_graphllm_survey] Pan, Shirui, Zheng, Yizhen, Liu, Yixin. (2023). Integrating Graphs with Large Language Models: Methods and Prospects. arXiv preprint arXiv:2310.05499.

[zhang2023graph_graphllm_survey] Ziwei Zhang, Haoyang Li, Zeyang Zhang, Yijian Qin, Xin Wang, Wenwu Zhu. (2023). Graph Meets LLMs: Towards Large Graph Models.

[li2023survey_graphllm_survey] Li, Yuhan, Li, Zhixun, Wang, Peisong, Li, Jia, Sun, Xiangguo, Cheng, Hong, Yu, Jeffrey Xu. (2023). A survey of graph meets large language model: Progress and future directions. arXiv preprint arXiv:2311.12399.

[jin2023large_graphllm_survey] Jin, Bowen, Liu, Gang, Han, Chi, Jiang, Meng, Ji, Heng, Han, Jiawei. (2023). Large Language Models on Graphs: A Comprehensive Survey. arXiv preprint arXiv:2312.02783.

[ye2023natural_general_graph_model] Ye, Ruosong, Zhang, Caiqi, Wang, Runhui, Xu, Shuyuan, Zhang, Yongfeng. (2023). Natural language is all a graph needs. arXiv preprint arXiv:2308.07134.

[liu2023one_general_graph_model] Liu, Hao, Feng, Jiarui, Kong, Lecheng, Liang, Ningyue, Tao, Dacheng, Chen, Yixin, Zhang, Muhan. (2023). One for All: Towards Training One Graph Model for All Classification Tasks. arXiv preprint arXiv:2310.00149.

[tang2023graphgpt_general_graph_model] Tang, Jiabin, Yang, Yuhao, Wei, Wei, Shi, Lei, Su, Lixin, Cheng, Suqi, Yin, Dawei, Huang, Chao. (2023). Graphgpt: Graph instruction tuning for large language models. arXiv preprint arXiv:2310.13023.

[lei2023boosting_general_graph_model] Lei, Bin, Liao, Chunhua, Ding, Caiwen, others. (2023). Boosting logical reasoning in large language models through a new framework: The graph of thought. arXiv preprint arXiv:2308.08614.

[yu2023thought_general_graph_model] Yu, Junchi, He, Ran, Ying, Rex. (2023). Thought Propagation: An Analogical Approach to Complex Reasoning with Large Language Models. arXiv preprint arXiv:2310.03965.

[li2023graphadapter_multimodal_model] Li, Xin, Lian, Dongze, Lu, Zhihe, Bai, Jiawang, Chen, Zhibo, Wang, Xinchao. (2023). Graphadapter: Tuning vision-language models with dual knowledge graph. arXiv preprint arXiv:2309.13625.

[yoon2023multimodal_multimodal_model] Yoon, Minji, Koh, Jing Yu, Hooi, Bryan, Salakhutdinov, Ruslan. (2023). Multimodal Graph Learning for Generative Tasks. arXiv preprint arXiv:2310.07478.

[zhang2023graph_basic_graph_reasoning] Zhang, Jiawei. (2023). Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT. arXiv preprint arXiv:2304.11116.

[zhao2023graphtext_basic_graph_reasoning] Zhao, Jianan, Zhuo, Le, Shen, Yikang, Qu, Meng, Liu, Kai, Bronstein, Michael, Zhu, Zhaocheng, Tang, Jian. (2023). Graphtext: Graph reasoning in text space. arXiv preprint arXiv:2310.01089.

[chai2023graphllm_basic_graph_reasoning] Chai, Ziwei, Zhang, Tianjie, Wu, Liang, Han, Kaiqiao, Hu, Xiaohai, Huang, Xuanwen, Yang, Yang. (2023). Graphllm: Boosting graph reasoning ability of large language model. arXiv preprint arXiv:2310.05845.

[he2023harnessing_nc] He, Xiaoxin, Bresson, Xavier, Laurent, Thomas, Perold, Adam, LeCun, Yann, Hooi, Bryan. (2023). Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning. arXiv preprint arXiv:2305.19523.

[chen2023exploring_nc] Chen, Zhikai, Mao, Haitao, Li, Hang, Jin, Wei, Wen, Hongzhi, Wei, Xiaochi, Wang, Shuaiqiang, Yin, Dawei, Fan, Wenqi, Liu, Hui, others. (2023). Exploring the potential of large language models (llms) in learning on graphs. arXiv preprint arXiv:2307.03393.

[huang2023can_nc] Huang, Jin, Zhang, Xingjian, Mei, Qiaozhu, Ma, Jiaqi. (2023). Can llms effectively leverage graph structural information: when and why. arXiv preprint arXiv:2309.16595.

[chen2023label_nc] Chen, Zhikai, Mao, Haitao, Wen, Hongzhi, Han, Haoyu, Jin, Wei, Zhang, Haiyang, Liu, Hui, Tang, Jiliang. (2023). Label-free node classification on graphs with large language models (llms). arXiv preprint arXiv:2310.04668.

[yu2023empower_nc] Yu, Jianxiang, Ren, Yuxiang, Gong, Chenghua, Tan, Jiaqi, Li, Xiang, Zhang, Xuecang. (2023). Empower text-attributed graphs learning with large language models (llms). arXiv preprint arXiv:2310.09872.

[qin2023disentangled_nc] Qin, Yijian, Wang, Xin, Zhang, Ziwei, Zhu, Wenwu. (2023). Disentangled representation learning with large language models for text-attributed graphs. arXiv preprint arXiv:2310.18152.

[sun2023large_nc] Sun, Shengyin, Ren, Yuxiang, Ma, Chen, Zhang, Xuecang. (2023). Large Language Models as Topological Structure Enhancers for Text-Attributed Graphs. arXiv preprint arXiv:2311.14324.

[zhao2023gimlet_gc] Zhao, Haiteng, Liu, Shengchao, Ma, Chang, Xu, Hannan, Fu, Jie, Deng, Zhi-Hong, Kong, Lingpeng, Liu, Qi. (2023). GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning. bioRxiv.

[qian2023can_gc] Qian, Chen, Tang, Huayi, Yang, Zhirui, Liang, Hong, Liu, Yong. (2023). Can large language models empower molecular property prediction?. arXiv preprint arXiv:2307.07443.

[tian2023graph_kg] Tian, Yijun, Song, Huan, Wang, Zichen, Wang, Haozhu, Hu, Ziqing, Wang, Fang, Chawla, Nitesh V, Xu, Panpan. (2023). Graph neural prompting with large language models. arXiv preprint arXiv:2309.15427.

[luo2023reasoning_rog_kg] Luo, Linhao, Li, Yuan-Fang, Haffari, Gholamreza, Pan, Shirui. (2023). Reasoning on graphs: Faithful and interpretable large language model reasoning. arXiv preprint arXiv:2310.01061.

[jiang2023structgpt_kg] Jiang, Jinhao, Zhou, Kun, Dong, Zican, Ye, Keming, Zhao, Wayne Xin, Wen, Ji-Rong. (2023). Structgpt: A general framework for large language model to reason over structured data. arXiv preprint arXiv:2305.09645.

[lewis2020retrieval_rag] Lewis, Patrick, Perez, Ethan, Piktus, Aleksandra, Petroni, Fabio, Karpukhin, Vladimir, Goyal, Naman, K{. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems.

[reimers2019sentence_bert] Reimers, Nils, Gurevych, Iryna. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.

[bienstock1993note_pcst] Bienstock, Daniel, Goemans, Michel X, Simchi-Levi, David, Williamson, David. (1993). A note on the prize collecting traveling salesman problem. Mathematical programming.

[lester2021power_prompt_tuning] Lester, Brian, Al-Rfou, Rami, Constant, Noah. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.

[li2021prefix_tuning] Li, Xiang Lisa, Liang, Percy. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.

[zhu2023minigpt] Zhu, Deyao, Chen, Jun, Shen, Xiaoqian, Li, Xiang, Elhoseiny, Mohamed. (2023). Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.

[liu2023visual_llava] Liu, Haotian, Li, Chunyuan, Wu, Qingyang, Lee, Yong Jae. (2023). Visual instruction tuning. arXiv preprint arXiv:2304.08485.

[wu2023next_chat] Wu, Shengqiong, Fei, Hao, Qu, Leigang, Ji, Wei, Chua, Tat-Seng. (2023). Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519.

[hu2021lora] Hu, Edward J, Shen, Yelong, Wallis, Phillip, Allen-Zhu, Zeyuan, Li, Yuanzhi, Wang, Shean, Wang, Lu, Chen, Weizhu. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

[zhang2023llama_adapter] Zhang, Renrui, Han, Jiaming, Zhou, Aojun, Hu, Xiangfei, Yan, Shilin, Lu, Pan, Li, Hongsheng, Gao, Peng, Qiao, Yu. (2023). Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.

[velivckovic2017graph_gat] Veli{\v{c. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903.

[saha2021explagraphs] Saha, Swarnadeep, Yadav, Prateek, Bauer, Lisa, Bansal, Mohit. (2021). ExplaGraphs: An explanation graph generation task for structured commonsense reasoning. arXiv preprint arXiv:2104.07644.

[hudson2019gqa] Hudson, Drew A, Manning, Christopher D. (2019). Gqa: A new dataset for real-world visual reasoning and compositional question answering. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[yih2016value_webqsp] Yih, Wen-tau, Richardson, Matthew, Meek, Christopher, Chang, Ming-Wei, Suh, Jina. (2016). The value of semantic parse labeling for knowledge base question answering. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).

[luo2023reasoning_rog] Luo, Linhao, Li, Yuan-Fang, Haffari, Gholamreza, Pan, Shirui. (2023). Reasoning on graphs: Faithful and interpretable large language model reasoning. arXiv preprint arXiv:2310.01061.

[huang2023survey_hallucination] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, Ting Liu. (2023). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.

[yin2023survey_mmlm] Yin, Shukang, Fu, Chaoyou, Zhao, Sirui, Li, Ke, Sun, Xing, Xu, Tong, Chen, Enhong. (2023). A Survey on Multimodal Large Language Models. arXiv preprint arXiv:2306.13549.

[team2023gemini] Team, Gemini, Anil, Rohan, Borgeaud, Sebastian, Wu, Yonghui, Alayrac, Jean-Baptiste, Yu, Jiahui, Soricut, Radu, Schalkwyk, Johan, Dai, Andrew M, Hauth, Anja, others. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.

[fuyu_8b] Bavishi, Rohan, Elsen, Erich, Hawthorne, Curtis, Nye, Maxwell, Odena, Augustus, Somani, Arushi, Ta\c{s. (2023). Introducing our Multimodal Models.

[2023GPT4VisionSC] . (2023). GPT-4V(ision) System Card.

[touvron2023llama_llama2] Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, others. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

[loshchilov2017decoupled_adamw] Loshchilov, Ilya, Hutter, Frank. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

[kipf2016semi_gcn] Kipf, Thomas N, Welling, Max. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.

[shi2020masked_graph_trans] Shi, Yunsheng, Huang, Zhengjie, Feng, Shikun, Zhong, Hui, Wang, Wenjin, Sun, Yu. (2020). Masked label prediction: Unified message passing model for semi-supervised classification. arXiv preprint arXiv:2009.03509.

[perozzi2024let_graphtoken] Perozzi, Bryan, Fatemi, Bahare, Zelle, Dustin, Tsitsulin, Anton, Kazemi, Mehran, Al-Rfou, Rami, Halcrow, Jonathan. (2024). Let Your Graph Do the Talking: Encoding Structured Data for LLMs. arXiv preprint arXiv:2402.05862.

[fatemi2023talk_graph_qa] Fatemi, Bahare, Halcrow, Jonathan, Perozzi, Bryan. (2023). Talk like a graph: Encoding graphs for large language models. arXiv preprint arXiv:2310.04560.

[sun2024thinkongraph] Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, Jian Guo. (2024). Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. The Twelfth International Conference on Learning Representations.

[baek-etal-2023-knowledge_kaping] Minki Kang, Jin Myung Kwak, Jinheon Baek, Sung Ju Hwang. (2023). Knowledge Graph-Augmented Language Models for Knowledge-Grounded Dialogue Generation. Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE). doi:10.18653/v1/2023.nlrse-1.1.

[wei2022chain_cot] Wei, Jason, Wang, Xuezhi, Schuurmans, Dale, Bosma, Maarten, Xia, Fei, Chi, Ed, Le, Quoc V, Zhou, Denny, others. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems.

[kojima2022large_zero_cot] Kojima, Takeshi, Gu, Shixiang Shane, Reid, Machel, Matsuo, Yutaka, Iwasawa, Yusuke. (2022). Large language models are zero-shot reasoners. Advances in neural information processing systems.

[bibx1] Daniel Bienstock, Michel X Goemans, David Simchi-Levi and David Williamson “A note on the prize collecting traveling salesman problem” In Mathematical programming 59.1-3 Springer, 1993, pp. 413–420

[bibx2] Ziwei Chai et al. “Graphllm: Boosting graph reasoning ability of large language model” In arXiv preprint arXiv:2310.05845, 2023

[bibx3] Zhikai Chen et al. “Exploring the potential of large language models (llms) in learning on graphs” In arXiv preprint arXiv:2307.03393, 2023

[bibx4] Zhikai Chen et al. “Label-free node classification on graphs with large language models (llms)” In arXiv preprint arXiv:2310.04668, 2023

[bibx5] Yunfan Gao et al. “Retrieval-Augmented Generation for Large Language Models: A Survey” In arXiv preprint arXiv:2312.10997, 2023

[bibx6] Xiaoxin He et al. “Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning” In arXiv preprint arXiv:2305.19523, 2023

[bibx7] Chinmay Hegde, Piotr Indyk and Ludwig Schmidt “A nearly-linear time framework for graph-structured sparsity” In International Conference on Machine Learning, 2015, pp. 928–937 PMLR

[bibx8] Edward J Hu et al. “Lora: Low-rank adaptation of large language models” In arXiv preprint arXiv:2106.09685, 2021

[bibx9] Jin Huang, Xingjian Zhang, Qiaozhu Mei and Jiaqi Ma “Can llms effectively leverage graph structural information: when and why” In arXiv preprint arXiv:2309.16595, 2023

[bibx10] Lei Huang et al. “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions”, 2023 arXiv:2311.05232 [cs.CL]

[bibx11] Drew A Hudson and Christopher D Manning “Gqa: A new dataset for real-world visual reasoning and compositional question answering” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709

[bibx12] Jinhao Jiang et al. “Structgpt: A general framework for large language model to reason over structured data” In arXiv preprint arXiv:2305.09645, 2023

[bibx13] Bowen Jin et al. “Large Language Models on Graphs: A Comprehensive Survey” In arXiv preprint arXiv:2312.02783, 2023

[bibx14] Thomas N Kipf and Max Welling “Semi-supervised classification with graph convolutional networks” In arXiv preprint arXiv:1609.02907, 2016

[bibx15] Bin Lei, Chunhua Liao and Caiwen Ding “Boosting logical reasoning in large language models through a new framework: The graph of thought” In arXiv preprint arXiv:2308.08614, 2023

[bibx16] Brian Lester, Rami Al-Rfou and Noah Constant “The power of scale for parameter-efficient prompt tuning” In arXiv preprint arXiv:2104.08691, 2021

[bibx17] Patrick Lewis et al. “Retrieval-augmented generation for knowledge-intensive nlp tasks” In Advances in Neural Information Processing Systems 33, 2020, pp. 9459–9474

[bibx18] Xiang Lisa Li and Percy Liang “Prefix-tuning: Optimizing continuous prompts for generation” In arXiv preprint arXiv:2101.00190, 2021

[bibx19] Xin Li et al. “Graphadapter: Tuning vision-language models with dual knowledge graph” In arXiv preprint arXiv:2309.13625, 2023

[bibx20] Yuhan Li et al. “A survey of graph meets large language model: Progress and future directions” In arXiv preprint arXiv:2311.12399, 2023

[bibx21] Hao Liu et al. “One for All: Towards Training One Graph Model for All Classification Tasks” In arXiv preprint arXiv:2310.00149, 2023

[bibx22] Haotian Liu, Chunyuan Li, Qingyang Wu and Yong Jae Lee “Visual instruction tuning” In arXiv preprint arXiv:2304.08485, 2023

[bibx23] Ilya Loshchilov and Frank Hutter “Decoupled weight decay regularization” In arXiv preprint arXiv:1711.05101, 2017

[bibx24] Linhao Luo, Yuan-Fang Li, Gholamreza Haffari and Shirui Pan “Reasoning on graphs: Faithful and interpretable large language model reasoning” In arXiv preprint arXiv:2310.01061, 2023

[bibx25] Linhao Luo, Yuan-Fang Li, Gholamreza Haffari and Shirui Pan “Reasoning on graphs: Faithful and interpretable large language model reasoning” In arXiv preprint arXiv:2310.01061, 2023

[bibx26] Shirui Pan, Yizhen Zheng and Yixin Liu “Integrating Graphs with Large Language Models: Methods and Prospects” In arXiv preprint arXiv:2310.05499, 2023

[bibx27] Chen Qian et al. “Can large language models empower molecular property prediction?” In arXiv preprint arXiv:2307.07443, 2023

[bibx28] Yijian Qin, Xin Wang, Ziwei Zhang and Wenwu Zhu “Disentangled representation learning with large language models for text-attributed graphs” In arXiv preprint arXiv:2310.18152, 2023

[bibx29] Nils Reimers and Iryna Gurevych “Sentence-bert: Sentence embeddings using siamese bert-networks” In arXiv preprint arXiv:1908.10084, 2019

[bibx30] Swarnadeep Saha, Prateek Yadav, Lisa Bauer and Mohit Bansal “ExplaGraphs: An explanation graph generation task for structured commonsense reasoning” In arXiv preprint arXiv:2104.07644, 2021

[bibx31] Yunsheng Shi et al. “Masked label prediction: Unified message passing model for semi-supervised classification” In arXiv preprint arXiv:2009.03509, 2020

[bibx32] Shengyin Sun, Yuxiang Ren, Chen Ma and Xuecang Zhang “Large Language Models as Topological Structure Enhancers for Text-Attributed Graphs” In arXiv preprint arXiv:2311.14324, 2023

[bibx33] Jiabin Tang et al. “Graphgpt: Graph instruction tuning for large language models” In arXiv preprint arXiv:2310.13023, 2023

[bibx34] Yijun Tian et al. “Graph neural prompting with large language models” In arXiv preprint arXiv:2309.15427, 2023

[bibx35] Hugo Touvron et al. “Llama 2: Open foundation and fine-tuned chat models” In arXiv preprint arXiv:2307.09288, 2023

[bibx36] Petar Veličković et al. “Graph attention networks” In arXiv preprint arXiv:1710.10903, 2017

[bibx37] Heng Wang et al. “Can Language Models Solve Graph Problems in Natural Language?” In arXiv preprint arXiv:2305.10037, 2023

[bibx38] Shengqiong Wu et al. “Next-gpt: Any-to-any multimodal llm” In arXiv preprint arXiv:2309.05519, 2023

[bibx39] Ruosong Ye et al. “Natural language is all a graph needs” In arXiv preprint arXiv:2308.07134, 2023

[bibx40] Wen-tau Yih et al. “The value of semantic parse labeling for knowledge base question answering” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2016, pp. 201–206

[bibx41] Minji Yoon, Jing Yu Koh, Bryan Hooi and Ruslan Salakhutdinov “Multimodal Graph Learning for Generative Tasks” In arXiv preprint arXiv:2310.07478, 2023

[bibx42] Jianxiang Yu et al. “Empower text-attributed graphs learning with large language models (llms)” In arXiv preprint arXiv:2310.09872, 2023

[bibx43] Junchi Yu, Ran He and Rex Ying “Thought Propagation: An Analogical Approach to Complex Reasoning with Large Language Models” In arXiv preprint arXiv:2310.03965, 2023

[bibx44] Jiawei Zhang “Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT” In arXiv preprint arXiv:2304.11116, 2023

[bibx45] Renrui Zhang et al. “Llama-adapter: Efficient fine-tuning of language models with zero-init attention” In arXiv preprint arXiv:2303.16199, 2023

[bibx46] Ziwei Zhang et al. “Graph Meets LLMs: Towards Large Graph Models”, 2023 arXiv:2308.14522 [cs.LG]

[bibx47] Haiteng Zhao et al. “GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning” In bioRxiv Cold Spring Harbor Laboratory, 2023, pp. 2023–05

[bibx48] Jianan Zhao et al. “Graphtext: Graph reasoning in text space” In arXiv preprint arXiv:2310.01089, 2023

[bibx49] Deyao Zhu et al. “Minigpt-4: Enhancing vision-language understanding with advanced large language models” In arXiv preprint arXiv:2304.10592, 2023