Graph MLP-Mixer

% Ziyue Qi, % School of Coumputing and Information, % University of Pittsburgh, % Pittsburgh, PA, % %% examples of more authors %, % Zixuan Lu, % School of Coumputing and Information, % University of Pittsburgh, % Pittsburgh, PA, % Yuchen Lu, % School of Coumputing and Information, % University of Pittsburgh, % Pittsburgh, PA, %%, %% Coauthor, %% Affiliation, %% Address, %% email, %%, %% Coauthor, %% Affiliation, %% Address, %% email, %%, %% Coauthor, %% Affiliation, %% Address, %% email

Abstract

Graph Neural Networks (GNNs) have shown great potential in the field of graph representation learning. Standard GNNs define a local message-passing mechanism which propagates information over the whole graph domain by stacking multiple layers. This paradigm suffers from two major limitations, over-squashing and poor long-range dependencies, that can be solved using global attention but significantly increases the computational cost to quadratic complexity. In this work, we propose an alternative approach to overcome these structural limitations by leveraging the ViT/MLP-Mixer architectures introduced in computer vision. We introduce a new class of GNNs, called Graph ViT/MLP-Mixer, that holds three key properties. First, they capture long-range dependency and mitigate the issue of over-squashing as demonstrated on Long Range Graph Benchmark and TreeNeighbourMatch datasets. Second, they offer better speed and memory efficiency with a complexity linear to the number of nodes and edges, surpassing the related Graph Transformer and expressive GNN models. Third, they show high expressivity in terms of graph isomorphism as they can distinguish at least 3-WL non-isomorphic graphs. We test our architecture on 4 simulated datasets and 7 real-world benchmarks, and show highly competitive results on all of them. The source code is available for reproducibility at: https://github.com/XiaoxinHe/Graph-ViT-MLPMixer.

Generalizing ViT/MLP-Mixer to Graphs

Xiaoxin He 1 Bryan Hooi 1 Thomas Laurent 2 Adam Perold 3 Yann LeCun 4 . 5 Xavier Bresson 1

{xiaoxin, bhooi, xaviercs}@comp.nus.edu.sg, tlaurent@lmu.edu ap@elementresearch.com, yann@cs.nyu.edu

1 National University of Singapore 2 Loyola Marymount University 3 Element, Inc. 4 New York University 5 Meta AI

Patch Extraction

Graph Neural Networks (GNNs) have shown great potential in the field of graph representation learning. Standard GNNs define a local message-passing mechanism which propagates information over the whole graph domain by stacking multiple layers. This paradigm suffers from two major limitations, over-squashing and poor long-range dependencies, that can be solved using global attention but significantly increases the computational cost to quadratic complexity. In this work, we propose an alternative approach to overcome these structural limitations by leveraging the ViT/MLPMixer architectures introduced in computer vision. We introduce a new class of GNNs, called Graph ViT/MLP-Mixer, that holds three key properties. First, they capture long-range dependency and mitigate the issue of over-squashing as demonstrated on Long Range Graph Benchmark and TreeNeighbourMatch datasets. Second, they offer better speed and memory efficiency with a complexity linear to the number of nodes and edges, surpassing the related Graph Transformer and expressive GNN models. Third, they show high expressivity in terms of graph isomorphism as they can distinguish at least 3-WL non-isomorphic graphs. We test our architecture on 4 simulated datasets and 7 real-world benchmarks, and show highly competitive results on all of them. The source code is available for reproducibility at: https://github.com/XiaoxinHe/Graph-ViT-MLPMixer .

Message-Passing GNNs and the Limitations

In this first section, we present the background of the project by introducing the standard Message-Passing (MP) GNNs and their two major limitations; low expressivity representation and poor long-range dependency. We also present the current techniques that address these issues, i.e. Weisfeiler-Leman GNNs, graph positional encoding and Graph Transformers, as well as their shortcomings.

Message-Passing GNNs (MP-GNNs). GNNs have become the standard learning architectures for graphs based on their flexibility to work with complex data domains s.a. recommendation [1, 2], chemistry [3, 4], physics [5, 6], transportation [7], vision [8], natural language processing (NLP) [9], knowledge graphs [10], drug design [11, 12] and medical domain [13, 14]. Most GNNs are designed to have two core components. First, a structural message-passing mechanism s.a. Defferrard et al. [15], Kipf and Welling [16], Hamilton et al. [17], Monti et al. [1], Bresson and Laurent [18], Gilmer et al. [4], Veliˇ ckovi´ c et al. [19] that computes node representations by aggregating the local 1-hop neighborhood information. Second, a stack of L layers that aggregates L -hop neighborhood nodes to increase the expressivity of the network and transmit information between nodes that are L hops apart.

Weisfeiler-Leman GNNs (WL-GNNs). One of the major limitations of MP-GNNs is their inability to distinguish (simple) non-isomorphic graphs. This limited expressivity can be formally analyzed with the Weisfeiler-Leman graph isomorphism test [20], as first proposed in Xu et al. [21], Morris et al. [22]. Later on, Maron et al. [23] introduced a general class of k -order WL-GNNs that can be proved to universally represent any class of k -WL graphs [24, 25]. But to achieve such expressivity, this class of GNNs requires using k -tuples of nodes with memory and speed complexities of O ( N k ) , with N being the number of nodes and k ≥ 3 . Although the complexity can be reduced to O ( N 2 ) and O ( N 3 ) respectively [24, 25, 26], it is still computationally costly compared to the linear complexity O ( E ) of MP-GNNs with E being the number of edges, which often reduces to O ( N ) for real-world graphs that exhibit sparse structures. In order to reduce memory and speed complexities of WL-GNNs while keeping high expressivity, several works have focused on designing graph networks from their sub-structures s.a. sub-graph isomorphism [27], sub-graph routing mechanism [28], cellular WL sub-graphs [29], and k-hop egonet sub-graphs [21, 30, 25, 31, 32].

Graph Positional Encoding (PE). Another aspect of the limited expressivity of GNNs is their inability to recognize simple graph structures s.a. cycles or cliques, which are often present in molecules and social graphs [33]. We can consider k -order WL-GNNs with value k to be the length of cycle/clique, but with high complexity O ( N k ) . An alternative approach is to add positional encoding to the graph nodes. It was proved in Murphy et al. [34], Loukas [35] that unique and equivariant PE increases the representation power of any MP-GNN while keeping the linear complexity. This theoretical result was applied with great empirical success with index PE [34], Laplacian eigenvectors [36, 37, 38, 39] and k-step Random Walk [40, 41]. All these graph PEs lead to GNNs strictly more powerful than the 1-WL test, which seems to be enough expressivity in practice [42]. However, none of the PE proposed for graphs can provide a global position of the nodes that is unique, equivariant and distance sensitive. This is due to the fact that a canonical positioning of nodes does not exist for arbitrary graphs, as there is no notion of up, down, left and right on graphs. For example, any embedding coordinate system like graph Laplacian eigenvectors [43] can flip up-down directions, right-left directions, and would still be a valid PE. This introduces ambiguities for the GNNs that require to (learn to) be invariant with respect to the graph or PE symmetries. A well-known example is given by the eigenvectors: there exist 2 k number of possible sign flips for k eigenvectors that require to be learned by the network.

Issue of long-range dependencies. Another major limitations of MP-GNNs is the well-known issue of long-range dependencies. Standard MP-GNNs require L layers to propagate the information from one node to their L -hop neighborhood. This implies that the receptive field size for GNNs can grow exponentially, for example with O (2 L ) for binary tree graphs. This causes over-squashing; exponentially growing information is compressed into a fixed-length vector by the aggregation mechanism [44, 45]. It is worth noting that the poor long-range modeling ability of deep GNNs can be caused by the combined effect of multiple factors, such as over-squashing, vanishing gradients, poor isomorphism expressivity, etc. but, in this work, we focus our effort on alleviating over-squashing s.a. Deac et al. [46], Arnaiz-Rodríguez et al. [47]. Over-squashing is well-known since recurrent neural networks [48], which have led to the development of the (self- and cross-)attention mechanisms for the translation task [49, 50] first, and then for more general NLP tasks [51, 52]. Transformer architectures are the most elaborated networks that leverage attention, and have gained great success in NLP and computer vision (CV). Several works have generalized the transformer architecture for graphs, alleviating the issue of long-range dependencies and achieving competitive or superior performance against standard MP-GNNs. We highlight the most recent research works in the next paragraph.

Graph Transformers. Dwivedi and Bresson [37] generalize Transformers to graphs, with graph Laplacian eigenvectors as node PE, and incorporating graph structure into the permutation-invariant attention function. SAN and LSPE [38, 41] further improve with PE learned from Laplacian and random walk operators. GraphiT [53] encodes relative PE derived from diffusion kernels into the attention mechanism. GraphTrans [54] add Transformers on the top of standard GNN layers. SAT [55] proposes a novel self-attention mechanism that incorporates structural information into the standard self-attention module by using a GNN to compute subgraph representations. Graphormer [56] introduce three structural encodings, with great success on large molecular benchmarks. GPS [57] categorizes the different types of PE and puts forward a hybrid MPNN+Transformer architecture. We refer to Min et al. [58] for an overview of graph-structured Transformers. Generally, most Graph Transformer architectures address the problems of over-squashing and limited long-range dependencies in GNNs but they also increase significantly the complexity from O ( E ) to O ( N 2 ) , resulting in a computational bottleneck. A detailed description of related literature can be found in Appendix A.

Generalizing ViT/MLP-Mixer to Graphs

In the following, we explain the importance of generalizing the ViT/MLP-Mixer architectures from CV to graphs.

ViT and MLP-Mixer in computer vision. Transformers have gained remarkable success in CV and NLP, most notably with architectures like ViT [59] and BERT [51]. The success of transformers has been long attributed to the attention mechanism [50], which is able to model long-range dependency by making "everything connected to everything". But recently, this prominent line of networks has been challenged by more cost efficient alternatives. A novel family of models based on the MLP-Mixer introduced by Tolstikhin et al. [60] has emerged and gained recognition for its simplicity and its efficient implementation. Overall, MLP-Mixer replaces the attention module with multi-layer perceptrons (MLPs) which are also not affected by over-squashing and poor long-range interaction. The original architecture is simple [60]; it takes image patches (or tokens) as inputs, encodes them with a linear layer (equivalent to a convolutional layer over the image patches), and updates their representations with a series of feed-forward layers applied alternatively to image patches (or tokens) and features. These plain networks [60, 61, 62, 63] can perform competitively with state-of-the-art (SOTA) vision Transformers, which tends to indicate that attention is not the only important inductive bias, but other elements like the general architecture of Transformers with patch embedding,

Table 1: Differences between ViT/MLP-Mixer components for images and graphs.

residual connection and layer normalization, and carefully-curated data augmentation techniques seem to play essential roles as well [64].

The benefits of generalizing ViT/MLP-Mixer from grids to graphs. Standard MP-GNNs have linear learning/inference complexities but low representation power and poor long-range dependency. Graph Transformers address these two problems but loose the computational efficiency with a quadratic complexity price. A generalization of ViT/MLP-Mixer to graphs overcomes the computational bottleneck of Graph Transformers and solves the issue of long-distance dependency, hence going beyond standard MP-GNNs. However, achieving such a successful generalization is challenging given the irregular and variable nature of graphs. See Section 3 for a detailed presentation of theses challenges.

Contribution. Our contributions are listed as follows.

· We identify the key challenges to generalize ViT/MLP-Mixer from images to graphs and design a new efficient class of GNNs, namely Graph ViT/MLP-Mixer, that simultaneously captures long-range dependency, keeps linear speed/memory complexity, and achieves high graph isomorphic expressivity. · We show competitive results on multiple benchmarks from Benchmarking GNNs [36] and the Open Graph Benchmark (OGB) [65]; specifically, with 0.073 MAE on ZINC and 0.7997 ROCAUC on MolHIV. · We demonstrate the capacity of the proposed model to capture long-range dependencies with SOTA performance on Long Range Graph Benchmark (LRGB) [66] while keeping low complexity, to mitigate the over-squashing issue on the TreeNeighbourMatch dataset [44], and to reach the 3-WL expressive power on the SR25 dataset [67]. · Our approach forms a bridge between CV, NLP and graphs under a unified architecture, that can potentially benefit cross-over domain collaborations to design better networks.

Generalization Challenges

We list the main questions when adapting MLP-Mixer from images to graphs in the following and in Table 1.

(1) How to define and extract graph patches/tokens? One notable geometrical property that distinguishes graphstructured data from regular structured data, such as images and sequences, is that there does not exist in general a canonical grid to embed graphs. As shown in Table 1, images are supported by a regular lattice, which can be easily split into multiple grid-like patches (also referred to as tokens) of the same size via fast pixel reordering. However, graph data is irregular: the number of nodes and edges in different graphs is typically different. Hence, graphs cannot be uniformly divided into similar patches across all examples in the dataset. Finally, the extraction process for graph patches cannot be uniquely defined given the lack of canonical graph embedding. This raises the questions of how we identify meaningful graph tokens, and quickly extract them.

(2) How to encode graph patches into a vectorial representation? Since images can be reshaped into patches of the same size, they can be linearly encoded with an MLP, or equivalently with a convolutional layer with kernel size and stride values equal to the patch size. However, graph patches are not all the same size: they have variable topology with different number of nodes, edges and connectivity. Another important difference is the absence of a unique node ordering for graphs, which constrains the process to be invariant to node re-indexing for generalization purposes. In summary, we need a process that can transform graph patches into a fixed-length vectorial representation for arbitrary subgraph structures while being permutation invariant.

(3) How to preserve positional information for nodes and graph patches? As shown in Table 1, image patches in the sequence have implicit positions since image data is always ordered the same way due to its unique embedding in the Euclidean space. For instance, the image patch at the upper-left corner is always the first one in the sequence and the image patch at the bottom-right corner is the last one. On this basis, the token mixing operation of the MLP-Mixer is able to fuse the same patch information. However, graphs are naturally not-aligned and the set of graph patches are therefore unordered. We face a similar issue when we consider the positions of nodes within each graph patch. In images, the pixels in each patch are always ordered the same way; in contrast, nodes in graph tokens are naturally unordered. Thus, how do we preserve local and global positional consistency for nodes and graph patches?

(4) How to reduce over-fitting for Graph ViT/MLP-Mixer? ViT/MLP-Mixer architectures are known to be strong over-fitters [62]. Most MLP-variants [60, 61, 63] first pre-train on large-scale datasets, and then fine-tune on downstream tasks, coupled with a rich set of data augmentation and regularization techniques, e.g. cropping, random horizontal flipping, RandAugment [68], mixup [69], etc. While data augmentation has drawn much attention in CV and NLP, graph data augmentation methods are not yet as effective, albeit interest and works on this topic [70]. Variable number of nodes, edges and connectivity make graph augmentation challenging. Thus, how do we augment graph-structured data given this nature of graphs?

Proposed Architecture

Overview

The basic architecture of Graph MLP-Mixer is illustrated in Figure 1. The goal of this section is to detail the choices we made to implement each component of the architecture. On the whole, these choices lead to a simple framework that provides speed and quality results.

Notation. Let G = ( V , E ) be a graph with V being the set of nodes and E the set of edges. The graph has N = |V| nodes and E = |E| edges. The connectivity of the graph is represented by the adjacency matrix A ∈ R N × N . The node features of node i are denoted by h i , while the features for an edge between nodes i and j are indicated by e ij . Let {V 1 , ..., V P } be the nodes partition, P be the pre-defined number of patches, and G i = ( V i , E i ) be the induced subgraph of G with all the nodes in V i and all the edges whose endpoints belong to V i . Let h G be the graph-level representation and y G be the graph-level target.

Patch Extraction

When generalizing MLP-Mixer to graphs, the first step is to extract patches. This extraction is straightforward for images. Indeed, all image data x ∈ R H × W × C are defined on a regular grid with the same fixed resolution ( H,W ) , where H and W are respectively the height and the width, and C is the number of channels. Hence, all images can be easily reshaped into a sequence of flattened patches x p ∈ R P × ( R 2 C ) , where ( R,R ) is the resolution of each image patch, and P = HW/R 2 is the resulting number of patches.

Figure 1: The basic architecture of the proposed Graph ViT/MLP-Mixer. They consist of a patch extraction module, a patch embedding module, a sequence of mixer layers, a global average pooling, and a classifier head. The patch extraction module partitions graphs into overlapping patches. The patch embedding module transforms these graph patches into corresponding token representations, which are fed into a sequence of mixer layers to generate the output tokens. A global average pooling layer followed by a fully-connected layer is finally used for prediction. Each Mixer Layer, MLP or graph-based multi-head attention (gMHA), is a residual network that alternates between a Token Mixer applied to all patches, and a Channel Mixer applied to each patch independently (see right side).

patch extraction process must satisfy the following conditions: 1) the same extraction algorithm can be applied to any arbitrary graph, 2) the nodes in the sub-graph patch must be more closely connected than for those outside the patch, and 3) the extraction complexity must be fast, that is at most linear w.r.t. the number of edges, i.e. O ( E ) .

METIS [71] is a graph clustering algorithm with one of the best trade-off accuracy and speed. It partitions a graph into a pre-defined number of clusters such that the number of within-cluster links is much higher than between-cluster links in order to better capture good community structure. For these fine properties, we select it as our patch extraction algorithm. However, METIS is limited to finding non-overlapping clusters, as visualized in Figure 1. In this example, METIS partitions the graph into four non-overlapping parts, i.e. { 1 , 2 , 3 } , { 4 , 5 , 6 } , { 7 , 8 , 9 } and { 10 , 11 , 12 } , resulting in 5 edge cuts. Unlike images, extracting non-overlapping patches could imply losing important edge information, i.e. the cutting edges, and thus decreasing the predictive performance, as we will observe experimentally. To overcome this issue and to retain all original edges, we allow graph patches to overlap with each other. For example in Figure 1, if the source and destination nodes of an edge are not in the same patch, we assign both nodes to the patches they belong to. As such, node 3 and node 4 are in two different patches, here the blue and red one, but are connected with each other. After our overlapping adjustment, these two nodes belong to both the blue and red patches. This practice is equivalent to expanding the graph patches to the one-hop neighbourhood of all nodes in that patch. Formally, METIS is first applied to partition a graph into P non-overlapping patches: {V 1 , ..., V P } such that V = V 1 ∪ ... ∪V P and V i ∩V j = ∅ , ∀ i = j. Then, patches are expanded to their one-hop neighbourhood in order to preserve the information of between-patch links and make use of all graph edges: V i ←V i ∪ { N 1 ( j ) | j ∈ V i } , where N k ( j ) defines the k -hop neighbourhood of node j .

Patch Encoder

For images, patch encoding can be done with a simple linear transformation given the fixed resolution of all image patches. This operation is fast and well-defined. For graphs, the patch encoder network must be able to handle complex data structure such as invariance to index permutation, heterogeneous neighborhood, variable patch sizes, convolution on graphs, and expressive to differentiate graph isomorphisms. As a result, the graph patch encoder is a GNN, whose architecture is designed to best transform a graph token G p into a fixed-size representation x G p into 3 steps.

Step 1. Raw node and edge linear embedding. The input node features α i ∈ R d n × 1 and edge features β ij ∈ R d e × 1 are linearly projected into d -dimensional hidden features:

where U 0 ∈ R d × d n , V 0 ∈ R d × d e and u 0 , v 0 ∈ R d are learnable parameters.

Step 2. Graph convolutional layers with MP-GNN. We apply a series of L convolution layers, where the node and edge representations are updated with a MP-GNN applied to each graph patch G p = ( V p , E p ) as follows:

where ℓ is the layer index, p is the patch index, i, j denotes the nodes, N ( i ) is the neighborhood of the node i and functions f node and f edge (with learnable parameters) define any arbitrary MP-GNN architecture s.a. [16, 18, 72, 37], h ℓ p = 1 |V p | ∑ i ∈V p h l i,p ∈ R d , e ℓ p = 1 |E p | ∑ ij ∈E p e l ij,p ∈ R d are respectively the mean representations of the patch nodes and patch edges, and g patch-node , g patch-edge are MLP-based functions that act on h ℓ p and e ℓ p . For each node and edge covered by more than one patch due to the patch overlapping to include all edges cut by METIS, we update the node/edge representation by averaging over the overlapping patches:

where { k | i ∈ V k } , { k | ij ∈ E k } are the set of all patches that cover node i , edge ij respectively.

Step 3. Pooling and readout. The final step is mean pooling all node vectors in G p such that h p = 1 |V p | ∑ i ∈V p h ℓ = L i,p ∈ R d , and applying a small MLP to get the fixed-sized patch embedding x G p ∈ R d .

Observe that the patch encoder is a MP-GNN, and thus has the inherent limitation of poor long-range dependency. Does it affect the Graph MLP-Mixer to capture long-range interactions? The answer is negative because this problem is limited to large graphs. But for small patch graphs, this issue does not really exist (or is negligible). To give a few numbers, the mean number of nodes and the mean diameter for graph patches are around 3.15 and 1.82 respectively for molecular datasets and around 17.20 and 3.07 for image datasets, see Table 7.

Positional Information

Regular grids offer a natural implicit arrangement for the sequence of image patches and for the pixels inside the image patches. However, such ordering of nodes and patches do not exist for general graphs. This lack of positional information reduces the expressivity of the network. Hence, we use two explicit positional encodings (PE); one absolute PE for the patch nodes and one relative PE for the graph patches.

Node PE. Input node features in Eq (1) are augmented with p i ∈ R K , with a learnable matrix T 0 ∈ R d × K :

The benefits of different PEs are dataset dependent. We follow the strategy in [57] that uses random-walk structural encoding (RWSE) [41] for molecular data and Laplacian eigenvectors encodings [36] for image superpixels. Since Laplacian eigenvectors are defined up to sign flips, the sign of the eigenvectors is randomly flipped during training.

Patch PE. Relative positional information between the graph patches can be computed from the original graph adjacency matrix A ∈ R N × N and the clusters {V 1 , ..., V P } extracted by METIS in Section 4.2. Specifically, we capture relative positional information via the coarsened adjacency matrix A P ∈ R P × P over the patch graphs:

where Cut ( V i , V j ) = ∑ k ∈V i ∑ l ∈V j A kl is the standard graph cut operator which counts the number of connecting edges between cluster V i and cluster V j .

We extract the positional encoding ˆ p i ∈ R ˆ K at the patch level, similar to the node level, which will be injected (after a linear transformation) into the first layer of the mixer block:

where x i is the patch embedding.

Mixer Layer

dimension. These two interleaved steps enable information fusion among tokens and channels. The simplicity of the mixer layer has led to a significant reduction in computational cost with little or no sacrifice in performance. Indeed, the self-attention mechanism in ViT requires O ( P 2 ) memory and O ( P 2 ) computation, while the mixer layer in MLP-Mixer needs O ( P ) memory and O ( P ) computation.

Let X ∈ R P × d be the patch embedding { x G 1 , ..., x G P } , the graph mixer layer can be expressed as

where σ is a GELU nonlinearity [73], LayerNorm ( · ) is layer normalization [74], and matrices W 1 ∈ R d s × P , W 2 ∈ R P × d s , W 3 ∈ R d c × d , W 4 ∈ R d × d c , where d s and d c are the tunable hidden widths in the token-mixing and channelmixing MLPs.

Alternatively, we can formulate a graph transformer layer to incorporate the self-attention mechanism as in ViT:

where gMHA (graph-based multi-head attention) is designed to capture token dependencies based on the given graph topology. In Eq.(9), gHMA is defined as ( A P ⊙ softmax ( QK T √ d ) ) V but other options are possible to characterize the gHMA mechanism, as studied in Appendix F.

Then we generate the final graph-level representation by mean pooling all the non-empty patches:

where m p is a binary variable with value 1 for non-empty patches and value 0 for empty patches (since graphs have variable sizes, small graphs can produce empty patches). Finally, we apply a small MLP to get the graph-level target:

Data augmentation

MLP-Mixer architectures are known to be strong over-fitters [62]. In order to reduce this effect, we propose to introduce some perturbations in METIS as follows. Let G = ( V , E ) be the original graph and G ′ = ( V , E ′ ) be the graph after randomly dropping a small set of edges. At each epoch, we apply METIS graph partition algorithm on G ′ to get slightly different node partitions {V 1 , ..., V P } . Then, we extract the graph patches { G 1 , ..., G P } where G i = ( V i , E i ) is the induced subgraph of the original graph G , and not the modified G ′ . This way, we can produce distinct graph patches at each epoch that retain all the nodes and edges from the original graph.

Experiments

Graph Benchmark Datasets. We evaluate our Graph ViT/MLP-Mixer on a wide range of graph benchmarks; 1) Simulated datasets: CSL, EXP, SR25 and TreeNeighbourMatch dataset, 2) Small real-world datasets: ZINC, MNIST and CIFAR10 from Benchmarking GNNs [36], and MolTOX21 and MolHIV from OGB [65] and 3) Large real-world datasets: Peptides-func and Peptides-struct from LRGB [66]. Details are provided in Appendix B and Appendix C.

Comparison with MP-GNNs

We show in Table 2 that ViT/Graph MLP-Mixer lifts the performance of all base MP-GNNs across various datasets, which include GCN [16], GatedGCN [18], GINE [72] and Graph Transformer [37]. We augmented all the base models with the same type of PE as Graph MLP-Mixer to ensure a fair comparison. These promising results demonstrate the generic nature of our proposed architecture which can be applied to any MP-GNNs in practice. Remarkably, Graph ViT/MLP-Mixer outperforms the base MP-GNNs by large margins on two LRGB [66] datasets; we can observe an average 0.056 Average Precision improvement on Peptides-func and an average 0.028 MAE decrease on Peptides-struct, which verifies its superiority over MP-GNNs in capturing long-range interaction.

Table 2: Comparison with base MP-GNNs. Results are averaged over 4 runs with 4 different seeds.

Comparison with SOTAs

Next, we compare Graph ViT/MLP-Mixer against popular GNN models with SOTA results, including Graph Transformers (GraphiT, GPS, SAN, etc.) and expressive GNNs (GNN-AK+ and SUN), as shown in Table 3 and Table 17. For small molecular graphs, our model achieved 0.073 on ZINC and 0.7997 on MolHIV. For larger molecular graphs, our model sets new SOTA performance with the best scores of 0.6970 on Peptides-func and 0.2449 on Peptides-struct.

Besides, Graph ViT/MLP-Mixer offers better space-time complexity and scalability. Theoretically, most Graph Transformer models and expressive GNNs, might be computationally infeasible for large graphs, as they need to calculate the full attention and need to run an inner GNN on every node of the graph respectively. Experimentally, we observed that, when training on datasets with hundreds of nodes, SAN+LapPE [55] and SUN [32] require 9 . 4 × and 43 . 8 × training time per epoch, and 12 . 4 × and 18 . 8 × memory respectively, compared to our model.

Graph ViT/MLP-Mixer can mitigate over-squashing

TreeNeighbourMatch is a synthetic dataset proposed by Alon and Yahav [44] to provide an intuition into over-squashing. Each example is a binary tree of depth r . The goal is to predict an alphabetical label for the target node, where the correct answer is the label of the leaf node that has the same degree as the target node. Figure 2 shows that standard MP-GNNs (i.e., GCN, GGCN, GAT and GIN) fail to generalize on the dataset from r = 4 , whereas our model mitigates over-squashing and generalizes well until r = 7 . As for why this happens, Alon and Yahav [44] show that GNNs fail to solve larger TreeNeighbourMatch cases as they 'squash' information about the graph into the target node's embedding, which can hold a limited amount of information. In contrast, Graph ViT/MLP-Mixer avoids this problem as it transmits long-range information directly without 'squashing.' Concretely, Appendix I shows a simple construction illustrating

1 For SUN, we run the official code and obtain 0.7886 ± 0.0081 on MolHIV with our 4 seeds.

2 For CIN, the reporting score is not obtained with the budget of 500k parameters but with 1.7M parameters (3x more) when running their official code.

that our model can solve TreeNeighbourMatch cases while avoiding the inherent limitations of MP-GNNs discussed by Alon and Yahav [44].

Table 4: Empirical evaluation of the expressive power on three simulation datasets, averaging over 4 runs with 4 different seeds.

Figure 2: Test Accuracy across problem radius (tree depth) in the TreeNeighbourMatch problem.

Graph ViT/MLP-Mixer can achieve empirical high expressivity

We experimentally evaluate the expressive power of Graph ViT/MLP-Mixer on three simulated datasets. CSL [34] contains 150 4-regular graphs that cannot be distinguished with a 1-WL isomorphism test. EXP [75] contains 600 pairs of non-isomorphic graphs: both 1-WL and 2-WL tests fail at differentiating these graphs. Finally, SR25 [67] has 15 strongly regular graphs with 25 nodes each that cannot be discerned with a 3-WL test. Numerical experiments show that our model achieves perfect accuracy on all 3 datasets while MP-GNNs fail, see Table 4. Our results are only empirical. Due to the nonlocal way in which information is passed from one layer to the other, a direct analytical comparison between the proposed neural network and the Weisfeiler-Lehman test is challenging.

Ablation Studies

In our ablation studies, we evaluated various choices made during the implementation of each component of the architecture. The details of these studies can be found in the appendix. Appendix D focuses on the design of the patch extraction process, including the effects of the graph partition algorithm (Table 9), patch size (Figure 4), patch overlapping (Figure 5), and other related aspects. Appendix E presents the effects of two types of positional encoding, i.e., node PE and patch PE. Appendix G investigates the effect of data augmentation and explores the trade-off between performance and efficiency. In Appendix F, we delve into different designs of the gMHA mechanism in the Graph ViT. Additionally, we provide a complexity analysis in Appendix J and discuss the limitations in Appendix K.

Conclusion

In this work, we proposed a novel GNN architecture to improve standard MP-GNN limitations, particularly their low expressivity power and poor long-range dependency, and presented promising results on several benchmark graph datasets. Future work will focus on further exploring graph network architectures with the inductive biases of graph tokens and vision Transformer-like architectures in order to solve fundamental node and link prediction tasks, and possibly without the need of specialized GNN libraries like PyG [76] or DGL [77] by replacing sparse linear algebra operations on graph tokens with dense operations.

Acknowledgments

XB is supported by NUS Grant ID R-252-000-B97-133 and BH was supported in part by NUS ODPRT Grant ID A-0008067-00-00. The authors would like to express their gratitude to the reviewers for their feedback, which has improved the clarity and contribution of the paper.

Experiments

Table 5: Comparison of different hierarchical graph models.

We briefly review the hierarchical graph models [78, 79, 80, 81] and highlight the main differences among them.

Coarformer [78] combines MPNNs and Transformers, using a GNN-based module for local information and a Transformer-based module for global information. Exphormer [79] also employs MPNN+Transformer, using GNN and Transformer modules on the original graph and expander graph, respectively. ANS-GT [80] introduces a nodesampling-based GT with hierarchical attention and graph coarsening. NAGphormer [81] treats each node as a token sequence and aggregates multi-hop information using attention-based readout.

Main differences: 1) GNN/Transformer module. Coarformer, Exphormer, SAT and ours use a hybrid MPNN+Transformer architecture while ANS-GT and NAGphormer rely solely on Transformers. However, there are notable differences between these approaches as we do not use any Transformer but rather MLP as our backbone. Besides, our MPNN operates on small graph patches instead of the entire graph as Coarformer and Exphormer. Furthermore, SAT and our architecture are sequential, while Coarformer and Exphormer combine MP-GNNs and GT in parallel.

Graph coarsening module. Coarformer, ANS-GT and SAT use a graph coarsening mechanism. These methods perform graph coarsening as a pre-processing step and generate static and non-overlapping graph patches. In contrast, we perform graph coarsening with a stochastic version of Metis on-the-fly, generating dynamic and overlapping graph patches.
Graph embedding module. The ways to capture local and global information are different as stated in the above reviews and the above summary Table 5.

In summary, although these aforementioned hierarchical graph models share similarities with our model, the major difference between these models and our work is that we do not use Graph Transformer (GT) as the backbone architecture but an alternative architecture based on ViT/MLP-Mixer and graphs. We believe moving from GT to Graph ViT/MLP-Mixer as a new backbone/high-level architecture has the potential to open a new line of work for GNN design (by enhancing the building blocks of the proposed architecture such as graph clustering, graph embedding, mixer layer, positional encoding, (pre-)training, etc).

Datasets Description

We evaluate our Graph MLP-Mixer on a wide range of graph benchmarks. See summary statistics of datasets in Table 6.

CSL [34] is a synthetic dataset to test the expressivity of GNNs. CSL has 150 graphs divided into 10 isomorphism classes. Each CSL graph is a 4-regular graph with edges connected to form a cycle and containing skip-links between nodes. The goal of the task is to classify them into corresponding isomorphism classes.

EXP [75] contains 600 pairs of 1&2-WL failed graphs. The goal is to map these graphs into two classes.

SR25 [67] has 15 strongly regular graphs (3-WL failed) with 25 nodes each. SR25 is translated to a 15 way classification problem with the goal of mapping each graph into a different class.

ZINC [36] is a subset (12K) of molecular graphs (250K) from a free database of commercially-available compounds [82]. These molecular graphs are between 9 and 37 nodes large. Each node represents a heavy atom (28 possible atom types) and each edge represents a bond (3 possible types). The task is to regress a molecular property known as the constrained solubility. The dataset comes with a predefined 10K/1K/1K train/validation/test split.

MNIST and CIFAR10 [36] are derived from classical image classification datasets by constructing an 8 nearestneighbor graph of SLIC superpixels for each image. The resultant graphs are of sizes 40-75 nodes for MNIST and 85-150 nodes for CIFAR10. The 10-class classification tasks and standard dataset splits follow the original image classification datasets, i.e., for MNIST 55K/5K/10K and for CIFAR10 45K/5K/10K train/validation/test graphs. These

datasets are sanity-checks, as we expect most GNNs to perform close to 100% for MNIST and well enough for CIFAR10.

MolTOX21 and MolHIV [65] are molecular property prediction datasets adopted from the MoleculeNet [83]. All the molecules are pre-processed using RDKit [84]. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds. Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not. The datasets come with a predefined scaffold splits based on their two-dimensional structural frameworks, i.e. for MolTOX21 6K/0.78K/0.78K and for MolHIV 32K/4K/4K train/validation/test.

Peptides-func and Peptides-struct [66] are derived from 15,535 peptides with a total of 2.3 million nodes retrieved from SAT-Pdb [85]. Both datasets use the same set of graphs but differ in their prediction tasks. These graphs are constructed in such a way that requires long-range interactions (LRI) reasoning to achieve strong performance in a given task. In concrete terms, they are larger graphs: on average 150.94 nodes per graph, and on average 56.99 graph diameter. Thus, they are better suited to benchmarking of graph Transformers or other expressive GNNs that are intended to capture LRI.

TreeNeighbourMatch is a synthetic dataset proposed by Alon and Yahav [44] to highlight the inherent problem of over-squashing in GNNs. It is designed to simulate the exponentially-growing receptive field while allowing us to control the problem radius r , and thus control the intensity of over-squashing. Specifically, each graph is a binary tree of depth depth (a.k.a. problem radius r ). The goal is to predict a label for the target node, where the correct answer lies in one of the leave nodes. Therefore, the TreeNeighbourMatch problem requires information to be propagated from all leave nodes to the target node before predicting the label, causing the issue of over-squashing at the target node.

Distributions of the graph sizes. We plot of the distributions of the graph sizes (i.e., the number of nodes in each data sample) of these datasets in Figure 3.

Experiment Details

Figure 3: Distributions of the graph sizes.

employing the ASAM optimizer [88] to reduce such fluctuations. We use the same hyperparameter with batch size of 32 and learning rate of 0.01 without further tuning.

Simulation Datasets. For CSL and EXP, we run the 5-fold cross validation with stratified sampling to ensure class distribution remains the same across the splits [36, 30]. For SR25 dataset, we follow the evaluation process in [31, 89] that directly train and validate the model on the whole dataset and report the best performance.

Real-World Datasets. For benchmarking datasets from Dwivedi et al. [36], we followed the most commonly used parameter budgets: up to 500k parameters for ZINC; For MolTOX21 and MolHIV from OGB [65], there is no upper limit on the number of parameters. For peptides-func and peptides-struct from LRGB [66], we followed the parameter budget ∼ 500k. All real world evaluated benchmarks define a standard train/validation/test dataset split.

Baselines. We use GCN [16], GatedGCN [18], GINE [72] and Graph Transformer [37] as our baseline models, which also server as the base patch encoder of Graph MLP-Mixer. The hidden size is set to 128 and the number of layers is set to 4 by default. For TreeNeighbourMatch datasets, we follow the experimental protocol introduced in [44], that is, for TreeNeighbourMatch dataset with problem radias r = depth , we implemented a network with r +1 graph layers to allow an additional nonlinearity after the information from the leaves reaches the target node.

Graph MLP-Mixer. The hidden size is set to 128, and the number of GNN layers and Mixer layers is set to 4. Except that for LRGB datasets, we reduce the number of Mixer layers to 2 to fulfill the parameter budget ∼ 500k.

SOTA models. In Table 3 and Table 17, results are referenced directly from literature if available, otherwise are reproduced using authors' official code. To enable a fair comparison of speed/memory complexity (Table 17), we set the batch size to 128 all the SOTA models and ours and reduce the batch size by half if OOM until the model and batch data can be fit into the memory. Besides, all experiments are run on the same machine.

Positional Encodings. As the most appropriate choice of node positional encoding (NodePE) is dataset and task dependent, we follow the practice of Rampášek et al. [57], Dwivedi et al. [66], see Table 8. We have already augmented all the base models (GCN, GatedGCN, GINE and GraphTrans) in Table 2 with the same type of NodePE as Graph MLP-Mixer to ensure a fair comparison.

Table 8: Summary statistics of positional encoding (PE).

Studies on Patch Extraction Module

Effect of Patch Extraction

Table 9: Effect of patch extraction: ✗ means no patch extraction and ✓ means uses patch extraction.

We conducted an experiment where we ran the Graph MLP-Mixer without the patch extraction process, treating each individual node as a patch. The results of this experiment are presented in Table 9. The patch extraction process is critical. We believe that the patch extraction process, which includes Metis partition and 1-hop extension, helps to capture important local information about the graph structure.

Effect of Graph Partition Algorithm

Graph partitioning algorithms have been studied for decades [90] given their importance in identifying meaningful clusters. Mathematically, graph partitioning is known to be NP-hard [91]. Approximations are thus required. A graph clustering algorithm with one of the best trade-off accuracy and speed is METIS [71], which partitions a graph into a pre-defined number of clusters/patches such that the number of within-cluster links is much higher than between-cluster links in order to better capture good community structure. For these fine properties, we select METIS as our graph patch extraction algorithm.

Table 10: Comparison of METIS and random graph partition algorithm.

consistently gives better performance than random node partition, especially on graphs with more nodes and edges (such as Peptides-func), which corresponds to our intuition that nodes and edges composing a patch should share similar semantic or information. Nevertheless, it is interesting to see that random graph partitioning is still able to achieve reasonable results, which shows that the performance of the model is not solely supported by the quality of the patches.

Effect of Number of Patches

Figure 4: Effect of the number of patches.

We observe in Figure 4 when increasing the number of graph patches (# Patch), performance increases first and then flattens out (with small fluctuations) when #Patch=24. We set the number of patches to 32 by default.

Effect of Patch Overlapping

Figure 5: Effect of the patch overlapping with k -hop extension.

In Figure 5, we observe a clear performance increase when graph patches are overlapping with each other (0-hop vs 1-hop), which is consistent with our intuition that extracting non-overlapping patches implies losing important edge information. We further expand graph patches to their k -hop neighbourhood. Performance increases first and then flattens out or begins to decrease when k = 2 for ZINC and k = 3 for Peptides-func. We set k = 1 by default.

Patch Size and WL Expressivity

Table 11: Accuracy on EXP with different patch sizes P , averaging over 4 runs with 4 different seeds

We evaluated the 2-WL and 3-WL expressivity on the benchmark datasets available to us, which indeed have small graphs. As we do not have access to 2-WL/3-WL datasets with larger graph sizes, we studied the impact of performance with a smaller number of patches in Table 11. As expected, expressivity increases when the number of patches increases as well. Given these experiential results, we also suppose that for larger graphs, we would need to increase the number of patches to maintain expressivity.

Studies on Positional Encoding

Effect of Positional Encoding

Table 12: Effect of positional encoding. We study the effects of node PE and patch PE by removing one of them in turn from our model while keeping the other components unchanged.

It was proved in [34, 35] that unique and permutation-invariant positional encoding (PE) increases the representation power of any MP-GNN, i.e. PE leads to GNNs strictly more powerful than the 1-WL test. PE is thus important from a theoretical point of view but, unfortunately, theory does not provide any guidance on the choice of PE for a given graph dataset and task. Consequently, the choice of PE is so far arbitrary and is selected by trial-and-error experiments such as [57, 39] to cite the most recent PE-based GNNs.

Our experiments show that PE may or not be useful, see Table 12. Thus, PE increases the expressivity power of GNNs but not necessarily their generalization performance. In other words, they improve over-fitting but not necessarily generalization. In conclusion, PE is certainly useful to improve the quality of GNN prediction given the theory and the increased number of published works on this topic, but more mathematical progress is needed to identify more relevant choices and provides consistent result improvement.

Positional Encoding and Patch Size

Table 13: Ablation with combining effects of PE and patch size on ZINC.

Overall, increasing the number of patches improves the results independently of using or not the PEs for ZINC and Peptide-func. Node PE clearly helps more than patch PE for both datasets and using both PEs is generally more helpful for a larger number of patches.

Study on Different Designs of Graph-Based MHA

Table 15: Different designs of graph-based multi-head attention (gMHA) in transformer layer.

We conducted experiments on ZINC and Peptides-func datasets to explore five different versions of Graph ViT. The versions primarily differ in the attention function used. The attention functions we considered are as follows: (1) Standard/Full Attention: This attention function is based on the original attention mechanism introduced by Vaswani et al. [50]. (2) Graph Attention: This attention function is derived from the Graph Transformer (GT) model proposed by Dwivedi and Bresson [37]. (3) Kernel Attention: This attention function is based on the kernel attention mechanism proposed by Mialon et al. [53] in the GraphiT model. (4) Additive Attention: This attention function is derived from the Graphormer model proposed by Ying et al. [56]. (5) Hadamard Attention: We employed Hadamard attention as the default attention function in our Graph ViT model. Results are presented in the Table 15.

Experiments clearly demonstrate that the choice of the self-attention function is important. The Hadamard attention provides the best performance for ZINC (0.0849) and for peptides-func (0.6919) among all attention functions.

Effect of Data Augmentation

Table 16: Effect of data augmentation (DA): ✗ means no DA and ✓ uses DA.

different results. First, it is clear that DA brings an increase in performance. Second, re-generating graph patches only add to a small amount of training time.

It is worth noting that the drop edge technique we use here is different to the standard data augmentation techniques such as DropEdge [92], and G-Mixup [93], which either add slightly modified copies of existing data or generate synthetic based on existing data. Our approach is different and actually specific to the Graph MLP-Mixer model.

Long Range Graph Benchmark

Table 17: Comparison of our best results from Table 2 with the state-of-the-art Models on large real world datasets [66].

We have provided additional experiments with the recent Long Range Graph Benchmark (LRGB) [66] to demonstrate that Graph MLP-Mixer is able to capture long-range interactions. In LRGB, Peptides-func and Peptides-struct are two graph-level prediction datasets, consisting of 15,535 graphs with a total of 2.3 million nodes. The graphs are one order of magnitude larger than ZINC, MolTOX21 and MolHIV with 151 nodes per graph on average and a mean graph diameter of 57. As such, they are better suited to evaluate models enabled with long-range dependencies, as they contain larger graphs and more data points. The performance is reported in Table 2, Table 3 and in Table 17.

We summarize the main results as follows: 1) Graph MLP-Mixer sets new SOTA performance with the best scores of 0.6970 on Peptides-fun and 0.2449 on Peptides-struct (Table 3), demonstrating the ability of the model to better capture long-range relationships. 2) Compared with MP-GNNs (Table 2), Graph MLP-Mixer significantly outperforms the base MP-GNNs; we can observe an average 0.056 Average Precision improvement on Peptides-func and an average 0.028 MAE decrease on Peptides-struct, which verifies its superiority over MP-GNNs in capturing long-range interaction. 3) Graph MLP-Mixer provides significantly better speed/memory complexity compared to Graph Transformer and expressive gnn models, epspecially when training with large graphs, such as SAN+LSPE [55] and SUN [32]. For example, SUN gives similar performance to Graph MLP-Mixer, 0.6730 on Peptides-func and 0.2498 on Peptides-struct, but requires 44x memory and 19x training time (Table 17).

Mitigating Oversquashing in TreeNeighbourMatch

As discussed in section 5.3, each example of TreeNeighbourMatch is a binary tree of depth r . The goal is to predict an alphabetical label for the target node, where the correct answer is the label of the leaf node that has the same degree as the target node. Figure 2 shows that standard MP-GNNs (i.e., GCN, GGCN, GAT and GIN) fail to generalize on the dataset from r = 4 , whereas our model mitigates over-squashing and generalizes well until r = 7 .

oversquashing. Concretely, consider each node as one patch. Then, Graph MLP-Mixer's Patch Encoder extracts each node's degree and alphabetical label, storing them into the resulting Patch Embeddings. The next Token Mixer layer then compares each node's degree to the target node's, and outputs an indicator variable for whether these degrees are equal, which is transmitted to the next layer. Finally, by combining each node's alphabetical label and this indicator variable, the Fully Connected layer can then output the alphabetical label of the node with matching degree to the target node. In summary, we can observe that Graph MLP-Mixer can solve TreeNeighbourMatch instances while only requiring that each node embedding to capture information about that patch, not the entire graph, thus avoiding the inherent limitations of MP-GNNs as discussed in [44].

Complexity Analysis

For each graph G = ( V , E ) , with N = |V| being the number of nodes and E = |E| being the number of edges, the METIS patch extraction takes O ( E ) runtime complexity, and outputs graph patches { G 1 , ..., G P } , with P being the pre-defined number of patches. Accordingly, we denote each graph patch as G p = ( V p , E p ) , with N p = |V p | being the number of nodes and E p = |E p | being the number of edges in G p . After our one-hop overlapping adjustment, the total number of nodes and edges of all the patches are N ′ = ∑ p N p ≤ PN and E ′ = ∑ p E p ≤ PE , respectively. Assuming base GNN has O ( N + E ) runtime and memory complexity, our patch embedding module has O ( N ′ + E ′ ) runtime and memory complexity, introducing a constant overhead over the base GNN model. For the mixer layers, the complexity is O ( P ) as discussed in Section 4.5.

Limitations

The current limitations of the model are as follows.

Input	Regular grid Same data resolution (Height, Width)	Irregular domain Variable data structure (# Nodes and # Edges)
Patch Extraction	Via pixel reordering Non-overlapping patches Same patches at each epoch	Via graph clustering algorithm Overlapping patches Different patches at each epoch
Patch Encoder	Same patch resolution (Patch Height, Patch Width) MLP (equivalently CNN)	Variable patch structure (# Nodes and # Edges) GNN (e.g. GCN, GAT, GT)
Positional Information	Implicitly ordered (No need for explicit PE)	No universal ordering Node PE for patch encoder Patch PE for token mixer
ViT / MLP-Mixer	MLP / Channel mixer MHA/ Token mixer	MLP / Channel mixer gMHA / Token mixer

Model	ZINC MAE ↓	MNIST Accuracy ↑	CIFAR10 Accuracy ↑	MolTOX21 ROCAUC ↑	MolHIV ROCAUC ↑	Peptide-func AP ↑	Peptide-struct MAE ↓
GCN GCN-MLP-Mixer GCN-ViT	0.1952 ± 0.0057 0.1347 ± 0.0020 0.1688 ± 0.0095	0.9269 ± 0.0023 0.9516 ± 0.0027 0.9600 ± 0.0015	0.5423 ± 0.0056 0.6111 ± 0.0017 0.6367 ± 0.0027	0.7525 ± 0.0031 0.7816 ± 0.0075 0.7820 ± 0.0096	0.7813 ± 0.0081 0.7929 ± 0.0111 0.7780 ± 0.0120	0.6328 ± 0.0086 0.6832 ± 0.0061 0.6855 ± 0.0049	0.2758 ± 0.0012 0.2486 ± 0.0041 0.2468 ± 0.0015
GatedGCN GatedGCN-MLP-Mixer GatedGCN-ViT	0.1577 ± 0.0046 0.1244 ± 0.0053 0.1421 ± 0.0031	0.9776 ± 0.0017 0.9832 ± 0.0004 0.9846 ± 0.0009	0.6628 ± 0.0017 0.7060 ± 0.0022 0.7158 ± 0.0009	0.7641 ± 0.0057 0.7910 ± 0.0040 0.7857 ± 0.0028	0.7874 ± 0.0119 0.7976 ± 0.0136 0.7734 ± 0.0114	0.6300 ± 0.0029 0.6932 ± 0.0017 0.6942 ± 0.0075	0.2778 ± 0.0017 0.2508 ± 0.0007 0.2465 ± 0.0015
GINE GINE-MLP-Mixer GINE-ViT	0.1072 ± 0.0037 0.0733 ± 0.0014 0.0849 ± 0.0047	0.9705 ± 0.0023 0.9809 ± 0.0004 0.9820 ± 0.0005	0.6131 ± 0.0035 0.6833 ± 0.0022 0.6967 ± 0.0040	0.7730 ± 0.0064 0.7868 ± 0.0043 0.7851 ± 0.0077	0.7885 ± 0.0034 0.7997 ± 0.0102 0.7792 ± 0.0149	0.6405 ± 0.0077 0.6970 ± 0.0080 0.6919 ± 0.0085	0.2780 ± 0.0021 0.2494 ± 0.0007 0.2449 ± 0.0016
GraphTrans GraphTrans-MLP-Mixer GraphTrans-ViT	0.1230 ± 0.0018 0.0773 ± 0.0030 0.0960 ± 0.0073	0.9782 ± 0.0012 0.9742 ± 0.0011 0.9725 ± 0.0023	0.6809 ± 0.0020 0.7396 ± 0.0033 0.7211 ± 0.0055	0.7646 ± 0.0055 0.7817 ± 0.0040 0.7835 ± 0.0032	0.7884 ± 0.0104 0.7969 ± 0.0061 0.7755 ± 0.0208	0.6313 ± 0.0039 0.6858 ± 0.0062 0.6876 ± 0.0059	0.2777 ± 0.0025 0.2480 ± 0.0013 0.2455 ± 0.0027

Model	ZINC	MolHIV	Peptides-func	Peptides-func	Peptides-func	Peptides-strcut	Peptides-strcut	Peptides-strcut
	MAE ↓	ROCAUC ↑	AP ↑	Time	Memory	MAE ↓	Time	Memory
GT [36]	0.226 ± 0.014	-	-	-	-	-	-	-
GraphiT [53]	0.202 ± 0.011	-	-	-	-	-	-	-
Graphormer [56]	0.122 ± 0.006	-	-	-	-	-	-	-
GPS [57]	0.070 ± 0.004	0.7880 ± 0.0101	0.6562 ± 0.0115	1 . 4 ×	6 . 8 ×	0.2515 ± 0.0012	1 . 3 ×	8 . 3 ×
SAN+LapPE [38]	0.139 ± 0.006	0.7775 ± 0.0061	0.6384 ± 0.0121	9 . 4 ×	12 . 4 ×	0.2683 ± 0.0043	8 . 8 ×	14 . 7 ×
SAN+RWSE [38]	-	-	0.6439 ± 0.0075	8 . 0 ×	19 . 5 ×	0.2545 ± 0.0012	7 . 9 ×	14 . 5 ×
GNN-AK+ [31]	0.080 ± 0.001	0.7961 ± 0.0119	0.6480 ± 0.0089	2 . 6 ×	7 . 8 ×	0.2736 ± 0.0007	2 . 5 ×	9 . 2 ×
SUN [32]	0.084 ± 0.002	0.8003 ± 0.0055 1	0.6730 ± 0.0078	43 . 8 ×	18 . 8 ×	0.2498 ± 0.0008	42 . 7 ×	20 . 7 ×
CIN [29]	0.079 ± 0.006 2	0.8094 ± 0.0057	-	-	-	-	-	-
Graph MLP-Mixer	0.073 ± 0.001	0.7997 ± 0.0102	0.6970 ± 0.0080	1 . 0 ×	1 . 0 ×	0.2475 ± 0.0015	1 . 0 ×	1 . 2 ×
Graph ViT	0.085 ± 0.005	0.7792 ± 0.0149	0.6942 ± 0.0075	1 . 1 ×	0 . 8 ×	0.2449 ± 0.0016	1 . 0 ×	1 . 0 ×

Model	CSL (ACC)	EXP (ACC)	SR25 (ACC)
GCN	10.00 ± 0.00	51.90 ± 1.96	6.67 ± 0.00
GatedGCN	10.00 ± 0.00	51.73 ± 1.65	6.67 ± 0.00
GINE	10.00 ± 0.00	50.69 ± 1.39	6.67 ± 0.00
GraphTrans	10.00 ± 0.00	52.35 ± 2.32	6.67 ± 0.00
GCN-MLP-Mixer	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
GatedGCN-MLP-Mixer	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
GINE-MLP-Mixer	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
GraphTrans-MLP-Mixer	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00

	GNN	Transformer	Graph Coarsening	Local Info.	Global Info.
Coarformer [78]	✓	✓	✓ (non-overlap, static)	GNN on original graph	MHAon coarsen graph
Exphormer [79]	✓	✓	✗	GNN on original graph	MHAon expander graph
ANS-GT [80]	✗	✓	✓ (non-overlap, static)	adaptive node sampling strategy	sampled nodes from the coarsened graph
NAGphormer [81]	✗	✓	✗	MHAon multi-hop neighbour	-
Graph MLP-Mixer (Ours)	✓	✓	✓ (overlap, dynamic)	GNN on graph patches	token mixer across patches

Dataset	#Graphs	#Nodes	Avg. #Nodes	Avg. #Edges	Task	Metric
CSL	150	41	41	164	10-class classif.	Accuracy
EXP	1,200	32-64	44.4	110.2	2-class classif.	Accuracy
SR25	15	25	25	300	15-class classif.	Accuracy
ZINC	12,000	9-37	23.2	24.9	regression	MAE
MNIST	70,000	40-75	70.6	684.4	10-class classif.	Accuracy
CIFAR10	60,000	85-150	117.6	1129.7	10-class classif.	Accuracy
MolTOX21	7,831	1-132	18.57	38.6	12-task classif.	ROCAUC
MolHIV	41,127	2-222	25.5	54.9	binary classif.	ROCAUC
Peptides-func	15,535	8-444	150.9	307.3	10-class classif.	Avg. Precision (AP)
Peptides-struct	15,535	8-444	150.9	307.3	regression	MAE
TreeNeighbourMatch (r=2)	96	7	7	6	4-class classif.	Accuracy
TreeNeighbourMatch (r=3)	32,000	15	15	14	8-class classif.	Accuracy
TreeNeighbourMatch (r=4)	64,000	31	31	30	16-class classif.	Accuracy
TreeNeighbourMatch (r=5)	128,000	63	63	62	32-class classif.	Accuracy
TreeNeighbourMatch (r=6)	256,000	127	127	126	64-class classif.	Accuracy
TreeNeighbourMatch (r=7)	512,000	255	255	254	128-class classif.	Accuracy
TreeNeighbourMatch (r=8)	640,000	511	511	510	256-class classif.	Accuracy

Dataset	# Patch	# Node	# Node	# Node	Diameter	Diameter	Diameter
Dataset	# Patch	Mean	Min	Max	Mean	Min	Max
CSL	32	5.80	5	8	2.28	2	3
EXP	32	4.07	2	11	2.31	1	5
SR25	32	13.00	13	13	2.00	2	2
ZINC	32	3.15	2	7	1.82	1	3
MNIST	32	14.36	9	28	2.85	2	5
CIFAR10	32	17.20	10	35	3.07	2	7
MolTOX21	32	3.15	1	10	1.80	0	6
MolHIV	32	3.27	1	13	1.87	0	8
Peptides-func	32	7.08	1	20	4.15	0	14
Peptides-struct	32	7.08	1	20	4.15	0	14
TreeNeighbourMatch(r=2)	8	1.86	1	3	0.86	0	2
TreeNeighbourMatch(r=3)	32	1.93	1	3	0.93	0	2
TreeNeighbourMatch(r=4)	32	1.97	1	3	0.97	0	2
TreeNeighbourMatch(r=5)	32	3.28	1	5	2.25	0	3
TreeNeighbourMatch(r=6)	32	5.34	3	8	3.31	2	5
TreeNeighbourMatch(r=7)	32	9.19	7	14	4.33	4	5
TreeNeighbourMatch(r=8)	32	17.03	15	23	6.17	6	8

	CSL	EXP	SR25	ZINC	MNIST	CIFAR10	MolTOX21	MolHIV	Peptides-fun	Peptides-struct
NodePE PatchPE	RWSE-8 RWSE-8	RWSE-8 RWSE-8	LapPE-8 RWSE-8	RWSE-20 RWSE-8	LapPE-8 RWSE-8	LapPE-8 RWSE-8	-	-	RWSE-16 RWSE-8	RWSE-16 RWSE-8
							-	-

Model	Patch Extraction	ZINC (MAE ↓ )	Peptides-func (AP ↑ )
GCN-MLP-Mixer	✗ ✓	0.2495 ± 0.0040 0.1347 ± 0.0020	0.6341 ± 0.0139 0.6832 ± 0.0061
GatedGCN-MLP-Mixer	✗ ✓	0.2521 ± 0.0084 0.1244 ± 0.0053	0.6230 ± 0.0110 0.6932 ± 0.0017
GINE-MLP-Mixer	✗ ✓	0.2558 ± 0.0059 0.0733 ± 0.0014	0.6350 ± 0.0038 0.6970 ± 0.0080
GraphTrans-MLP-Mixer	✗ ✓	0.2538 ± 0.0067 0.0773 ± 0.0030	0.6224 ± 0.0112 0.6858 ± 0.0062

Model	ZINC (MAE ↓ )	ZINC (MAE ↓ )	Peptides-struct (MAE ↓ )	Peptides-struct (MAE ↓ )
	METIS	Random	METIS	Random
GCN-MLP-Mixer	0.1347 ± 0.0020	0.1435 ± 0.0122	0.2486 ± 0.0041	0.2565 ± 0.0031
GatedGCN-MLP-Mixer	0.1244 ± 0.0053	0.1284 ± 0.0074	0.2508 ± 0.0007	0.2539 ± 0.0012
GINE-MLP-Mixer	0.0733 ± 0.0014	0.0708 ± 0.0020	0.2494 ± 0.0007	0.2559 ± 0.0012
GraphTrans-MLP-Mixer	0.0773 ± 0.0030	0.0767 ± 0.0019	0.2480 ± 0.0013	0.2574 ± 0.0025

Model	P=2	P=4	P=8	P=16	P=32
GCN-MLP-Mixer	57.54 ± 3.87	99.44 ± 0.59	99.69 ± 0.98	100.00 ± 0.00	100.00 ± 0.00
GatedGCN-MLP-Mixer	67.65 ± 2.01	99.77 ± 0.37	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
GINE-MLP-Mixer	57.75 ± 3.80	99.58 ± 0.45	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
GraphTrans-MLP-Mixer	73.79 ± 1.52	96.77 ± 8.43	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00

Dataset	Method	GCN-MLP-Mixer	Gated-MLP-Mixer	GINE-MLP-Mixer	GraphTrans-MLP-Mixer
ZINC	Full - NodePE - PatchPE	0.1347 ± 0.0020 0.1944 ± 0.0061 0.1414 ± 0.0058	0.1244 ± 0.0053 0.1775 ± 0.0031 0.1250 ± 0.0026	0.0733 ± 0.0014 0.1225 ± 0.0070 0.0746 ± 0.0010	0.0773 ± 0.0030 0.1393 ± 0.0122 0.0778 ± 0.0029
ZINC	- Both	0.2207 ± 0.0072	0.1883 ± 0.0096	0.1160 ± 0.0023	0.1700 ± 0.0064
Peptides-func	Full	0.6832 ± 0.0061	0.6932 ± 0.0017	0.6970 ± 0.0080	0.6858 ± 0.0062
Peptides-func	- NodePE	0.6688 ± 0.0039	0.6864 ± 0.0080	0.6868 ± 0.0034	0.6763 ± 0.0030
Peptides-func	- PatchPE	0.6871 ± 0.0055	0.6934 ± 0.0055	0.6933 ± 0.0104	0.6882 ± 0.0076
Peptides-func	- Both	0.6760 ± 0.0078	0.6847 ± 0.0034	0.6756 ± 0.0070	0.6783 ± 0.0088

Patch Size	2	4	16	32
Full	0.0983 ± 0.0042	0.1011 ± 0.0103	0.0799 ± 0.0037	0.0743 ± 0.0049
- Node PE	0.1589 ± 0.0056	0.1414 ± 0.0061	0.1307 ± 0.0107	0.1154 ± 0.0032
- Patch PE	0.1081 ± 0.0007	0.1076 ± 0.0110	0.0840 ± 0.0035	0.0744 ± 0.0037
- Both	0.1677 ± 0.0045	0.1532 ± 0.0051	0.1284 ± 0.0018	0.1187 ± 0.0050

Patch Size	2	4	16	32	64
Full	0.6578 ± 0.0063	0.6675 ± 0.0037	0.6855 ± 0.0039	0.6939 ± 0.0034	0.6944 ± 0.0074
- Node PE	0.6613 ± 0.0063	0.6708 ± 0.0065	0.6864 ± 0.0069	0.6873 ± 0.0033	0.6789 ± 0.0047
- Patch PE	0.6594 ± 0.0059	0.6724 ± 0.0051	0.6937 ± 0.0068	0.6939 ± 0.0062	0.6865 ± 0.0061
- Both	0.6562 ± 0.0057	0.6739 ± 0.0038	0.6879 ± 0.0052	0.6825 ± 0.0074	0.6746 ± 0.0056

gMHA	Equation	ZINC (MAE ↓ )	Peptides-func (AP ↑ )
Standard/Full attention [50]	softmax ( QK T √ d ) V softmax ( A P ⊙ QK T √ d ) V softmax ( RW( A P ) ⊙ QK T √ d softmax ( QK T √ d ) V + LL ( A T V	0.1784 ± 0.0238	0.6778 ± 0.0039
Graph Attention [37]		0.1527 ± 0.0067	0.6795 ± 0.0070
Kernel Attention [53]	) V	0.1010 ± 0.0031	0.6844 ± 0.0102
Additive Attention [56]	P )	0.1632 ± 0.0063	0.6842 ± 0.0057
Hadamard Attention	( A P ⊙ softmax ( QK √ d ) )	0.0849 ± 0.0047	0.6919 ± 0.0085

Model	DA	ZINC	ZINC	Peptides-struct	Peptides-struct
		MAE ↓	Time (S/Epoch)	MAE ↓	Time (S/Epoch)
GCN-MLP-Mixer	✗ ✓	0.2537 ± 0.0139	5.3603	0.2761 ± 0.0041 0.2486 ± 0.0041	6.8297 9.2561
		0.1347 ± 0.0020	5.6728
GatedGCN-MLP-Mixer	✗	0.2121 ± 0.0172	5.3816	0.2776 ± 0.0020	7.8609
	✓	0.1244 ± 0.0053	5.7786	0.2508 ± 0.0007	9.5830
GINE-MLP-Mixer	✗	0.1389 ± 0.0171	5.3905	0.2792 ± 0.0043	7.8849
	✓	0.0733 ± 0.0014	5.6704	0.2494 ± 0.0007	8.8136
GraphTrans-MLP-Mixer	✗	0.1665 ± 0.0145	6.0039	0.2802 ± 0.0030	9.0999
	✓	0.0773 ± 0.0030	6.1616	0.2480 ± 0.0013	9.7730

Model	# Params	Peptide-func	Peptide-func	Peptide-func	Peptide-struct	Peptide-struct	Peptide-struct
		Avg. Precision ↑	Time (S/Epoch)	Memory (MB)	MAE ↓	Time (S/Epoch)	Memory (MB)
GCN	508k	0.5930 ± 0.0023	4.59	696	0.3496 ± 0.0013	4.51	686
GINE	476k	0.5498 ± 0.0079	3.94	659	0.3547 ± 0.0045	3.84	658
GatedGCN	509k	0.5864 ± 0.0077	5.48	1,038	0.3420 ± 0.0013	5.31	1,029
GatedGCN + RWSE	506k	0.6069 ± 0.0035	5.75	1,035	0.3357± 0.0006	5.61	1,038
Transformer + LapPE	488k	0.6326 ± 0.0126	9.74 ( 1 . 1 × )	6,661 ( 6 . 6 × )	0.2529 ± 0.0016	9.61 ( 1 . 1 × )	6,646 ( 8 . 0 × )
SAN + LapPE [55]	493k	0.6384 ± 0.0121	80.47 ( 9 . 4 × )	12,493 ( 12 . 4 × )	0.2683 ± 0.0043	79.41 ( 8 . 8 × )	12,226 ( 14 . 7 × )
SAN + RWSE [55]	500k	0.6439 ± 0.0075	68.44 ( 8 . 0 × )	19,691 ( 19 . 5 × )	0.2545 ± 0.0012	70.39 ( 7 . 8 × )	12,111 ( 14 . 5 × )
GPS [57]	504k	0.6562 ± 0.0115	11.83 ( 1 . 4 × )	6,904 ( 6 . 8 × )	0.2515 ± 0.0012	11.74 ( 1 . 3 × )	6,878 ( 8 . 3 × )
GNN-AK+ [31]	631k	0.6480 ± 0.0089	22.52 ( 2 . 6 × )	7,855 ( 7 . 8 × )	0.2736 ± 0.0007	22.11 ( 2 . 5 × )	7,634 ( 9 . 2 × )
SUN [32]	508k	0.6730 ± 0.0078	376.66 ( 43 . 8 × )	18,941 ( 18 . 8 × )	0.2498 ± 0.0008	384.26 ( 42 . 7 × )	17,215 ( 20 . 7 × )
GCN-MLP-Mixer	329k	0.6832 ± 0.0061	8.48	716	0.2486 ± 0.0041	8.12	679
GatedGCN-MLP-Mixer	527k	0.6932 ± 0.0017	8.96	969	0.2508 ± 0.0007	8.44	887
GINE-MLP-Mixer	397k	0.6970 ± 0.0080	8.59 ( 1 . 0 × )	1,010 ( 1 . 0 × )	0.2494 ± 0.0007	8.51	974
GraphTrans-MLP-Mixer	593k	0.6858 ± 0.0062	9.94	975	0.2480 ± 0.0013	9.00	1,048
GCN-ViT	493k	0.6855 ± 0.0049	8.90	628	0.2468 ± 0.0015	8.55	609
GatedGCN-ViT	692k	0.6942 ± 0.0075	9.07	848	0.2465 ± 0.0015	9.00 ( 1 . 0 × )	833 ( 1 . 0 × )
GINE-ViT	561k	0.6919 ± 0.0085	8.98	920	0.2449 ± 0.0016	8.77	902
GraphTrans-ViT	757k	0.6876 ± 0.0059	9.94	975	0.2455 ± 0.0027	9.58	981

Graph Neural Networks (GNNs) have shown great potential in the field of graph representation learning. Standard GNNs define a local message-passing mechanism which propagates information over the whole graph domain by stacking multiple layers. This paradigm suffers from two major limitations, over-squashing and poor long-range dependencies, that can be solved using global attention but significantly increases the computational cost to quadratic complexity. In this work, we propose an alternative approach to overcome these structural limitations by leveraging the ViT/MLP-Mixer architectures introduced in computer vision. We introduce a new class of GNNs, called Graph ViT/MLP-Mixer, that holds three key properties. First, they capture long-range dependency and mitigate the issue of over-squashing as demonstrated on Long Range Graph Benchmark and TreeNeighbourMatch datasets. Second, they offer better speed and memory efficiency with a complexity linear to the number of nodes and edges, surpassing the related Graph Transformer and expressive GNN models. Third, they show high expressivity in terms of graph isomorphism as they can distinguish at least 3-WL non-isomorphic graphs. We test our architecture on 4 simulated datasets and 7 real-world benchmarks, and show highly competitive results on all of them. The source code is available for reproducibility at: https://github.com/XiaoxinHe/Graph-ViT-MLPMixer.

Message-Passing GNNs (MP-GNNs). GNNs have become the standard learning architectures for graphs based on their flexibility to work with complex data domains s.a. recommendation [1, 2], chemistry [3, 4], physics [5, 6], transportation [7], vision [8], natural language processing (NLP) [9], knowledge graphs [10], drug design [11, 12] and medical domain [13, 14]. Most GNNs are designed to have two core components. First, a structural message-passing mechanism s.a. Defferrard et al. [15], Kipf and Welling [16], Hamilton et al. [17], Monti et al. [1], Bresson and Laurent [18], Gilmer et al. [4], Veličković et al. [19] that computes node representations by aggregating the local 1-hop neighborhood information. Second, a stack of L𝐿L layers that aggregates L𝐿L-hop neighborhood nodes to increase the expressivity of the network and transmit information between nodes that are L𝐿L hops apart.

Weisfeiler-Leman GNNs (WL-GNNs). One of the major limitations of MP-GNNs is their inability to distinguish (simple) non-isomorphic graphs. This limited expressivity can be formally analyzed with the Weisfeiler-Leman graph isomorphism test [20], as first proposed in Xu et al. [21], Morris et al. [22]. Later on, Maron et al. [23] introduced a general class of k𝑘k-order WL-GNNs that can be proved to universally represent any class of k𝑘k-WL graphs [24, 25]. But to achieve such expressivity, this class of GNNs requires using k𝑘k-tuples of nodes with memory and speed complexities of O(Nk)𝑂superscript𝑁𝑘O(N^{k}), with N𝑁N being the number of nodes and k≥3𝑘3k\geq 3. Although the complexity can be reduced to O(N2)𝑂superscript𝑁2O(N^{2}) and O(N3)𝑂superscript𝑁3O(N^{3}) respectively [24, 25, 26], it is still computationally costly compared to the linear complexity O(E)𝑂𝐸O(E) of MP-GNNs with E𝐸E being the number of edges, which often reduces to O(N)𝑂𝑁O(N) for real-world graphs that exhibit sparse structures. In order to reduce memory and speed complexities of WL-GNNs while keeping high expressivity, several works have focused on designing graph networks from their sub-structures s.a. sub-graph isomorphism [27], sub-graph routing mechanism [28], cellular WL sub-graphs [29], and k-hop egonet sub-graphs [21, 30, 25, 31, 32].

Graph Positional Encoding (PE). Another aspect of the limited expressivity of GNNs is their inability to recognize simple graph structures s.a. cycles or cliques, which are often present in molecules and social graphs [33]. We can consider k𝑘k-order WL-GNNs with value k𝑘k to be the length of cycle/clique, but with high complexity O(Nk)𝑂superscript𝑁𝑘O(N^{k}). An alternative approach is to add positional encoding to the graph nodes. It was proved in Murphy et al. [34], Loukas [35] that unique and equivariant PE increases the representation power of any MP-GNN while keeping the linear complexity. This theoretical result was applied with great empirical success with index PE [34], Laplacian eigenvectors [36, 37, 38, 39] and k-step Random Walk [40, 41]. All these graph PEs lead to GNNs strictly more powerful than the 1-WL test, which seems to be enough expressivity in practice [42]. However, none of the PE proposed for graphs can provide a global position of the nodes that is unique, equivariant and distance sensitive. This is due to the fact that a canonical positioning of nodes does not exist for arbitrary graphs, as there is no notion of up, down, left and right on graphs. For example, any embedding coordinate system like graph Laplacian eigenvectors [43] can flip up-down directions, right-left directions, and would still be a valid PE. This introduces ambiguities for the GNNs that require to (learn to) be invariant with respect to the graph or PE symmetries. A well-known example is given by the eigenvectors: there exist 2ksuperscript2𝑘2^{k} number of possible sign flips for k𝑘k eigenvectors that require to be learned by the network.

Issue of long-range dependencies. Another major limitations of MP-GNNs is the well-known issue of long-range dependencies. Standard MP-GNNs require L𝐿L layers to propagate the information from one node to their L𝐿L-hop neighborhood. This implies that the receptive field size for GNNs can grow exponentially, for example with O(2L)𝑂superscript2𝐿O(2^{L}) for binary tree graphs. This causes over-squashing; exponentially growing information is compressed into a fixed-length vector by the aggregation mechanism [44, 45]. It is worth noting that the poor long-range modeling ability of deep GNNs can be caused by the combined effect of multiple factors, such as over-squashing, vanishing gradients, poor isomorphism expressivity, etc. but, in this work, we focus our effort on alleviating over-squashing s.a. Deac et al. [46], Arnaiz-Rodríguez et al. [47]. Over-squashing is well-known since recurrent neural networks [48], which have led to the development of the (self- and cross-)attention mechanisms for the translation task [49, 50] first, and then for more general NLP tasks [51, 52]. Transformer architectures are the most elaborated networks that leverage attention, and have gained great success in NLP and computer vision (CV). Several works have generalized the transformer architecture for graphs, alleviating the issue of long-range dependencies and achieving competitive or superior performance against standard MP-GNNs. We highlight the most recent research works in the next paragraph.

Graph Transformers. Dwivedi and Bresson [37] generalize Transformers to graphs, with graph Laplacian eigenvectors as node PE, and incorporating graph structure into the permutation-invariant attention function. SAN and LSPE [38, 41] further improve with PE learned from Laplacian and random walk operators. GraphiT [53] encodes relative PE derived from diffusion kernels into the attention mechanism. GraphTrans [54] add Transformers on the top of standard GNN layers. SAT [55] proposes a novel self-attention mechanism that incorporates structural information into the standard self-attention module by using a GNN to compute subgraph representations. Graphormer [56] introduce three structural encodings, with great success on large molecular benchmarks. GPS [57] categorizes the different types of PE and puts forward a hybrid MPNN+Transformer architecture. We refer to Min et al. [58] for an overview of graph-structured Transformers. Generally, most Graph Transformer architectures address the problems of over-squashing and limited long-range dependencies in GNNs but they also increase significantly the complexity from O(E)𝑂𝐸O(E) to O(N2)𝑂superscript𝑁2O(N^{2}), resulting in a computational bottleneck. A detailed description of related literature can be found in Appendix A.

In the following, we explain the importance of generalizing the ViT/MLP-Mixer architectures from CV to graphs.

We identify the key challenges to generalize ViT/MLP-Mixer from images to graphs and design a new efficient class of GNNs, namely Graph ViT/MLP-Mixer, that simultaneously captures long-range dependency, keeps linear speed/memory complexity, and achieves high graph isomorphic expressivity.

We show competitive results on multiple benchmarks from Benchmarking GNNs [36] and the Open Graph Benchmark (OGB) [65]; specifically, with 0.073 MAE on ZINC and 0.7997 ROCAUC on MolHIV.

We demonstrate the capacity of the proposed model to capture long-range dependencies with SOTA performance on Long Range Graph Benchmark (LRGB) [66] while keeping low complexity, to mitigate the over-squashing issue on the TreeNeighbourMatch dataset [44], and to reach the 3-WL expressive power on the SR25 dataset [67].

Our approach forms a bridge between CV, NLP and graphs under a unified architecture, that can potentially benefit cross-over domain collaborations to design better networks.

We list the main questions when adapting MLP-Mixer from images to graphs in the following and in Table 1.

(1) How to define and extract graph patches/tokens? One notable geometrical property that distinguishes graph-structured data from regular structured data, such as images and sequences, is that there does not exist in general a canonical grid to embed graphs. As shown in Table 1, images are supported by a regular lattice, which can be easily split into multiple grid-like patches (also referred to as tokens) of the same size via fast pixel reordering. However, graph data is irregular: the number of nodes and edges in different graphs is typically different. Hence, graphs cannot be uniformly divided into similar patches across all examples in the dataset. Finally, the extraction process for graph patches cannot be uniquely defined given the lack of canonical graph embedding. This raises the questions of how we identify meaningful graph tokens, and quickly extract them.

Notation. Let G=(𝒱,ℰ)𝐺𝒱ℰG=(\mathcal{V},\mathcal{E}) be a graph with 𝒱𝒱\mathcal{V} being the set of nodes and ℰℰ\mathcal{E} the set of edges. The graph has N=|𝒱|𝑁𝒱N=|\mathcal{V}| nodes and E=|ℰ|𝐸ℰE=|\mathcal{E}| edges. The connectivity of the graph is represented by the adjacency matrix A∈ℝN×N𝐴superscriptℝ𝑁𝑁A\in\mathbb{R}^{N\times N}. The node features of node i𝑖i are denoted by hisubscriptℎ𝑖h_{i}, while the features for an edge between nodes i𝑖i and j𝑗j are indicated by eijsubscript𝑒𝑖𝑗e_{ij}. Let {𝒱1,…,𝒱P}subscript𝒱1…subscript𝒱𝑃{\mathcal{V}{1},...,\mathcal{V}{P}} be the nodes partition, P𝑃P be the pre-defined number of patches, and Gi=(𝒱i,ℰi)subscript𝐺𝑖subscript𝒱𝑖subscriptℰ𝑖G_{i}=(\mathcal{V}{i},\mathcal{E}{i}) be the induced subgraph of G𝐺G with all the nodes in 𝒱isubscript𝒱𝑖\mathcal{V}{i} and all the edges whose endpoints belong to 𝒱isubscript𝒱𝑖\mathcal{V}{i}. Let hGsubscriptℎ𝐺h_{G} be the graph-level representation and yGsubscript𝑦𝐺y_{G} be the graph-level target.

When generalizing MLP-Mixer to graphs, the first step is to extract patches. This extraction is straightforward for images. Indeed, all image data x∈ℝH×W×C𝑥superscriptℝ𝐻𝑊𝐶x\in\mathbb{R}^{H\times W\times C} are defined on a regular grid with the same fixed resolution (H,W)𝐻𝑊(H,W), where H𝐻H and W𝑊W are respectively the height and the width, and C𝐶C is the number of channels. Hence, all images can be easily reshaped into a sequence of flattened patches xp∈ℝP×(R2C)subscript𝑥𝑝superscriptℝ𝑃superscript𝑅2𝐶x_{p}\in\mathbb{R}^{P\times(R^{2}C)}, where (R,R)𝑅𝑅(R,R) is the resolution of each image patch, and P=HW/R2𝑃𝐻𝑊superscript𝑅2P=HW/R^{2} is the resulting number of patches.

Unlike images with fixed resolution, extracting graph patches is more challenging, see Table 1. Generally, graphs have different sizes, i.e. number of nodes, and therefore cannot be uniformly divided like image data. Additionally, meaningful sub-graphs must be identified in the sense that nodes and edges composing a patch must share similar semantic or information, s.a. a community of friends sharing biking interest in a social network. As such, a graph patch extraction process must satisfy the following conditions: 1) the same extraction algorithm can be applied to any arbitrary graph, 2) the nodes in the sub-graph patch must be more closely connected than for those outside the patch, and 3) the extraction complexity must be fast, that is at most linear w.r.t. the number of edges, i.e. O(E)𝑂𝐸O(E).

METIS [71] is a graph clustering algorithm with one of the best trade-off accuracy and speed. It partitions a graph into a pre-defined number of clusters such that the number of within-cluster links is much higher than between-cluster links in order to better capture good community structure. For these fine properties, we select it as our patch extraction algorithm. However, METIS is limited to finding non-overlapping clusters, as visualized in Figure 1. In this example, METIS partitions the graph into four non-overlapping parts, i.e. {1,2,3},{4,5,6},{7,8,9}123456789{1,2,3},{4,5,6},{7,8,9} and {10,11,12}101112{10,11,12}, resulting in 5 edge cuts. Unlike images, extracting non-overlapping patches could imply losing important edge information, i.e. the cutting edges, and thus decreasing the predictive performance, as we will observe experimentally. To overcome this issue and to retain all original edges, we allow graph patches to overlap with each other. For example in Figure 1, if the source and destination nodes of an edge are not in the same patch, we assign both nodes to the patches they belong to. As such, node 3 and node 4 are in two different patches, here the blue and red one, but are connected with each other. After our overlapping adjustment, these two nodes belong to both the blue and red patches. This practice is equivalent to expanding the graph patches to the one-hop neighbourhood of all nodes in that patch. Formally, METIS is first applied to partition a graph into P𝑃P non-overlapping patches: {𝒱1,…,𝒱P} such that 𝒱=𝒱1∪…∪𝒱P and 𝒱i∩𝒱j=∅,∀i≠j.formulae-sequencesubscript𝒱1…subscript𝒱𝑃 such that 𝒱subscript𝒱1…subscript𝒱𝑃 and subscript𝒱𝑖subscript𝒱𝑗for-all𝑖𝑗{\mathcal{V}{1},...,\mathcal{V}{P}}\ \textrm{ such that }\mathcal{V}=\mathcal{V}{1}\cup...\cup\mathcal{V}{P}\ \textrm{ and }\ \mathcal{V}{i}\cap\mathcal{V}{j}=\emptyset,\ \forall i\neq j. Then, patches are expanded to their one-hop neighbourhood in order to preserve the information of between-patch links and make use of all graph edges: 𝒱i←𝒱i∪{𝒩1(j)|j∈𝒱i},←subscript𝒱𝑖subscript𝒱𝑖conditional-setsubscript𝒩1𝑗𝑗subscript𝒱𝑖\mathcal{V}{i}\leftarrow\mathcal{V}{i}\ \cup\ {\ \mathcal{N}{1}(j)\ |\ {j\in\mathcal{V}{i}}\ }, where 𝒩k(j)subscript𝒩𝑘𝑗\mathcal{N}_{k}(j) defines the k𝑘k-hop neighbourhood of node j𝑗j.

For images, patch encoding can be done with a simple linear transformation given the fixed resolution of all image patches. This operation is fast and well-defined. For graphs, the patch encoder network must be able to handle complex data structure such as invariance to index permutation, heterogeneous neighborhood, variable patch sizes, convolution on graphs, and expressive to differentiate graph isomorphisms. As a result, the graph patch encoder is a GNN, whose architecture is designed to best transform a graph token Gpsubscript𝐺𝑝G_{p} into a fixed-size representation xGpsubscript𝑥subscript𝐺𝑝x_{G_{p}} into 3 steps.

Step 1. Raw node and edge linear embedding. The input node features αi∈ℝdn×1subscript𝛼𝑖superscriptℝsubscript𝑑𝑛1\alpha_{i}\in\mathbb{R}^{d_{n}\times 1} and edge features βij∈ℝde×1subscript𝛽𝑖𝑗superscriptℝsubscript𝑑𝑒1\beta_{ij}\in\mathbb{R}^{d_{e}\times 1} are linearly projected into d𝑑d-dimensional hidden features:

where U0∈ℝd×dnsuperscript𝑈0superscriptℝ𝑑subscript𝑑𝑛U^{0}\in\mathbb{R}^{d\times d_{n}}, V0∈ℝd×desuperscript𝑉0superscriptℝ𝑑subscript𝑑𝑒V^{0}\in\mathbb{R}^{d\times d_{e}} and u0,v0∈ℝdsuperscript𝑢0superscript𝑣0superscriptℝ𝑑u^{0},v^{0}\in\mathbb{R}^{d} are learnable parameters.

Step 2. Graph convolutional layers with MP-GNN. We apply a series of L𝐿L convolution layers, where the node and edge representations are updated with a MP-GNN applied to each graph patch Gp=(𝒱p,ℰp)subscript𝐺𝑝subscript𝒱𝑝subscriptℰ𝑝G_{p}=(\mathcal{V}{p},\mathcal{E}{p}) as follows:

where ℓℓ\ell is the layer index, p𝑝p is the patch index, i,j𝑖𝑗i,j denotes the nodes, 𝒩(i)𝒩𝑖\mathcal{N}(i) is the neighborhood of the node i𝑖i and functions fnodesubscript𝑓nodef_{\textrm{node}} and fedgesubscript𝑓edgef_{\textrm{edge}} (with learnable parameters) define any arbitrary MP-GNN architecture s.a. [16, 18, 72, 37], hpℓ=1|𝒱p|∑i∈𝒱phi,pl∈ℝdsuperscriptsubscriptℎ𝑝ℓ1subscript𝒱𝑝subscript𝑖subscript𝒱𝑝superscriptsubscriptℎ𝑖𝑝𝑙superscriptℝ𝑑h_{p}^{\ell}=\frac{1}{|\mathcal{V}{p}|}\sum{i\in\mathcal{V}{p}}h{i,p}^{l}\in\mathbb{R}^{d}, epℓ=1|ℰp|∑ij∈ℰpeij,pl∈ℝdsuperscriptsubscript𝑒𝑝ℓ1subscriptℰ𝑝subscript𝑖𝑗subscriptℰ𝑝superscriptsubscript𝑒𝑖𝑗𝑝𝑙superscriptℝ𝑑e_{p}^{\ell}=\frac{1}{|\mathcal{E}{p}|}\sum{ij\in\mathcal{E}{p}}e{ij,p}^{l}\in\mathbb{R}^{d} are respectively the mean representations of the patch nodes and patch edges, and gpatch-nodesubscript𝑔patch-nodeg_{\textrm{patch-node}}, gpatch-edgesubscript𝑔patch-edgeg_{\textrm{patch-edge}} are MLP-based functions that act on hpℓsuperscriptsubscriptℎ𝑝ℓh_{p}^{\ell} and epℓsuperscriptsubscript𝑒𝑝ℓe_{p}^{\ell}. For each node and edge covered by more than one patch due to the patch overlapping to include all edges cut by METIS, we update the node/edge representation by averaging over the overlapping patches:

where {k|i∈𝒱k}conditional-set𝑘𝑖subscript𝒱𝑘{k|i\in\mathcal{V}{k}}, {k|ij∈ℰk}conditional-set𝑘𝑖𝑗subscriptℰ𝑘{k|ij\in\mathcal{E}{k}} are the set of all patches that cover node i𝑖i, edge ij𝑖𝑗ij respectively.

Step 3. Pooling and readout. The final step is mean pooling all node vectors in Gpsubscript𝐺𝑝G_{p} such that hp=1|𝒱p|∑i∈𝒱phi,pℓ=L∈ℝdsubscriptℎ𝑝1subscript𝒱𝑝subscript𝑖subscript𝒱𝑝superscriptsubscriptℎ𝑖𝑝ℓ𝐿superscriptℝ𝑑h_{p}=\frac{1}{|\mathcal{V}{p}|}\sum{i\in\mathcal{V}{p}}h{i,p}^{\ell=L}\in\mathbb{R}^{d}, and applying a small MLP to get the fixed-sized patch embedding xGp∈ℝdsubscript𝑥subscript𝐺𝑝superscriptℝ𝑑x_{G_{p}}\in\mathbb{R}^{d}.

Node PE. Input node features in Eq (1) are augmented with pi∈ℝKsubscript𝑝𝑖superscriptℝ𝐾p_{i}\in\mathbb{R}^{K}, with a learnable matrix T0∈ℝd×Ksuperscript𝑇0superscriptℝ𝑑𝐾T^{0}\in\mathbb{R}^{d\times K}:

Patch PE. Relative positional information between the graph patches can be computed from the original graph adjacency matrix A∈ℝN×N𝐴superscriptℝ𝑁𝑁A\in\mathbb{R}^{N\times N} and the clusters {𝒱1,…,𝒱P}subscript𝒱1…subscript𝒱𝑃{\mathcal{V}{1},...,\mathcal{V}{P}} extracted by METIS in Section 4.2. Specifically, we capture relative positional information via the coarsened adjacency matrix AP∈ℝP×Psuperscript𝐴𝑃superscriptℝ𝑃𝑃A^{P}\in\mathbb{R}^{P\times P} over the patch graphs:

where Cut(𝒱i,𝒱j)=∑k∈𝒱i∑l∈𝒱jAklCutsubscript𝒱𝑖subscript𝒱𝑗subscript𝑘subscript𝒱𝑖subscript𝑙subscript𝒱𝑗subscript𝐴𝑘𝑙\textrm{Cut}(\mathcal{V}{i},\mathcal{V}{j})=\sum_{k\in\mathcal{V}{i}}\sum{l\in\mathcal{V}{j}}A{kl} is the standard graph cut operator which counts the number of connecting edges between cluster 𝒱isubscript𝒱𝑖\mathcal{V}{i} and cluster 𝒱jsubscript𝒱𝑗\mathcal{V}{j}.

We extract the positional encoding p^i∈ℝK^subscript^𝑝𝑖superscriptℝ^𝐾\hat{p}_{i}\in\mathbb{R}^{\hat{K}} at the patch level, similar to the node level, which will be injected (after a linear transformation) into the first layer of the mixer block:

where xisubscript𝑥𝑖x_{i} is the patch embedding.

For images, the original mixer layer [60] is a simple network that alternates channel and token mixing steps. The token mixing step is performed over the token dimension, while the channel mixing step is carried out over the channel dimension. These two interleaved steps enable information fusion among tokens and channels. The simplicity of the mixer layer has led to a significant reduction in computational cost with little or no sacrifice in performance. Indeed, the self-attention mechanism in ViT requires O(P2)𝑂superscript𝑃2O(P^{2}) memory and O(P2)𝑂superscript𝑃2O(P^{2}) computation, while the mixer layer in MLP-Mixer needs O(P)𝑂𝑃O(P) memory and O(P)𝑂𝑃O(P) computation.

Let X∈ℝP×d𝑋superscriptℝ𝑃𝑑X\in\mathbb{R}^{P\times d} be the patch embedding {xG1,…,xGP}subscript𝑥subscript𝐺1…subscript𝑥subscript𝐺𝑃{x_{G_{1}},...,x_{G_{P}}}, the graph mixer layer can be expressed as

where σ𝜎\sigma is a GELU nonlinearity [73], LayerNorm(⋅)LayerNorm⋅\textrm{LayerNorm}(\cdot) is layer normalization [74], and matrices W1∈ℝds×P,W2∈ℝP×dsformulae-sequencesubscript𝑊1superscriptℝsubscript𝑑𝑠𝑃subscript𝑊2superscriptℝ𝑃subscript𝑑𝑠W_{1}\in\mathbb{R}^{d_{s}\times P},W_{2}\in\mathbb{R}^{P\times d_{s}}, W3∈ℝdc×d,W4∈ℝd×dcformulae-sequencesubscript𝑊3superscriptℝsubscript𝑑𝑐𝑑subscript𝑊4superscriptℝ𝑑subscript𝑑𝑐W_{3}\in\mathbb{R}^{d_{c}\times d},W_{4}\in\mathbb{R}^{d\times d_{c}}, where dssubscript𝑑𝑠d_{s} and dcsubscript𝑑𝑐d_{c} are the tunable hidden widths in the token-mixing and channel-mixing MLPs.

Alternatively, we can formulate a graph transformer layer to incorporate the self-attention mechanism as in ViT:

where gMHA (graph-based multi-head attention) is designed to capture token dependencies based on the given graph topology. In Eq.(9), gHMA is defined as (AP⊙softmax(QKTd))Vdirect-productsuperscript𝐴𝑃softmax𝑄superscript𝐾𝑇𝑑𝑉\big{(}A^{P}\odot\textrm{softmax}(\frac{QK^{T}}{\sqrt{d}})\big{)}V but other options are possible to characterize the gHMA mechanism, as studied in Appendix F.

Then we generate the final graph-level representation by mean pooling all the non-empty patches:

where mpsubscript𝑚𝑝m_{p} is a binary variable with value 1 for non-empty patches and value 0 for empty patches (since graphs have variable sizes, small graphs can produce empty patches). Finally, we apply a small MLP to get the graph-level target:

MLP-Mixer architectures are known to be strong over-fitters [62]. In order to reduce this effect, we propose to introduce some perturbations in METIS as follows. Let G=(𝒱,ℰ)𝐺𝒱ℰG=(\mathcal{V},\mathcal{E}) be the original graph and G′=(𝒱,ℰ′)superscript𝐺′𝒱superscriptℰ′G^{\prime}=(\mathcal{V},\mathcal{E}^{\prime}) be the graph after randomly dropping a small set of edges. At each epoch, we apply METIS graph partition algorithm on G′superscript𝐺′G^{\prime} to get slightly different node partitions {𝒱1,…,𝒱P}subscript𝒱1…subscript𝒱𝑃{\mathcal{V}{1},...,\mathcal{V}{P}}. Then, we extract the graph patches {G1,…,GP}subscript𝐺1…subscript𝐺𝑃{G_{1},...,G_{P}} where Gi=(𝒱i,ℰi)subscript𝐺𝑖subscript𝒱𝑖subscriptℰ𝑖G_{i}=(\mathcal{V}{i},\mathcal{E}{i}) is the induced subgraph of the original graph G𝐺G, and not the modified G′superscript𝐺′G^{\prime}. This way, we can produce distinct graph patches at each epoch that retain all the nodes and edges from the original graph.

Besides, Graph ViT/MLP-Mixer offers better space-time complexity and scalability. Theoretically, most Graph Transformer models and expressive GNNs, might be computationally infeasible for large graphs, as they need to calculate the full attention and need to run an inner GNN on every node of the graph respectively. Experimentally, we observed that, when training on datasets with hundreds of nodes, SAN+LapPE [55] and SUN [32] require 9.4×9.4\times and 43.8×43.8\times training time per epoch, and 12.4×12.4\times and 18.8×18.8\times memory respectively, compared to our model.

TreeNeighbourMatch is a synthetic dataset proposed by Alon and Yahav [44] to provide an intuition into over-squashing. Each example is a binary tree of depth r𝑟r. The goal is to predict an alphabetical label for the target node, where the correct answer is the label of the leaf node that has the same degree as the target node. Figure 2 shows that standard MP-GNNs (i.e., GCN, GGCN, GAT and GIN) fail to generalize on the dataset from r=4𝑟4r=4, whereas our model mitigates over-squashing and generalizes well until r=7𝑟7r=7. As for why this happens, Alon and Yahav [44] show that GNNs fail to solve larger TreeNeighbourMatch cases as they ‘squash’ information about the graph into the target node’s embedding, which can hold a limited amount of information. In contrast, Graph ViT/MLP-Mixer avoids this problem as it transmits long-range information directly without ‘squashing.’ Concretely, Appendix I shows a simple construction illustrating that our model can solve TreeNeighbourMatch cases while avoiding the inherent limitations of MP-GNNs discussed by Alon and Yahav [44].

We briefly review the hierarchical graph models [78, 79, 80, 81] and highlight the main differences among them.

Coarformer [78] combines MPNNs and Transformers, using a GNN-based module for local information and a Transformer-based module for global information. Exphormer [79] also employs MPNN+Transformer, using GNN and Transformer modules on the original graph and expander graph, respectively. ANS-GT [80] introduces a node-sampling-based GT with hierarchical attention and graph coarsening. NAGphormer [81] treats each node as a token sequence and aggregates multi-hop information using attention-based readout.

Graph coarsening module. Coarformer, ANS-GT and SAT use a graph coarsening mechanism. These methods perform graph coarsening as a pre-processing step and generate static and non-overlapping graph patches. In contrast, we perform graph coarsening with a stochastic version of Metis on-the-fly, generating dynamic and overlapping graph patches.
Graph embedding module. The ways to capture local and global information are different as stated in the above reviews and the above summary Table 5.

We evaluate our Graph MLP-Mixer on a wide range of graph benchmarks. See summary statistics of datasets in Table 6.

EXP [75] contains 600 pairs of 1&2-WL failed graphs. The goal is to map these graphs into two classes.

SR25 [67] has 15 strongly regular graphs (3-WL failed) with 25 nodes each. SR25 is translated to a 15 way classification problem with the goal of mapping each graph into a different class.

MNIST and CIFAR10 [36] are derived from classical image classification datasets by constructing an 8 nearest-neighbor graph of SLIC superpixels for each image. The resultant graphs are of sizes 40-75 nodes for MNIST and 85-150 nodes for CIFAR10. The 10-class classification tasks and standard dataset splits follow the original image classification datasets, i.e., for MNIST 55K/5K/10K and for CIFAR10 45K/5K/10K train/validation/test graphs. These datasets are sanity-checks, as we expect most GNNs to perform close to 100% for MNIST and well enough for CIFAR10.

TreeNeighbourMatch is a synthetic dataset proposed by Alon and Yahav [44] to highlight the inherent problem of over-squashing in GNNs. It is designed to simulate the exponentially-growing receptive field while allowing us to control the problem radius r𝑟r, and thus control the intensity of over-squashing. Specifically, each graph is a binary tree of depth depth𝑑𝑒𝑝𝑡ℎdepth (a.k.a. problem radius r𝑟r). The goal is to predict a label for the target node, where the correct answer lies in one of the leave nodes. Therefore, the TreeNeighbourMatch problem requires information to be propagated from all leave nodes to the target node before predicting the label, causing the issue of over-squashing at the target node.

Distributions of the graph sizes. We plot of the distributions of the graph sizes (i.e., the number of nodes in each data sample) of these datasets in Figure 3.

We implement out model using PyTorch [86] and PyG [76]. We ran our experiments on NVIDIA RTX A5000 GPUs. We run each experiment with 4 different seeds, reporting the averaged results at the epoch achieving the best validation metric. For optimization, we use Adam [87] optimizer, with the default settings of β1=0.9subscript𝛽10.9\beta_{1}=0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999, and ϵ=1e−8italic-ϵ1superscript𝑒8\epsilon=1e^{-8}. We observe large fluctuations in the validation metric with the common Adam optimizer on the OGB datasets (i.e., MolHIV and MolTOX21), as also observed in [32, 30, 25]. We consider following the practice of SUN [32] by employing the ASAM optimizer [88] to reduce such fluctuations. We use the same hyperparameter with batch size of 32 and learning rate of 0.01 without further tuning.

Real-World Datasets. For benchmarking datasets from Dwivedi et al. [36], we followed the most commonly used parameter budgets: up to 500k parameters for ZINC; For MolTOX21 and MolHIV from OGB [65], there is no upper limit on the number of parameters. For peptides-func and peptides-struct from LRGB [66], we followed the parameter budget ∼similar-to\sim500k. All real world evaluated benchmarks define a standard train/validation/test dataset split.

Baselines. We use GCN [16], GatedGCN [18], GINE [72] and Graph Transformer [37] as our baseline models, which also server as the base patch encoder of Graph MLP-Mixer. The hidden size is set to 128 and the number of layers is set to 4 by default. For TreeNeighbourMatch datasets, we follow the experimental protocol introduced in [44], that is, for TreeNeighbourMatch dataset with problem radias r=depth𝑟depthr=\textit{depth}, we implemented a network with r+1𝑟1r+1 graph layers to allow an additional nonlinearity after the information from the leaves reaches the target node.

SOTA models. In Table 3 and Table 17, results are referenced directly from literature if available, otherwise are reproduced using authors’ official code. To enable a fair comparison of speed/memory complexity (Table 17), we set the batch size to 128 all the SOTA models and ours and reduce the batch size by half if OOM until the model and batch data can be fit into the memory. Besides, all experiments are run on the same machine.

We provide the ablation study for how many benefits the METIS can provide against random graph partitioning. For random graph partition, nodes are randomly assigned to a pre-defined number of patches. We apply data augmentation as described in Section 4.6 to both algorithms. Table 10 shows that using METIS as the graph partition algorithm consistently gives better performance than random node partition, especially on graphs with more nodes and edges (such as Peptides-func), which corresponds to our intuition that nodes and edges composing a patch should share similar semantic or information. Nevertheless, it is interesting to see that random graph partitioning is still able to achieve reasonable results, which shows that the performance of the model is not solely supported by the quality of the patches.

In Figure 5, we observe a clear performance increase when graph patches are overlapping with each other (0-hop vs 1-hop), which is consistent with our intuition that extracting non-overlapping patches implies losing important edge information. We further expand graph patches to their k𝑘k-hop neighbourhood. Performance increases first and then flattens out or begins to decrease when k=2𝑘2k=2 for ZINC and k=3𝑘3k=3 for Peptides-func. We set k=1𝑘1k=1 by default.

We run ablation experiments to study the combined effects of patch size vs. model with and without node and patch PE, see Table 13 and Table 14.

Then proposed data augmentation (DA) corresponds to newly generated graph patches with METIS at each epoch, while no DA means patches are only generated at the initial epoch and then reused during training. Table 16 presents different results. First, it is clear that DA brings an increase in performance. Second, re-generating graph patches only add to a small amount of training time.

As discussed in section 5.3, each example of TreeNeighbourMatch is a binary tree of depth r𝑟r. The goal is to predict an alphabetical label for the target node, where the correct answer is the label of the leaf node that has the same degree as the target node. Figure 2 shows that standard MP-GNNs (i.e., GCN, GGCN, GAT and GIN) fail to generalize on the dataset from r=4𝑟4r=4, whereas our model mitigates over-squashing and generalizes well until r=7𝑟7r=7.

To better understand these empirical observations, we first note that as shown by Alon and Yahav [44], MP-GNNs are fundamentally limited in their ability to solve larger TreeNeighbourMatch cases as they ‘squash’ information about the graph into the target node’s embedding, which can hold a limited amount of information in their floating point representation. Next, we consider Graph MLP-Mixer from an expressiveness point of view, and provide a simple construction to illustrate that it avoids this problem by transmitting long-range information directly without oversquashing. Concretely, consider each node as one patch. Then, Graph MLP-Mixer’s Patch Encoder extracts each node’s degree and alphabetical label, storing them into the resulting Patch Embeddings. The next Token Mixer layer then compares each node’s degree to the target node’s, and outputs an indicator variable for whether these degrees are equal, which is transmitted to the next layer. Finally, by combining each node’s alphabetical label and this indicator variable, the Fully Connected layer can then output the alphabetical label of the node with matching degree to the target node. In summary, we can observe that Graph MLP-Mixer can solve TreeNeighbourMatch instances while only requiring that each node embedding to capture information about that patch, not the entire graph, thus avoiding the inherent limitations of MP-GNNs as discussed in [44].

For each graph G=(𝒱,ℰ)𝐺𝒱ℰG=(\mathcal{V},\mathcal{E}), with N=|𝒱|𝑁𝒱N=|\mathcal{V}| being the number of nodes and E=|ℰ|𝐸ℰE=|\mathcal{E}| being the number of edges, the METIS patch extraction takes O(E)𝑂𝐸O(E) runtime complexity, and outputs graph patches {G1,…,GP}subscript𝐺1…subscript𝐺𝑃{G_{1},...,G_{P}}, with P𝑃P being the pre-defined number of patches. Accordingly, we denote each graph patch as Gp=(𝒱p,ℰp)subscript𝐺𝑝subscript𝒱𝑝subscriptℰ𝑝G_{p}=(\mathcal{V}{p},\mathcal{E}{p}), with Np=|𝒱p|subscript𝑁𝑝subscript𝒱𝑝N_{p}=|\mathcal{V}{p}| being the number of nodes and Ep=|ℰp|subscript𝐸𝑝subscriptℰ𝑝E{p}=|\mathcal{E}{p}| being the number of edges in Gpsubscript𝐺𝑝G{p}. After our one-hop overlapping adjustment, the total number of nodes and edges of all the patches are N′=∑pNp≤PNsuperscript𝑁′subscript𝑝subscript𝑁𝑝𝑃𝑁N^{\prime}=\sum_{p}N_{p}\leq PN and E′=∑pEp≤PEsuperscript𝐸′subscript𝑝subscript𝐸𝑝𝑃𝐸E^{\prime}=\sum_{p}E_{p}\leq PE, respectively. Assuming base GNN has O(N+E)𝑂𝑁𝐸O(N+E) runtime and memory complexity, our patch embedding module has O(N′+E′)𝑂superscript𝑁′superscript𝐸′O(N^{\prime}+E^{\prime}) runtime and memory complexity, introducing a constant overhead over the base GNN model. For the mixer layers, the complexity is O(P)𝑂𝑃O(P) as discussed in Section 4.5.

Arbitrary choice of the number of clusters in Metis. The number of patches needs to be selected and the number is different across different datasets. Besides, selecting the number of patches to be the same for graphs of variable sizes makes the network operate at different levels of graph resolution and may affect the overall performance.
Empirical experiments on WL-expressivity. Our results on the expressivity of the Graph MLP-Mixer are empirical. A theoretical analysis of the expressivity of the model on graphs with higher WL degrees would be valuable but such an analysis is non-trivial.
Training and pre-training on large-scale datasets of small and large graphs. More experimental results on MalNet [94], PascalVOC-SP, COCO-SP [66] and PCQM4Mv2 [65] to further test the supervised ability of the model. Besides, the pre-training capability of Graph MLP-Mixer on large-scale datasets with small graphs and large graphs was also not studied. We leave these tasks as future work.

Table: S3.T1: Differences between ViT/MLP-Mixer components for images and graphs.

	Images	Graphs

Input	Regular grid	Irregular domain
Same data resolution	Variable data structure
(Height, Width)	(# Nodes and # Edges)
Patch Extraction	Via pixel reordering	Via graph clustering algorithm
Non-overlapping patches	Overlapping patches
Same patches at each epoch	Different patches at each epoch
Patch Encoder	Same patch resolution	Variable patch structure
(Patch Height, Patch Width)	(# Nodes and # Edges)
MLP (equivalently CNN)	GNN (e.g. GCN, GAT, GT)
Positional Information	Implicitly ordered	No universal ordering
(No need for explicit PE)	Node PE for patch encoder
	Patch PE for token mixer
ViT / MLP-Mixer	MLP / Channel mixer	MLP / Channel mixer
MHA / Token mixer	gMHA / Token mixer

Table: S5.T2: Comparison with base MP-GNNs. Results are averaged over 4 runs with 4 different seeds.

Model	ZINC	MNIST	CIFAR10	MolTOX21	MolHIV	Peptide-func	Peptide-struct
MAE ↓↓\downarrow	Accuracy ↑↑\uparrow	Accuracy ↑↑\uparrow	ROCAUC ↑↑\uparrow	ROCAUC ↑↑\uparrow	AP ↑↑\uparrow	MAE ↓↓\downarrow
GCN	0.1952 ± 0.0057	0.9269 ± 0.0023	0.5423 ± 0.0056	0.7525 ± 0.0031	0.7813 ± 0.0081	0.6328 ± 0.0086	0.2758 ± 0.0012
GCN-MLP-Mixer	0.1347 ± 0.0020	0.9516 ± 0.0027	0.6111 ± 0.0017	0.7816 ± 0.0075	0.7929 ± 0.0111	0.6832 ± 0.0061	0.2486 ± 0.0041
GCN-ViT	0.1688 ± 0.0095	0.9600 ± 0.0015	0.6367 ± 0.0027	0.7820 ± 0.0096	0.7780 ± 0.0120	0.6855 ± 0.0049	0.2468 ± 0.0015
GatedGCN	0.1577 ± 0.0046	0.9776 ± 0.0017	0.6628 ± 0.0017	0.7641 ± 0.0057	0.7874 ± 0.0119	0.6300 ± 0.0029	0.2778 ± 0.0017
GatedGCN-MLP-Mixer	0.1244 ± 0.0053	0.9832 ± 0.0004	0.7060 ± 0.0022	0.7910 ± 0.0040	0.7976 ± 0.0136	0.6932 ± 0.0017	0.2508 ± 0.0007
GatedGCN-ViT	0.1421 ± 0.0031	0.9846 ± 0.0009	0.7158 ± 0.0009	0.7857 ± 0.0028	0.7734 ± 0.0114	0.6942 ± 0.0075	0.2465 ± 0.0015
GINE	0.1072 ± 0.0037	0.9705 ± 0.0023	0.6131 ± 0.0035	0.7730 ± 0.0064	0.7885 ± 0.0034	0.6405 ± 0.0077	0.2780 ± 0.0021
GINE-MLP-Mixer	0.0733 ± 0.0014	0.9809 ± 0.0004	0.6833 ± 0.0022	0.7868 ± 0.0043	0.7997 ± 0.0102	0.6970 ± 0.0080	0.2494 ± 0.0007
GINE-ViT	0.0849 ± 0.0047	0.9820 ± 0.0005	0.6967 ± 0.0040	0.7851 ± 0.0077	0.7792 ± 0.0149	0.6919 ± 0.0085	0.2449 ± 0.0016
GraphTrans	0.1230 ± 0.0018	0.9782 ± 0.0012	0.6809 ± 0.0020	0.7646 ± 0.0055	0.7884 ± 0.0104	0.6313 ± 0.0039	0.2777 ± 0.0025
GraphTrans-MLP-Mixer	0.0773 ± 0.0030	0.9742 ± 0.0011	0.7396 ± 0.0033	0.7817 ± 0.0040	0.7969 ± 0.0061	0.6858 ± 0.0062	0.2480 ± 0.0013
GraphTrans-ViT	0.0960 ± 0.0073	0.9725 ± 0.0023	0.7211 ± 0.0055	0.7835 ± 0.0032	0.7755 ± 0.0208	0.6876 ± 0.0059	0.2455 ± 0.0027

Table: A1.T5: Comparison of different hierarchical graph models.

	GNN	Transformer	Graph Coarsening	Local Info.	Global Info.
Coarformer [78]	✓	✓	✓(non-overlap, static)	GNN on original graph	MHA on coarsen graph
Exphormer [79]	✓	✓	✗	GNN on original graph	MHA on expander graph
ANS-GT [80]	✗	✓	✓(non-overlap, static)	adaptive node sampling strategy	sampled nodes from the coarsened graph
NAGphormer [81]	✗	✓	✗	MHA on multi-hop neighbour	–
Graph MLP-Mixer (Ours)	✓	✓	✓(overlap, dynamic)	GNN on graph patches	token mixer across patches

Table: A2.T6: Summary statistics of datasets used in this study

Dataset	#Graphs	#Nodes	Avg. #Nodes	Avg. #Edges	Task	Metric
CSL	150	41	41	164	10-class classif.	Accuracy
EXP	1,200	32-64	44.4	110.2	2-class classif.	Accuracy
SR25	15	25	25	300	15-class classif.	Accuracy
ZINC	12,000	9-37	23.2	24.9	regression	MAE
MNIST	70,000	40-75	70.6	684.4	10-class classif.	Accuracy
CIFAR10	60,000	85-150	117.6	1129.7	10-class classif.	Accuracy
MolTOX21	7,831	1-132	18.57	38.6	12-task classif.	ROCAUC
MolHIV	41,127	2-222	25.5	54.9	binary classif.	ROCAUC
Peptides-func	15,535	8-444	150.9	307.3	10-class classif.	Avg. Precision (AP)
Peptides-struct	15,535	8-444	150.9	307.3	regression	MAE
TreeNeighbourMatch (r=2)	96	7	7	6	4-class classif.	Accuracy
TreeNeighbourMatch (r=3)	32,000	15	15	14	8-class classif.	Accuracy
TreeNeighbourMatch (r=4)	64,000	31	31	30	16-class classif.	Accuracy
TreeNeighbourMatch (r=5)	128,000	63	63	62	32-class classif.	Accuracy
TreeNeighbourMatch (r=6)	256,000	127	127	126	64-class classif.	Accuracy
TreeNeighbourMatch (r=7)	512,000	255	255	254	128-class classif.	Accuracy
TreeNeighbourMatch (r=8)	640,000	511	511	510	256-class classif.	Accuracy

Table: A4.T9: Effect of patch extraction: ✗means no patch extraction and ✓means uses patch extraction.

Model	Patch Extraction	ZINC (MAE↓↓\downarrow)	Peptides-func (AP↑↑\uparrow)
GCN-MLP-Mixer	✗	0.2495 ± 0.0040	0.6341 ± 0.0139
✓	0.1347 ± 0.0020	0.6832 ± 0.0061
GatedGCN-MLP-Mixer	✗	0.2521 ± 0.0084	0.6230 ± 0.0110
✓	0.1244 ± 0.0053	0.6932 ± 0.0017
GINE-MLP-Mixer	✗	0.2558 ± 0.0059	0.6350 ± 0.0038
✓	0.0733 ± 0.0014	0.6970 ± 0.0080
GraphTrans-MLP-Mixer	✗	0.2538 ± 0.0067	0.6224 ± 0.0112
✓	0.0773 ± 0.0030	0.6858 ± 0.0062

Table: A5.T12: Effect of positional encoding. We study the effects of node PE and patch PE by removing one of them in turn from our model while keeping the other components unchanged.

Dataset	Method	GCN-MLP-Mixer	Gated-MLP-Mixer	GINE-MLP-Mixer	GraphTrans-MLP-Mixer
ZINC	Full	0.1347 ± 0.0020	0.1244 ± 0.0053	0.0733 ± 0.0014	0.0773 ± 0.0030
- NodePE	0.1944 ± 0.0061	0.1775 ± 0.0031	0.1225 ± 0.0070	0.1393 ± 0.0122
- PatchPE	0.1414 ± 0.0058	0.1250 ± 0.0026	0.0746 ± 0.0010	0.0778 ± 0.0029
- Both	0.2207 ± 0.0072	0.1883 ± 0.0096	0.1160 ± 0.0023	0.1700 ± 0.0064
Peptides-func	Full	0.6832 ± 0.0061	0.6932 ± 0.0017	0.6970 ± 0.0080	0.6858 ± 0.0062
- NodePE	0.6688 ± 0.0039	0.6864 ± 0.0080	0.6868 ± 0.0034	0.6763 ± 0.0030
- PatchPE	0.6871 ± 0.0055	0.6934 ± 0.0055	0.6933 ± 0.0104	0.6882 ± 0.0076
- Both	0.6760 ± 0.0078	0.6847 ± 0.0034	0.6756 ± 0.0070	0.6783 ± 0.0088

Table: A5.T13: Ablation with combining effects of PE and patch size on ZINC.

Patch Size	2	4	16	32
Full	0.0983 ± 0.0042	0.1011 ± 0.0103	0.0799 ± 0.0037	0.0743 ± 0.0049
- Node PE	0.1589 ± 0.0056	0.1414 ± 0.0061	0.1307 ± 0.0107	0.1154 ± 0.0032
- Patch PE	0.1081 ± 0.0007	0.1076 ± 0.0110	0.0840 ± 0.0035	0.0744 ± 0.0037
- Both	0.1677 ± 0.0045	0.1532 ± 0.0051	0.1284 ± 0.0018	0.1187 ± 0.0050

Table: A6.T15: Different designs of graph-based multi-head attention (gMHA) in transformer layer.

gMHA	Equation	ZINC (MAE↓↓\downarrow)	Peptides-func (AP↑↑\uparrow)
Standard/Full attention [50]	softmax(QKTd)Vsoftmax𝑄superscript𝐾𝑇𝑑𝑉\textrm{softmax}\big{(}\frac{QK^{T}}{\sqrt{d}}\big{)}V	0.1784 ± 0.0238	0.6778 ± 0.0039
Graph Attention [37]	softmax(AP⊙QKTd)Vsoftmaxdirect-productsuperscript𝐴𝑃𝑄superscript𝐾𝑇𝑑𝑉\textrm{softmax}\big{(}A^{P}\odot\frac{QK^{T}}{\sqrt{d}}\big{)}V	0.1527 ± 0.0067	0.6795 ± 0.0070
Kernel Attention [53]	softmax(RW(AP)⊙QKTd)V\textrm{softmax}\big{(}\textrm{RW(}A^{P})\odot\frac{QK^{T}}{\sqrt{d}}\big{)}V	0.1010 ± 0.0031	0.6844 ± 0.0102
Additive Attention [56]	softmax(QKTd)V+LL(AP)softmax𝑄superscript𝐾𝑇𝑑𝑉LLsuperscript𝐴𝑃\textrm{softmax}\big{(}\frac{QK^{T}}{\sqrt{d}}\big{)}V+\textrm{LL}(A^{P})	0.1632 ± 0.0063	0.6842 ± 0.0057
Hadamard Attention	(AP⊙softmax(QKTd))Vdirect-productsuperscript𝐴𝑃softmax𝑄superscript𝐾𝑇𝑑𝑉\big{(}A^{P}\odot\textrm{softmax}(\frac{QK^{T}}{\sqrt{d}})\big{)}V	0.0849 ± 0.0047	0.6919 ± 0.0085

Table: A8.T17: Comparison of our best results from Table 2 with the state-of-the-art Models on large real world datasets [66].

Model	# Params	Peptide-func	Peptide-struct
Avg. Precision ↑↑\uparrow	Time (S/Epoch)	Memory (MB)	MAE ↓↓\downarrow	Time (S/Epoch)	Memory (MB)
GCN	508k	0.5930 ± 0.0023	4.59	696	0.3496 ± 0.0013	4.51	686
GINE	476k	0.5498 ± 0.0079	3.94	659	0.3547 ± 0.0045	3.84	658
GatedGCN	509k	0.5864 ± 0.0077	5.48	1,038	0.3420 ± 0.0013	5.31	1,029
GatedGCN + RWSE	506k	0.6069 ± 0.0035	5.75	1,035	0.3357± 0.0006	5.61	1,038
Transformer + LapPE	488k	0.6326 ± 0.0126	9.74 (1.1×1.1\times)	6,661 (6.6×6.6\times)	0.2529 ± 0.0016	9.61 (1.1×1.1\times)	6,646 (8.0×8.0\times)
SAN + LapPE [55]	493k	0.6384 ± 0.0121	80.47 (9.4×9.4\times)	12,493 (12.4×12.4\times)	0.2683 ± 0.0043	79.41 (8.8×8.8\times)	12,226 (14.7×14.7\times)
SAN + RWSE [55]	500k	0.6439 ± 0.0075	68.44 (8.0×8.0\times)	19,691 (19.5×19.5\times)	0.2545 ± 0.0012	70.39 (7.8×7.8\times)	12,111 (14.5×14.5\times)
GPS [57]	504k	0.6562 ± 0.0115	11.83 (1.4×1.4\times)	6,904 (6.8×6.8\times)	0.2515 ± 0.0012	11.74 (1.3×1.3\times)	6,878 (8.3×8.3\times)
GNN-AK+ [31]	631k	0.6480 ± 0.0089	22.52 (2.6×2.6\times)	7,855 (7.8×7.8\times)	0.2736 ± 0.0007	22.11 (2.5×2.5\times)	7,634 (9.2×9.2\times)
SUN [32]	508k	0.6730 ± 0.0078	376.66 (43.8×43.8\times)	18,941 (18.8×18.8\times)	0.2498 ± 0.0008	384.26 (42.7×42.7\times)	17,215 (20.7×20.7\times)
GCN-MLP-Mixer	329k	0.6832 ± 0.0061	8.48	716	0.2486 ± 0.0041	8.12	679
GatedGCN-MLP-Mixer	527k	0.6932 ± 0.0017	8.96	969	0.2508 ± 0.0007	8.44	887
GINE-MLP-Mixer	397k	0.6970 ± 0.0080	8.59 (1.0×1.0\times)	1,010 (1.0×1.0\times)	0.2494 ± 0.0007	8.51	974
GraphTrans-MLP-Mixer	593k	0.6858 ± 0.0062	9.94	975	0.2480 ± 0.0013	9.00	1,048
GCN-ViT	493k	0.6855 ± 0.0049	8.90	628	0.2468 ± 0.0015	8.55	609
GatedGCN-ViT	692k	0.6942 ± 0.0075	9.07	848	0.2465 ± 0.0015	9.00 (1.0×1.0\times)	833 (1.0×1.0\times)
GINE-ViT	561k	0.6919 ± 0.0085	8.98	920	0.2449 ± 0.0016	8.77	902
GraphTrans-ViT	757k	0.6876 ± 0.0059	9.94	975	0.2455 ± 0.0027	9.58	981

Refer to caption The basic architecture of the proposed Graph ViT/MLP-Mixer. They consist of a patch extraction module, a patch embedding module, a sequence of mixer layers, a global average pooling, and a classifier head. The patch extraction module partitions graphs into overlapping patches. The patch embedding module transforms these graph patches into corresponding token representations, which are fed into a sequence of mixer layers to generate the output tokens. A global average pooling layer followed by a fully-connected layer is finally used for prediction. Each Mixer Layer, MLP or graph-based multi-head attention (gMHA), is a residual network that alternates between a Token Mixer applied to all patches, and a Channel Mixer applied to each patch independently (see right side).

Refer to caption (a) ZINC

Refer to caption (c) CIFAR10

Refer to caption (d) MolTOX21

Refer to caption (f) Peptides-func/struct

Refer to caption Effect of the number of patches.

Refer to caption

$$ h_i^0=U^0\alpha_i + u^0 \in\mathbb{R}^{d} ; \quad e_{ij}^0=V^0\beta_{ij}+v^0 \in\mathbb{R}^{d} \label{eq: input features} $$ \tag{eq: input features}

$$ \begin{split}&h_{i,p}^{\ell+1}=f_{\textrm{node}}(h_{i,p}^{\ell},{h_{j,p}^{\ell}|{j\in\mathcal{N}(i)}},e_{ij,p}^{\ell})+g_{\textrm{patch-node}}(h_{p}^{\ell}),\quad h_{i,p}^{\ell+1},h_{i,p}^{\ell},h_{p}^{\ell}\in\mathbb{R}^{d},\ &e_{ij,p}^{\ell+1}=f_{\textrm{edge}}(h_{i,p}^{\ell},h_{i,p}^{\ell},e_{ij,p}^{\ell})+g_{\textrm{patch-edge}}(e_{p}^{\ell}),\quad e_{ij,p}^{\ell+1},e_{ij,p}^{\ell},e_{p}^{\ell}\in\mathbb{R}^{d},\end{split} $$ \tag{S4.E2}

$$ A^{P}_{ij}=|\mathcal{V}_i \cap \mathcal{V}_j | =\textrm{Cut}(\mathcal{V}_i, \mathcal{V}_j), \label{eq: coarsen_adj} $$ \tag{eq: coarsen_adj}

$$ x_i^0 = \hat T^0 \hat p_i + \hat U^0 x_i + \hat u^0 \in \mathbb{R}^d, \label{eq: patch pe} $$ \tag{eq: patch pe}

$$ \begin{split} &U = X + (W_2\sigma(W_1\ \textrm{LayerNorm}(X))) \in \mathbb{R}^{P\times d} \quad\quad\quad \textrm { Token mixer},\ &Y = U + (W_4\sigma(W_3 \textrm{LayerNorm}(U)^T))^T \in \mathbb{R}^{P\times d} \quad\quad\quad\ \textrm { Channel mixer}, \end{split} $$

$$ h_G = \sum_{p} m_p \cdot x_{G_p} / \sum_{p} m_p \ \in\mathbb{R}^{d}, $$

$$ y_{G} = \textrm{MLP}(h_G) \in \mathbb{R} \textrm{ (regression) } \quad \textrm{or} \quad \mathbb{R}^{n_c} \textrm{ (classification)}. $$

$$ \displaystyle h_{i,p}^{l+1} $$

$$ \begin{split} &h_{i,p}^{\ell+1} = f_\textrm{node}(h_{i,p}^\ell, {h_{j,p}^\ell|{j\in\mathcal{N}(i)}}, e_{ij,p}^\ell) + g_\textrm{patch-node}(h_p^\ell), \quad h_{i,p}^{\ell+1}, h_{i,p}^{\ell},h_p^\ell \in \mathbb{R}^d,\ &e_{ij, p}^{\ell+1}=f_\textrm{edge}(h_{i,p}^\ell, h_{i,p}^\ell, e_{ij, p}^{\ell}) + g_\textrm{patch-edge}(e_p^\ell), \quad e_{ij, p}^{\ell+1}, e_{ij, p}^{\ell},e_p^\ell \in \mathbb{R}^d, \end{split} $$

Input	Regular grid Same data resolution (Height, Width)	Irregular domain Variable data structure (# Nodes and # Edges)
Patch Extraction	Via pixel reordering Non-overlapping patches Same patches at each epoch	Via graph clustering algorithm Overlapping patches Different patches at each epoch
Patch Encoder	Same patch resolution (Patch Height, Patch Width) MLP (equivalently CNN)	Variable patch structure (# Nodes and # Edges) GNN (e.g. GCN, GAT, GT)
Positional Information	Implicitly ordered (No need for explicit PE)	No universal ordering Node PE for patch encoder Patch PE for token mixer
ViT / MLP-Mixer	MLP / Channel mixer MHA/ Token mixer	MLP / Channel mixer gMHA / Token mixer

Model	ZINC MAE ↓	MNIST Accuracy ↑	CIFAR10 Accuracy ↑	MolTOX21 ROCAUC ↑	MolHIV ROCAUC ↑	Peptide-func AP ↑	Peptide-struct MAE ↓
GCN GCN-MLP-Mixer GCN-ViT	0.1952 ± 0.0057 0.1347 ± 0.0020 0.1688 ± 0.0095	0.9269 ± 0.0023 0.9516 ± 0.0027 0.9600 ± 0.0015	0.5423 ± 0.0056 0.6111 ± 0.0017 0.6367 ± 0.0027	0.7525 ± 0.0031 0.7816 ± 0.0075 0.7820 ± 0.0096	0.7813 ± 0.0081 0.7929 ± 0.0111 0.7780 ± 0.0120	0.6328 ± 0.0086 0.6832 ± 0.0061 0.6855 ± 0.0049	0.2758 ± 0.0012 0.2486 ± 0.0041 0.2468 ± 0.0015
GatedGCN GatedGCN-MLP-Mixer GatedGCN-ViT	0.1577 ± 0.0046 0.1244 ± 0.0053 0.1421 ± 0.0031	0.9776 ± 0.0017 0.9832 ± 0.0004 0.9846 ± 0.0009	0.6628 ± 0.0017 0.7060 ± 0.0022 0.7158 ± 0.0009	0.7641 ± 0.0057 0.7910 ± 0.0040 0.7857 ± 0.0028	0.7874 ± 0.0119 0.7976 ± 0.0136 0.7734 ± 0.0114	0.6300 ± 0.0029 0.6932 ± 0.0017 0.6942 ± 0.0075	0.2778 ± 0.0017 0.2508 ± 0.0007 0.2465 ± 0.0015
GINE GINE-MLP-Mixer GINE-ViT	0.1072 ± 0.0037 0.0733 ± 0.0014 0.0849 ± 0.0047	0.9705 ± 0.0023 0.9809 ± 0.0004 0.9820 ± 0.0005	0.6131 ± 0.0035 0.6833 ± 0.0022 0.6967 ± 0.0040	0.7730 ± 0.0064 0.7868 ± 0.0043 0.7851 ± 0.0077	0.7885 ± 0.0034 0.7997 ± 0.0102 0.7792 ± 0.0149	0.6405 ± 0.0077 0.6970 ± 0.0080 0.6919 ± 0.0085	0.2780 ± 0.0021 0.2494 ± 0.0007 0.2449 ± 0.0016
GraphTrans GraphTrans-MLP-Mixer GraphTrans-ViT	0.1230 ± 0.0018 0.0773 ± 0.0030 0.0960 ± 0.0073	0.9782 ± 0.0012 0.9742 ± 0.0011 0.9725 ± 0.0023	0.6809 ± 0.0020 0.7396 ± 0.0033 0.7211 ± 0.0055	0.7646 ± 0.0055 0.7817 ± 0.0040 0.7835 ± 0.0032	0.7884 ± 0.0104 0.7969 ± 0.0061 0.7755 ± 0.0208	0.6313 ± 0.0039 0.6858 ± 0.0062 0.6876 ± 0.0059	0.2777 ± 0.0025 0.2480 ± 0.0013 0.2455 ± 0.0027

Model	ZINC	MolHIV	Peptides-func	Peptides-func	Peptides-func	Peptides-strcut	Peptides-strcut	Peptides-strcut
	MAE ↓	ROCAUC ↑	AP ↑	Time	Memory	MAE ↓	Time	Memory
GT [36]	0.226 ± 0.014	-	-	-	-	-	-	-
GraphiT [53]	0.202 ± 0.011	-	-	-	-	-	-	-
Graphormer [56]	0.122 ± 0.006	-	-	-	-	-	-	-
GPS [57]	0.070 ± 0.004	0.7880 ± 0.0101	0.6562 ± 0.0115	1 . 4 ×	6 . 8 ×	0.2515 ± 0.0012	1 . 3 ×	8 . 3 ×
SAN+LapPE [38]	0.139 ± 0.006	0.7775 ± 0.0061	0.6384 ± 0.0121	9 . 4 ×	12 . 4 ×	0.2683 ± 0.0043	8 . 8 ×	14 . 7 ×
SAN+RWSE [38]	-	-	0.6439 ± 0.0075	8 . 0 ×	19 . 5 ×	0.2545 ± 0.0012	7 . 9 ×	14 . 5 ×
GNN-AK+ [31]	0.080 ± 0.001	0.7961 ± 0.0119	0.6480 ± 0.0089	2 . 6 ×	7 . 8 ×	0.2736 ± 0.0007	2 . 5 ×	9 . 2 ×
SUN [32]	0.084 ± 0.002	0.8003 ± 0.0055 1	0.6730 ± 0.0078	43 . 8 ×	18 . 8 ×	0.2498 ± 0.0008	42 . 7 ×	20 . 7 ×
CIN [29]	0.079 ± 0.006 2	0.8094 ± 0.0057	-	-	-	-	-	-
Graph MLP-Mixer	0.073 ± 0.001	0.7997 ± 0.0102	0.6970 ± 0.0080	1 . 0 ×	1 . 0 ×	0.2475 ± 0.0015	1 . 0 ×	1 . 2 ×
Graph ViT	0.085 ± 0.005	0.7792 ± 0.0149	0.6942 ± 0.0075	1 . 1 ×	0 . 8 ×	0.2449 ± 0.0016	1 . 0 ×	1 . 0 ×

Model	CSL (ACC)	EXP (ACC)	SR25 (ACC)
GCN	10.00 ± 0.00	51.90 ± 1.96	6.67 ± 0.00
GatedGCN	10.00 ± 0.00	51.73 ± 1.65	6.67 ± 0.00
GINE	10.00 ± 0.00	50.69 ± 1.39	6.67 ± 0.00
GraphTrans	10.00 ± 0.00	52.35 ± 2.32	6.67 ± 0.00
GCN-MLP-Mixer	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
GatedGCN-MLP-Mixer	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
GINE-MLP-Mixer	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
GraphTrans-MLP-Mixer	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00

	GNN	Transformer	Graph Coarsening	Local Info.	Global Info.
Coarformer [78]	✓	✓	✓ (non-overlap, static)	GNN on original graph	MHAon coarsen graph
Exphormer [79]	✓	✓	✗	GNN on original graph	MHAon expander graph
ANS-GT [80]	✗	✓	✓ (non-overlap, static)	adaptive node sampling strategy	sampled nodes from the coarsened graph
NAGphormer [81]	✗	✓	✗	MHAon multi-hop neighbour	-
Graph MLP-Mixer (Ours)	✓	✓	✓ (overlap, dynamic)	GNN on graph patches	token mixer across patches

Dataset	#Graphs	#Nodes	Avg. #Nodes	Avg. #Edges	Task	Metric
CSL	150	41	41	164	10-class classif.	Accuracy
EXP	1,200	32-64	44.4	110.2	2-class classif.	Accuracy
SR25	15	25	25	300	15-class classif.	Accuracy
ZINC	12,000	9-37	23.2	24.9	regression	MAE
MNIST	70,000	40-75	70.6	684.4	10-class classif.	Accuracy
CIFAR10	60,000	85-150	117.6	1129.7	10-class classif.	Accuracy
MolTOX21	7,831	1-132	18.57	38.6	12-task classif.	ROCAUC
MolHIV	41,127	2-222	25.5	54.9	binary classif.	ROCAUC
Peptides-func	15,535	8-444	150.9	307.3	10-class classif.	Avg. Precision (AP)
Peptides-struct	15,535	8-444	150.9	307.3	regression	MAE
TreeNeighbourMatch (r=2)	96	7	7	6	4-class classif.	Accuracy
TreeNeighbourMatch (r=3)	32,000	15	15	14	8-class classif.	Accuracy
TreeNeighbourMatch (r=4)	64,000	31	31	30	16-class classif.	Accuracy
TreeNeighbourMatch (r=5)	128,000	63	63	62	32-class classif.	Accuracy
TreeNeighbourMatch (r=6)	256,000	127	127	126	64-class classif.	Accuracy
TreeNeighbourMatch (r=7)	512,000	255	255	254	128-class classif.	Accuracy
TreeNeighbourMatch (r=8)	640,000	511	511	510	256-class classif.	Accuracy

Dataset	# Patch	# Node	# Node	# Node	Diameter	Diameter	Diameter
Dataset	# Patch	Mean	Min	Max	Mean	Min	Max
CSL	32	5.80	5	8	2.28	2	3
EXP	32	4.07	2	11	2.31	1	5
SR25	32	13.00	13	13	2.00	2	2
ZINC	32	3.15	2	7	1.82	1	3
MNIST	32	14.36	9	28	2.85	2	5
CIFAR10	32	17.20	10	35	3.07	2	7
MolTOX21	32	3.15	1	10	1.80	0	6
MolHIV	32	3.27	1	13	1.87	0	8
Peptides-func	32	7.08	1	20	4.15	0	14
Peptides-struct	32	7.08	1	20	4.15	0	14
TreeNeighbourMatch(r=2)	8	1.86	1	3	0.86	0	2
TreeNeighbourMatch(r=3)	32	1.93	1	3	0.93	0	2
TreeNeighbourMatch(r=4)	32	1.97	1	3	0.97	0	2
TreeNeighbourMatch(r=5)	32	3.28	1	5	2.25	0	3
TreeNeighbourMatch(r=6)	32	5.34	3	8	3.31	2	5
TreeNeighbourMatch(r=7)	32	9.19	7	14	4.33	4	5
TreeNeighbourMatch(r=8)	32	17.03	15	23	6.17	6	8

	CSL	EXP	SR25	ZINC	MNIST	CIFAR10	MolTOX21	MolHIV	Peptides-fun	Peptides-struct
NodePE PatchPE	RWSE-8 RWSE-8	RWSE-8 RWSE-8	LapPE-8 RWSE-8	RWSE-20 RWSE-8	LapPE-8 RWSE-8	LapPE-8 RWSE-8	-	-	RWSE-16 RWSE-8	RWSE-16 RWSE-8
							-	-

Model	Patch Extraction	ZINC (MAE ↓ )	Peptides-func (AP ↑ )
GCN-MLP-Mixer	✗ ✓	0.2495 ± 0.0040 0.1347 ± 0.0020	0.6341 ± 0.0139 0.6832 ± 0.0061
GatedGCN-MLP-Mixer	✗ ✓	0.2521 ± 0.0084 0.1244 ± 0.0053	0.6230 ± 0.0110 0.6932 ± 0.0017
GINE-MLP-Mixer	✗ ✓	0.2558 ± 0.0059 0.0733 ± 0.0014	0.6350 ± 0.0038 0.6970 ± 0.0080
GraphTrans-MLP-Mixer	✗ ✓	0.2538 ± 0.0067 0.0773 ± 0.0030	0.6224 ± 0.0112 0.6858 ± 0.0062

Model	ZINC (MAE ↓ )	ZINC (MAE ↓ )	Peptides-struct (MAE ↓ )	Peptides-struct (MAE ↓ )
	METIS	Random	METIS	Random
GCN-MLP-Mixer	0.1347 ± 0.0020	0.1435 ± 0.0122	0.2486 ± 0.0041	0.2565 ± 0.0031
GatedGCN-MLP-Mixer	0.1244 ± 0.0053	0.1284 ± 0.0074	0.2508 ± 0.0007	0.2539 ± 0.0012
GINE-MLP-Mixer	0.0733 ± 0.0014	0.0708 ± 0.0020	0.2494 ± 0.0007	0.2559 ± 0.0012
GraphTrans-MLP-Mixer	0.0773 ± 0.0030	0.0767 ± 0.0019	0.2480 ± 0.0013	0.2574 ± 0.0025

Model	P=2	P=4	P=8	P=16	P=32
GCN-MLP-Mixer	57.54 ± 3.87	99.44 ± 0.59	99.69 ± 0.98	100.00 ± 0.00	100.00 ± 0.00
GatedGCN-MLP-Mixer	67.65 ± 2.01	99.77 ± 0.37	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
GINE-MLP-Mixer	57.75 ± 3.80	99.58 ± 0.45	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
GraphTrans-MLP-Mixer	73.79 ± 1.52	96.77 ± 8.43	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00

Dataset	Method	GCN-MLP-Mixer	Gated-MLP-Mixer	GINE-MLP-Mixer	GraphTrans-MLP-Mixer
ZINC	Full - NodePE - PatchPE	0.1347 ± 0.0020 0.1944 ± 0.0061 0.1414 ± 0.0058	0.1244 ± 0.0053 0.1775 ± 0.0031 0.1250 ± 0.0026	0.0733 ± 0.0014 0.1225 ± 0.0070 0.0746 ± 0.0010	0.0773 ± 0.0030 0.1393 ± 0.0122 0.0778 ± 0.0029
ZINC	- Both	0.2207 ± 0.0072	0.1883 ± 0.0096	0.1160 ± 0.0023	0.1700 ± 0.0064
Peptides-func	Full	0.6832 ± 0.0061	0.6932 ± 0.0017	0.6970 ± 0.0080	0.6858 ± 0.0062
Peptides-func	- NodePE	0.6688 ± 0.0039	0.6864 ± 0.0080	0.6868 ± 0.0034	0.6763 ± 0.0030
Peptides-func	- PatchPE	0.6871 ± 0.0055	0.6934 ± 0.0055	0.6933 ± 0.0104	0.6882 ± 0.0076
Peptides-func	- Both	0.6760 ± 0.0078	0.6847 ± 0.0034	0.6756 ± 0.0070	0.6783 ± 0.0088

Patch Size	2	4	16	32
Full	0.0983 ± 0.0042	0.1011 ± 0.0103	0.0799 ± 0.0037	0.0743 ± 0.0049
- Node PE	0.1589 ± 0.0056	0.1414 ± 0.0061	0.1307 ± 0.0107	0.1154 ± 0.0032
- Patch PE	0.1081 ± 0.0007	0.1076 ± 0.0110	0.0840 ± 0.0035	0.0744 ± 0.0037
- Both	0.1677 ± 0.0045	0.1532 ± 0.0051	0.1284 ± 0.0018	0.1187 ± 0.0050

Patch Size	2	4	16	32	64
Full	0.6578 ± 0.0063	0.6675 ± 0.0037	0.6855 ± 0.0039	0.6939 ± 0.0034	0.6944 ± 0.0074
- Node PE	0.6613 ± 0.0063	0.6708 ± 0.0065	0.6864 ± 0.0069	0.6873 ± 0.0033	0.6789 ± 0.0047
- Patch PE	0.6594 ± 0.0059	0.6724 ± 0.0051	0.6937 ± 0.0068	0.6939 ± 0.0062	0.6865 ± 0.0061
- Both	0.6562 ± 0.0057	0.6739 ± 0.0038	0.6879 ± 0.0052	0.6825 ± 0.0074	0.6746 ± 0.0056

gMHA	Equation	ZINC (MAE ↓ )	Peptides-func (AP ↑ )
Standard/Full attention [50]	softmax ( QK T √ d ) V softmax ( A P ⊙ QK T √ d ) V softmax ( RW( A P ) ⊙ QK T √ d softmax ( QK T √ d ) V + LL ( A T V	0.1784 ± 0.0238	0.6778 ± 0.0039
Graph Attention [37]		0.1527 ± 0.0067	0.6795 ± 0.0070
Kernel Attention [53]	) V	0.1010 ± 0.0031	0.6844 ± 0.0102
Additive Attention [56]	P )	0.1632 ± 0.0063	0.6842 ± 0.0057
Hadamard Attention	( A P ⊙ softmax ( QK √ d ) )	0.0849 ± 0.0047	0.6919 ± 0.0085

Model	DA	ZINC	ZINC	Peptides-struct	Peptides-struct
		MAE ↓	Time (S/Epoch)	MAE ↓	Time (S/Epoch)
GCN-MLP-Mixer	✗ ✓	0.2537 ± 0.0139	5.3603	0.2761 ± 0.0041 0.2486 ± 0.0041	6.8297 9.2561
		0.1347 ± 0.0020	5.6728
GatedGCN-MLP-Mixer	✗	0.2121 ± 0.0172	5.3816	0.2776 ± 0.0020	7.8609
	✓	0.1244 ± 0.0053	5.7786	0.2508 ± 0.0007	9.5830
GINE-MLP-Mixer	✗	0.1389 ± 0.0171	5.3905	0.2792 ± 0.0043	7.8849
	✓	0.0733 ± 0.0014	5.6704	0.2494 ± 0.0007	8.8136
GraphTrans-MLP-Mixer	✗	0.1665 ± 0.0145	6.0039	0.2802 ± 0.0030	9.0999
	✓	0.0773 ± 0.0030	6.1616	0.2480 ± 0.0013	9.7730

Model	# Params	Peptide-func	Peptide-func	Peptide-func	Peptide-struct	Peptide-struct	Peptide-struct
		Avg. Precision ↑	Time (S/Epoch)	Memory (MB)	MAE ↓	Time (S/Epoch)	Memory (MB)
GCN	508k	0.5930 ± 0.0023	4.59	696	0.3496 ± 0.0013	4.51	686
GINE	476k	0.5498 ± 0.0079	3.94	659	0.3547 ± 0.0045	3.84	658
GatedGCN	509k	0.5864 ± 0.0077	5.48	1,038	0.3420 ± 0.0013	5.31	1,029
GatedGCN + RWSE	506k	0.6069 ± 0.0035	5.75	1,035	0.3357± 0.0006	5.61	1,038
Transformer + LapPE	488k	0.6326 ± 0.0126	9.74 ( 1 . 1 × )	6,661 ( 6 . 6 × )	0.2529 ± 0.0016	9.61 ( 1 . 1 × )	6,646 ( 8 . 0 × )
SAN + LapPE [55]	493k	0.6384 ± 0.0121	80.47 ( 9 . 4 × )	12,493 ( 12 . 4 × )	0.2683 ± 0.0043	79.41 ( 8 . 8 × )	12,226 ( 14 . 7 × )
SAN + RWSE [55]	500k	0.6439 ± 0.0075	68.44 ( 8 . 0 × )	19,691 ( 19 . 5 × )	0.2545 ± 0.0012	70.39 ( 7 . 8 × )	12,111 ( 14 . 5 × )
GPS [57]	504k	0.6562 ± 0.0115	11.83 ( 1 . 4 × )	6,904 ( 6 . 8 × )	0.2515 ± 0.0012	11.74 ( 1 . 3 × )	6,878 ( 8 . 3 × )
GNN-AK+ [31]	631k	0.6480 ± 0.0089	22.52 ( 2 . 6 × )	7,855 ( 7 . 8 × )	0.2736 ± 0.0007	22.11 ( 2 . 5 × )	7,634 ( 9 . 2 × )
SUN [32]	508k	0.6730 ± 0.0078	376.66 ( 43 . 8 × )	18,941 ( 18 . 8 × )	0.2498 ± 0.0008	384.26 ( 42 . 7 × )	17,215 ( 20 . 7 × )
GCN-MLP-Mixer	329k	0.6832 ± 0.0061	8.48	716	0.2486 ± 0.0041	8.12	679
GatedGCN-MLP-Mixer	527k	0.6932 ± 0.0017	8.96	969	0.2508 ± 0.0007	8.44	887
GINE-MLP-Mixer	397k	0.6970 ± 0.0080	8.59 ( 1 . 0 × )	1,010 ( 1 . 0 × )	0.2494 ± 0.0007	8.51	974
GraphTrans-MLP-Mixer	593k	0.6858 ± 0.0062	9.94	975	0.2480 ± 0.0013	9.00	1,048
GCN-ViT	493k	0.6855 ± 0.0049	8.90	628	0.2468 ± 0.0015	8.55	609
GatedGCN-ViT	692k	0.6942 ± 0.0075	9.07	848	0.2465 ± 0.0015	9.00 ( 1 . 0 × )	833 ( 1 . 0 × )
GINE-ViT	561k	0.6919 ± 0.0085	8.98	920	0.2449 ± 0.0016	8.77	902
GraphTrans-ViT	757k	0.6876 ± 0.0059	9.94	975	0.2455 ± 0.0027	9.58	981

Refer to caption

References

[deac2022expander] Andreea Deac, Marc Lackenby, Petar Veli{\v{c. (2022). Expander Graph Propagation. The First Learning on Graphs Conference.

[arnaiz2022diffwire] Adri{'a. (2022). DiffWire: Inductive Graph Rewiring via the Lov'asz Bound. The First Learning on Graphs Conference.

[belkin2019reconciling] Belkin, Mikhail, Hsu, Daniel, Ma, Siyuan, Mandal, Soumik. (2019). Reconciling modern machine-learning practice and the classical bias--variance trade-off. Proceedings of the National Academy of Sciences.

[fey2019fast] Fey, Matthias, Lenssen, Jan Eric. (2019). Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428.

[wang2019deep] Wang, Minjie, Yu, Lingfan, Zheng, Da, Gan, Quan, Gai, Yu, Ye, Zihao, Li, Mufei, Zhou, Jinjing, Huang, Qi, Ma, Chao, others. (2019). Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs..

[hochreiter1997long] Hochreiter, Sepp, Schmidhuber, J{. (1997). Long short-term memory. Neural computation.

[ba2016layer] Ba, Jimmy Lei, Kiros, Jamie Ryan, Hinton, Geoffrey E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.

[page1999pagerank] Page, Lawrence, Brin, Sergey, Motwani, Rajeev, Winograd, Terry. (1999). The PageRank citation ranking: Bringing order to the web..

[mialon2021graphit] Mialon, Gr{'e. (2021). GraphiT: Encoding Graph Structure in Transformers. arXiv preprint arXiv:2106.05667.

[chung1997spectral] Chung, Fan RK. (1997). Spectral graph theory.

[zheng2020dgl] Zheng, Da, Song, Xiang, Ma, Chao, Tan, Zeyuan, Ye, Zihao, Dong, Jin, Xiong, Hao, Zhang, Zheng, Karypis, George. (2020). Dgl-ke: Training knowledge graph embeddings at scale. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.

[bulucc2016recent] Bulu{\c{c. (2016). Recent advances in graph partitioning. Algorithm engineering.

[zhao2021data] Tong Zhao, Yozen Liu, Leonardo Neves, Oliver J. Woodford, Meng Jiang, Neil Shah. (2020). Data Augmentation for Graph Neural Networks. CoRR.

[brown2020language] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. Advances in neural information processing systems.

[bahdanau2014neural] Bahdanau, Dzmitry, Cho, Kyunghyun, Bengio, Yoshua. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

[Loukas2020What] Andreas Loukas. (2020). What graph neural networks cannot learn: depth vs width. International Conference on Learning Representations.

[li2020distance] Li, Pan, Wang, Yanbang, Wang, Hongwei, Leskovec, Jure. (2020). Distance Encoding: Design Provably More Powerful Neural Networks for Graph Representation Learning. Advances in Neural Information Processing Systems.

[lim2022sign] Lim, Derek, Robinson, Joshua, Zhao, Lingxiao, Smidt, Tess, Sra, Suvrit, Maron, Haggai, Jegelka, Stefanie. (2022). Sign and Basis Invariant Networks for Spectral Graph Representation Learning. arXiv preprint arXiv:2202.13013.

[dwivedi2021generalization] Dwivedi, Vijay Prakash, Bresson, Xavier. (2021). A Generalization of Transformer Networks to Graphs. AAAI Workshop on Deep Learning on Graphs: Methods and Applications.

[chen2020can] Chen, Zhengdao, Chen, Lei, Villar, Soledad, Bruna, Joan. (2020). Can graph neural networks count substructures?. Advances in neural information processing systems.

[zhao2021stars] Zhao, Lingxiao, Jin, Wei, Akoglu, Leman, Shah, Neil. (2021). From Stars to Subgraphs: Uplifting Any GNN with Local Structure Awareness. International Conference on Learning Representations.

[zhang2021nested] Zhang, Muhan, Li, Pan. (2021). Nested graph neural networks. Advances in Neural Information Processing Systems.

[sun] Frasca, Fabrizio, Bevilacqua, Beatrice, Bronstein, Michael M, Maron, Haggai. (2022). Understanding and Extending Subgraph GNNs by Rethinking Their Symmetries. arXiv preprint arXiv:2206.11140.

[bevilacqua2021equivariant] Bevilacqua, Beatrice, Frasca, Fabrizio, Lim, Derek, Srinivasan, Balasubramaniam, Cai, Chen, Balamurugan, Gopinath, Bronstein, Michael M, Maron, Haggai. (2021). Equivariant subgraph aggregation networks. arXiv preprint arXiv:2110.02910.

[bodnar2021weisfeiler] Bodnar, Cristian, Frasca, Fabrizio, Otter, Nina, Wang, Yuguang, Lio, Pietro, Montufar, Guido F, Bronstein, Michael. (2021). Weisfeiler and Lehman go cellular: CW networks. Advances in Neural Information Processing Systems.

[alsentzer2020subgraph] Alsentzer, Emily, Finlayson, Samuel, Li, Michelle, Zitnik, Marinka. (2020). Subgraph neural networks. Advances in Neural Information Processing Systems.

[bouritsas2022improving] Bouritsas, Giorgos, Frasca, Fabrizio, Zafeiriou, Stefanos P, Bronstein, Michael. (2022). Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[azizian2020expressive] Azizian, Wa{. (2020). Expressive power of invariant and equivariant graph neural networks. arXiv preprint arXiv:2006.15646.

[chen2019equivalence] Chen, Zhengdao, Chen, Lei, Villar, Soledad, Bruna, Joan. (2019). On the equivalence between graph isomorphism testing and function approximation with GNNs. Advances in neural information processing systems.

[maron2019provably] Maron, Haggai, Ben-Hamu, Heli, Serviansky, Hadar, Lipman, Yaron. (2019). Provably powerful graph networks. arXiv preprint arXiv:1905.11136.

[maron2018invariant] Maron, Haggai, Ben-Hamu, Heli, Shamir, Nadav, Lipman, Yaron. (2018). Invariant and equivariant graph networks. arXiv preprint arXiv:1812.09902.

[xu2018how] Keyulu Xu, Weihua Hu, Jure Leskovec, Stefanie Jegelka. (2019). How Powerful are Graph Neural Networks?. International Conference on Learning Representations.

[morris2019weisfeiler] Morris, Christopher, Ritzert, Martin, Fey, Matthias, Hamilton, William L, Lenssen, Jan Eric, Rattan, Gaurav, Grohe, Martin. (2019). Weisfeiler and leman go neural: Higher-order graph neural networks. Proceedings of the AAAI Conference on Artificial Intelligence.

[weisfeiler1968reduction] Weisfeiler, Boris, Leman, Andrei. (1968). The reduction of a graph to canonical form and the algebra which appears therein. NTI Series.

[hamilton2017inductive] Hamilton, William L, Ying, Rex, Leskovec, Jure. (2017). Inductive representation learning on large graphs. Proceedings of the 31st International Conference on Neural Information Processing Systems.

[zopf20221] Zopf, Markus. (2022). 1-WL Expressiveness Is (Almost) All You Need. arXiv preprint arXiv:2202.10156.

[belkin2003laplacian] Belkin, Mikhail, Niyogi, Partha. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation.

[dwivedi2021graph] Dwivedi, Vijay Prakash, Luu, Anh Tuan, Laurent, Thomas, Bengio, Yoshua, Bresson, Xavier. (2021). Graph Neural Networks with Learnable Structural and Positional Representations. International Conference on Learning Representations.

[velivckovic2018graph] Veli{\v{c. (2018). Graph Attention Networks. International Conference on Learning Representations.

[defferrard2016convolutional] Defferrard, Micha{. (2016). Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. NIPS.

[kipf2017semi] Kipf, Thomas N., Welling, Max. (2017). Semi-Supervised Classification with Graph Convolutional Networks. International Conference on Learning Representations (ICLR).

[li2020graph] Li, Yang, Qian, Buyue, Zhang, Xianli, Liu, Hui. (2020). Graph Neural Network-Based Diagnosis Prediction. Big Data.

[li2021braingnn] Li, Xiaoxiao, Zhou, Yuan, Dvornek, Nicha, Zhang, Muhan, Gao, Siyuan, Zhuang, Juntang, Scheinost, Dustin, Staib, Lawrence H, Ventola, Pamela, Duncan, James S. (2021). Braingnn: Interpretable brain graph neural network for fmri analysis. Medical Image Analysis.

[gaudelet2020utilising] Gaudelet, Thomas, Day, Ben, Jamasb, Arian R, Soman, Jyothish, Regep, Cristian, Liu, Gertrude, Hayter, Jeremy BR, Vickers, Richard, Roberts, Charles, Tang, Jian, others. (2020). Utilising Graph Machine Learning within Drug Discovery and Development. arXiv preprint arXiv:2012.05716.

[stokes2020deep] Stokes, Jonathan M, Yang, Kevin, Swanson, Kyle, Jin, Wengong, Cubillos-Ruiz, Andres, Donghia, Nina M, MacNair, Craig R, French, Shawn, Carfrae, Lindsey A, Bloom-Ackermann, Zohar, others. (2020). A deep learning approach to antibiotic discovery. Cell.

[schlichtkrull2018modeling] Schlichtkrull, Michael, Kipf, Thomas N, Bloem, Peter, Berg, Rianne van den, Titov, Ivan, Welling, Max. (2018). Modeling relational data with graph convolutional networks. European semantic web conference.

[wu2021graph] Wu, Lingfei, Chen, Yu, Shen, Kai, Guo, Xiaojie, Gao, Hanning, Li, Shucheng, Pei, Jian, Long, Bo. (2021). Graph neural networks for natural language processing: A survey. arXiv preprint arXiv:2106.06090.

[han2022vision] Han, Kai, Wang, Yunhe, Guo, Jianyuan, Tang, Yehui, Wu, Enhua. (2022). Vision GNN: An Image is Worth Graph of Nodes. arXiv preprint arXiv:2206.00272.

[derrowpinion2021traffic] Derrow-Pinion, Austin, She, Jennifer, Wong, David, Lange, Oliver, Hester, Todd, Perez, Luis, Nunkesser, Marc, Lee, Seongjae, Guo, Xueying, Wiltshire, Brett, others. (2021). Eta prediction with graph neural networks in google maps. Proceedings of the 30th ACM International Conference on Information & Knowledge Management.

[cranmer2019learning] Cranmer, Miles D, Xu, Rui, Battaglia, Peter, Ho, Shirley. (2019). Learning symbolic physics with graph networks. arXiv preprint arXiv:1909.05862.

[bapst2020unveiling] Bapst, Victor, Keck, Thomas, Grabska-Barwi{'n. (2020). Unveiling the predictive power of static structure in glassy systems. Nature Physics.

[monti2017geometric] Monti, Federico, Bronstein, Michael, Bresson, Xavier. (2017). Geometric matrix completion with recurrent multi-graph neural networks. Advances in neural information processing systems.

[van2018graph] Berg, Rianne van den, Kipf, Thomas N, Welling, Max. (2017). Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263.

[duvenaud2015convolutional] Duvenaud, David K, Maclaurin, Dougal, Iparraguirre, Jorge, Bombarell, Rafael, Hirzel, Timothy, Aspuru-Guzik, Al{'a. (2015). Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems.

[gilmer2017neural] Gilmer, Justin, Schoenholz, Samuel S, Riley, Patrick F, Vinyals, Oriol, Dahl, George E. (2017). Neural message passing for quantum chemistry. International Conference on Machine Learning.

[dwivedi2020benchmarking] Dwivedi, Vijay Prakash, Joshi, Chaitanya K, Laurent, Thomas, Bengio, Yoshua, Bresson, Xavier. (2020). Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982.

[ZINC] Irwin, John J, Sterling, Teague, Mysinger, Michael M, Bolstad, Erin S, Coleman, Ryan G. (2012). ZINC: a free tool to discover chemistry for biology. Journal of chemical information and modeling.

[pytorch] Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer, Adam, Bradbury, James, Chanan, Gregory, Killeen, Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca, others. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems.

[pyg] Fey, Matthias, Lenssen, Jan Eric. (2019). Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428.

[hu2020open] Hu, Weihua, Fey, Matthias, Zitnik, Marinka, Dong, Yuxiao, Ren, Hongyu, Liu, Bowen, Catasta, Michele, Leskovec, Jure. (2020). Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems.

[landrum2006rdkit] Landrum, Greg, others. (2006). RDKit: Open-source cheminformatics. 2006.

[MoleculeNet] Szklarczyk, Damian, Gable, Annika L, Lyon, David, Junge, Alexander, Wyder, Stefan, Huerta-Cepas, Jaime, Simonovic, Milan, Doncheva, Nadezhda T, Morris, John H, Bork, Peer, others. (2019). STRING v11: protein--protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic acids research.

[he2022masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, {\L. (2017). Attention is all you need. Advances in neural information processing systems.

[tolstikhin2021mlp] Tolstikhin, Ilya O, Houlsby, Neil, Kolesnikov, Alexander, Beyer, Lucas, Zhai, Xiaohua, Unterthiner, Thomas, Yung, Jessica, Steiner, Andreas, Keysers, Daniel, Uszkoreit, Jakob, others. (2021). Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems.

[wang2022dynamixer] Wang, Ziyu, Jiang, Wenhao, Zhu, Yiming M, Yuan, Li, Song, Yibing, Liu, Wei. (2022). Dynamixer: a vision MLP architecture with dynamic mixing. International Conference on Machine Learning.

[touvron2021resmlp] Touvron, Hugo, Bojanowski, Piotr, Caron, Mathilde, Cord, Matthieu, El-Nouby, Alaaeldin, Grave, Edouard, Izacard, Gautier, Joulin, Armand, Synnaeve, Gabriel, Verbeek, Jakob, others. (2021). Resmlp: Feedforward networks for image classification with data-efficient training. arXiv preprint arXiv:2105.03404.

[yu2022metaformer] Yu, Weihao, Luo, Mi, Zhou, Pan, Si, Chenyang, Zhou, Yichen, Wang, Xinchao, Feng, Jiashi, Yan, Shuicheng. (2022). Metaformer is actually what you need for vision. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[min2022transformer] Min, Erxue, Chen, Runfa, Bian, Yatao, Xu, Tingyang, Zhao, Kangfei, Huang, Wenbing, Zhao, Peilin, Huang, Junzhou, Ananiadou, Sophia, Rong, Yu. (2022). Transformer for Graphs: An Overview from Architecture Perspective. arXiv preprint arXiv:2202.08455.

[zhao2019pairnorm] Zhao, Lingxiao, Akoglu, Leman. (2019). Pairnorm: Tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223.

[alon2020bottleneck] Alon, Uri, Yahav, Eran. (2020). On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205.

[zhang2020graphbert] Zhang, Jiawei, Zhang, Haopeng, Xia, Congying, Sun, Li. (2020). Graph-bert: Only attention is needed for learning graph representations. arXiv preprint arXiv:2001.05140.

[rong2020self] Rong, Yu, Bian, Yatao, Xu, Tingyang, Xie, Weiyang, Wei, Ying, Huang, Wenbing, Huang, Junzhou. (2020). Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems.

[wu2021GraphTrans] Wu, Zhanghao, Jain, Paras, Wright, Matthew, Mirhoseini, Azalia, Gonzalez, Joseph E, Stoica, Ion. (2021). Representing long-range context for graph neural networks with global attention. Advances in Neural Information Processing Systems.

[kreuzer2021rethinking] Kreuzer, Devin, Beaini, Dominique, Hamilton, Will, L{'e. (2021). Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems.

[ying2021graphormer] Ying, Chengxuan, Cai, Tianle, Luo, Shengjie, Zheng, Shuxin, Ke, Guolin, He, Di, Shen, Yanming, Liu, Tie-Yan. (2021). Do transformers really perform badly for graph representation?. Advances in Neural Information Processing Systems.

[chen2022structure_SAT] Chen, Dexiong, O’Bray, Leslie, Borgwardt, Karsten. (2022). Structure-aware transformer for graph representation learning. International Conference on Machine Learning.

[rampavsek2022recipe] Ramp{'a. (2022). Recipe for a General, Powerful, Scalable Graph Transformer. arXiv preprint arXiv:2205.12454.

[ToppingGC0B22] Topping, Jake, Di Giovanni, Francesco, Chamberlain, Benjamin Paul, Dong, Xiaowen, Bronstein, Michael M. (2022). Understanding over-squashing and bottlenecks on graphs via curvature. The Tenth International Conference on Learning Representations, {ICLR.

[dosovitskiy2020ViT] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[devlin2018bert] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[karypis1998metis] Karypis, George, Kumar, Vipin. (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing.

[kingma2014adam] Kingma, Diederik P, Ba, Jimmy. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[cubuk2020randaugment] Cubuk, Ekin D, Zoph, Barret, Shlens, Jonathon, Le, Quoc V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops.

[zhang2017mixup] Zhang, Hongyi, Cisse, Moustapha, Dauphin, Yann N, Lopez-Paz, David. (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.

[tay2021synthesizer] Tay, Yi, Bahri, Dara, Metzler, Donald, Juan, Da-Cheng, Zhao, Zhe, Zheng, Che. (2021). Synthesizer: Rethinking self-attention for transformer models. International conference on machine learning.

[liu2021gmlp] Liu, Hanxiao, Dai, Zihang, So, David, Le, Quoc V. (2021). Pay attention to mlps. Advances in Neural Information Processing Systems.

[melas2021you] Melas-Kyriazi, Luke. (2021). Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet. arXiv preprint arXiv:2105.02723.

[tay2020efficient] Tay, Yi, Dehghani, Mostafa, Bahri, Dara, Metzler, Donald. (2020). Efficient transformers: A survey. ACM Computing Surveys (CSUR).

[hendrycks2016gaussian] Hendrycks, Dan, Gimpel, Kevin. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.

[hu2019gine] Hu, Weihua, Liu, Bowen, Gomes, Joseph, Zitnik, Marinka, Liang, Percy, Pande, Vijay, Leskovec, Jure. (2019). Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265.

[bresson2017gatedgcn] Bresson, Xavier, Laurent, Thomas. (2017). Residual gated graph convnets. arXiv preprint arXiv:1711.07553.

[shi2020masked] Shi, Yunsheng, Huang, Zhengjie, Feng, Shikun, Zhong, Hui, Wang, Wenjin, Sun, Yu. (2020). Masked label prediction: Unified message passing model for semi-supervised classification. arXiv preprint arXiv:2009.03509.

[murphy2019relational] Murphy, Ryan, Srinivasan, Balasubramaniam, Rao, Vinayak, Ribeiro, Bruno. (2019). Relational pooling for graph representations. International Conference on Machine Learning.

[dwivedi2022long] Dwivedi, Vijay Prakash, Ramp{'a. (2022). Long range graph benchmark. arXiv preprint arXiv:2206.08164.

[singh2016satpdb] Singh, Sandeep, Chaudhary, Kumardeep, Dhanda, Sandeep Kumar, Bhalla, Sherry, Usmani, Salman Sadullah, Gautam, Ankur, Tuknait, Abhishek, Agrawal, Piyush, Mathur, Deepika, Raghava, Gajendra PS. (2016). SATPdb: a database of structurally annotated therapeutic peptides. Nucleic acids research.

[balcilar2021breaking] Balcilar, Muhammet, H{'e. (2021). Breaking the limits of message passing graph neural networks. International Conference on Machine Learning.

[EXP] Abboud, Ralph, Ceylan, Ismail Ilkan, Grohe, Martin, Lukasiewicz, Thomas. (2020). The surprising power of graph neural networks with random node initialization. arXiv preprint arXiv:2010.01179.

[feng2022powerful] Feng, Jiarui, Chen, Yixin, Li, Fuhai, Sarkar, Anindya, Zhang, Muhan. (2022). How Powerful are K-hop Message Passing Graph Neural Networks. arXiv preprint arXiv:2205.13328.

[rong2019dropedge] Rong, Yu, Huang, Wenbing, Xu, Tingyang, Huang, Junzhou. (2019). Dropedge: Towards deep graph convolutional networks on node classification. arXiv preprint arXiv:1907.10903.

[han2022gmixup] Han, Xiaotian, Jiang, Zhimeng, Liu, Ninghao, Hu, Xia. (2022). G-Mixup: Graph Data Augmentation for Graph Classification. arXiv preprint arXiv:2202.07179.

[kwon2021asam] Kwon, Jungmin, Kim, Jeongseop, Park, Hyunseo, Choi, In Kwon. (2021). ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks. arXiv preprint arXiv:2102.11600.

[langley00] P. Langley. (2000). Crafting Papers on Machine Learning. Proceedings of the 17th International Conference on Machine Learning (ICML 2000).

[mitchell80] T. M. Mitchell. (1980). The Need for Biases in Learning Generalizations.

[kearns89] M. J. Kearns. (1989). Computational Complexity of Machine Learning.

[MachineLearningI] . Machine Learning: An Artificial Intelligence Approach, Vol. I. (1983).

[DudaHart2nd] R. O. Duda, P. E. Hart, D. G. Stork. (2000). Pattern Classification.

[anonymous] Author, N. N.. (2021). Suppressed for Anonymity.

[Newell81] A. Newell, P. S. Rosenbloom. (1981). Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition.

[Samuel59] A. L. Samuel. (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development.

[kuang2022coarformer] Weirui Kuang, Zhen WANG, Yaliang Li, Zhewei Wei, Bolin Ding. (2022). Coarformer: Transformer for large graph via graph coarsening.

[shirzad2023exphormer] Shirzad, Hamed, Velingker, Ameya, Venkatachalam, Balaji, Sutherland, Danica J, Sinop, Ali Kemal. (2023). Exphormer: Sparse Transformers for Graphs. arXiv preprint arXiv:2303.06147.

[zhang2022hierarchical_ANSGT] Zhang, Zaixi, Liu, Qi, Hu, Qingyong, Lee, Chee-Kong. (2022). Hierarchical Graph Transformer with Adaptive Node Sampling. arXiv preprint arXiv:2210.03930.

[chen2022nagphormer] Chen, Jinsong, Gao, Kaiyuan, Li, Gaichao, He, Kun. (2022). NAGphormer: Neighborhood Aggregation Graph Transformer for Node Classification in Large Graphs. arXiv preprint arXiv:2206.04910.

[freitas2020large_malnet] Freitas, Scott, Dong, Yuxiao, Neil, Joshua, Chau, Duen Horng. (2020). A large-scale database for graph representation learning. arXiv preprint arXiv:2011.07682.

[bib1] Monti et al. [2017] Federico Monti, Michael Bronstein, and Xavier Bresson. Geometric matrix completion with recurrent multi-graph neural networks. Advances in neural information processing systems, 30, 2017.

[bib2] Berg et al. [2017] Rianne van den Berg, Thomas N Kipf, and Max Welling. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263, 2017.

[bib3] Duvenaud et al. [2015] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems, 28, 2015.

[bib4] Gilmer et al. [2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272. PMLR, 2017.

[bib5] Cranmer et al. [2019] Miles D Cranmer, Rui Xu, Peter Battaglia, and Shirley Ho. Learning symbolic physics with graph networks. arXiv preprint arXiv:1909.05862, 2019.

[bib6] Bapst et al. [2020] Victor Bapst, Thomas Keck, A Grabska-Barwińska, Craig Donner, Ekin Dogus Cubuk, Samuel S Schoenholz, Annette Obika, Alexander WR Nelson, Trevor Back, Demis Hassabis, et al. Unveiling the predictive power of static structure in glassy systems. Nature Physics, 16(4):448–454, 2020.

[bib7] Derrow-Pinion et al. [2021] Austin Derrow-Pinion, Jennifer She, David Wong, Oliver Lange, Todd Hester, Luis Perez, Marc Nunkesser, Seongjae Lee, Xueying Guo, Brett Wiltshire, et al. Eta prediction with graph neural networks in google maps. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 3767–3776, 2021.

[bib8] Han et al. [2022a] Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu. Vision gnn: An image is worth graph of nodes. arXiv preprint arXiv:2206.00272, 2022a.

[bib9] Wu et al. [2021a] Lingfei Wu, Yu Chen, Kai Shen, Xiaojie Guo, Hanning Gao, Shucheng Li, Jian Pei, and Bo Long. Graph neural networks for natural language processing: A survey. arXiv preprint arXiv:2106.06090, 2021a.

[bib10] Schlichtkrull et al. [2018] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European semantic web conference, pages 593–607. Springer, 2018.

[bib11] Stokes et al. [2020] Jonathan M Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M Donghia, Craig R MacNair, Shawn French, Lindsey A Carfrae, Zohar Bloom-Ackermann, et al. A deep learning approach to antibiotic discovery. Cell, 180(4):688–702, 2020.

[bib12] Gaudelet et al. [2020] Thomas Gaudelet, Ben Day, Arian R Jamasb, Jyothish Soman, Cristian Regep, Gertrude Liu, Jeremy BR Hayter, Richard Vickers, Charles Roberts, Jian Tang, et al. Utilising graph machine learning within drug discovery and development. arXiv preprint arXiv:2012.05716, 2020.

[bib13] Li et al. [2020a] Yang Li, Buyue Qian, Xianli Zhang, and Hui Liu. Graph neural network-based diagnosis prediction. Big Data, 8(5):379–390, 2020a.

[bib14] Li et al. [2021] Xiaoxiao Li, Yuan Zhou, Nicha Dvornek, Muhan Zhang, Siyuan Gao, Juntang Zhuang, Dustin Scheinost, Lawrence H Staib, Pamela Ventola, and James S Duncan. Braingnn: Interpretable brain graph neural network for fmri analysis. Medical Image Analysis, 74:102233, 2021.

[bib15] Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 2016.

[bib16] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.

[bib17] Hamilton et al. [2017] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1025–1035, 2017.

[bib18] Xavier Bresson and Thomas Laurent. Residual gated graph convnets. arXiv preprint arXiv:1711.07553, 2017.

[bib19] Veličković et al. [2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.

[bib20] Boris Weisfeiler and Andrei Leman. The reduction of a graph to canonical form and the algebra which appears therein. NTI Series, 2(9):12–16, 1968.

[bib21] Xu et al. [2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019.

[bib22] Morris et al. [2019] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4602–4609, 2019.

[bib23] Maron et al. [2018] Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivariant graph networks. arXiv preprint arXiv:1812.09902, 2018.

[bib24] Maron et al. [2019] Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. Provably powerful graph networks. arXiv preprint arXiv:1905.11136, 2019.

[bib25] Chen et al. [2019] Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. On the equivalence between graph isomorphism testing and function approximation with gnns. Advances in neural information processing systems, 2019.

[bib26] Waïss Azizian and Marc Lelarge. Expressive power of invariant and equivariant graph neural networks. arXiv preprint arXiv:2006.15646, 2020.

[bib27] Bouritsas et al. [2022] Giorgos Bouritsas, Fabrizio Frasca, Stefanos P Zafeiriou, and Michael Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

[bib28] Alsentzer et al. [2020] Emily Alsentzer, Samuel Finlayson, Michelle Li, and Marinka Zitnik. Subgraph neural networks. Advances in Neural Information Processing Systems, 33:8017–8029, 2020.

[bib29] Bodnar et al. [2021] Cristian Bodnar, Fabrizio Frasca, Nina Otter, Yuguang Wang, Pietro Lio, Guido F Montufar, and Michael Bronstein. Weisfeiler and lehman go cellular: Cw networks. Advances in Neural Information Processing Systems, 34:2625–2640, 2021.

[bib30] Muhan Zhang and Pan Li. Nested graph neural networks. Advances in Neural Information Processing Systems, 34:15734–15747, 2021.

[bib31] Zhao et al. [2021] Lingxiao Zhao, Wei Jin, Leman Akoglu, and Neil Shah. From stars to subgraphs: Uplifting any gnn with local structure awareness. In International Conference on Learning Representations, 2021.

[bib32] Frasca et al. [2022] Fabrizio Frasca, Beatrice Bevilacqua, Michael M Bronstein, and Haggai Maron. Understanding and extending subgraph gnns by rethinking their symmetries. arXiv preprint arXiv:2206.11140, 2022.

[bib33] Chen et al. [2020] Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. Can graph neural networks count substructures? Advances in neural information processing systems, 33:10383–10395, 2020.

[bib34] Murphy et al. [2019] Ryan Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Relational pooling for graph representations. In International Conference on Machine Learning, pages 4663–4673. PMLR, 2019.

[bib35] Andreas Loukas. What graph neural networks cannot learn: depth vs width. In International Conference on Learning Representations, 2020.

[bib36] Dwivedi et al. [2020] Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.

[bib37] Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. In AAAI Workshop on Deep Learning on Graphs: Methods and Applications, 2021.

[bib38] Kreuzer et al. [2021] Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618–21629, 2021.

[bib39] Lim et al. [2022] Derek Lim, Joshua Robinson, Lingxiao Zhao, Tess Smidt, Suvrit Sra, Haggai Maron, and Stefanie Jegelka. Sign and basis invariant networks for spectral graph representation learning. arXiv preprint arXiv:2202.13013, 2022.

[bib40] Li et al. [2020b] Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. Distance encoding: Design provably more powerful neural networks for graph representation learning. Advances in Neural Information Processing Systems, 33, 2020b.

[bib41] Dwivedi et al. [2021] Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Graph neural networks with learnable structural and positional representations. In International Conference on Learning Representations, 2021.

[bib42] Markus Zopf. 1-wl expressiveness is (almost) all you need. arXiv preprint arXiv:2202.10156, 2022.

[bib43] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.

[bib44] Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205, 2020.

[bib45] Topping et al. [2022] Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.

[bib46] Deac et al. [2022] Andreea Deac, Marc Lackenby, and Petar Veličković. Expander graph propagation. In The First Learning on Graphs Conference, 2022.

[bib47] Arnaiz-Rodríguez et al. [2022] Adrián Arnaiz-Rodríguez, Ahmed Begga, Francisco Escolano, and Nuria M Oliver. Diffwire: Inductive graph rewiring via the lovász bound. In The First Learning on Graphs Conference, 2022.

[bib48] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

[bib49] Bahdanau et al. [2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[bib50] Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

[bib51] Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[bib52] Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

[bib53] Mialon et al. [2021] Grégoire Mialon, Dexiong Chen, Margot Selosse, and Julien Mairal. Graphit: Encoding graph structure in transformers. arXiv preprint arXiv:2106.05667, 2021.

[bib54] Wu et al. [2021b] Zhanghao Wu, Paras Jain, Matthew Wright, Azalia Mirhoseini, Joseph E Gonzalez, and Ion Stoica. Representing long-range context for graph neural networks with global attention. Advances in Neural Information Processing Systems, 34:13266–13279, 2021b.

[bib55] Chen et al. [2022a] Dexiong Chen, Leslie O’Bray, and Karsten Borgwardt. Structure-aware transformer for graph representation learning. In International Conference on Machine Learning, pages 3469–3489. PMLR, 2022a.

[bib56] Ying et al. [2021] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34:28877–28888, 2021.

[bib57] Rampášek et al. [2022] Ladislav Rampášek, Mikhail Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a general, powerful, scalable graph transformer. arXiv preprint arXiv:2205.12454, 2022.

[bib58] Min et al. [2022] Erxue Min, Runfa Chen, Yatao Bian, Tingyang Xu, Kangfei Zhao, Wenbing Huang, Peilin Zhao, Junzhou Huang, Sophia Ananiadou, and Yu Rong. Transformer for graphs: An overview from architecture perspective. arXiv preprint arXiv:2202.08455, 2022.

[bib59] Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

[bib60] Tolstikhin et al. [2021] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34:24261–24272, 2021.

[bib61] Touvron et al. [2021] Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. Resmlp: Feedforward networks for image classification with data-efficient training. arXiv preprint arXiv:2105.03404, 2021.

[bib62] Liu et al. [2021] Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps. Advances in Neural Information Processing Systems, 34:9204–9215, 2021.

[bib63] Wang et al. [2022] Ziyu Wang, Wenhao Jiang, Yiming M Zhu, Li Yuan, Yibing Song, and Wei Liu. Dynamixer: a vision mlp architecture with dynamic mixing. In International Conference on Machine Learning, pages 22691–22701. PMLR, 2022.

[bib64] Yu et al. [2022] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10819–10829, 2022.

[bib65] Hu et al. [2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.

[bib66] Dwivedi et al. [2022] Vijay Prakash Dwivedi, Ladislav Rampášek, Mikhail Galkin, Ali Parviz, Guy Wolf, Anh Tuan Luu, and Dominique Beaini. Long range graph benchmark. arXiv preprint arXiv:2206.08164, 2022.

[bib67] Balcilar et al. [2021] Muhammet Balcilar, Pierre Héroux, Benoit Gauzere, Pascal Vasseur, Sébastien Adam, and Paul Honeine. Breaking the limits of message passing graph neural networks. In International Conference on Machine Learning, pages 599–608. PMLR, 2021.

[bib68] Cubuk et al. [2020] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.

[bib69] Zhang et al. [2017] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.

[bib70] Zhao et al. [2020] Tong Zhao, Yozen Liu, Leonardo Neves, Oliver J. Woodford, Meng Jiang, and Neil Shah. Data augmentation for graph neural networks. CoRR, abs/2006.06830, 2020.

[bib71] George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):359–392, 1998.

[bib72] Hu et al. [2019] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265, 2019.

[bib73] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.

[bib74] Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

[bib75] Abboud et al. [2020] Ralph Abboud, Ismail Ilkan Ceylan, Martin Grohe, and Thomas Lukasiewicz. The surprising power of graph neural networks with random node initialization. arXiv preprint arXiv:2010.01179, 2020.

[bib76] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.

[bib77] Zheng et al. [2020] Da Zheng, Xiang Song, Chao Ma, Zeyuan Tan, Zihao Ye, Jin Dong, Hao Xiong, Zheng Zhang, and George Karypis. Dgl-ke: Training knowledge graph embeddings at scale. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 739–748, 2020.

[bib78] Kuang et al. [2022] Weirui Kuang, Zhen WANG, Yaliang Li, Zhewei Wei, and Bolin Ding. Coarformer: Transformer for large graph via graph coarsening, 2022. URL https://openreview.net/forum?id=fkjO_FKVzw.

[bib79] Shirzad et al. [2023] Hamed Shirzad, Ameya Velingker, Balaji Venkatachalam, Danica J Sutherland, and Ali Kemal Sinop. Exphormer: Sparse transformers for graphs. arXiv preprint arXiv:2303.06147, 2023.

[bib80] Zhang et al. [2022] Zaixi Zhang, Qi Liu, Qingyong Hu, and Chee-Kong Lee. Hierarchical graph transformer with adaptive node sampling. arXiv preprint arXiv:2210.03930, 2022.

[bib81] Chen et al. [2022b] Jinsong Chen, Kaiyuan Gao, Gaichao Li, and Kun He. Nagphormer: Neighborhood aggregation graph transformer for node classification in large graphs. arXiv preprint arXiv:2206.04910, 2022b.

[bib82] Irwin et al. [2012] John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757–1768, 2012.

[bib83] Szklarczyk et al. [2019] Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al. String v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic acids research, 47(D1):D607–D613, 2019.

[bib84] Landrum et al. [2006] Greg Landrum et al. Rdkit: Open-source cheminformatics. 2006, 2006.

[bib85] Singh et al. [2016] Sandeep Singh, Kumardeep Chaudhary, Sandeep Kumar Dhanda, Sherry Bhalla, Salman Sadullah Usmani, Ankur Gautam, Abhishek Tuknait, Piyush Agrawal, Deepika Mathur, and Gajendra PS Raghava. Satpdb: a database of structurally annotated therapeutic peptides. Nucleic acids research, 44(D1):D1119–D1126, 2016.

[bib86] Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.

[bib87] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[bib88] Kwon et al. [2021] Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. arXiv preprint arXiv:2102.11600, 2021.

[bib89] Feng et al. [2022] Jiarui Feng, Yixin Chen, Fuhai Li, Anindya Sarkar, and Muhan Zhang. How powerful are k-hop message passing graph neural networks. arXiv preprint arXiv:2205.13328, 2022.

[bib90] Buluç et al. [2016] Aydın Buluç, Henning Meyerhenke, Ilya Safro, Peter Sanders, and Christian Schulz. Recent advances in graph partitioning. Algorithm engineering, pages 117–158, 2016.

[bib91] Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997.

[bib92] Rong et al. [2019] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. arXiv preprint arXiv:1907.10903, 2019.

[bib93] Han et al. [2022b] Xiaotian Han, Zhimeng Jiang, Ninghao Liu, and Xia Hu. G-mixup: Graph data augmentation for graph classification. arXiv preprint arXiv:2202.07179, 2022b.

[bib94] Freitas et al. [2020] Scott Freitas, Yuxiao Dong, Joshua Neil, and Duen Horng Chau. A large-scale database for graph representation learning. arXiv preprint arXiv:2011.07682, 2020.

Generalizing ViT/MLP-Mixer to Graphs​

Patch Extraction​

Message-Passing GNNs and the Limitations​

Generalizing ViT/MLP-Mixer to Graphs​

Generalization Challenges​

Proposed Architecture​

Overview​

Patch Extraction​

Patch Encoder​

Positional Information​

Mixer Layer​

Data augmentation​

Experiments​

Comparison with MP-GNNs​

Comparison with SOTAs​

Graph ViT/MLP-Mixer can mitigate over-squashing​

Graph ViT/MLP-Mixer can achieve empirical high expressivity​

Ablation Studies​

Conclusion​

Acknowledgments​

Experiments​

Related Work​

Datasets Description​

Experiment Details​

Studies on Patch Extraction Module​

Effect of Patch Extraction​

Effect of Graph Partition Algorithm​

Effect of Number of Patches​

Effect of Patch Overlapping​

Patch Size and WL Expressivity​

Studies on Positional Encoding​

Effect of Positional Encoding​

Positional Encoding and Patch Size​

Study on Different Designs of Graph-Based MHA​

Effect of Data Augmentation​

Long Range Graph Benchmark​

Mitigating Oversquashing in TreeNeighbourMatch​

Complexity Analysis​

Limitations​

References​

Generalizing ViT/MLP-Mixer to Graphs

Patch Extraction

Message-Passing GNNs and the Limitations

Generalizing ViT/MLP-Mixer to Graphs

Generalization Challenges

Proposed Architecture

Overview

Patch Extraction

Patch Encoder

Positional Information

Mixer Layer

Data augmentation

Experiments

Comparison with MP-GNNs

Comparison with SOTAs

Graph ViT/MLP-Mixer can mitigate over-squashing

Graph ViT/MLP-Mixer can achieve empirical high expressivity

Ablation Studies

Conclusion

Acknowledgments

Experiments

Related Work

Datasets Description

Experiment Details

Studies on Patch Extraction Module

Effect of Patch Extraction

Effect of Graph Partition Algorithm

Effect of Number of Patches

Effect of Patch Overlapping

Patch Size and WL Expressivity

Studies on Positional Encoding

Effect of Positional Encoding

Positional Encoding and Patch Size

Study on Different Designs of Graph-Based MHA

Effect of Data Augmentation

Long Range Graph Benchmark

Mitigating Oversquashing in TreeNeighbourMatch

Complexity Analysis

Limitations

References