Introducing a Graph Neural Network Benchmark in MLPerf Inference v5.0

Introduction

Graph neural networks (GNNs) have been increasing in popularity throughout the last few years with the abundance of data and use-cases that have underlying graph structures. GNNs are used in a variety of applications including social network analysis, fraud detection, transportation applications, recommendation systems, weather prediction, drug discovery, and chemical structure analysis. In addition, GNNs are also being explored for natural language processing and image segmentation. Generally GNNs have been shown to outperform other machine learning methods for tasks like node classification and graph classification and unlock many new applications that work with data that is more complex than traditional text, images, video, and audio. Based on the popularity of graph neural networks and the new application areas, the GNN task force recommended adding a graph neural network benchmark to MLPerf^® Inference v5.0.

Model selection

There are a variety of types of GNNs, including both Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs). GCNs rely on convolution operators to capture information about relationships between nodes in a graph, while GATs use the attention mechanism. Traditional GATs focus on undirected single-relation graphs. For the MLPerf GNN benchmark, we chose a variant of graph attention networks called RGAT, or Relational Graph Attention Network, which extends GATs to allow for multi-relational graph structures – representing a wider range of applications and use cases.

What are RGATs?

Graph Attention Networks (GATs) are a specialized type of GNN that use attention mechanisms to dynamically weigh the importance of neighboring nodes during information propagation in graph-structured data. They’re particularly effective for node classification and social network analysis. Relational GATs (RGATs) extend this concept by incorporating relationship type discrimination between nodes in knowledge graphs (i.e., differentiating by edge type).

The figure below demonstrates a 2-layer RGAT with fan-out: When classifying a target node, we construct a subgraph by randomly sampling neighbors across two hierarchical levels (note that here, we fan-in instead of fan-out). In the first layer, neighbor embeddings serve directly as attention inputs, while subsequent layers use propagated outputs from GAT computations. Notably, the same graph node (e.g., any single GATnode in the figure) may appear multiple times in the computation subgraph. While this allows reuse of its learned embedding, each occurrence requires recomputation because input edges carry distinct values in different contexts. This architecture creates a memory-computation tradeoff dependent on graph relationships.

The figure below illustrates the architecture of a single RGAT layer, detailing its attention mechanism. For each node pair in the attention computation, both the local embedding (central node) and external embeddings (neighbors) pass through a shared MLP to generate separate Query and Key vectors. These projections are concatenated and transformed into an attention score. The scores undergo Softmax normalization to produce attention weights. These weights then compute a weighted sum of the Value vectors, which are derived from projected neighbor embeddings. This aggregated result becomes the hidden embedding passed to the next RGAT layer.

A key implementation detail is that while the same MLP parameters are reused across all computations, each attention head recalculates context-specific weights because input edges (Key/Value) differ for every node-edge interaction.

Dataset and task selection

Some of the most popular applications of graph neural networks are social network analysis and recommenders. These applications can have massive datasets of user information which require a large amount of storage. One of the biggest challenges of graph neural networks is scaling to these large graphs, which can increase both the computational cost of processing all the relations in the graph, as well as the message passing and memory management of nodes across the graph, which are often stored on disk and even across multiple nodes.

To mimic this as much as possible, we chose to use the largest publicly available graph dataset at the time – the Illinois Graph Benchmark Heterogenous (IGB-H) dataset. The graph in this dataset consists of academic papers (“Paper” nodes), the field of study of those papers (“FoS” nodes), the authors of these papers (“Author” nodes), as well as the institutes the authors are affiliated with (“Institute” nodes). In addition to previously mentioned relations (topic, written by, and affiliated with), there is also an additional relation for “citation” which is a relation from “Paper” to “Paper”. The task for this dataset is classification of the ‘Paper’ nodes to a set of 2983 topics.

The IGB Heterogeneous “Full” dataset variant has 547 million nodes and 5.8 billion edges. With an embedding dimension of 1024, the dataset totals up to over 2 TB of data. In addition to this, for use in MLPerf Inference, the dataset is augmented with reverse edges, as well as self-loops for papers, over doubling the number of edges. It was decided that the subset used for inference as well as accuracy checks is the validation set from training, consisting of around 788 thousand “Paper” nodes. During classification, subgraphs are sampled for each input node with a fanout steps of 15 – 10 – 5 (i.e. 15 neighbors, 10 neighbors of each of those neighbors, and 5 neighbors of those neighbors). We acknowledge that some real-world applications of GNNs utilize “full fanout”, which uses every single neighbor of the node set at each step of subgraph generation. However, for the MLPerf Inference benchmark, we decided to use a fixed maximum fanout with the same parameters as when the model was trained in order to reduce variance of per-sample latencies, as it is possible for high neighbor count nodes to skew inference latency higher.

This iteration of GNN benchmarking only tests the ‘Offline’ MLPerf Inference scenario, as many of the production applications are only used in this setting. In the future, if there is sufficient interest, we may add the ‘Server’ scenario.

Accuracy metrics

To align with the selected task and dataset, the task force decided to use the ratio of correctly predicted topic “Paper” nodes in the validation. Baseline accuracy is based on 0.5% of the 157 million labelled nodes resulting in 788,000 validation nodes as evaluated in float32 is 72.86% (Model weights are kept in float32, while embedding is downcast to fp16 to save on storage/memory requirements).

Naturally, MLCommons^® provides a reasonable constraint for precisions lower than the baseline. This constraint is set at 99% of the reference. However, with neighborhood sampling embedded in the subgraph computation, some randomness had to be accounted for. The task force therefore made the decision to allow for an additional .5% margin.

Performance metrics

In our inaugural release, the benchmark is only tested in the ‘Offline’ scenario, where the performance metric is throughput measured in ‘samples per second’, and per-sample latency is not a performance consideration. We acknowledge that for some use-cases of GNNs, such as recommenders and transportation applications like map data processing, latency can be an important metric to consider and we may expand to consider different scenarios in future rounds of MLPerf Inference.

Conclusion

To address the growing usage of graph neural networks for different applications, a task force investigated and recommended the creation of a GNN benchmark within MLPerf Inference v5.0.

To represent the unique challenges associated with end-to-end deployment of GNNs, the task force selected a representative benchmark model, RGAT, that works with the largest public graph dataset with the highest precision. This benchmark emphasizes the different stages of the deployment process such as representation of sufficient scope, distributed storage at scale, and efficient computation and offers an opportunity for vendors to showcase efficient systems and optimizations for tackling these challenges. In the future we hope to expand the scope of the benchmark to include the different datacenter deployment scenarios for real-time such as online serving, which is increasingly finding applications in graph as well as recommender systems.

Details of the MLPerf Inference RGAT benchmark and reference implementation can be found here.

¹ The DGL neighborhood sampler has a known bug with seeding when the sampling is parallelized across CPUs