Graph neural networks (GNN) are evolving into an imperative tool for data analytics. This relatively new branch of deep learning seeks to exploit an unending stream of graph-structured data to drive innovation in business, industry, and other organizations. GNNs allow data scientists to predict outcomes from vast amounts of knowledge encoded in graph form. These predictions can be used for everything from social network analysis to groundbreaking immunotherapy research. However, training deep GNNs on large graph datasets has many challenges that impede its ability to scale.
Intel Labs is advancing the role of GNNs with open-source tools and optimizations that facilitate large graph training on Intel hardware.
Training GNNs on Graphs
Graphs hold vast amounts of information about the world we live in. But at their core, graphs are simply a set of nodes connected by edges. Each node and each edge can have an associated set of features or metadata. For example, in a graph representing a social network, nodes represent individuals with their associated metadata, while edges represent interactions between these individuals. More generally, a graph is a structural representation of the relationships that exists between data elements in non-Euclidean domains. Most importantly, graphs embody the answers to many questions that are of great value to businesses entities, organizations, and even mankind, from finding shortest route between two points to life-saving medical research.
GNNs can be used to find those answers—or at least answers that have a high likelihood of being true. GNNs use graphs as their input to extract high-level useful knowledge. For example, GNNs can be used to predict the properties of a new chemical, where the chemical is represented as a graph of atoms and their bonds. But while GNNs are very expressive, they do not scale well to graphs with billions of nodes and edges. Full graph training, also known as full batch training, consumes massive amounts of memory and leads to a phenomenon known as neighbor explosion, where the number of supporting nodes needed to make a prediction at a particular node grows exponentially with the number of GNN layers.
Training on randomly sampled sub-graphs rather than the entire graph, an approach known as sampling or mini-batch training, may address the memory problem but often leads to biased gradient estimates. The sampling operation itself could also significantly slow down training.
Enabling Full Graph Training
Intel Labs researchers have had marked success building off an approach known as domain parallel training, in which the input graph is split into many partitions and then distributed among multiple machines that each handle the computation for a single partition. However, even in domain parallel training, there comes a time when each machine will need to store a substantial portion of the full graph in order to produce the output for its own partition. This is because each machine needs to store all the data that is used to produce the output for the local partition. This data is called the computational graph (not to be confused with the input graph to the GNN). This data, the computational graph, can easily span the whole input graph. The computational graph is essential for executing the backward pass during training. The backward pass applies the backpropagation algorithm to adjust the parameters of the GNN to solve the task at hand.
To avoid having to build this memory-intensive computational graph during training, our researchers have developed the Sequential Aggregation and Rematerialization (SAR) scheme that sequentially re-constructs (then frees) pieces of the large GNN computational graph during the backward pass.
Moreover, we have devised optimization methods to mitigate the extra memory overhead that sometimes persists even with SAR when dealing with Graph Attention Networks (GATS). These optimizations avoid the costly materialization of the attention matrix. Instead, elements of the attention matrix are computed and used on the fly and are never stored in memory. This speeds up the computation by reducing the number of redundant memory accesses.
In our paper, “Sequential Aggregation and Rematerialization: Distributed Full-batch Training of Graph Neural Networks on Large Graphs,” we demonstrate excellent memory scaling behavior using these optimizations and the SAR scheme. SAR is built on top of the popular Deep Graph Library (DGL), which is built on PyTorch. DGL’s kernels are continuously updated, which facilitates graph operations.
The full extent of our optimizations can be read in the paper, however two primary results bear mentioning:
- Our optimized SAR approach drastically reduces memory consumption for GNN models such as GAT, requiring only 26% to 37% of the memory needed by vanilla DP training.
- The run-time of our open-source GNN training library improves as we add more workers. At 128 machines, the epoch time is 3.8s which is the fastest reported full-batch epoch time for the ogbn-papers100M dataset.
Furthermore, the methods we describe in the paper can be applied to both full-batch GNN training and domain-parallel training. We released an easy to use open-source library for applying SAR to any large-scale GNN training problem: https://github.com/IntelLabs/SAR. The full documentation is here: https://sar.readthedocs.io/en/latest/
Intel Labs continues to build on our GNN successes. We are working with both academia and start-ups to enable a wide range of GNN applications, including recommendation systems, video understanding, molecular discovery and more. Our now open source GNN framework is accelerating the use of GNNs in other areas, including drug discovery.