# Boost the Performance of Graph Analytics Workloads

Published: 05/22/2019

Last Updated: 05/22/2019

### Analyze the Graph Benchmarks on Intel® Xeon® Processors

Stijn Eyerman, Wim Heirman, and Kristof Du Bois, research scientists, and Joshua B. Fryman and Ibrahim Hur, principal engineers, Intel Corporation
@IntelDevTools

A graph is an intuitive way of representing a big data set and the relationships between its elements. Each vertex represents an element, and edges connect related elements. Representing data sets as a graph means you can build on the rich history of graph theory. And there are a variety of algorithms to extract useful information from a graph. This article explores the implementation characteristics of basic graph analysis algorithms and how they perform on Intel® Xeon® processors.

## Graphs and Graph Analytics

A graph is a structured representation of a data set with relationships between elements. Each element is represented as a vertex, and relationships between elements are shown as an edge between two vertices. Both vertices and edges can have attributes representing either the characteristics of the element or the relationship. The vertices to which a vertex is connected through edges are called its neighbors. The number of neighbors is called the degree of a vertex.

Graph analysis applications extract characteristics from a graph (or multiple graphs) that provide useful information, such as:

• Vertices with specific features
• The shortest path between vertices
• Clusters of vertices
• Higher-order relationships among vertices

With the growing availability of big data sets and the need to extract useful information from them, graph analytics applications are becoming an important workload, both in the data center and at the edge.

This study uses the GAP Benchmark Suite,1 a set of common graph algorithms implemented in C++ and optimized by researchers at the University of California, Berkeley for performance on shared memory multicore processors (Table 1). GAP performance is evaluated on a server that's based on the Intel Xeon processor and investigate opportunities to further improve performance.

Table 1. GAP Benchmark Suite overview

Algorithm Abbreviation What it Does
PageRank* pr Calculates the popularity of a vertex by aggregating the popularity of its neighbors
Triangle counting tc Counts the number of triangles (three vertices that are fully connected)
Connected components cc Splits the graph into subgraphs with no edges between them
Single-source shortest path sssp Calculates the shortest path from one vertex to all others
Betweenness centrality bc Calculates the centrality of a vertex, determined by how many shortest paths go through it

## Characteristics of Graph Algorithms

Graph algorithms pose challenging behavior for conventional processor architectures. A typical operation is to fetch the attributes of all neighbors of a vertex. The list of neighbors, determined by the topology of the graph, is usually irregular. This leads to sparse memory accesses―accessing individual elements scattered on a large data structure. Sparse memory accesses have no locality, leading to poor cache use.

Fetching the attributes of the neighbors means using indirect accesses. For example, if N is the array of neighbors of a vertex, and A the array containing the attributes, the attribute of neighbor i is accessed by A[N[i]]. This pattern is difficult to predict and to vectorize, which leads to an underuse of the available compute and memory resources.

Alternatively, graph algorithms generally have a lot of parallelism. The set of algorithms in the GAP suite fall into two categories in terms of parallelism. The first category consists of algorithms that operate on all vertices concurrently (pr, tc, and cc). They have abundant parallelism and can be run across many threads. Their parallelism is only limited by the size of the graph.

The second category is front-based algorithms where, at each iteration, a subset of vertices is analyzed (the current front) and a new front is defined to be processed in the next iteration (bfs, sssp, and bc). These algorithms usually start with a front containing a single vertex. The next front consists of its neighbors, and then the neighbors of these neighbors, and so on. In the first iterations, the size of the front (and consequently the parallelism that can be exploited) is limited. Also, each iteration ends with a global barrier, which creates additional synchronization overhead. These algorithms scale worse with increasing thread count, especially on smaller graphs.

## Run Graph Algorithms on Intel Xeon Processors

Despite the challenging behavior of graph algorithms, there are ways to increase the efficiency of running these applications on a multicore server based on the Intel Xeon processor.

### Vectorization

Using vector memory instructions can increase the performance of a graph algorithm by increasing the number of parallel load operations, which hides part of their latency. Specifically, you can use the vector gather instruction (found in Intel® Advanced Vector Extensions 2 [Intel® AVX2] and Intel® Advanced Vector Extensions 512 [Intel® AVX-512]) to perform multiple indirect loads in one instruction. But, the compiler isn’t always able to detect these indirect access patterns, or it does not vectorize based on its heuristics. So, it might be useful to add a #pragma vector always to force the compiler to vectorize and to rewrite the code to make the indirect access pattern more apparent to the compiler.

Figure 1 gives an example for cc. The original code on the left did not generate vector gather instructions, while the altered code on the right did. It makes the indirect pattern clearer and forces the compiler to vectorize. This led to a speedup of 5x for cc on Intel Xeon processors.

 for (NodeID v : g.out_neigh(u)) { NodeID comp_v = comp[v]; if ((comp_u < comp_v) && (comp_v == comp[comp_v])) { change = true; comp[comp_v] = comp_u; } }  NodeID *a = g.out_neigh(u).begin(); #pragma vector always for (int i=0; i

Figure 1. Inner loop of connected components.

It might also be useful to look at other vectorization opportunities, and to rewrite the code so that they can be exploited (for example, using intrinsics). For example, in tc, you need to count the number of matches between the elements of two neighbor lists. Ilya Katsov describes an algorithm to speed up the matching algorithm with SSE instructions in Fast Intersection of Sorted Lists Using Streaming SIMD Extensions (SSE) Instructions. This algorithm is adapted to Intel AVX-512 and is included in the tc benchmark, leading to a performance increase of 2.5x (the code is not included for brevity).2

## Parallelism

The GAP benchmarks are parallelized using OpenMP*. As discussed before, there are two categories of parallelism: vertex- and front-based. For the vertex-parallel algorithms (pr, cc, and tc), it’s important to use dynamic scheduling in the OpenMP parallel for-loops because the processing time of a vertex depends on its neighbor count, which can differ significantly across vertices. With static scheduling, threads that are assigned vertices with many neighbors run longer than other threads, which leads to load imbalance. To reduce the scheduling overhead while still maintaining enough scheduling flexibility, set the chunk size to 16 or 32.

The front-based algorithms (bfs, sssp, and bc) are harder to parallelize, which means they don’t use the full capacity of the processor (fewer threads than cores). The current front contains many fewer vertices than the full graph, and the next front can only be processed when the current front is finished. To fully exploit the increasing core count of servers based on Intel Xeon processors, these algorithms need to be revised to increase their parallelism.

An example of this is already implemented in bfs. Instead of looking for neighbors of the current front and checking whether they’ve already been visited (forward algorithm), all nonvisited vertices are considered, and they are checked for whether they are a neighbor of a vertex in the current front (backward algorithm). Because there are more nonvisited vertices than vertices in the front, there’s more parallelism to exploit. The downside is that the backward algorithm sometimes does unnecessary work when most of the nonvisited vertices are not neighbors of the current front.

At each step of the algorithm, you can choose between the two methods. This choice is currently done using the characteristics of the current front and the remaining vertices, but it should also include the available parallelism (core or thread count) to better exploit the capacity of the processor.

## Caches and Input Graphs

Graph workloads generally don’t generate cache-friendly access patterns. The one exception is when the graph, or the most accessed data structure of the graph (for example, the attributes of vertices), fits in the last-level cache (which is up to 38 MB per socket on high-end Intel Xeon processors). From these experiments, the performance (expressed in giga traversed edges per second [GTEPS]) decreases with increasing graph size, since less and less of the data fits into the cache. Because of the nature of graphs and graph algorithms, methods to improve cache locality either don’t work well or take too much time to reorganize the graph, often more than the algorithm itself.

## Distributed Graph Processing

The GAP benchmarks are designed for only running single nodes (using OpenMP parallelization). However, based on the insights from the study, let's discuss the impact on distributed graph processing (in other words, using multiple nodes). For running high-performance multinodes, it’s crucial to minimize the communication and maximize local data and computation. This is challenging for graph applications because of the irregular and nonlocalized access pattern. Partitioning a graph to minimize the number of edges between partitions is an NP-complete problem in itself, and often leads to more compute time than the algorithm itself. So, when you’re deploying a graph analysis algorithm on multiple nodes, connect the nodes with a high-bandwidth, low-latency network such as Intel® Omni-Path Architecture to deal with the unavoidably high amount of communication between the nodes.

## Boost Performance through Analysis

Graph applications form a challenging workload for current processors because of their memory intensiveness and irregularity. By carefully crafting their implementation and exploiting vector units and thread parallelism, you can increase performance significantly. However, more investigation is needed, including redesigning the algorithms to fully exploit the capabilities of a server based on an Intel Xeon processor, especially when moving to distributed processing.

## You May Also Like

High-Performance Graph Analytics Library from Katana Graph*

Measure Graph Analytics Performance