Boost the Performance of Graph Analytics Workloads
Published: 05/22/2019
Last Updated: 05/22/2019
Analyze the Graph Benchmarks on Intel® Xeon® Processors
Stijn Eyerman, Wim Heirman, and Kristof Du Bois, research scientists, and Joshua B. Fryman and Ibrahim Hur, principal engineers, Intel Corporation
@IntelDevTools
A graph is an intuitive way of representing a big data set and the relationships between its elements. Each vertex represents an element, and edges connect related elements. Representing data sets as a graph means you can build on the rich history of graph theory. And there are a variety of algorithms to extract useful information from a graph. This article explores the implementation characteristics of basic graph analysis algorithms and how they perform on Intel® Xeon® processors.
Graphs and Graph Analytics
A graph is a structured representation of a data set with relationships between elements. Each element is represented as a vertex, and relationships between elements are shown as an edge between two vertices. Both vertices and edges can have attributes representing either the characteristics of the element or the relationship. The vertices to which a vertex is connected through edges are called its neighbors. The number of neighbors is called the degree of a vertex.
Graph analysis applications extract characteristics from a graph (or multiple graphs) that provide useful information, such as:
 Vertices with specific features
 The shortest path between vertices
 Clusters of vertices
 Higherorder relationships among vertices
With the growing availability of big data sets and the need to extract useful information from them, graph analytics applications are becoming an important workload, both in the data center and at the edge.
This study uses the GAP Benchmark Suite,^{1} a set of common graph algorithms implemented in C++ and optimized by researchers at the University of California, Berkeley for performance on shared memory multicore processors (Table 1). GAP performance is evaluated on a server that's based on the Intel Xeon processor and investigate opportunities to further improve performance.
Table 1. GAP Benchmark Suite overview
Algorithm  Abbreviation  What it Does 

PageRank*  pr  Calculates the popularity of a vertex by aggregating the popularity of its neighbors 
Triangle counting  tc  Counts the number of triangles (three vertices that are fully connected) 
Connected components  cc  Splits the graph into subgraphs with no edges between them 
Breadthfirst search  bfs  Walks through the graph in breadthfirst order 
Singlesource shortest path  sssp  Calculates the shortest path from one vertex to all others 
Betweenness centrality  bc  Calculates the centrality of a vertex, determined by how many shortest paths go through it 
Characteristics of Graph Algorithms
Graph algorithms pose challenging behavior for conventional processor architectures. A typical operation is to fetch the attributes of all neighbors of a vertex. The list of neighbors, determined by the topology of the graph, is usually irregular. This leads to sparse memory accesses―accessing individual elements scattered on a large data structure. Sparse memory accesses have no locality, leading to poor cache use.
Fetching the attributes of the neighbors means using indirect accesses. For example, if N is the array of neighbors of a vertex, and A the array containing the attributes, the attribute of neighbor i is accessed by A[N[i]]. This pattern is difficult to predict and to vectorize, which leads to an underuse of the available compute and memory resources.
Alternatively, graph algorithms generally have a lot of parallelism. The set of algorithms in the GAP suite fall into two categories in terms of parallelism. The first category consists of algorithms that operate on all vertices concurrently (pr, tc, and cc). They have abundant parallelism and can be run across many threads. Their parallelism is only limited by the size of the graph.
The second category is frontbased algorithms where, at each iteration, a subset of vertices is analyzed (the current front) and a new front is defined to be processed in the next iteration (bfs, sssp, and bc). These algorithms usually start with a front containing a single vertex. The next front consists of its neighbors, and then the neighbors of these neighbors, and so on. In the first iterations, the size of the front (and consequently the parallelism that can be exploited) is limited. Also, each iteration ends with a global barrier, which creates additional synchronization overhead. These algorithms scale worse with increasing thread count, especially on smaller graphs.
Run Graph Algorithms on Intel Xeon Processors
Despite the challenging behavior of graph algorithms, there are ways to increase the efficiency of running these applications on a multicore server based on the Intel Xeon processor.
Vectorization
Using vector memory instructions can increase the performance of a graph algorithm by increasing the number of parallel load operations, which hides part of their latency. Specifically, you can use the vector gather instruction (found in Intel® Advanced Vector Extensions 2 [Intel® AVX2] and Intel® Advanced Vector Extensions 512 [Intel® AVX512]) to perform multiple indirect loads in one instruction. But, the compiler isn’t always able to detect these indirect access patterns, or it does not vectorize based on its heuristics. So, it might be useful to add a #pragma vector always to force the compiler to vectorize and to rewrite the code to make the indirect access pattern more apparent to the compiler.
Figure 1 gives an example for cc. The original code on the left did not generate vector gather instructions, while the altered code on the right did. It makes the indirect pattern clearer and forces the compiler to vectorize. This led to a speedup of 5x for cc on Intel Xeon processors.


Figure 1. Inner loop of connected components.
It might also be useful to look at other vectorization opportunities, and to rewrite the code so that they can be exploited (for example, using intrinsics). For example, in tc, you need to count the number of matches between the elements of two neighbor lists. Ilya Katsov describes an algorithm to speed up the matching algorithm with SSE instructions in Fast Intersection of Sorted Lists Using Streaming SIMD Extensions (SSE) Instructions. This algorithm is adapted to Intel AVX512 and is included in the tc benchmark, leading to a performance increase of 2.5x (the code is not included for brevity).^{2}
Parallelism
The GAP benchmarks are parallelized using OpenMP*. As discussed before, there are two categories of parallelism: vertex and frontbased. For the vertexparallel algorithms (pr, cc, and tc), it’s important to use dynamic scheduling in the OpenMP parallel forloops because the processing time of a vertex depends on its neighbor count, which can differ significantly across vertices. With static scheduling, threads that are assigned vertices with many neighbors run longer than other threads, which leads to load imbalance. To reduce the scheduling overhead while still maintaining enough scheduling flexibility, set the chunk size to 16 or 32.
The frontbased algorithms (bfs, sssp, and bc) are harder to parallelize, which means they don’t use the full capacity of the processor (fewer threads than cores). The current front contains many fewer vertices than the full graph, and the next front can only be processed when the current front is finished. To fully exploit the increasing core count of servers based on Intel Xeon processors, these algorithms need to be revised to increase their parallelism.
An example of this is already implemented in bfs. Instead of looking for neighbors of the current front and checking whether they’ve already been visited (forward algorithm), all nonvisited vertices are considered, and they are checked for whether they are a neighbor of a vertex in the current front (backward algorithm). Because there are more nonvisited vertices than vertices in the front, there’s more parallelism to exploit. The downside is that the backward algorithm sometimes does unnecessary work when most of the nonvisited vertices are not neighbors of the current front.
At each step of the algorithm, you can choose between the two methods. This choice is currently done using the characteristics of the current front and the remaining vertices, but it should also include the available parallelism (core or thread count) to better exploit the capacity of the processor.
Caches and Input Graphs
Graph workloads generally don’t generate cachefriendly access patterns. The one exception is when the graph, or the most accessed data structure of the graph (for example, the attributes of vertices), fits in the lastlevel cache (which is up to 38 MB per socket on highend Intel Xeon processors). From these experiments, the performance (expressed in giga traversed edges per second [GTEPS]) decreases with increasing graph size, since less and less of the data fits into the cache. Because of the nature of graphs and graph algorithms, methods to improve cache locality either don’t work well or take too much time to reorganize the graph, often more than the algorithm itself.
Distributed Graph Processing
The GAP benchmarks are designed for only running single nodes (using OpenMP parallelization). However, based on the insights from the study, let's discuss the impact on distributed graph processing (in other words, using multiple nodes). For running highperformance multinodes, it’s crucial to minimize the communication and maximize local data and computation. This is challenging for graph applications because of the irregular and nonlocalized access pattern. Partitioning a graph to minimize the number of edges between partitions is an NPcomplete problem in itself, and often leads to more compute time than the algorithm itself. So, when you’re deploying a graph analysis algorithm on multiple nodes, connect the nodes with a highbandwidth, lowlatency network such as Intel® OmniPath Architecture to deal with the unavoidably high amount of communication between the nodes.
Boost Performance through Analysis
Graph applications form a challenging workload for current processors because of their memory intensiveness and irregularity. By carefully crafting their implementation and exploiting vector units and thread parallelism, you can increase performance significantly. However, more investigation is needed, including redesigning the algorithms to fully exploit the capabilities of a server based on an Intel Xeon processor, especially when moving to distributed processing.
References
^{1} GAP Benchmark Suite
^{2} Manycore Graph Workload Analysis
______
You May Also Like
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.