## Accelerate the Floyd Warshall algorithm using parallel blocked processing and the Intel® oneAPI DPC++/C++ Compiler

The all-pairs shortest paths problem refers to finding the shortest paths between all the pairs of nodes or vertices in a graph. The Floyd Warshall algorithm provides an approach to solving this problem by comparing many possible paths through directed and undirected weighted graphs, thus enabling it to identify the shortest.

It is an important computational tool for many real-world use cases and applications, as it assists with finding the shortest paths for practical everyday scenarios in navigation, gaming, manufacturing, communications, and research.

Some of the most common usage domains are:

**Robots and autonomous vehicles**: to navigate through environments with obstacles**Telecommunication and computer networks**: for efficient data routing**Geographical Information Systems (GIS)**: for mapping and location-based services**Transportation systems**: to optimize routes such as flight connections or road networks**Circuit design**: to optimize the layout of electronic circuits**Video game development:**to optimize pathfinding algorithms for gaming entities

Using the Floyd Warshall algorithm an initial estimation is incrementally improved upon in iterative calculations that scale with the complexity of the graph. As this can require considerable parallel computational performance, implementing the Floyd Warshall algorithm lends itself to offloading its compute-intensive parts to a GPU using SYCL* kernels and the Intel® oneAPI DPC++/C++ Compiler.

Before we delve deeper into the code sample available in the oneAPI GitHub repository, let us look a bit closer at the Intel DPC++/C++ Compiler and the Floyd Warshall algorithm to highlight how the advanced C++ and SYCL based multi-architecture parallelism of the compiler benefits the execution of such algorithms on heterogeneous hardware platforms.

**An Overview: Intel® oneAPI DPC++/C++ Compiler**

Intel oneAPI DPC++/C++ Compiler is an industry-standard, cross-architecture compiler designed to compile ISO C/C++ and SYCL applications. It helps you achieve standards-compliant high performance leveraging the LLVM technology. It supports accelerated compute frameworks, including OpenMP* and the latest generation SYCL. It allows you to take advantage of oneAPI libraries such as oneTBB and oneDPL. It enables flexible performance optimization and offload compute acceleration with code reusability across diverse architectures, including x86 CPUs, GPUs, and FPGAs.

► Learn how to compile your C/C++ and SYCL code across diverse architectures.

**Understand the Floyd Warshall Algorithm**

Using a dynamic programming approach, the Floyd Warshall algorithm constructs a matrix of the shortest paths (distances) between all the pairs of nodes in a directed or undirected weighted graph. It can handle graphs with negative edge weights and cycles (if the sum of the edges in a cycle is not negative).

The algorithm works as follows:

- Initialize the result matrix as the input graph matrix.
- Consider all the nodes one by one; update the shortest path between every pair of nodes, including the chosen node as an intermediate node.
- Choosing a kth node as an intermediate node means that {0,1,…,k-1} nodes have already been chosen as intermediate nodes.
- Suppose the matrix's d[i][j] element denotes the shortest distance from a source node i to a destination node j. For each pair of nodes (i,j), either of the following holds:
- if k is not an intermediate node in the shortest path from i to j, then d[i][j] is kept unchanged.
- else, d[i][j] is updated to (d[i][k] + d[k][j]), provided d[i][j] > (d[i][k] + d[k][j]).

**About the Code Sample**

The all-pairs shortest paths code sample demonstrates how to implement Floyd Warshall algorithm to compute a matrix that shows the minimum distance from each node to all other nodes in the graph. The sample uses a parallel blocked processing technique in which multiple blocks of code, each performing a specific task, are executed in parallel. The compute-intensive blocks are offloaded to the GPU for faster results, and the code is compiled using the Intel oneAPI DPC++/C++ Compiler. The sample also performs sequential implementation of the Floyd Warshall algorithm and compares its execution time with that of the accelerated parallel blocked implementation. It uses SYCL concepts such as kernels, device selector, unified shared memory, and command groups.

The inner loop in the sequential implementation is of the form:

`g[i][j] = min(g[i][j], g[i][k] + g[k][j])`

The inner loop in the parallel blocked implementation looks like this:

```
void BlockedFloydWarshallCompute(nd_item<1> &item, const LocalBlock &C,
const LocalBlock &A, const LocalBlock &B,
int i, int j) {
for (int k = 0; k < block_length; k++) {
if (C[i][j] > A[i][k] + B[k][j]) {
C[i][j] = A[i][k] + B[k][j];
}
}
}
```

In the parallel implementation, each thread handles one cell of a block. For each block, the above function is invoked as many times as there are cells in the block. Each such invocation performs as many iterations as there are blocks along a single dimension. All the threads that simultaneously operate on a block synchronize with each other at the end of the iteration, ensuring the correctness of the result.

The parallel blocked technique comprises three phases such that each phase implements the above loop on the blocks at specific positions in the matrix, such that:

- Phase 1 operates on blocks on the diagonal of the graph's adjacency matrix.
- Phase 2 operates on blocks that are either on the same row or on the same column of a diagonal block.
- Phase 3 operates on blocks other than those handled by phase 1 and phase 2.

If a prior round of these phases is completed, for any subsequent round,

- Phase 1 is an independent phase.
- Phase 2 executes only after the completion of phase 1.
- Phase 3 executes only after the completion of phase 2.

► For steps to build and execute the code sample, check out the key implementation details.

**Sample Output**

An example output of running the code on Intel® Processor Graphics Gen9 or newer looks like:

Device: Intel(R) Gen9 Repeating computation 8 times to measure run time ... Iteration: 1 Iteration: 2 Iteration: 3 Iteration: 4 Iteration: 5 Iteration: 6 Iteration: 7 Iteration: 8 Successfully computed all pairs' shortest paths in parallel! Time sequential: 0.583029 sec Time parallel: 0.159223 sec

**Next Steps**

Try out the all-pairs shortest paths code sample for parallel implementation of the Floyd Warshall algorithm. Check out the many other code samples available in the oneAPI GitHub repository. Get started with the Intel oneAPI DPC++/C++ Compiler today to compile C/C++ and SYCL applications across heterogeneous architectures efficiently.

Also, explore other AI, HPC, and Rendering tools in Intel’s oneAPI-powered software portfolio.

**Get the Software**

You can install the Intel oneAPI DPC++/C++ Compiler as a part of Intel® oneAPI Base Toolkit and Intel® oneAPI HPC Toolkit. You can also download a standalone version of the compiler or test it across Intel® CPUs and GPUs on the Intel® Developer Cloud platform.