Success Story: A Repository for Evaluating Performance and...

A GitHub* project offers many benchmark source codes written in oneAPI for independent software vendors (ISVs).

Challenge

New ideas need methods to evaluate their effectiveness and value. Different parallel programming models have emerged over the last several years, including OpenMP*, OpenCL™ standard, Heterogeneous-Computing Interface for Portability (HIP), SYCL*, and most recently oneAPI.

oneAPI is gaining traction across the software ecosystem, both with individual developers¹ and with ISVs who build compilers and tools to deploy SYCL and Data Parallel C++ (DPC++) code to targeted devices. Besides oneAPI support for Intel® XPUs, the ecosystem is expanding to other accelerators. There’s the Intel® DPC++ Compatibility Tool used to assist migrating CUDA* to DPC++. Recent contributions from Codeplay* to the open source DPC++ with LLVM* project allow developers to target NVIDIA* GPUs using SYCL code without intermediate layers.² Work at Heidelberg University³ is underway to create a SYCL and DPC++ compiler for AMD* GPUs. As more ISVs develop products to the oneAPI standard, they need benchmarks written in different ways to measure and compare their performance and productivity.

Solution

Zheming Jin, formerly a postdoctoral researcher at Argonne National Laboratory and now at Oak Ridge National Laboratory, has created a rich repository of such benchmarks. The project specifically evaluates oneAPI direct programming based on SYCL and DPC++ and OpenMP (see Table 1). His project is posted on Intel® DevMesh.

“oneAPI is an interesting model worth studying,” explained Zheming. “But we don’t have a lot of open source programs to study SYCL and DPC++ implementations. We have many CUDA and OpenMP programs targeting different devices, like NVIDIA GPUs, but not enough that can be used to compare against SYCL and DPC++.”

Currently, his GitHub project contains more than 90 different codes (see Table 1) written in SYCL, DPC++, CUDA, and OpenMP. Some of these come from the Rodinia benchmark suite, including Kmeans, Heart Wall, Particle Filter, and Back Propagation, among others. Refer to the full GitHub list for sources and other information important to the repository.

Table 1. List of test codes at oneAPI-DirectProgramming

affine	Affine transformation
all-pairs-distance	All-pairs distance calculation
amgmk	The relax kernel in the AMGmk benchmark
aobench	A lightweight ambient occlusion renderer
atomicIntrinsics	Atomic add, subtract, min, max, AND, OR, and XOR
axhelm	Helmholtz matrix-vector product
backprop	Train weights of connecting nodes on a layered neural network
bezier-surface	Bezier surface
bfs	Breadth-first search
bitonic-sort	Bitonic sorting
black-scholes	Black-Scholes simulation
bsearch	Classic and vectorizable binary search algorithms
bspline-vgh	Bspline value gradient hessian
b+tree	B+Tree search
ccsd-trpdrv	The CCSD tengy kernel in NWChem
cfd	Solver for the 3D Euler equations for compressible flow
chi2	Chi-square 2-df test
clenergy	Direct coulomb summation kernel
clink	Compact LSTM inference kernel
cobahh	Simulation of random network of Hodgkin and Huxley neurons with exponential synaptic conductances
compute-score	Document filtering
diamond	Mask sequences kernel in Diamond
divergence	Divergence comparison between a CPU and a GPU
easyWave	Simulation of tsunami generation and propagation in the context of early warning
extend2	Smith-Waterman extension in Burrow-wheeler aligner for short-read alignment
filter	Filtering by a predicate
fft	Fast Fourier transform
floydwarshall	Floyd-Warshall pathfinding sample
fpc	Frequent-pattern compression
gamma-correction	Gamma correction
gaussian	Gaussian elimination
geodesic	Geodesic distance
haccmk	The HACC microkernel
heartwall	Track the movement of a mouse heart over a sequence of ultrasound images
heat	A heat equation solver
heat2d	Discrete 2D Laplacian operations on a given vector
histogram	Histogram computation
hmm	Hidden Markov model
hotspot3D	Simulate microprocessor temperature on a 3D input grid
hybridsort	A combination of bucketsort and mergesort
interleave	Interleaved and non-interleaved global memory accesses
inversek2j	The inverse kinematics for 2-joint arm
ising	Monte-Carlo simulations of 2D Ising model
iso2dfd	A finite difference stencil kernel for solving the 2D acoustic isotropic wave equation
jenkins-hash	Bob Jenkins lookup3 hash function
keccaktreehash	A Keccak tree hash function
kmeans	Fuzzy k-means clustering
knn	Fast k-nearest neighbor search
laplace	A Laplace solver using red-black Gaussian Seidel with SOR solver
lavaMD	Calculate particle potential and relocation between particles within a large 3D space
leukocyte	Detect and track rolling leukocytes
lid-driven-cavity	A GPU solver for a 2D lid-driven cavity problem
lombscargle	Lomb-Scargle periodogram
lud	LU decomposition
mandelbrot	Calculate the Mandelbrot set
matrix-mul	Single-precision floating-point matrix multiply
matrix-rotate	In-place matrix rotation
maxpool3d	3D maxpooling
md	Computation of Lennard-Jones potential using a neighbor-list algorithm
md5hash	MD5 hash function
memcpy	Memory copies from a host to a device
miniFE	A proxy application for unstructured implicit finite element codes
mixbench	A read-only version of Mixbench
mkl-sgemm	Single-precision floating-point matrix multiply using Intel® Math Kernel Library
murmurhash3	MurmurHash3 yields a 128-bit hash value
myocyte	Model cardiac myocyte and simulate its behavior
nbody	N-body simulation
nn	k-nearest neighbors from an unstructured data set of coordinates
nw	A nonlinear global optimization method for DNA sequence alignments
page-rank	PageRank
particle-diffusion	Monte-Carlo simulation of the diffusion of water molecules in tissue
particlefilter	Statistical estimator of the location of a target object
pathfinder	Find a path on a 2D grid with the smallest accumulated weights
projectile	Projectile motion is a program that implements a ballistic equation
qtclustering	Quality-threshold clustering
quicksort	Quicksort
randomAccess	Random memory accesses
reduction	Integer sum reduction using atomics
reverse	Reverse an input array of size 256 using shared memory
rng-wallace	Random number generation using the Wallace algorithm
rsbench	A proxy application for full neutron transport application that supports multipole cross-section representations
rtm8	A structured grid application in the oil and gas industry
s3d	Chemical rates computation used in the simulation of combustion
scan	A block-level scan using shared memory
simplemoc	The attenuation of neutron fluxes across an individual geometrical segment
softmax	Softmax function
sort	Radix sort
sph	The simple n2 SPH simulation
srad	Speckle reducing anisotropic diffusion
sssp	The single-source shortest path
stencil	1D stencil using shared memory
streamcluster	Online clustering of an input stream
su3	Lattice QCD SU(3) matrix-matrix multiply microbenchmark
transpose	Tensor transpose example
xsbench	A proxy application for full neutron transport application like OpenMC

History

“Rodinia is a popular benchmark suite used by researchers that target accelerators. Most codes in Rodinia include OpenMP, CUDA, and OpenCL [code] implementations,” added Zheming. “But codes within the Rodinia suite didn’t have SYCL implementations to compare performance or productivity across different programming models. I started with some codes from Rodinia.”

Zheming wanted benchmarks between different implementations to be as fair as possible. So, his goal was to create codes that are as functionally equivalent as possible, that is, without special optimizations that might cater to a particular manufacturer’s device.

OpenMP is widely used across the ecosystem for shared-memory parallelism. The OpenMP standard continues to evolve. Since 2015, it has added support for targeted device offloading to GPUs, and ISVs, including Intel, continue to enhance their compilers for it.

For his repository, Zheming created CUDA versions from open source files, and then ported those codes to DPC++ using the Intel DPC++ Compatibility Tool. The tool helps developers migrate CUDA code to DPC++ with 80 to 90 percent of the code automatically transitioning to DPC++. CUDA language kernels and library API calls are also ported over.

“There are many CUDA programs for parallel computation,” added Zheming. “It’s important that we have a tool to help port them to SYCL/DPC++ so they can be run across different architectures.”

The Intel DPC++ Compatibility Tool is part of the Intel® oneAPI Base Toolkit, which Zheming used. Because different CUDA implementations can access data either through device buffers or shared memory, Zheming produced two versions of the converted code. For codes in the repository with “‑dpct” suffixes, memory management migration was implemented using both the explicit and restricted Unified Shared Memory extension. Codes with the “-dpct-h” suffix use Intel DPC++ Compatibility Tool header files.

Additionally, he wrote SYCL versions using SYCL buffers. Then he ran the codes on Intel® processor-based platforms.

Evaluating the Codes on Intel® CPUs and GPUs

Zheming evaluated the four codes (SYCL, both DPCT-generated device buffer or shared memory versions, and OpenMP) on two different Intel processor-based platforms with Intel integrated graphics:

Intel® Xeon® E3-1284L processor with Intel® P6300 graphics (Gen8)
Intel® Xeon® E2176G processor with Intel® UHD630 graphics (Gen9.5)

He used the OpenCL standard intercept layer to monitor the following using the OpenCL standard plug-in interface:

Total enqueue—indicates the total number of low-level OpenCL standard enqueue commands called by a parallel program. These enqueue commands include clEnqueueNDRangeKernel, clEnqueueReadBuffer, and clEnqueueWriteBuffer.
Host timing—the total elapsed time of executing OpenCL API functions on a CPU host.
Device timing—the total elapsed time of executing OpenCL API functions on a GPU device.

The results from both platforms are listed in the repository starting at Results on Platform 1.

The results show that for most of these codes, performance across the Intel DPC++ Compatibility Tool and SYCL implementations differs only slightly. This is not surprising considering they are similar in approach. However, compared to OpenMP code, the oneAPI codes had fewer calls and ran faster.

Figure 1 compares results of a few of these benchmark codes. Because the measurements differ in units (number of calls, seconds, and milliseconds), the table is normalized to the OpenMP standard implementation.

Figure 1. Partial benchmark results from repository codes normalized to OpenMP code. See the full list at GitHub.

Zheming’s repository is an ongoing work that began with studying the Rodinia benchmark to understand how to migrate programs to oneAPI and evaluate the benefit of the new programming model on Intel architectures. These studies are documented in several reports,⁴ which show improved performance using SYCL on Intel xPUs and higher coding productivity. For example, comparing SYCL and OpenCL standard implementations of Heart Wall and Particle Filter, a SYCL implementation of Heart Wall ran 15 percent faster on Intel® Iris® Plus graphics than an OpenCL standard implementation, while the Particle Filter SYCL code ran 4.5X faster on an Intel® Xeon® processor with four cores.⁴ Both programs required 52 percent (for Heart Wall) and 38 percent (for Particle Filter) fewer lines of code, making actual programming more productive.⁴

Zheming encourages other developers creating similar benchmarks to consider merging GitHub projects for the community.

Enabling Technologies

Zheming used the following tools and technologies to help build his repository.

Software:

Intel® oneAPI Base Toolkit
Intel® oneAPI HPC Toolkit
Intercept Layer for OpenCL™ Applications
Intel® Parallel Studio and Intel® VTune™ Profiler tools

Hardware:

Intel Xeon E3-1284L processor and Intel Xeon E2176G processor
P6300 graphics (Gen8) from Intel and Intel® UHD Graphics 630 (Gen9.5)
Intel® DevCloud

Resources and Recommendations

Develop, test, and run your solution with access to the latest Intel® hardware and oneAPI software on Intel DevCloud for oneAPI
SYCL Overview

Footnotes

1. See projects at Intel DevMesh

2. Codeplay Contribution Brings NVIDIA Support For SYCL Developers

3. Intel oneAPI Is Coming to AMD Radeon* GPUs

4. Improve the Performance of Medical Imaging Applications Using SYCL. Also see A Case Study with the HACCmk Kernel in SYCL and A Case Study of k-means Clustering Using SYCL

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

A Repository for Evaluating Performance and Productivity of oneAPI