Intel® OPA Performance

Determining Performance

In high performance computing (HPC), message passing interface (MPI) benchmarks are used to demonstrate the performance capability of the cluster network. While application performance is the most important result, benchmarking generally starts with standard micro-benchmarks used to determine best-case MPI latency, bandwidth, and message rate. An HPC cluster is only as fast as the communication between individual servers. Intel® Omni-Path Architecture (Intel® OPA) has been designed to meet the requirements of clusters from small to large scale. This includes ground-breaking quality of service (QoS) features meant to keep latency low, bandwidth high, and message rate high, even at scale.

The following three benchmarks compare MPI latency, bandwidth, and message rate between two nodes using Intel® OPA and EDR InfiniBand* (IB). The OPA measurements use Intel® MPI Library software and the IB* measurements use Open MPI 3.1.0 and HPC-X 2.1.0.1 2

MPI Latency

This figure compares point to point latency for Intel® OPA vs EDR IB* using the Intel® MPI Library software PingPong test. Throughout the entire curve, OPA performance is competitive with or faster than EDR IB*. Unlike EDR IB* which automatically disables Forward Error Correction when using <=2M copper cable lengths3, error detection and correction is always enabled for Intel® OPA through a feature known as Packet Integrity Protection (PIP). Intel® OPA provides this low latency even with error detection and correction enabled. EDR IB* has a bit error ratio (BER) of 1e-154 when FEC is not enabled and all errors are end to end5. In comparison, Intel® OPA has an effective BER of 3e-29, meaning that the EDR IB* end-to-end error rate is up to 33 trillion times higher than Intel® OPA, ultimately affecting job completion times.

MPI Bandwidth

Both fabrics quickly ramp up to achieve nearly 100Gbps line rate using the streaming IMB Uniband benchmark with only one core per node.

Message Rate

Intel® OPA has significantly higher 8 byte MPI message rate than EDR IB*. These tests are using the IMB Uniband and Biband benchmark with all cores on one node sending to/receiving from a partner core on the neighbor node.

Application Performance

While micro-benchmarks demonstrate the potential capability of a high-performance computing (HPC) fabric through isolated bandwidth, message rate, and latency tests, real application performance is the end goal in HPC. Below are examples of how Intel® Omni-Path Architecture (Intel® OPA) helps maximize cluster performance across a wide range of applications.

This figure compares the application performance for Intel® OPA relative to InfiniBand* Enhanced Data Rate (EDR) using 16 dual-socket server nodes populated with two Intel® Xeon® Platinum 8170 processors. The 51 workloads are sorted by application segment/vertical. Overall, Intel® OPA and EDR IB* performance is very competitive: When averaged across the shown applications, OPA performance is 2% higher than EDR IB*. While the application performance of Intel® OPA and EDR IB* are comparable, OPA offers better TCO (total cost of ownership), allowing for more compute or storage hardware within a fixed cluster budget.6

Configuration Details (Internal Testing – Single Switch)

General/Common Configurations EDR IB* Intel® OPA
Processor and memory Dual socket Intel® Xeon® Platinum 8170 processor nodes, 192 GB 2666 MHz DDR4 memory per node. Intel® Turbo Boost Technology and Hyperthreading Technology enabled. Unless otherwise noted, one MPI rank per physical CPU core is used.
Kernel & CPU microcode 3.10.0-693.21.1.el7.x86_64, 0x2000043. Variants 1, 2, and 3 mitigated.
Network software MLNX_OFED_LINUX-4.3-1.0.1.0 Intel Fabric Suite 10.6.1
Network hardware Intel® OPA: Intel Corporation Device 24f0 - Series 100 Host Fabric Interface (HFI). Series 100 Edge switch - 48 port. EDR IB: Mellanox Technologies MT27800 Family [ConnectX-5]. Mellanox MSB7800-36 port EDR InfiniBand* switch. All measurements for both fabrics use 2M copper cables between hosts and switches.
Operating System

Red Hat Enterprise Linux* Server Release 7.4

Message Passing Interface
(MPI) Library & Compilers (Applications)
Intel® MPI Library software or Intel® MPI Library 2018 Update 1. Intel parallel_studio_xe_2018.1.038 Intel® OPA: The better performing of either I_MPI_FABRICS=[shm:tmi or tmi] EDR IB*: The better performing of I_MPI_FABRICS=[shm:ofa, ofa, shm:dapl, or dapl]. I_MPI_DAPL_TRANSLATION_CACHE=1, I_MPI_DAPL_UD_TRANSLATION_CACHE=1, I_MPI_OFA_TRANSLATION_CACHE=1 
MPI Library & Compilers (Latency, Bandwidth, Message Rate) Open MPI 3.1.0 and HPC-X 2.1.0. 1. Default run flags. Intel® MPI Library software or Intel® MPI Library 2018 Update 1, I_MPI_FABRICS=ofi. libfabric 1.5.3.
Storage All input/output performed with NFSv3 with 1GbE to Intel SSDSC2BB48 drives.

Application-specific Configurations 

BSMBench - An HPC Benchmark for BSM Lattice Physics Version 1.0. 32 ranks per node. Parameters: global size is 64x32x32x32, proc grid is 8x4x4x4. Machine config build file: cluster.cfg.

GROMACS version 2016.2. http://www.prace-ri.eu/UEABS/GROMACS/1.2/GROMACS_TestCaseB.tar.gz lignocellulose-rf benchmark. -g -static-intel. CC=mpicc CXX=mpicxx-DBUILD_SHARED_LIBS=OFF -DGMX_FFT_LIBRARY=mkl -DGMX_MPI=ON -DGMX_OPENMP=ON -DGMX_CYCLE_SUBCOUNTERS=ON -DGMX_GPU=OFF -DGMX_BUILD_HELP=OFF -DGMX_HWLOC=OFF -DGMX_SIMD=AVX512 GMX_OPENMP_MAX_THREADS=256. Run detail: gmx_mpi mdrun -s run.tpr -gcom 20 -resethway -noconfout.

LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) Feb 16, 2016 stable version release. Official Git Mirror for LAMMPS (http://lammps.sandia.gov/download.html) 52 ranks per node and 2 OMP threads per rank. Common parameters: I_MPI_PIN_DOMAIN=core Run detail: Number of time steps=100, warm up time steps=10 (not timed) Number of copies of the simulation box in each dimension: 8x8x4 and problem size: 8x8x4x32k = 8,192k atoms Build parameters: Modules: yes-asphere yes-class2 yes-kspace yes-manybody yes-misc yes-molecule yes-mpiio yes-opt yes-replica yes-rigid yes-user-omp yes-user-intel. Binary to be built: lmp_intel_cpu. . Runtime lammps parameters: -pk intel 0 -sf intel -v n 1.

NWCHEM release 6.6. Binary: nwchem_armci-mpi_intel-mpi_mkl with MPI-PR run over MPI-1. Workload: siosi3 and siosi5. http://www.nwchem-sw.org/index.php/Main_Page. 2 ranks per node, 1 rank for computation and 1 rank for communication. -genv CSP_VERBOSE 1 -genv CSP_NG 1 -genv LD_PRELOAD libcasper.so. 

LS-DYNA, A Program for Nonlinear Dynamic Analysis of Structures in Three Dimensions Example pfile: gen { nodump nobeamout dboutonly } dir { global one_global_dir local /tmp/3cars }. Higher performance shown with mpp s R8.1.0 Revision 105896 or mpp s R9.1.0 Revision: 113698.

NAMD version 2.10b2, stmv and apoa1 benchmark. Build detail: CHARM 6.6.1. FFTW 3.3.4. Relevant build flags: ./config Linux-x86_64-icc --charm-arch mpi-linux-x86_64-ifort-smp-mpicxx --cxx icpc --cc icc --with-fftw3.

OpenFOAM* is a free, open source CFD software package developed primarily by [OpenCFD](http://www.openfoam.com). Version v1606+. Gcc version 4.8.5 for Intel® MPI Library software or Intel® MPI Library. All default make options. This offering is not approved or endorsed by OpenCFD Limited, producer, and distributor of the OpenFOAM* software via www.openfoam.com, and owner of the OpenFOAM* and OpenCFD trade marks. OpenFOAM* is a registered trade mark of OpenCFD Limited, producer, and distributor of the OpenFOAM* software via www.openfoam.com.

Quantum ESPRESSO is an integrated suite of Open-Source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials. http://www.quantum-espresso.org/./configure --enable-openmp --enable-parallel. BLAS_LIBS= -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core ELPA_LIBS_SWITCH = enabled SCALAPACK_LIBS = $(TOPDIR)/ELPA/libelpa.a -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 DFLAGS = -D__INTEL -D__FFTW -D__MPI -D__PARA -D__SCALAPACK -D__ELPA -D__OPENMP $(MANUAL_DFLAGS) AUSURF112 benchmark, all default options.

SPECFEM3D_GLOBE simulates the three-dimensional global and regional seismic wave propagation based upon the spectral-element method (SEM). It is a time-step algorithm which simulates the propagation of earth waves given the initial conditions, mesh coordinates/details of the earth crust. small_benchmark_run_to_test_more_complex_Earth benchmark, default input settings. specfem3d_globe-7.0.0. FC=mpiifort CC=mpiicc MPIFC=mpiifort FCFLAGS=-g -xCORE_AVX2 CFLAGS=-g -O2 -xCORE_AVX2.run_this_example.sh and run_mesher_solver.sh, NCHUNKS=6, NEX_XI=NEX_ETA=80, NPROC_XI=NPROC_ETA=10. 600 cores used, 52 cores per node.

Spec MPI2007, https://www.spec.org/mpi/. *Intel Internal measurements marked estimates until published. Compiler options: -O3 -xCORE-AVX2 -no-prec-div. Intel® MPI Library software or Intel® MPI Library: mpiicc, mpiifort, mpiicpc. Open MPI: mpicc, mpifort, mpicxx. Run detail: mref and lref suites, 3 iterations. 121.pop2: CPORTABILITY =-DSPEC_MPI_CASE_FLAG. 126.lammps: CXXPORTABILITY = -DMPICH_IGNORE_CXX_SEEK. 127.wrf2: CPORTABILITY = -DSPEC_MPI_CASE_FLAG -DSPEC_MPI_LINUX. 129.tera_tf=default=default=default: srcalt=add_rank_support 130.socorro=default=default=default: srcalt=nullify_ptrs FPORTABILITY= -assume nostd_intent_in CPORTABILITY = -DSPEC_EIGHT_BYTE_LONG CPORTABILITY = -DSPEC_SINGLE_UNDERSCORE. 

WRF - Weather Research & Forecasting Model (http://www.wrf-model.org/index.php) version 3.5.1. -xCORE_AVX2 -O3 . Net CDF 4.4.1.1 built with icc. Net CDF-fortran version 4.4.4 built with icc.

Product and Performance Information

1Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
2Performance results are based on testing as of July 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel provides these materials as-is, with no express or implied warranties. Intel, the Intel logo, Intel Core, Intel Optane, Intel Inside, Pentium, Xeon and others are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
3SwitchIB-FW-11_1200_0102-release_nodes.pdf: “Removed out-of-the-box FEC, reaching 90ns latency, on Mellanox GA level copper cables equal to or shorter than 2m.”
4https://community.mellanox.com/docs/DOC-2725
5https://cw.infinibandta.org/document/dl/7141
6Configurations: Intel® Omni-Path Architecture: Configuration assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of director switches and edge switches. Includes hardware acquisition costs (server and fabric), 24x7 3-year support (Mellanox Gold support), and 3-year power and cooling costs. Mellanox and Intel® OPA component pricing from www.kernelsoftware.com, with prices as of March 20, 2018. Mellanox power data based on Mellanox CS7500 Director Switch, Mellanox SB7700/SB7790 Edge switch, and Mellanox ConnectX-5 VPI adapter card product briefs posted on www.mellanox.com as of August 15, 2017. Intel® OPA power data based on product briefs posted on www.intel.com as of April 4, 2017. Power and cooling costs based on $0.1071 per kWh, and assumes server power costs and server cooling cost are equal and additive.