

# Intel® Omni-Path Fabric Software Architecture Overview

Todd Rimmer, DCG Architecture

**Intel Corporation** 

November, 2016

# Legal Notices and Disclaimers

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Statements in this document that refer to Intel's plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel's results and plans is included in Intel's SEC filings, including the annual report on Form 10-K.

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries.

\*Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.



# Intel® Scalable System Framework









# MANY WORKLOADS - ONE FRAMEWORK

A Flexible Framework for Today & Tomorrow



Enabling Breakthrough System Performance

## Intel® Omni-Path Architecture

## Evolutionary Approach, Revolutionary Features, End-to-End Solution

#### **Building on the industry's best technologies**

- Highly leverage existing Aries and Intel® True Scale fabric
- Adds innovative new features and capabilities to improve performance, reliability, and QoS
- Re-use of existing OpenFabrics Alliance\* software

#### Robust product offerings and ecosystem

- End-to-end Intel product line
- >100 OEM designs<sup>1</sup>
- Strong ecosystem with 70+ Fabric Builders members







OEM custom designs **HFI and Switch ASICs** 

Intel® Denis Fish Architecture 10056/j. HFT HFI silicon Up to 2 ports (50 GB/s total b/w)

tinoel" Ommi-Padh Andhlocture 48 Badix Switch Switch silicon up to 48 ports (1200 GB/s total b/w

#### Software

Open Source
Host Software and
Fabric Manager



#### Cables

Third Party Vendors
Passive Copper
Active Optical



<sup>&</sup>lt;sup>1</sup> Source: Intel internal information. Design win count based on OEM and HPC storage vendors who are planning to offer either Intel-branded or custom switch products, along with the total number of OEM platforms that are currently planned to support custom and/or standard Intel® OPA adapters. Design win count as of November 1, 2015 and subject to change without notice based on vendor product plans. \*Other names and brands may be claimed as property of others.

## Intel<sup>®</sup> Omni-Path Architecture

## Disruptive innovations to knock down the "I/O Wall"



#### **HIGHER PERFORMANCE**

Accelerates discovery and innovation

Up to 21% lower latency at scale, up to 17% higher messaging rate, and up to 9% higher application performance than InfiniBand EDR<sup>1</sup>



#### **BETTER ECONOMICS**

Reduces size of fabric budgets.
Use savings to purchase more compute



Better price-performance than InfiniBand\* EDR reduces fabric spends for a given cluster size. Use savings to get more compute nodes with same total budget<sup>2</sup>



#### **MORE POWER EFFICIENT**

more efficient switches and cards and a reduction in switch count and cables due to the 48-port chip architecture





#### **GREATER RESILIENCY**

"no compromise" error detection and maintains link continuity with lane failures

No additional latency penalty for error detection with Packet Integrity Protection<sup>4</sup>

¹ Intel® Xeon® Processor E5-2697A v4 dual-socket servers with 2133 MHz DDR4 memory. Intel® Turbo Boost Technology and Intel® Hyper Threading Technology enabled. BIOS: Early snoop disabled, Cluster on Die disabled, IOU non-posted prefetch disabled, Snoop hold-off timer=9. Red Hat Enterprise Linux Server release 7.2 (Maipo). Intel® OPA testing performed with Intel Corporation Device 24f0 – Series 100 HFI ASIC (B0 silicon). OPA Switch: Series 100 Edge Switch – 48 port (B0 silicon). Intel® OPA host software 10.1 or newer using Open MPI 1.10.x contained within host software package. EDR IB® testing performed with Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 - 36 Port EDR Infiniband switch. Between the MLNX\_OFED\_Linux-3.2.x. OpenMPI 1.10.x contained within MLNX HPC-X. Message rate claim: Ohio State Micro Benchmarks v. 5.0. osu\_mbw\_mr, 8 B message (uni-directional), 32 MPI rank pairs. Maximum rank pair communication time used instead of average time, average timing introduced into Ohio State Micro Benchmarks as of v3.9 (2/28/13). Best of default, MXM\_TLS=self,rc, and -mca pml yalla tunings. All measurements include one switch hop. Latency claim: HPCC 1.4.3 Random order ring latency using 16 nodes, 32 MPI ranks per node, 512 total MPI ranks. Application claim: GROMACS version 5.0.4 ion\_channel benchmark. 16 nodes, 32 MPI ranks per node, 512 total MPI ranks. Intel® MPI Library 2017.0.064. Additional configuration details available under request. 2 Configuration assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switches and 36-port edge switches. Intel and Mellanox Component pricing from www.kernelsoftware.com, with prices as of October 20, 2016. Assumes \$6,200 for a 2-socket Intel® Xeon® processor based compute node. 3 Configuration assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configu



# **New Intel<sup>®</sup> OPA Fabric Features:** Fine-grained Control Improves Resiliency and Optimizes Traffic Movement







|                                | Description                                                                                                                                                                                                                       | Benefits                                                                                                                                                                                  |
|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Traffic Flow<br>Optimization   | <ul> <li>Optimizes Quality of Service (QoS) in mixed traffic environments, such as storage and MPI</li> <li>Transmission of lower-priority packets can be paused so higher priority packets can be transmitted</li> </ul>         | <ul> <li>Ensures high priority traffic is not delayed → Faster time to solution</li> <li>Deterministic latency → Lowers runto-run timing inconsistencies</li> </ul>                       |
| Packet Integrity<br>Protection | <ul> <li>Allows for rapid and transparent recovery of transmission errors on an Intel® OPA link without additional latency</li> <li>Resends 1056-bit bundle w/errors only instead of entire packet (based on MTU size)</li> </ul> | <ul> <li>Fixes happen at the link level rather than end-to-end level</li> <li>Much lower latency than Forward Error Correction (FEC) defined in the InfiniBand* specification¹</li> </ul> |
| Dynamic Lane<br>Scaling        | <ul> <li>Maintain link continuity in the event of a failure of one of more physical lanes</li> <li>Operates with the remaining lanes until the failure can be corrected at a later time</li> </ul>                                | <ul> <li>Enables a workload to continue to<br/>completion. Note: InfiniBand will shut<br/>down the entire link in the event of a<br/>physical lane failure</li> </ul>                     |

<sup>&</sup>lt;sup>1</sup> Lower latency based on the use of InfiniBand with Forward Error Correction (FEC) Mode A or C in the public presentation titled "Option to Bypass Error Marking (supporting comment #205)," authored by Adee Ran (Intel) and Oran Sela (Mellanox), January 2013. Mode A modeled to add as much as 140ns latency above baseline, and Mode C can add up to 90ns latency above baseline. Link:

www.ieee802.org/3/bi/public/jan13/ran 3bi 01a 0113.pdf



# Intel® Omni-Path Architecture Rapid Ramp



It's been quite a year since launch at Supercomputing 2015



## Major wins across the globe:





























50% MSS of 100Gb systems on the June '16 Top500 list<sup>1</sup>

Expect dramatic increase for November '16 list next week

Gained significant momentum in the market<sup>2</sup>

# Intel® Omni-Path Architecture: Software Components and Usage Model

Station

Director

Switches

Intel® OPA Fabric

**Compute Nodes** 

Mgmt Node 2

#### **Element Management Stack**

- Runs on embedded Intel Atom processor included in managed switches
- "Traditional System Mgmt" e.g. Signal integrity, Thermal monitoring, voltage monitoring, etc.

#### **Host Software Stack**

- Runs on all Intel® OPAconnected host nodes
- High performance, highly scalable MPI implementation via PSM and extensive set of upper layer protocols
- Boot over Fabric

#### Fabric Management GUI

- Runs on workstation with a local screen/keyboard
- Provides interactive GUI access to Fabric Management TCO features (configuration, monitoring, diagnostics, element management drill down)

#### **Fabric Management Stack**

- Runs on OPA-connected management nodes or switch embedded Atom processor
- Initializes, configures and monitors the fabric routing, QoS, security, and performance
- Includes toolkit for TCO functions: configuration, monitoring, diags, and repair



## Intel® Omni-Path Architecture

## Optimized host implementation



# Host Strategy: Leverage OpenFabrics Alliance\* (OFA)

- OpenFabrics Alliance compliant: Off-the-shelf application compatibility
- Provides an extensive set of mature upper layer protocols
- Integrates 4<sup>th</sup> generation proven, scalable PSM capability for HPC
- OpenFabrics Interface (OFI) API aligned with application requirements

#### Access: Open Source Key Elements

- Host software stack via OFA
- Intel® Omni-Path FastFabric Tools, Fabric Manager, and GUI

#### Channels: Integrate into Linux\* Distributions

- Intel® Omni-Path Architecture support included in standard distributions
  - Starting with RHEL 7.3 and SLES 12sp2
- Delta distribution of OFA stack atop Linux distributions as needed

## Maintains existing HPC fabric software approach



# Onload vs. Offload InfiniBand HCA vs. Intel® OPA HFI

Low latency, consistent access to all connection state



# Choosing the Right Data Transfer Approach MPI and Storage Traffic Performance Needs



# Multimodal Data Acceleration

## Highest performance small message transfer



#### Host Driven Send

- Optimizes latency and message rate for high priority messages
- Transfer time lower than memory handle exchange, memory registration

#### Receive Buffer Placement

- Data placed in receive buffers
- Buffers copied to application buffer

## Multimodal Data Acceleration

## Lowest overhead RDMA-based large message transfer



#### Send DMA (SDMA) Engine

- Stateless offloads on send side
- DMA setup required

#### Direct Data Placement

- Direct data placement on receive side
- Eliminates memory copy

# Intel® Omni-Path Architecture (Intel® OPA)

Application Performance - Intel® MPI - 16 Nodes



\*Spec MPI2007 results estimates until published

No Intel® OPA or EDR specific optimizations applied to any workloads except LS-DYNA and ANSYS Fluent: Intel® OPA HFI driver parameter: eager\_buffer\_size=8388608 WIEN2k comparison is for 8 nodes because EDR IB\* measurements did not scale above 8 nodes



<sup>\*\*</sup>see following slide for system configurations

# Intel® Omni-Path Architecture (Intel® OPA)

Application Performance Per Fabric Dollar\* - Intel® MPI - 16 Nodes





\*Spec MPI2007 results estimates until published

\*\*see following slide for system configurations

No Intel® OPA or EDR specific optimizations applied to any workloads except LS-DYNA and ANSYS Fluent: Intel® OPA HFI driver parameter: eager\_buffer\_size=8388608 WIEN2k comparison is for 8 nodes because EDR IB\* measurements did not scale above 8 nodes

\*All pricing data obtained from www.kernelsoftware.com May 4, 2016. All cluster configurations estimated via internal Intel configuration tool. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Fabric hardware assumes one edge switch, 16 network adapters and 16 cables.



## System & Software Configuration [include with previous slides]

Common configuration for bullets 1-11 unless otherwise specified: Intel® Xeon® Processor E5-2697A v4 dual socket servers. 64 GB DDR4 memory per node, 2133 MHz. RHEL 7.2. BIOS settings: Snoop hold-off timer = 9, Early snoop disabled, Cluster on die disabled. IDU Non-posted prefetch disabled. Intel® Omni-Path Architecture (Intel® OPA):Intel Tabric Suite 10.0.1.0.50. Intel Corporation Device 24f0 – Series 100 HFI RSIC (Production silicon). OPA Switch: Series 100 Edge Switch – 48 port (Production silicon). EDR Infiniband: MLNX OFED LINUX-3.2-2.0.0 (PED-3.2-2.0.0), Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 - 36 Port EDR Infiniband switch.

- 1. WIEN2k version 14.2. http://www.wien2k.at/. http://www.wien2k.at/reg\_user/benchmark/. Run command: "mpirun ... lapw1c\_mpi lapw1.def". Intel Fortran Compiler 17.0.0 20160517. Compile flags: -FR -mp1 -w -prec\_div -pc80 -pad -ip DINTEL VML -traceback -assume buffered io -DFTW3 -l/opt/intel/compilers and libraries 2017.0.064/linux/mkl/include/fftw/ -DParallel.shm:tmi fabric used for Intel® OPA and shm:dapl fabric used for EDR IB\*.
- 2. GROMACS version 5.0.4. Intel Composer XE 2015.1.133. Intel MPI 5.1.3. FFTW-3.3.4. ~/bin/cmake ..-DGMX\_BUILD\_OWN\_FFTW=OFF -DREGRESSIONTEST\_DOWNLOAD=OFF -DCMAKE\_C\_COMPILER=icc -DCMAKE\_CXX\_COMPILER=icpc DCMAKE\_INSTALL\_PREFIX=~/gromacs-5.0.4-installed. Intel® OPA MPI parameters: I\_MPI\_FABRICS=shm:tmi, EDR MPI parameters: I\_MPI\_FABRICS=shm:dapl
- 3. NWChem release 6.6. Binary: nwchem\_comex-mpi-pr\_mkl with MPI-PR run over MPI-1. Workload: siosi3 and siosi5. Intel® MPI Library 2017.0.064. 2 ranks per node, 1 rank for computation and 1 rank for communication. shm:tmi fabric for Intel® OPA and shm:dapl fabric for EDR, all default settings. Intel Fabric Suite 10.2.0.0.153. http://www.nwchem-sw.org/index.php/Main\_Page
- 4. LS-DYNA MPP R8.1.0 dynamic link. Intel Fortran Compiler 13.1 AVX2. Intel® OPA Intel MPI 2017 Library Beta Release Candidate 1. mpi.2017.0.0.BETA.U1.RC1.x86\_64.ww20.20160512.143008. MPI parameters: I\_MPI\_FABRICS=shm:tmi. HFI driver parameter: eager buffer size=8388608. EDR MPI parameters: I\_MPI\_FABRICS=shm:ofa.
- 5. ANSYS Fluent v17.0, Rotor\_3m benchmark. Intel® MPI Library 5.0.3 as included with Fluent 17.0 distribution, and libpsm\_infinipath.so.1 added to the Fluent syslib library path for PSM/PSM2 compatibility. Intel® OPA MPI parameters: -pib.infinipath, EDR MPI parameters: -pib.dapl
- 6. NAMD: Intel Composer XE 2015.1.133. NAMD V2.11, Charm 6.7.0, FFTW 3.3.4. Intel MPI 5.1.3. Intel® OPA MPI parameters: I MPI FABRICS=shm:tmi, EDR MPI parameters: I MPI FABRICS=shm:tdpl
- 7. Quantum Espresso version 5.3.0. Intel Compiler 2016 Update 2. ELPA 2015.11.001 (http://elpa.mpcdf.mpg.de/elpa-tar-archive). Minor patch set for QE to accommodate latest ELPA. Most optimal NPOOL, NDIAG, and NTG settings reported for both OPA and EDR. Intel® OPA MPI parameters: I\_MPI\_FABRICS=shm:dapl
- 8. CD-adapco STAR-CCM+\* version 11.04.010. Workload: lemans\_poly\_17m.amg.sim benchmark. Intel MPI version 5.0.3.048.32 ranks per node. OPA command: \$ /starccm+ -Idlibpath /STAR-CCM+11.04.010/mpi/intel/5.0.3.048/linux-x86\_64/lib64-ldpreload /usr/lib64/psm2-compat/libpsm\_infinipath.so.1 -mpi intel -mppflags "-env I\_MPI\_DEBUG 5 -env I\_MPI\_FABRICS shm:tmi -env I\_MPI\_TMI\_PROVIDER psm" -power -rsh ssh -np 512 -machinefile hosts -benchmark:"-nps 512,256,128,64,32 -nits 20 -preits 40 -tag lemans\_poly\_17m.amg.sim. EDR command: \$ /starccm+ -mpi intel -mppflags "-env I\_MPI\_DEBUG 5" -power -rsh ssh -np 512 -machinefile hosts -benchmark:"-nps 512,256,128,64,32 -nits 20 -preits 40 -tag lemans edr n16" lemans poly\_17m.amg.sim
- 9. LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) Feb 16, 2016 stable version release. MPI: Intel® MPI Library 5.1 Update 3 for Linux. Workload: Rhodopsin protein benchmark. Number of time steps=10, warm up time steps=10 (not timed) Number of copies of the simulation box in each dimension: 8x8x4 and problem size: 8x8x4x32k = 8,192k atoms Intel® OPA: MPI parameters: I\_MPI\_FABRICS=shm:tmi, I\_MPI\_PIN\_DOMAIN=core EDR: MPI parameters: I\_MPI\_FABRICS=shm:dapl,, I\_MPI\_PIN\_DOMAIN=core
- 10. WRF version 3.5.1, Intel Composer XE 2015.1.133. Intel MPI 5.1.3. NetCDF version 4.4.2. FCBASEOPTS=-w -ftz -align all -fno-alias -fp-model precise. CFLAGS\_LOCAL = -w -O3 -ip. Intel® OPA MPI parameters: I\_MPI\_FABRICS=shm:tmi, EDR MPI parameters: I\_MPI\_FABRICS=shm:dapl
- 11. Spec MPI 2007: 16 nodes, 32 MPI ranks/node. SPEC MPI2007, Large suite, https://www.spec.org/mpi/. \*Intel Internal measurements marked estimates until published. Intel MPI 5.1.3. Intel® OPA MPI parameters: I\_MPI\_FABRICS=shm:tmi, EDR MPI parameters: I\_MPI\_FABRICS=shm:dapl
- Common configuration for bullets 12-13: Intel® Xeon® Processor E5-2697 v4 dual socket servers. 128 GB DDR4 memory per node, 2400 MHz. RHEL 6.5. BIOS settings: Snoop hold-off timer = 9. Intel® OPA: Intel® OPA: Intel Fabric Suite 10.0.1.0.50. Intel Corporation Device 24f0 Series 100 HFI ASIC (Production silicon). OPA Switch: Series 100 Edge Switch 48 port (Production silicon). IOU Non-posted prefetch disabled. 2). Mellanox EDR based on internal measurements: Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 36 Port EDR Infiniband switch. IOU Non-posted prefetch enabled.
- 12. MiniFE 2.0, Intel compiler 16.0.2. Intel® MPI Library version 5.1.3. Build settings: -O3 -xCORE-AVX2 -DMINIFE\_CSR\_MATRIX -DMINIFE\_GLOBAL\_ORDINAL="long long int", mpirun -bootstrap ssh -env OMP\_NUM\_THREADS 1 perhost 36 miniFe.x nr. 2200 nr. 2200, 200x200x200 grid using 36 MPI ranks pinned to 36 cores per node. Intel® OPA MPI parameters: I\_MPI\_FABRICS=shm:tmi, EDR MPI parameters: I\_MPI\_FABRICS=shm:dapl. Intel® Turbo Mode technology and Intel® Hyper threading technology disabled.
- 13. VASP (developer branch). MKL: 11.3 Update 3 Product build 20160413. Compiler: 2016u3. Intel MPI-2017 Build 20160718. elpa-2016.05.002. Intel® OPA MPI parameters: I\_MPI\_FABRICS=shm:tmi, EDR MPI parameters: I\_MPI\_FABRICS=shm:dapl, I\_MPI\_DAPL\_PROVIDER=ofa-v2-mlx5\_0-1u, I\_MPI\_DAPL\_DIRECT\_COPY\_THRESHOLD=331072. Intel® Turbo Mode technology disabled. Intel Hyper Threading technology enabled.



## **Intel® OPA Management Software**



- Intel® OPA leverages existing stacks for each type of management
- Assorted 3<sup>rd</sup> party unified management consoles
- Intel® OPA provides a scalable centralized fabric management stack

# **Advanced Switching Features**

- Route Balancing per Device Group
  - Scalable unit and cross sub-cluster storage vs compute optimization
- Self discovering FatTree
  - Flexibility in cable placement, option to handle HFIs at core
- Medium Grained Adaptive Routing

Can specify any and all potential alternate routes

- Dispersive Routing
  - Maximized dispersion and resiliency
- Congestion Control



## **Fabric Diagnostic and Debug Features**

In a typical cluster the majority of fabric FRUs are cables

Managing cable FRUs is one of the biggest sysadmin challenges



## Cable FRU management Intel® OPA Fabric innovations:

- Link Quality Indicator "5 bars" instantaneous view of link quality
  - In every HW port, monitored by FM, FastFabric Tools, FM GUI
- Topology verification Are cables in correct places?
  - FM can warn or quarantine incorrect links
  - FastFabric online and offline topology analysis
- Port type information
  - QSFP/Standard, Fixed/Backplane, Variable, Disconnected, ...
- QSFP CableInfo shows all key cable /transceiver info
  - Vendor, model, length, technology, date, etc.
  - Fully integrated into FM, FastFabric tools, FM GUI
- Link Down Reason
  - LinkDownReason and NeighborLinkDownReason most recent reason link went down





## **Fabric Diagnostic and Debug Features**

Fabric utilization and performance monitoring is critical to fabric operations

#### Intel® OPA Fabric Innovations

- FM's monitors fabric and maintains history of fabric performance and errors
  - Over 130 performance counters per port
  - Including utilization, packet rate and congestion per VL
  - 64-bit counters (many decades to rollover)
- PM Device Groups allow performance analysis for a specific set of devices
  - Storage vs compute, etc.
- PM/PA database sync PM data retained during FM failover
- Statistics available via FastFabric CLI, FM GUI, PA APIs







## **Omni-Path scalable FM GUI**

Advanced User Interface

Exposes New OPA Features



## **Fabric Security**





#### Intel® OPA Fabric Innovations

- SMA/PMA protection
  - Specific enablement of Mgmt. Nodes by switches
  - SMA and PMA protocols strictly protected
  - SMA pacing at non-mgmt. hosts limits SMA denial of service attacks
- Host spoofing prevention
  - SLID verification at neighbor switch
  - NodeType, NodeGUID and PortGUID verification (hardware assisted)
  - Topology verification by SM
    - Catches unexpected devices or attempts to spoof other devices
- Cluster information restricted
  - SA limits data available to non-mgmt. nodes
  - SSL security for FM GUI connection to FM





## **Virtual Fabrics Overview**

## Unifying Concept for Security and QoS

- Allow multiple applications on cluster at once
- Allow sysadmin to control degree of isolation
- Composed of application, devicegroup and policy

## **Applications**

Identified by PathRecord ServiceID or Multicast MGID

## Device Group

Identified by Node Names or other mechanisms

#### **Policies**

QoS settings (SL, BW, TFO, etc), security settings (Pkey)



## Virtual Fabrics Address Resolution

## Transparently Integrated into OFA mechanisms

- PathRecord query, RDMA CM connection establish
- Multicast Join, IPolB

## Implemented in FM's SA

- Resolves request, finds matching application, device group, and associated vFabric
- Returns proper SL, PKey, etc

## Supports Other Mechanisms

- VF Info queries to get SL, PKey, etc for a given vFabric
- Useful for scalable MPI job launch



## **Virtual Fabrics Examples**

## **Default Virtual Fabrics Configuration**



## Simple QoS Virtual Fabrics Configuration



## **Virtual Fabrics Examples**

## Simple Security Virtual Fabrics Configuration



## **Virtual Fabrics Ground Rules**

## Always get the SL and PKey from the FM

- PathRecord query, Multicast Join, RDMA CM
- VF Info query and related scripts (opagetvf)
- A priori discussion with sysadmin and direct application parameters for SL and PKey

### Use scalable PathRecord query mechanisms

- RDMA CM, ibacm, kernel SA query
- Do not hand build your own SA MADs via umad
  - Often secured to root access and will not scale on large fabrics

## Be aware 0x7fff/0xffff is admin pkey

- Secured by default, not for use by application traffic
- Only includes SMA, PMA, SA, PA "applications"

# **INTEL® FABRIC BUILDERS**

An ecosystem working together to enable world class solutions based on Intel® Omni-Path Fabric



https://fabricbuilders.intel.com/



## Intel® Omni-Path Architecture Summary

Next generation HPC fabric that builds on Intel® True Scale Fabric

- Full end-to-end solution: Switches, adapters, host software, management software, cabling, silicon
- Optimized CPU, host and fabric architecture that cost effectively scales from entry to extreme deployments
- Comprehensive & mature host software that is compatible with existing Intel True Scale Fabric and Open Fabric Alliance (OFA) APIs
- Many advanced fabric administration features
- An established ecosystem and rapidly growing customer base

