Accelerate Distributed Training Performance Using Elastic Fabric Adapter (EFA) Peer Direct

Optimize with Intel® Gaudi® AI Accelerators

  • Create new deep learning models or migrate existing code in minutes.

  • Deliver generative AI performance with simplified development and increased productivity.

author-image

By

Overview

Amazon Web Services (AWS)* and Intel collaborated to enable EFA Peer Direct support on Amazon Elastic Compute Cloud (Amazon EC2*) DL1 instances based on Intel® Gaudi® accelerators. This solution offers users significant improvement in multi-instance model training performance.

Multinode scale-out of Intel Gaudi accelerators via host network interface controller (NIC) traditionally requires memory copying from the device to the host CPU and across the network to other devices. This increases CPU overhead, causes latency bottlenecks, and results in lower training performance. The performance issue can become more apparent when scaling up to larger models and more devices that require more forms of network communication, impacting overall model training.

To resolve the additional overhead, we collaborated with AWS to enable EFA Peer Direct for multinode communication on Amazon EC2 DL1 instances. EFA is an Elastic Network Adapter (ENA) for Amazon EC2 DL1 instances that supports bypassing the operating system kernel to communicate directly with the NIC. Peer Direct is a process that directly reads and writes data to and from memory between devices, bypassing the CPU. This removes unnecessary memory copies, decreases CPU overhead, and reduces latency. EFA Peer Direct significantly improves performance for collective operations like All-Reduce and All-Gather, which are pivotal for large-scale distributed training models. Support for EFA Peer Direct on Amazon EC2 DL1 instances became effective with the SynapseAI* 1.8.0 release in February 2023. For more information on EFA Peer Direct, see Elastic Fabric Adapter.

The steps to enable EFA Peer Direct are:

  1. Prepare an EFA-enabled security group.
  2. Launch Amazon EC2 DL1 instances on Intel-supported Amazon Machine Images (AMI).

    Note Intel-supported AMIs require no additional packages to be installed to run EFA.
     
  3. Enable EFA NICs. Amazon EC2 DL1 instances support up to four EFA-enabled NICs.
  4. Set the environment variables in the training scripts using the following code:
    RDMAV_FORK_SAFE=1
    FI_EFA_USE_DEVICE_RDMA=1
  5. Run multinode training on Amazon EC2 DL1 instances.

For more information, refer to Scale-Out via Host NIC.

Collective Operations Performance on 32 Intel Gaudi Accelerators

Table 1 shows performance advances using EFA Peer Direct communication. We measured the performance of All-Reduce, All-Gather, and Reduce-Scatter collective operations for 32 MB messages on 32 Intel Gaudi accelerators (that is, four Amazon EC2 DL1 instances).

Table 1. Collective operations performance on 32 Intel Gaudi accelerators (four Amazon EC2 DL1 instances)

Collective Operator

Performance Using EFA Networking Without Peer Direct (MB/s)

Performance Using EFA with Peer Direct (MB/s)

Performance Improvement with EFA Peer Direct

All-Reduce

12,110

21,136

154%

All-Gather

20,299

41,082

202%

Reduce-Scatter

9,249

17,418

188%

 

Running with EFA Peer Direct enables a performance increase of 1.5x to 2x across different collective operators for 32 MB message sizes. Eliminating the extra steps of copying to host memory has a significant benefit and provides a beneficial performance boost during training, the degree of which depends on the number of collective operations performed and varies by the model and framework used.

To see the impact across small and medium message sizes, we ran an additional experiment on 32 Intel Gaudi accelerators with the All-Reduce operation. This provides additional insight across different scenarios as well as the scalability improvements gained through EFA Peer Direct.

Figure 1. Amazon EC2 DL1 All-Reduce EFA Peer Direct performance

As shown in figure 1, without EFA Peer Direct, throughput plateaus around 12 MB/s for larger message sizes (primarily due to CPU overhead causing a bottleneck and hindering overall performance). With EFA Peer Direct, the bottleneck is removed, leading to a significant performance boost in throughput for larger message sizes (for instance, up to 1.76x for a 256 MB message size). For models with a large number of collective operations such as All-Reduce, EFA Peer Direct provides a significant improvement in inter-instance communications.

Accelerate Distributed Model Training with EFA Peer Direct

Next, let's look at model training performance, specifically multi-instance distributed training. Our candidate model is the Microsoft DeepSpeed* BERT 5B parameter model running on 16 Amazon EC2 DL1 instances (128 Intel Gaudi accelerators). This model uses the ZeRO-2 optimizer to reduce memory consumption at the cost of increased collective operations, a perfect candidate to demonstrate the EFA Peer Direct capabilities.

We used the model configuration shown in table 2 and measured performance with SynapseAI version 1.8.0.

Table 2. DeepSpeed BERT 5B model configuration

Field

Value

Number of EFA NICs

4

Model

DeepSpeed BERT 5B

Zero optimization stage

ZeRO-2

Global batch size

12,288

Micro batch size (per Intel Gaudi accelerator)

24

Gradient accumulation steps

4

Activation checkpoint

Yes

Optimizer

LANS

 

The following table shows the model throughput performance with and without Peer Direct enabled.

Table 3. DeepSpeed BERT 5B performance results

Scale-Out Mode

Throughput (Samples per Second)

Performance Improvement

Without Peer Direct

391.44

 

With Peer Direct

674.73

153%

 

We see that with EFA Peer Direct, model throughput performance is increased by about 1.5x. This improvement allows for faster training and convergence with minimal additional effort.

EFA Peer Direct on Amazon EC2 DL1 instances based on Intel Gaudi accelerators provides significant performance improvements for multi-instance distributed training jobs.

Take advantage of the benefits of EFA Peer Direct on Amazon EC2 DL1 instances with Intel Gaudi accelerators.