Overview
Amazon Web Services (AWS)* and Intel collaborated to enable EFA Peer Direct support on Amazon Elastic Compute Cloud (Amazon EC2*) DL1 instances based on Intel® Gaudi® accelerators. This solution offers users significant improvement in multi-instance model training performance.
Multinode scale-out of Intel Gaudi accelerators via host network interface controller (NIC) traditionally requires memory copying from the device to the host CPU and across the network to other devices. This increases CPU overhead, causes latency bottlenecks, and results in lower training performance. The performance issue can become more apparent when scaling up to larger models and more devices that require more forms of network communication, impacting overall model training.
To resolve the additional overhead, we collaborated with AWS to enable EFA Peer Direct for multinode communication on Amazon EC2 DL1 instances. EFA is an Elastic Network Adapter (ENA) for Amazon EC2 DL1 instances that supports bypassing the operating system kernel to communicate directly with the NIC. Peer Direct is a process that directly reads and writes data to and from memory between devices, bypassing the CPU. This removes unnecessary memory copies, decreases CPU overhead, and reduces latency. EFA Peer Direct significantly improves performance for collective operations like All-Reduce and All-Gather, which are pivotal for large-scale distributed training models. Support for EFA Peer Direct on Amazon EC2 DL1 instances became effective with the SynapseAI* 1.8.0 release in February 2023. For more information on EFA Peer Direct, see Elastic Fabric Adapter.
The steps to enable EFA Peer Direct are:
- Prepare an EFA-enabled security group.
- Launch Amazon EC2 DL1 instances on Intel-supported Amazon Machine Images (AMI).
Note Intel-supported AMIs require no additional packages to be installed to run EFA.
- Enable EFA NICs. Amazon EC2 DL1 instances support up to four EFA-enabled NICs.
- Set the environment variables in the training scripts using the following code:
RDMAV_FORK_SAFE=1 FI_EFA_USE_DEVICE_RDMA=1
- Run multinode training on Amazon EC2 DL1 instances.
For more information, refer to Scale-Out via Host NIC.
Collective Operations Performance on 32 Intel Gaudi Accelerators
Table 1 shows performance advances using EFA Peer Direct communication. We measured the performance of All-Reduce, All-Gather, and Reduce-Scatter collective operations for 32 MB messages on 32 Intel Gaudi accelerators (that is, four Amazon EC2 DL1 instances).
Table 1. Collective operations performance on 32 Intel Gaudi accelerators (four Amazon EC2 DL1 instances)
Collective Operator |
Performance Using EFA Networking Without Peer Direct (MB/s) |
Performance Using EFA with Peer Direct (MB/s) |
Performance Improvement with EFA Peer Direct |
---|---|---|---|
All-Reduce |
12,110 |
21,136 |
154% |
All-Gather |
20,299 |
41,082 |
202% |
Reduce-Scatter |
9,249 |
17,418 |
188% |
Running with EFA Peer Direct enables a performance increase of 1.5x to 2x across different collective operators for 32 MB message sizes. Eliminating the extra steps of copying to host memory has a significant benefit and provides a beneficial performance boost during training, the degree of which depends on the number of collective operations performed and varies by the model and framework used.
To see the impact across small and medium message sizes, we ran an additional experiment on 32 Intel Gaudi accelerators with the All-Reduce operation. This provides additional insight across different scenarios as well as the scalability improvements gained through EFA Peer Direct.
Figure 1. Amazon EC2 DL1 All-Reduce EFA Peer Direct performance
As shown in figure 1, without EFA Peer Direct, throughput plateaus around 12 MB/s for larger message sizes (primarily due to CPU overhead causing a bottleneck and hindering overall performance). With EFA Peer Direct, the bottleneck is removed, leading to a significant performance boost in throughput for larger message sizes (for instance, up to 1.76x for a 256 MB message size). For models with a large number of collective operations such as All-Reduce, EFA Peer Direct provides a significant improvement in inter-instance communications.
Accelerate Distributed Model Training with EFA Peer Direct
Next, let's look at model training performance, specifically multi-instance distributed training. Our candidate model is the Microsoft DeepSpeed* BERT 5B parameter model running on 16 Amazon EC2 DL1 instances (128 Intel Gaudi accelerators). This model uses the ZeRO-2 optimizer to reduce memory consumption at the cost of increased collective operations, a perfect candidate to demonstrate the EFA Peer Direct capabilities.
We used the model configuration shown in table 2 and measured performance with SynapseAI version 1.8.0.
Table 2. DeepSpeed BERT 5B model configuration
Field |
Value |
---|---|
Number of EFA NICs |
4 |
Model |
DeepSpeed BERT 5B |
Zero optimization stage |
ZeRO-2 |
Global batch size |
12,288 |
Micro batch size (per Intel Gaudi accelerator) |
24 |
Gradient accumulation steps |
4 |
Activation checkpoint |
Yes |
Optimizer |
LANS |
The following table shows the model throughput performance with and without Peer Direct enabled.
Table 3. DeepSpeed BERT 5B performance results
Scale-Out Mode |
Throughput (Samples per Second) |
Performance Improvement |
---|---|---|
Without Peer Direct |
391.44 |
|
With Peer Direct |
674.73 |
153% |
We see that with EFA Peer Direct, model throughput performance is increased by about 1.5x. This improvement allows for faster training and convergence with minimal additional effort.
EFA Peer Direct on Amazon EC2 DL1 instances based on Intel Gaudi accelerators provides significant performance improvements for multi-instance distributed training jobs.
Take advantage of the benefits of EFA Peer Direct on Amazon EC2 DL1 instances with Intel Gaudi accelerators.
- To get started with EFA on Amazon EC2, see Get Started.
- To access the first-generation Intel Gaudi accelerator on an Amazon EC2 DL1 instance, see Amazon EC2 DL1 Instances.