Intel® oneAPI Collective Communications Library
Scalable & Efficient Distributed Training for Deep Neural Networks
Implement Multi-Node Communication Patterns
The Intel® oneAPI Collective Communications Library (oneCCL) enables developers and researchers to more quickly train newer and deeper models. This is done by using optimized communication patterns to distribute model training across multiple nodes.
The library is designed for easy integration into deep learning frameworks, whether you are implementing them from scratch or customizing existing ones.
- Built on top of lower-level communication middleware. Message passing interface (MPI) and libfabrics transparently support many interconnects, such as Cornelis Networks*, InfiniBand*, and Ethernet.
- Optimized for high performance on Intel CPUs and GPUs.
- Allows the tradeoff of compute for communication performance to drive scalability of communication patterns.
- Enables efficient implementations of collectives that are heavily used for neural network training, including all-gather, all-reduce, and reduce-scatter.
Download as Part of the Toolkit
oneCCL is included as part of the Intel® oneAPI Base Toolkit, which is a core set of tools and libraries for developing high-performance, data-centric applications across diverse architectures.
Download the Stand-Alone Version
A stand-alone download of oneCCL is available. You can download binaries from Intel or choose your preferred repository.
Develop in the Intel® Tiber™ Developer Cloud
Build and optimize oneAPI multiarchitecture applications using the latest Intel-optimized oneAPI and AI tools, and test your workloads across Intel® CPUs and GPUs. No hardware installations, software downloads, or configuration necessary.
Help oneCCL Evolve
oneCCL is part of the oneAPI industry standards initiative. We welcome you to participate.
Features
Common APIs to Support Deep Learning Frameworks
oneCCL exposes a collective API that supports:
- Commonly used collective operations found in deep learning and machine learning workloads
- Interoperability with SYCL* from the Khronos* Group
Deep Learning Optimizations
The runtime implementation enables several optimizations, including:
- Asynchronous progress for compute communication overlap
- Dedication of one or more cores to ensure optimal network use
- Message prioritization, persistence, and out-of-order execution
- Collectives in low-precision data types
Documentation & Code Samples
Documentation
Code Samples
View All Code Samples (GitHub)
Training
Understanding oneCCL
oneAPI Collective Communications Library [5:07]
Distributed AI Acceleration
Accelerate Distributed AI with a oneCCL Framework [3:24]
Distributed Deep Learning Optimization
Optimize a Deep Learning Recommendation Model by Using PyTorch* with a oneCCL Back End
Efficient Model Training on Multiple CPUs
Specifications
Processors:
- Intel® Xeon® processors
GPUs:
- Intel® Processor Graphics Gen9 and higher
- Xe Architecture
Memory:
- Dynamic RAM
- Intel® Optane™ DC persistent memory
Operating system:
- Linux*
Target operating system:
- Linux
Languages:
- SYCL
- C and C++
For more information, see the system requirements.
Compilers:
- GNU Compiler Collection (GCC)*
- Intel® oneAPI DPC++/C++ Compiler
Distributed environments:
- OFI
Get Help
Your success is our success. Access these forum and GitHub resources when you need assistance.
Stay Up to Date on AI Workload Optimizations
Sign up to receive hand-curated technical articles, tutorials, developer tools, training opportunities, and more to help you accelerate and optimize your end-to-end AI and data science workflows. Take a chance and subscribe. You can change your mind at any time.