Get Started Guide

  • 2021.3
  • 06/28/2021
  • Public Content

Get Started with
Intel® oneAPI Collective Communications Library

Intel® oneAPI Collective Communications Library
(
oneCCL
) is a scalable and high-performance communication library for Deep Learning (DL) and Machine Learning (ML) workloads. It develops the ideas originated in Intel(R) Machine Learning Scaling Library and expands the design and API to encompass new features and use cases.
oneCCL
features include:
  • Built on top of lower-level communication middleware - MPI and libfabrics.
  • Optimized to drive scalability of communication patterns by allowing to easily trade-off compute for communication performance.
  • Enables a set of DL-specific optimizations, such as prioritization, persistent operations, out-of-order execution, etc.
  • Works across various interconnects: Intel(R) Omni-Path Architecture, InfiniBand*, and Ethernet.
  • Common API suitable for popular Deep Learning frameworks (Caffe*, nGraph*, Horovod*, etc.)

Before You Begin

Before you start using
oneCCL
, make sure to set up the library environment. There are two ways to do it:
  1. Using standalone
    oneCCL
    package:
    source <install-dir>/bin/cclvars.sh
  2. Using
    oneCCL
    from
    Intel® oneAPI Base Toolkit
    :
    source <install-dir>/setvars.sh
<install-dir>
is the
oneCCL
installation directory. By default,
oneCCL
is installed in
/opt/intel/inteloneapi
.

System Requirements

Refer to the
oneCCL
System Requirements
page.

Basic Usage Scenario

Below is a generic flow for using C++ API of
oneCCL
:
  1. Initialize the library:
    ccl::environment::instance();
    Alternatively, you can just create communicator objects:
    ccl::communicator_t comm = ccl::environment::instance().create_communicator();
  2. Execute collective operation of choice on this communicator:
    auto request = comm.allreduce(...); request->wait();

Sample Application

Below is a complete sample that shows how
oneCCL
API can be used to perform allreduce communication for SYCL* buffers:
#include <iostream> #include <stdio.h> #include <CL/sycl.hpp> #include "ccl.h" #define COUNT (10 * 1024 * 1024) #define COLL_ROOT (0) using namespace std; using namespace cl::sycl; using namespace cl::sycl::access; int main(int argc, char** argv) { int i = 0; size_t size = 0; size_t rank = 0; cl::sycl::queue q; cl::sycl::buffer<int, 1> sendbuf(COUNT); cl::sycl::buffer<int, 1> recvbuf(COUNT); ccl_request_t request; ccl_stream_t stream; ccl_init(); ccl_get_comm_rank(NULL, &rank); ccl_get_comm_size(NULL, &size); // create CCL stream based on SYCL* command queue ccl_stream_create(ccl_stream_sycl, &q, &stream); /* open sendbuf and initialize it on the CPU side */ auto host_acc_sbuf = sendbuf.get_access<mode::write>(); for (i = 0; i < COUNT; i++) { host_acc_sbuf[i] = rank; } /* open sendbuf and modify it on the target device side */ q.submit([&](cl::sycl::handler& cgh) { auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh); cgh.parallel_for<class allreduce_test_sbuf_modify>(range<1>{COUNT}, [=](item<1> id) { dev_acc_sbuf[id] += 1; }); }); /* invoke ccl_allreduce on the CPU side */ ccl_allreduce(&sendbuf, &recvbuf, COUNT, ccl_dtype_int, ccl_reduction_sum, NULL, NULL, stream, &request); ccl_wait(request); /* open recvbuf and check its correctness on the target device side */ q.submit([&](handler& cgh) { auto dev_acc_rbuf = recvbuf.get_access<mode::write>(cgh); cgh.parallel_for<class allreduce_test_rbuf_check>(range<1>{COUNT}, [=](item<1> id) { if (dev_acc_rbuf[id] != size * (size + 1) / 2) { dev_acc_rbuf[id] = -1; } }); }); /* print out the result of the test on the CPU side */ if (rank == COLL_ROOT) { auto host_acc_rbuf_new = recvbuf.get_access<mode::read>(); for (i = 0; i < COUNT; i++) { if (host_acc_rbuf_new[i] == -1) { cout << "FAILED" << endl; break; } } if (i == COUNT) { cout << "PASSED" << endl; } } ccl_stream_free(stream); ccl_finalize(); return 0; }
Intel(R) MPI Library is required to run this sample.
To build and run the sample:
  1. Build
    oneCCL
    with SYCL* support.
  2. Set up the library environment (see Before You Begin).
  3. Use
    clang++
    compiler to build the sample:
    clang++ -I${CCL_ROOT}/include -L${CCL_ROOT}/lib/ -lsycl -lccl -o ccl_sample ccl_sample.cpp
  4. Use the following command to run the application sample:
    mpiexec <parameters> ./ccl_sample
    <parameters>
    is optional mpiexec parameters such as node count, processes per node, hosts, and so on.

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.