Added Capabilities Ease Parallel Programming across Heterogeneous Architectures
In past decades, the notion of “one workload, one processor” was the conventional wisdom. Today though, that legacy perception falls to the wayside with the rise of heterogeneous computing. The latest technologies can now provide the exceptional speed and scalability levels necessary to usher in an era of high-performance computing (HPC), AI, advanced rendering, and IoT systems. While this evolution creates enormous opportunities across the industry, it also comes with application development challenges. The industry is responding with the oneAPI initiative to facilitate software development productivity while delivering performance across different architectures.
Data-centric workloads and applications have continued a path of diversification in recent years. Different hardware choices have separate silos of tools and development languages. Proprietary solutions and the variety of development languages limit the amount of reusable code. Data Parallel C++ (DPC++) offers an evolution of ISO C++ to help alleviate that issue. It increases coding productivity and changes the paradigm for programming across diverse architectures. oneAPI, with DPC++ (its primary language) makes the process much easier.
Open specifications are critical. Without them, developers cannot predict a language’s evolution over time and its future compatibility. Therefore, innovation requires open development. When “walled gardens” are in place, innovation stagnates. The walls also restrict performance across architectures. C++ is a broadly adopted and robust language. However, it has some limitations since it has been defined primarily with only CPUs in mind. Today’s standard cannot accommodate heterogeneous programming without additions.
The Khronos Group* led the innovation of the OpenCL™ standard, enabling low-level and detailed programming across multiple architectures. Over time, though, members of the consortium for the OpenCL standard developed their own extensions—the standard effectively diverging—and proprietary languages held sway. Based on experience with the OpenCL standard, the Khronos Group innovated to simplify the process further with the SYCL* programming model. Like the OpenCL standard, it supports heterogeneous programming, but uses standard C++ constructs with single-source host and accelerator code.
Joe Curley, senior director for oneAPI products at Intel, said, “Accelerated computing has diversified over the past several years given advances in CPU, GPU, FPGA, and AI technologies. This innovation drives the need for an open and cross-platform language that allows developers to realize the potential of new hardware, minimizes development cost and complexity, and maximizes reuse of their software investments.”
Today, SYCL from the Khronos Group offers open-source implementations supporting hardware vendors like Advanced Micro Devices (AMD*), Arm Mali, NVIDIA*, and Intel.
Ease Programming for Heterogeneous Architectures
DPC++, a cross-architecture language that is part of the industry’s oneAPI initiative, complements existing efforts to ease the heterogeneity challenge. DPC++ combines three elements—C++, SYCL, and extensions—to facilitate cross-architecture systems. It also serves as a standards-based language implementation that opens the door to introduce new features to future revisions of the SYCL specification. Intel, as a key participant of that effort, is also building implementations of oneAPI.
Joe Curley reinforced the commitment to openness in the language. “DPC++ builds on open standards: C++ and SYCL. But the language also provides the ability to rapidly experiment and innovate through extensions, develop them, and establish a virtuous cycle into open standards bodies—like [SYCL from the Khronos Group]. DPC++ provides an open mechanism for the developers to evolve data parallel programming rapidly.”
DPC++ adoption is gaining steam already, as illustrated by the recently announced DPC++ compiler for NVIDIA GPUs from Codeplay*.
New DPC++ Extensions Unleash SYCL and C++
DPC++ offers nearly 30 extensions to augment tools. The provisional SYCL2020 specification incorporates many of them:
- Unified Shared Memory (USM) defines pointer-based memory accesses and management interfaces. It provides the ability to create allocations that are visible and have consistent pointer values across hosts and devices. Different USM capability levels are defined, corresponding to varying degrees of device and implementation support.
- In-order queues define semantics for queues to streamline common coding patterns.
- Optional lambda name eliminates the need to define kernels manually. It also simplifies coding and enables composability with libraries. You also have an option for manually named lambdas in scenarios like debugging or interfacing with a sycl::program object.
- Deduction guides simplify common code patterns and reduce code verbosity and length by enabling Class Template Argument Deduction (CTAD) from modern C++.
- Reductions improve productivity with a common reduction pattern without explicit coding. Building them into the language enables optimized implementations to exist for combinations of device, runtime, and reduction properties.
- Subgroups define a work item grouping within a work group. The process of synchronizing work items in the subset can occur independently of work items in other subgroups. At the same time, the subgroups that commonly map to SIMD hardware expose communication operations across work items in the group.
- Subgroup algorithms define the collective operations across work items in a subgroup that are available only for subgroups. They also enable algorithms from the more generic group algorithm extension as subgroup aggregate operations.
- Enqueued barriers ease dependence creation and tracking for some common programming patterns. This benefit allows coarser-grained synchronization within a queue without the need for manual creation of fine-grained dependencies.
- Extended atomics offer atomic operations aligned with C++20, including support for floating-point types and shorthand operators.
- Group algorithms define collective operations that cross groups of work items, including broadcast, reduce, and scan. They streamline productivity with algorithms that do not need explicit coding and also allow optimized implementations to exist for combinations of device and runtime.
- Group mask defines a type that can represent a set of work items from a group and also collective operations that create or operate on that type, such as ballot and count.
- Restrict all arguments defines an attribute that can apply to kernels (including lambda definitions of kernels). The restriction signals that there is no memory aliasing between any pointer arguments that are passed to or captured by a kernel. When the developer knows more about the kernel arguments than a compiler can infer or safely assume, this optimization attribute is most beneficial.
- Relaxed data layout removes the requirement of C++ standard layout types for data shared by a host and devices. It requires device compilers to validate layout compatibility too.
- Queue shortcuts define kernel invocation functions directly on the queue classes. When dependencies and accessors do not need to be created within the additional command group scope, queue shortcuts simplify code patterns.
- Required work-group size is an attribute that enables optimizations based on additional user-driven information. It defines a kernel-applied attribute, including lambda definitions of kernels that signal invocation of the kernel with a specific work-group size.
- Data flow pipes enable efficient first-in, first-out (FIFO) communication in DPC++ for a mechanism commonly used when describing algorithms for spatial architectures such as FPGAs.
“We've been working closely with Intel on defining oneAPI and using oneAPI for our own internal development and testing,” explained Hal Finkel, lead for Compiler Technology and Programming Languages at Argonne National Laboratory’s Leadership Computing Facility. “oneAPI provides extended capabilities, such as supporting unified memory and reductions, above what is available in the current SYCL 1.2.1 specification, and these capabilities are essential for us. Our development of a Kokkos back end for DPC++ [and] oneAPI, for example, relies on these additional features. We're looking forward to updates to the SYCL specification, which we trust will contain important new features from DPC++ that address specific needs identified during these development activities.”
Added Joe Curley, “In only a few months, the DPC++ community has made enormous progress in both language design, architecture, and implementation. We encourage the community to join in the effort to open accelerated programming.”
Product and Performance Information
Features and benefits in Intel® technologies depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn more at Intel.com.
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.