# Intel® oneAPI DPC++/C++ Compiler Release Notes

Version: 2023.0   Published: 11/02/2020

Last Updated: 02/09/2023

This document summarizes new and changed product features and includes notes about features and problems not described in the product documentation.

## oneAPI 2023.0, Compiler Release 2023.0

### New Features and Improvements

• The compiler has moved to using C++17 as the default C++ language. If users want to use an older version, they have to specify it as a compiler option. For example, if users want to use C++14, they need to use -std=c++14
• Added support for FPGA IP authoring flow. It allows you to target your SYCL* code to generate standalone IP components on different targets and integrate it into a custom Intel® Quartus® Prime project. You can target your compilation to a supported Intel® FPGA device family or part number instead of a specific acceleration platform.
• FPGA optimization reports now support user-defined loop labels replacing the system-generated loop labels. For example:
LOOP1: for( int i = 0; i < 12; i++ ) {
...
}

• Added support for the standalone Intel® oneAPI FPGA Reports tool.
• Added support for using latency controls with a stall-free loop in FPGA.
• Added support to view simulation waveforms in the simulators supported by FPGA.
• Added ability to enforce stateless memory accesses for ESIMD.
• Added support for -fsycl-force-target compiler option.
• Added support for -fsycl-link-huge-device-code compiler option, which allows linking object files larger than 2GB.
• Implemented group collective built-in functions for more integral types.
• Implemented SYCL 2020 callable device selectors.
• Implemented SYCL 2020 standalone device selectors.
• Added SYCL 2020 property interfaces for local_accessorusm_allocatoraccessor and host_accessor classes.
• Added support for fpga_simulator_selector
• Added support for local_accessor. Deprecated target::local.
• Added support for querying free device memory on Level Zero backend.
• Implemented bfloat16 conversions from/to float for host.
• Added support for ext::oneapi::property::queue::discard_events to Level Zero PI plugin.
• Added lsc_atomic support on ESIMD emulator.
• Added dpas support on ESIMD emulator.
• Added C++ API for imf libdevice built-ins.
• Introduced predicates for ESIMD lsc_block_store/load.
• Added experimental set_kernel_properties API and use_double_grf property for ESIMD.
• Added "eager initialization" mode to Level Zero PI plugin. It might result in unnecessary work done by the plugin, but it ensures the fastest possible execution on hot and reportable paths.
• Implemented group::get_linear_id(int) method.
• Ensured that a correct errc thrown for an unassociated placeholder accessor.
• Removed dependency on OpenCL ICD Loader from the runtime.
• Added support for ZEBIN format to persistent caching mechanism.
• Added identification mechanism for binaries in the newer ZEBIN format.
• Switched to use struct information descriptors in accordance with SYCL 2020. Removed some deprecated information queries.
• Updated kernel_device_specific::max_sub_group_size query to match SYCL 2020 spec. Deprecated the old variant.
• Deprecated SYCL 1.2.1 device selectors.
• Improved error messages reported for unsupported device partitioning.
• Made device and platform default to default_selector_v
• Deprecated address_space::constant_space
• Marked sycl::exception::has_context as noexcept
• Improved range reduction performance on CPU.
• Made sycl::exception nothrow copy constructible.
• Marked has_property methods as noexcept
• Improved sycl::event::get_profiling_info exception message when event is default constructed.
• Added a diagnostic (in the form of static_assert) about kernel lambda size mismatch between host and device.
• Updated pipes class to throw exceptions if used on the host.
• Updated ESIMD Emulator PI plugin to report support for cl_khr_fp64 extension.
• Updated Level Zero plugin to prefer copy engine for memory read/write operations.
• Optimized some memory transfers.
• Enabled event caching in the Level Zero PI plugin.
• Optimized some reductions for parallel_for accepting sycl::range for discrete GPUs.
• Added ability to use descendent devices of context members within that context. Not supported with the OpenCL backend yet.
• Limited allowed argument types for rol/ror ESIMD functions to better represent HW capabilities.
• Implemented lazy mechanism of setting the context for default-constructed events.
• Improved performance for multi-dimensional accessors with multiple accesses in a kernel.
• Increased max _Bitint size to 4096 for FPGA target.
• Removed deprecation message for [[intel::disable_loop_pipelining]] attribute.
• Allowed __builtin_assume_aligned to be called from device code.
• Improved link step performance when per_kernel device code split is used.
• Added support for SYCL_EXTERNAL on device_global variables.
• Updated __builtin_intel_fpga_mem to accept more parameters.
• Updated ivdep attribute to allow safelen = 0
• Improved linking with sycl.lib on Windows.
• Implemented more diagnostics for incorrect device_global usages.
• Improved library resolution for libsycl.so.
• Improved diagnostics when linking with mismatched objects.
• Added a warning for floating-point size changes after implicit conversions.
• Made invoke_simd convert its argument to appropriate types.

### Bug Fixes

• Removed deprecated kernel::get_work_group_info .
• Removed deprecated get_native class method.
• Removed support for intel::fpga_pipeline attribute.
• Added MAJOR_VERSION to the name of the SYCL library on Windows.
• Removed sycl::program class.
• Removed ext::oneapi::reduction
• Removed deprecated address_space enum values.
• Removed event::get method.
• Removed using namespace experimental inside ext::intel
• Made intel-specific device info descriptors namespace-qualified.
• Removed deprecated make_queue API.
• Aligned return types of sycl::get_native and interop::get_native_mem functions to be in conformance with SYCL 2020 spec.
• Aligned sycl::buffer_allocator interface with SYCL 2020 spec.
• Removed cl namespace from sycl/sycl.hpp header.
• Dropped support for compiling SYCL in less than C++17 mode.
• Many other ABI-breaking changes resulting from internal refactoring.
• When compiling for FPGA, you can now use a system installed with Intel® FPGA PAC D5005 to compile a SYCL application that targets Intel® PAC with Intel® Arria® 10 FX FPGA.
• When compiling for FPGA emulator flow on Windows system, an issue leading to the failure to launch device kernels has been fixed.
• Fixed a compilation issue where it wasn't possible to pass an initializer list for dependency events vector in queue shortcuts with offset parameter.
• Fixed sycl::get_pointer_device throwing an exception when it passed a descendent device (sub-device) instead of a root device.
• Fixed memory leak happening when kernel bundles are linked.
• Fixed USM free throwing an exception when it passed a context created for a descendent device.
• Fixed a compilation issue when using multi-dimensional accessor's subscript operator.
• Fixed "definition with the same mangled name" error happening when using multiple buffer reductions in a kernel.
• Fixed a compilation issue with SYCL math built-ins when GCC < 11.1 is used as a host compiler.
• Fixed a compilation issue with SYCL math built-ins (such as sycl::modf, for example) not accepting pointers to half
• Fixed an issue with reductions when MSVC is used as the host compiler.
• Fixed a compilation issue when fully specialized sycl::span is initialized from an array.
• Fixed a crash in Level Zero PI plugins caused by specialization constants not being used on the device side, but present in a program.
• Fixed event leak in the Level Zero plugin.
• Fixed an issue with sub-sub-devices in the Level Zero plugin.
• Fixed an issue with incorrect half conversion on ESIMD emulator.
• Fixed a compilation issue with abs ESIMD function.
• Fixed some warnings coming out of SYCL headers when compiled in C++20 mode.
• Fixed a compilation issue when using multiple bitwise shift operations in ESIMD.
• Fixed a crash in Level Zero PI plugin, which occurs when the runtime tries to reset a command list that does not have a synchronization fence associated with it.
• Fixed a compilation issue with sycl::get_native<sycl::backend::ext_oneapi_cuda>(sycl::device) free function (#6653).
• Fixed synchronization issue for explicit dependencies (depends_on usage) which is blocked by the host task or host accessor.
• Fixed an issue in the Level Zero plugin, which could cause barriers not to be correctly applied for an entire queue.
• Fixed accessor so gdb can parse its template parameters correctly.
• Fixed uses of common macro names in the implementation's header files.
• Fixed a performance regression related to the command list in the Level Zero backend.
• Fixed cleanup of temporary files produced by unbundling archives.
• Fixed optimizing out device_global variables with internal linkage.
• Fixed an issue when compiling and linking with different optimization levels that could cause runtime errors.
• Fixed description of -f[no-]sycl-unnamed-lambda compiler option.
• Fixed an issue when building SYCL programs in Debug mode with Windows-Clang.cmake.
• Fixed an issue causing incorrect conversions involving unsigned types in ESIMD.
• Fixed a crash in applications containing a mix of unnamed ESIMD and non-ESIMD kernels.
• Fixed an issue when op[] was called with a typedef argument under gdb.

### Known Issues and Limitations

• Customers might see "fatal error: 'iostream' file not found" when trying to compile a simple program with Intel® oneAPI DPC++/C++ Compiler on a Linux* machine if matching GNU g++ package is not installed. For further details, please check: fatal error: <C++ header> file not found with Intel® oneAPI DPC++/C++ Compiler
• This release is not backward compatible with previous releases, which means that existing SYCL applications won't work with the newer runtime without re-compilation.
• There is a potential for incorrect results using OpenMP pragmas to offload to Intel GPUs where a parallel loop nested inside a TEAM construct is using a variable in a REDUCTION clause and the TEAM construct does not have the same REDUCTION clause. To avoid incorrect results, compile with -mllvm -vpo-paropt-atomic-free-reduction-slm=true to disable global memory buffers.
• There is a known issue with using opt-reports with programs containing OpenMP loop constructs with "schedule(dynamic)", which may cause the compiler to emit an error. In this case, it is recommended that the user remove -qopt-report from their compilation.
• Intel® oneAPI DPC++ Compiler 2023.0.0 may not include all the latest functional and security updates. A new version of Intel® oneAPI DPC++/C++ Compiler is targeted to be released by March 2023 and will include additional functional and security updates. Customers should update to the latest version as it becomes available.
• If your design has nested loops and data is carried across the loops, you should run simulation to verify that the output is correct. In very rare circumstances, functional issue when you have nested loops and data is carried across the loops, the RTL generated by the compiler is functionally incorrect. If there are any errors in the simulation output, you might be affected by this issue. You can work around the issue by removing the loop nest either by using the loop-coalesce attribute, or manually changing the code. This issue is scheduled to be fixed in a future version of oneAPI.
• If you use SUSE15 U3, SUSE15 U3 and include <complex.h> header, you might run into an error: "expanded from macro 'I'". It is a problem with SYCL headers with <complex.h> which should define macro ‘I’ (https://en.cppreference.com/w/c/numeric/complex/I) but the identifier ‘I’ is widely used in SYCL headers. The reason why it appears on SUSE15 U3 but not other OS is because the provided C/C++ headers may vary between different OS.
• When compiling with the following options, -fiopenmp -fopenmp-targets=spir64_gen -Xopenmp-target-backend "-device xxx" -fopenmp-device-code-split=per_kernel, i.e. Ahead of Time (AOT), and the offload kernel contains print statements, the program will stop with a runtime failure.

• SYCL built-in group algorithms may produce wrong results on CPU or FPGA emulator devices if all of the following conditions are met:
• The work-group size on the highest dimension is larger than the sub-group size
• The group algorithm is applied to the work-group
• The group algorithm produces the same result for all work items in the work group (e.g. all_of_group, any_of_group, group_broadcast, reduce_over_group)
• The group algorithm is used in a loop, and the result may change due to input changes. For example, the following kernel code would produce wrong results (the while loop may not exit or acc[gid] may not be set for all work items due to the known issue):
cgh.parallel_for(
sycl::nd_range<1>(8, 8),
[=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(4)]] { // work-group size > sub-group size
bool predicate = true;
int gid = item.get_global_id(0);
while (sycl::all_of_group(item.get_group(), predicate)) { // applying all_of_group to the work-group
// and all_of_group is expected to produce same result for all work-items in the group
// and is used inside a loop
acc[gid] = 1;
predicate = false; // the result of all_of_group would change on the second loop iteration because predicate is changing
}
});

The workaround is to set the work-group size equal to the sub-group size.

• SYCL 2020 barriers show worse performance than SYCL 1.2.1 do.
• It requires explicit linking against lib/libsycl-fallback-cassert.o or lib/libsycl-fallback-cassert.spvwhen using fallback assert in a separate compilation flow.
• Limit alignment of allocation requests at 64KB, which is the only alignment supported by Level Zero.
• On the following scenario on Level Zero backend:
1. Kernel A, which uses buffer A, is submitted to queue A.
2. Kernel B, which uses buffer B, is submitted to queue B.
3. queueA.wait().
4. queueB.wait().
DPCPP runtime is used to treat unmap/write commands for buffer A/B as host dependencies (i.e., they were waited for before enqueueing any command that's dependent on them). This allowed the Level Zero plugin to detect that each queue is idle on steps 1/2 and submit the command list immediately. This is no longer the case since we started passing these dependencies in an event waitlist, and the Level Zero plugin attempts to batch these commands, so the execution of kernel B starts only on step 4. The workaround restores the old behavior in this case until this is resolved.
• User-defined functions with the name and signature matching those of any OpenCL C built-in function (i.e., an exact match of arguments, return type doesn't matter) can lead to Undefined Behavior.
• A DPC++ system that has FPGAs installed does not support multi-process execution. Creating a context opens the device associated with the context and places a lock on it for that process. No other process may use that device. Some queries about the device through device.get_info<>() also open up the device and lock it to that process since the runtime needs to query the actual device to obtain that information.
• The format of the object files produced by the compiler can change between versions. The workaround is to rebuild the application.
• Using sycl::program/sycl::kernel_bundle API to refer to a kernel defined in another translation unit leads to undefined behavior
• Linkage errors with the following message: error LNK2005: "bool const std::_Is_integral<bool>" (??_Is_integral@_N@std@@3_NB) already defined can happen when a SYCL application is built using MS Visual Studio 2019 version below 16.3.0 and the user specifies -std=c++14 or /std:c++14. • Printing internal defines is not supported on Windows. • The usage of new -ax (auto cpu dispatch) is not currently supported when building libraries with -fpic option. • /Fo<file or dir/> flag no longer accepts directory arguments. Using this flag will result in an error message: clang-offload-bundler command failed with exit code 1. Fix is not available in this release. • Having MESA OpenCL implementation, which provides no devices on a system, may cause incorrect device discovery. As a workaround, such an OpenCL implementation can be disabled by removing /etc/OpenCL/vendor/mesa.icd. • Compilation may fail on Windows in debug mode if a kernel uses std::array. This happens because debug version of std::array in Microsoft STL C++ headers calls functions that are illegal for the device code. As a workaround, the following can be done: 1. Dump compiler pipeline execution strings by passing -### option to the compiler. The compiler will print the internal execution strings of compilation tools. The actual compilation will not happen. 2. Modify the (usually) first execution string (it should have -fsycl-is-device option) by adding -D_CONTAINER_DEBUG_LEVEL=0 -D_ITERATOR_DEBUG_LEVEL=0 options to the end of the string. Execute all string one by one. • -fsycl-dead-args-optimization cannot eliminate the offset of the accessor even though it is created with no offset specified. • SYCL 2020 barriers show worse performance than SYCL 1.2.1 do. • When using fallback assert in a separate compilation flow, it requires explicit linking against lib/libsycl-fallback-cassert.o or lib/libsycl-fallback-cassert.spv. • Limit alignment of allocation requests at 64KB, which is the only alignment supported by Level Zero. • On the following scenario on Level Zero backend: 1. Kernel A, which uses buffer A, is submitted to queue A. 2. Kernel B, which uses buffer B, is submitted to queue B. 3. queueA.wait(). 4. queueB.wait(). DPCPP runtime is used to treat unmap/write commands for buffer A/B as host dependencies (i.e. they were waited for before enqueueing any command that's dependent on them). This allowed the Level Zero plugin to detect that each queue is idle on steps 1/2 and submit the command list immediately. This is no longer the case since we started passing these dependencies in an event waitlist and the Level Zero plugin attempts to batch these commands, so the execution of kernel B starts only on step 4. The workaround restores the old behavior in this case until this is resolved. • User-defined functions with the name and signature matching those of any OpenCL C built-in function (i.e. an exact match of arguments, return type doesn't matter) can lead to Undefined Behavior. • A DPC++ system that has FPGAs installed does not support multi-process execution. Creating a context opens the device associated with the context and places a lock on it for that process. No other process may use that device. Some queries about the device through device.get_info<>() also open up the device and lock it to that process since the runtime needs to query the actual device to obtain that information. • The format of the object files produced by the compiler can change between versions. The workaround is to rebuild the application. • Using sycl::kernel_bundle API to refer to a kernel defined in another translation unit leads to undefined behavior • Linkage errors with the following message: error LNK2005: "bool const std::_Is_integral<bool>" (??_Is_integral@_N@std@@3_NB) already defined can happen when a SYCL application is built using MS Visual Studio 2019 version below 16.3.0 and user specifies -std=c++14 or /std:c++14.
• Printing internal defines isn't supported on Windows.
• The compile times can be significant when compiling for FPGA and using a read-only accessor for a very wide struct. As a workaround, use a read-write accessor instead to address long compile times.
• When you perform FPGA compile and link stages with a single dpcpp command (for example, dpcpp -fintelfpga <other arguments> -Xshardware src/kernel.cpp), if the source code is not located in the current directory, you might observe that the source code browser is missing in the generated FPGA optimization reports. To work around this issue, compile and link the executable in separate stages, as follows:
icpx -fsycl -fintelfpga <other arguments> -Xshardware -c src/kernel.cpp -o kernel.o
icpx -fsycl -fintelfpga <other arguments> -Xshardware -kernel.o
• When compiling for FPGA, the debug support on Windows is unavailable when using device-side libraries. To avoid this issue, do not run a debugger on the emulator platform on Windows.
• The modulefiles-setup.sh script is not supported for FPGA in this release. As a workaround, use the setvars.sh script.

• On Windows, compiling FPGA designs in a directory with a long path name might fail, and you might see the following error:
dpcpp: error: fpga compiler command failed with exit code 1 (use -v to see invocation)
NMAKE : fatal error U1077: ‘…\oneAPI\compiler\latest\windows\bin\dpcpp.EXE' : return code '0x1'

As a workaround, either compile the design in a directory with a short path name or reset TMP and TEMP environment variables to point to a shorter path (for example, C:\temp).

• When using the atomic_fence function for FPGA, the memory_scope::system constraint is not supported. The broadest scope supported is the memory_scope::device constraint. There is no workaround available for this currently.

• When compiling for FPGA, the compiler might produce a different intermediate representation (IR) on Windows than Linux. Misaligned structs cause this issue. As a result, some designs that compile with an II=1 on Linux might have, for example, II=10 on Windows. As a workaround, force an alignment on the misaligned structs, as shown in the following example:

//Code with misaligned struct
struct Item {
bool valid;
int value1;
unsigned char value2;
};

//Forced alignment of the struct
struct Item {
bool valid;
bool __empty__[3];
int value1;
unsigned char value2;
unsigned char __empty2__[3];
}
• The FPGA emulator does not recognize different Avalon interfaces when defining a host pipe. This can lead to unexpected behavior when specifying the Avalon interface type. There is no known workaround for this issue.

• When compiling for FPGA and trying to reduce the II of the II-critical path, the scheduler may return an incorrect II-critical path. This means the compiler reduces the II of the wrong path, and the II goal is not achieved. You might observe this issue only when there are multiple negative cycles in the LSU's critical path. There is no known workaround for this issue. However, your design’s functionality stays unaffected. Performance (QoR) might get degraded slightly.

• When simulating FPGA designs, a design with a host channel might pose two signal mismatch errors—dataBitsPerSymbol and firstSymbolInHigh OrderBits:

• dataBitsPerSymbol error can occur in the FPGA IP authoring flow when you specify a dataBitsPerSymbol value that is not equal to 8. As a workaround, set the dataBitsPerSymbol to 8.

• firstSymbolInHigh OrderBits error can occur in the FPGA IP authoring flow when you set firstSymbolInHigh OrderBits to false. As a workaround, set the firstSymbolInHigh OrderBits to true.

• With the FPGA IP Authoring flow, you can intuitively integrate your design into the Platform Designer by copying the generated .prj folder into your Intel® Quartus® Prime project directory. The Platform Designer detects the project automatically. However, there is a known issue with the generated hw.tcl file, which is not mapping the signals correctly. To work around this issue, follow these steps on both Linux and Windows systems:

$cd <kernel_name>.prj$ python <kernel-name>_di_hw_tcl_adjustment_script.py
1. Add python to your PATH environment variable to run python from your command line.

2. Execute the following commands to run the <kernel-name>_di_hw_tcl_adjustment_script.py python script generated in your .prj directory before integrating your IP authoring kernel into the Platform Designer:

• When compiling an FPGA kernel that calls the sycl::ext::oneapi::experimental::printf() function, the compiler issues the following warning message:
compiler warning: argument 'llvm_fpga_printf_buffer_start' on component '<your kernel name>' is never used by the component. Note that the compiler may optimize it away.
There is no known workaround for this issue. However, you can ignore this warning since it does not impact the kernel’s functionality.

• When compiling for FPGA, if your SYCL code contains the std::popcount function inside a fixed-size loop (bit-widths not in 8, 16, 32, or 64), it gets mapped directly into llvm.ctpop, and the compilation fails with an error message. There is no known workaround for this issue. However, Intel recommends avoiding the use of the std::popcount function inside loops.

• On the Windows system, the standalone Intel® oneAPI FPGA Reports Tool application might fail to run on a mapped network drive and display "GPU process launch failed" error message on the console. As a workaround for this issue, copy the Intel® oneAPI FPGA Reports Tool application from the mapped network drive to your local computer and run it locally.

• The Intel FPGA IP authoring encryption flow is not fully supported on Windows systems.

• In the Intel FPGA IP authoring flow, the fpga_tools::UnrolledLoop utility defined in the unrolled_loop.hpp code sample header file does not support the kernel argument interface macros (mmhost, conduit_mmhost, and register_map_mmhost). For example:

fpga_tools::UnrolledLoop<ROWS>([&](auto row) {
#pragma unroll
for (int i = COLS - 1; i > 0; i--) {
shift_reg[row][i] = shift_reg[row][i - 1];
}
shift_reg[row][0] = MA[col * ROWS + row];
});

As a workaround, use the #pragma unroll before a for loop, as shown in the following example:

#pragma unroll
for (int row = 0; row < ROWS; row++) {
#pragma unroll
for (int i = COLS - 1; i > 0; i--) {
shift_reg[row][i] = shift_reg[row][i - 1];
}
shift_reg[row][0] = MA[col * ROWS + row];
}

## Notices and Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from a course of performance, course of dealing, or usage in trade.

#### Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.