Intel® oneAPI Toolkits Installation Guide for HPC Clustered Environments

Introduction

This guide provides instructions for administrators of HPC/Cluster systems to help them install Intel® oneAPI Toolkits in multi-user environment.

Installation workflow consists of the following steps:

  • Step 1: Download and install Intel® oneAPI toolkits packages
  • Step 2: Install the Intel GPU drivers if you plan using GPU Accelerator support 
  • Step 3: Set the Intel® Graphics devices
  • Step 4: Generate and Set Up Module Files

Preinstallation Planning for Multi-machine Environments

Before installing Intel® oneAPI toolkits in an environment more complicated than a single machine under super-user control of the user (for example, if installing it in a multi-machine, multi-user cluster with a shared file system), there are some things you need to know and plan.

Operating System Requirements

Please check the System Requirements page to learn more about compatibility details for the Intel® oneAPI Base and HPC toolkits. Note that the list of supported operating systems differs depending on whether you use Intel® oneAPI Toolkits with Intel GPUs, or without. You may need to defer installing Intel® oneAPI toolkits until after you have upgraded your cluster to one of the supported operating systems, or until after you have set aside a dedicated subcluster running a supported operating system.

File System Requirements

By default, the Intel® oneAPI toolkits installers place the product files in the following directories:

  • root: /opt/intel/oneapi
  • user: ~/intel/oneapi 

Subsequent releases are installed into the same location by default, although this can be worked around. Depending on how you expose other software packages on your cluster, you can:

  • Install the releases entirely in parallel with each other. Each major version of the software will exist in its own directory tree, resulting in something like /opt/intel/oneapi_<version1>/<tool> and /opt/intel/oneapi_<version2>/<tool>. Currently, to install this way and maintain the ability to uninstall the software, you need to complete a number of manual steps.

  • Install the releases in the same directory tree. In this case, versions of the software that can coexist will end up installed in directories like /opt/intel/oneapi/<tool1>/<version1> and /opt/intel/oneapi/<tool1>/<version2>, while those that cannot coexist end up with only a single version of the tool installed (for example, after installing the next version of the Intel oneAPI toolkit one of the tool directories may have only the /opt/intel/oneapi/<tool2>/<version2> directory). This the default method for installing the Intel® oneAPI toolkits.

Likewise, the user-mode Intel GPU drivers install their components to /etc/OpenCL/vendors and /usr/lib/x86_64-linux-gnu or similar. Rather than install these drivers on every machine in your cluster or update the boot image for your cluster nodes every time you update your Intel GPU drivers, you can put these files on a shared file system where they are available to all nodes without individual installs or boot image rebuilds.  Please decide how you will expose the GPU drivers to your users.

The kernel-mode GPU driver needs to be installed on every machine in your cluster, likely via the boot image you create for your cluster nodes.  This will rarely require updates (likely only when you update the operating system on your cluster).

Network Access Requirements

All but one of the installation methods require a connection to the Internet on the installation machine. The exception is the local installer, which can be downloaded and then copied to the installation machine. To install software on your cluster using a package manager like YUM, APT, or Zypper, download the packages on an Internet-connected machine and then copy them to the installation machine.

System Image Requirements

Systems with GPUs need to boot in a special manner, have additional software installed, and have some system variables and permissions set differently from non-GPU systems. Refer to Steps 2, 3, and 5 to understand the necessary changes in your system images or post-boot scripts for cluster nodes.   

Step 1: Download and Install Intel® oneAPI Toolkits Packages

You can obtain and install Intel oneAPI packages using one of the following options:

Installing Intel® oneAPI Toolkits Packages to Non-default Directories Using the Intel Installer

The installer supports using a non-default installation directory on the first installation using the custom or silent installation methods. However, after the first installation, the installer leaves a database with the directory information. This may prevent you from installing other toolkits in directories other than the first toolkit installation.  Or perhaps you want to install components separately to different directories.

The Intel® oneAPI Toolkits Installation Guide for Linux* OS provides the instructions for installing packages or components to different directories by deleting the installer database.  

Note If you decide to install different Intel oneAPI releases into different directories (for example, /opt/intel/oneapi_<version1>/<tool> and /opt/intel/oneapi_<version2>/<tool>), we recommend that you install all toolkits into the same directory. This will ensure that there are no unexpected dependencies between one of the toolkits and the Base Toolkit. 

Please be aware that removing the installer database breaks the product uninstall feature. However, you can still uninstall the product manually.

Note After installing the product, a database of this product is created, and you must remove it again to install the next one.

 For example:

  • Install Beta09 in the /opt/intel/beta09 directory
  • Remove /var/intel/installercache/packagemanager.db
  • Install Beta10 in another directory, for example, /opt/intel/beta10

Note A less destructive way of supporting multiple parallel installations of Intel® oneAPI toolkits using the Intel installers (as opposed to a package manager) is to rename the packagemanager.db file to a name that reflects the version of Intel® oneAPI toolkit it is for. Then, when it comes time to uninstall a version of Intel® oneAPI toolkit, you could copy the named database back into the  /var/intel/installercache/ directory with the expected name (packagemanager.db), and uninstall using the instructions in TBD link: Uninstalling oneAPI Toolkits and Components.

Installing oneAPI Packages to Non-default Directories Using Relocatable RPMs

TBD

Step 2: Install Intel GPU Drivers if You Plan Using GPU Accelerator Support

If you plan to use Intel® oneAPI with Intel GPUs, you need to install the GPU drivers. The drivers are broken down into two parts: kernel-mode drivers (KMD) and the user-mode (UMD) Intel GPU driver.

Kernel-mode Drivers (KMD) and Firmware 

On every node that will host a GPU, a recent enough kernel-mode driver, for example, i915 kernel module, and related GPU firmware need to be installed. For built-in kernel support for Intel GPUs, refer to the hardware support matrix of your Linux distribution. For cases when the Linux kernel used by the distribution does not support the used GPU hardware, there are two options:

  • Upgrade the whole kernel to a more recent one
  • Recompile a backported i915 kernel module and load it to the kernel through Linux dynamic kernel module support (DKMS)

User-mode Drivers (UMD)

The user-mode OpenCL* ICD Loader installs in the system directories and needs to be installed on every node that will host a GPU. To install the driver, use the following command:

sudo apt-get install clinfo ocl-icd-libopencl1

By default, the user-mode Intel GPU drivers install their components to /etc/OpenCL/vendors and /usr/lib/x86_64-linux-gnu, or similar. Rather than install these files on every machine in your cluster or update the boot image for your cluster nodes every time you update your Intel GPU drivers, you can put these files on a shared file system where they are available to all nodes without individual installs or boot image rebuilds.

There are multiple ways to get these files to the desired location on the shared file system:

  • Install on one system and then copy the UMD files to the shared file system
  • Unpack the *.deb/*.rpm files that make up the driver using dpkg or rpm2cpio and copy the results to the shared file system

There are two classes of files that need to be moved:

  • *.icd files, which are installed by default in /etc/OpenCL/vendors
    • When moving these files into a directory on your shared file system, it is recommended that you put them in a versioned location, even though the contents rarely change from release to release
    • Inspect each *.icd file and remove any paths it may contain, leaving just the bare file name. For example, change /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so to libigdrcl.so. Alternatively, adjust the full path to point to the new location of the file in your shared file system.
    • Set up the following environment variables for your users when they invoke a particular driver version, so that these *.icd files are found:
      • OPENCL_VENDOR_PATH – should point to the directory on the shared file system with the *.icd files for this driver version
      • OCL_ICD_VENDORS – should point to the directory on the shared file system with the *.icd files for this driver version
  • Shared libraries (contained in multiple directories) that make up the actual driver version, which are installed by default in /usr/lib/x86_64-linux-gnu, or similar
    • To easier identify the needed versions of the files, unpack the *.deb/*.rpm files that make up the driver to a local directory and then copy them to the versioned driver directory on your shared file system
    • Prepend the location of the versioned driver directory to LD_LIBRARY_PATH. For example:
      export LD_LIBRARY_PATH=$DRIVERLOC:$LD_LIBRARY_PATH
      export DRIVERLOC=<cluster_app_directory>/IntelGPUDrivers/21.19.32768

      Optionally, you can append the location in your shared file system and name of the main OpenCL* driver library noted in the ICD files to ICD_FILENAMES.  For example:

      export DRIVERLOC=<cluster_app_directory>/IntelGPUDrivers/21.19.32768
      export OCL_ICD_FILENAMES=/opt/intel/oneapi/compiler/latest/linux/lib/x64/libintelocl.so:$DRIVERLOC/intel-opencl/libigdrcl.so

       

Step 3: Set the Intel® Graphics Devices

You need to modify system parameters on those nodes using Intel GPUs for computing tasks. Some of these actions can be done once, some need to be done every reboot.

GPU: Disable Hangcheck

By default, the GPU driver assumes that tasks will run on the GPU for only a few seconds. Many cluster applications tend to run for longer than this, and, if they need to use the GPU as part of their execution, the systems with GPUs will need to be modified to allow long-running GPU tasks.

For instructions on how to tell the Intel GPU driver to allow long-running tasks on the GPU, refer to this document (TBD: link to Section 8 of Linux IG).

Set Permissions on Graphics Devices

Before users can make use of an Intel GPU, they need to be given the permission to access the /dev/dri/card* and /dev/dri/renderD* devices.  The /dev/dri/card* device is used for direct rendering and provides full privileged access to the GPU hardware.  The /dev/dri/renderD* device gives users non-privileged access to the GPU hardware, which is typically all that is needed for compute.

By default, access to these two devices by the local "render" (on Ubuntu* 19 and higher, CentOS* 8, and Fedora* 31) or “video” (on Ubuntu* 18, Fedora* 30, and SLES* 15 SP1) group. The “render” group was introduced on RHEL* 8.x and Ubuntu* 19.x for users requiring less-privileged use of the GPU for things like computation. The “video” group gives much more privileged access to the GPU.

To allow non-root users to run compute tasks on nodes with Intel GPUs, you have three options:

  1. Assign each user who might want to compute on Intel GPUs to the local “render” or “video” group (depending on the OS version) on every node with an Intel GPU.   This may be impractical if you have a large number of GPU nodes or use a system image for your cluster nodes that are not updated often (updates are the only time you could add additional users to the “render” or “video” groups).
  2. Assign each user who might want to compute on Intel GPUs to a network group (coming from LDAP) or similar. Then, on every node Intel GPU, either assign this group to the /dev/dri/card* and /dev/dri/renderD* devices by a script that runs after the node boots (for example, with sudo chown root:<network_group> /dev/dri/renderD128), or use the udev device manager to do this at boot time using a udev rule:
    • As root, create a file with the rules.
    • Put the following in this file:
      SUBSYSTEM=="drm", KERNEL=="card*", GROUP="<network_group>"¡
      SUBSYSTEM=="drm", KERNEL=="renderD*", GROUP="<network_group>"
      

      Note The network may not be fully up when the udev rule runs, therefore you must use the numerical equivalent of the group name in the udev rule. For example, in the output of the command getent group <network_group>, use the number that appears just before the list of users: <network_group>:VAS:29647:user1…

  3. Make the Intel GPU accessible to any user who logs into a node with a GPU. This can be most practical in systems where users can access a node only by going through a reservation system that controls what nodes each user can access. This also means you do not need to update the system or system image every time a user is authorized/removed from your cluster, or when their access rights change. In this case, the contents of the /etc/udev/rules.d/99-dri-change-group.rules file would be something like this:
    SUBSYSTEM=="drm", KERNEL=="card*", GROUP=”users”, MODE=”0666”
    SUBSYSTEM=="drm", KERNEL=="renderD*", GROUP=”users”, MODE=”0666”

     

Set Up for GPU Profiling Using Intel® VTune™ Profiler

For successful use of Intel® VTune™ Profiler on a cluster node with an Intel GPU, you need to perform the actions listed below. Some of them survive system reboot, some need to be done after each boot with a post-boot script.

  1. Enable collection of GPU hardware metrics by non-privileged users. This is managed by the dev.i915.perf_stream_paranoid sysctl option. For instructions on how to modify it, refer to https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/installation/set-up-system-for-gpu-analysis.html.
  2. Allow non-privileged users of the Intel® VTune™ Profiler to access the Linux* FTrace file system. To do this, use the /opt/intel/oneapi/vtune/latest/bin64/prepare-debugfs.sh script. You can find additional information about this at https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/custom-analysis/custom-analysis-options/linux-and-android-kernel-analysis.html#linux-and-android-kernel-analysis_LIMITATION.
    When running the script, we recommend that you use the -i option, and also the -g option to assign it to the same group used when installing the VTune sampling driver when running insmod-sep with the -g option. See https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/set-up-analysis-target/linux-targets/building-and-installing-the-sampling-drivers-for-linux-targets.html in the Install the Sampling Drivers section for more details and instructions on how to build the sampling drivers for the OS on your cluster nodes.
  3. If your nodes are running Ubuntu*, set the value of the kernel.yama.ptrace_scope sysctl variable to 0. Refer to https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/troubleshooting/error-ptrace-sys-call-scope-limited.html for instructions on how to do this.
  4. Despite the warnings seen by most VTune users, it is not necessary to rebuild the i915 driver with CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y as suggested by https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/installation/set-up-system-for-gpu-analysis.html.  Adequate information can be obtained about compute jobs without it.

Set Up for GPU Debugging Using Intel® Distribution for GDB*

Before you can use the Intel® Distribution for GDB* to debug code running on an Intel GPU, you need to install and load the debug companion driver.   You may need to load it after each reboot. You can find information on how to do this at https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-debugging-dpcpp-linux/top.html.

Step 4: Generate and Set Up Module Files

Generate and Set Up Modules for oneAPI Components

Follow the instructions provided in Using Module Files to Set Up Your Development Environment for use with oneAPI Toolkits.

Generate and Set Up Modules for Compute Drivers for Intel GPUs

TBD

Testing Your Installation and Helping Your Users

You have now installed and set up Intel® oneAPI toolkits.  To support your users, you may need some of the following information.

What Graphic Driver is Installed/Visible/Working on System

If you installed clinfo as instructed in the Kernel-mode Drivers and Firmware section, you can test that you users can see the proper OpenCL* devices and drivers:

  • First, as a non-privileged user, set up your oneAPI environment by running the setvars.sh file in your equivalent of the /opt/intel/oneapi directory.
  • Run clinfo -l to determine what OpenCL* devices are available.  Missing devices are usually a symptom of missing or duplicated *.icd files, bad paths in the *.icd files, or incorrectly set OPENCL_VENDOR_PATH, OCL_ICD_VENDORS, OCL_ICD_FILENAMES environment variables.
  • Run clinfo without any arguments and pipe the result to a file. In this file you will find details about the OpenCL* devices discovered, in particular hardware characteristics and the version number of the installed driver, which you can crosscheck with the driver version you expect from how you set the environment up. Grep commands can be used to give condensed information when clinfo used without arguments.  Some interesting strings to search for are:
    • Device Name
    • Device Version
    • Max compute units
    • Max number of sub-devices
    • Max sub-groups 
    • Sub-group sizes
    • Global memory size
    • Max memory allocation
    • Preferred work group size multiple

Basic Admin Checks

  • View permissions on the compute devices
    ls -l /dev/dri

     

  • Check that level0 is installed
    dpkg -l | grep -i level-zero
    This may not work if you did not install the drivers using a package manager, but rather point to them by environment variable and custom ICD files.
  • Check the driver packages installed on the system
    dpkg -l | grep -i graphics
    This may not work if you did not install the drivers using a package manager, but rather point to them by environment variable and custom ICD files.
  • Make sure the device shows up on the system
    sudo lspci -k | grep "VGA compatible"
    sudo cat /sys/kernel/debug/dri/0/i915_capabilities | grep "platform:"

     

  • clinfo | grep -i gen12

     

Resolving GPU Issues

  • If your users report that their GPU jobs are aborting unexpectedly, make sure you have disabled hangcheck, as described at Step 3. The kernel log can be checked for GPU driver resets messages as follows:
    $ dmesg |grep i915 | tail -n1
    [3481725.904059] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

     

Mixing and Matching oneAPI Components

If you installed multiple oneAPI versions in the same directory tree, your users can switch between them via custom configuration files. You can find more information on how to do this at https://software.intel.com/content/www/us/en/develop/documentation/using-configuration-file-for-setvars-sh/top.html.

Your users can also source the configuration files for each particular product they want to run together. For example, they can source files including:

source /opt/intel/oneapi/compiler/<version1>/env/vars.sh 
source /opt/intel/oneapi/dpcpp-ct/<version1>/env/vars.sh 
source /opt/intel/oneapi/mkl/<version2>/env/vars.sh  
source /opt/intel/oneapi/tbb/<version3>/env/vars.sh 

If you are using module files, you need to write your own instructions on how to do this and ensure that products that are not compatible are not combined together.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.