Getting Started

Intel® Cluster Checker User Guide

Download PDF

ID 772051

Date 12/06/2022

Version 2021.7.2

Public

Getting Started

Prerequisites
Environment Setup
Running using an Individual Nodefile
Running using Slurm
User-Specific Workflows
Intel® MPI Library Troubleshooting

Prerequisites

Intel® Cluster Checker must be accessible by the same path on all nodes.
A readable, writable shared directory must be available from the same path on all nodes for temporary file creation.
- $HOME as the shared directory is used by default, but you can change this option by setting the environment variable $CLCK_SHARED_TEMP_DIR to the shared directory.
- For admin privileged users, such as root, the environment variable $CLCK_SHARED_TEMP_DIR must be explicitly set.
Determine if passwordless ssh access to all nodes is set up. (e.g. test if the following command responds with a valid hostname, while not asking for ‘Password:’) If passwordless ssh to all nodes is available - go ahead with Environment Setup and Running using Slurm below.
- By default Intel® Cluster Checker is configured to use passwordless ssh (through the command pdsh) to launch remotely on nodes of the cluster. Note: you may need to add enabling passwordless access in your local ssh configuration setup.
- Intel® Cluster Checker can communicate with MPI rather than pdsh. To use this feature it requires Intel MPI Library to be set up and an edit to an XML configuration file.
  - Locate and copy clck.xml; found in <installdir>/clck/<version>/etc/clck.xml
  - In the <collector> section, uncomment the <extension>mpi.so</extension> by removing the commenting statements  in the line after.
  - When launching clck, use the -c flag to point to your new copy of clck.xml, and it will now communicate via MPI rather than pdsh.
  - Note: In some scenarios you may need to include the Intel® MPI environment variable I_MPI_HYDRA_BOOTSTRAP=<arg> with the appropriate bootstrap agent. Please reference Intel® MPI Library documentation for details on options for this variable.
  - To revert back to PDSH, do not use the -c flag and use the default clck.xml or put comments around <extension>mpi.so</extension> again.

clck -c <path/to/local/copy/of/clck.xml> -F health_base -f ./nodefile

Environment Setup

Before you start using Intel® Cluster Checker, you will need to establish the proper runtime environment. If you are new to Linux, this means we need to make sure the command line is setup to find the applications we just installed. Helper scripts are provided to accomplish this. For full functionality, Intel® Cluster Checker expects the following items to be loaded in the environment correctly: Intel® Cluster Checker, Intel® MPI Library, Intel® Math Kernel Library and Intel® Distribution of Python.

If using the Intel® oneAPI HPC Toolkit
- oneAPI Toolkit includes setvars.sh|csh script in the installation folder and will analyze all software installed from oneAPI and then add it to your path.
- Each Intel® oneAPI tool includes individual tool environment variable setting scripts in its env folder oneapi/<tool>/<version>/env/vars.(c)sh

source /opt/intel/oneapi/setvars.sh

or if you would rather source individual packages directly, there are var.sh|var.csh scripts in /opt/intel/oneapi/<tool>/<version>/env/. Please note the ordering of MPI as last item is important because Intel® Distribution for Python also includes an mpirun. We want to insure Intel® MPI Library is being used for MPI. You can validate which MPI is in use with the command which mpirun - looking for the path to oneapi/mpi/latest/bin/mpirun

source /opt/intel/oneapi/mkl/latest/env/vars.sh
source /opt/intel/oneapi/intelpython/latest/vars.sh
source /opt/intel/oneapi/clck/latest/vars.sh
source /opt/intel/oneapi/mpi/latest/env/vars.sh

or from Intel® Parallel Studio XE Cluster Edition including all above components

source psxevars.[sh | csh]

An alternative to these scripts is ‘modulefiles’ to setup your runtime environment.
- Versioned modulefiles for all above components can be installed and loaded with Intel® oneAPI.
- Additionally the Intel® Cluster Checker modulefile is available using the module commands

module use <install_directory>/clck/<version>/modulefiles
module load clck

NOTE:

If the syscfg system configuration utility or the ‘OSU micro-benchmarks’ were installed, make sure these were also added to the environment path variable $PATH.

Running using an Individual Nodefile

The command line for Intel® Cluster Checker is clck. If you type in clck to the Linux command line, hit enter, and it returns command not found; then the environment setup is not correct.

A nodefile specifies which nodes to include and, if applicable, their roles. Intel® Cluster Checker contains a set of pre-defined roles. A separate hostname appears on each line. If no role is specified for a node, that node is considered a compute node. The following example includes four compute nodes.

[user]# cat nodefile
node1
node2
node3
node4

A cluster with a single node would only include one hostname in the nodefile. Localhost is not a recommended hostname, use the value returned by the command hostname on the servers themselve and are network resolvable.

You can then do your first run for Intel® Cluster Checker by running

clck -f <nodefile>

Running using Slurm

Regardless of whether you are using a batch script via (sbatch) or allocating nodes (salloc), Intel® Cluster Checker uses the list of nodes allocated through Slurm automatically, unless you override it with the individual nodefile option -f

Do not use the command srun to start Intel® Cluster Checker. Only use the clck command (or clck-collect, clck-analyze, etc.), as parallel job for remote data collection is built-in already.

If running on the commandline with a salloc Slurm resource allocation, remember to have set up the environment. You can then launch Intel® Cluster Checker by running the command:

clck

If running with sbatch, you should be able to run Intel® Cluster Checker by using a Slurm script that must include the environment setup above through your choice of environment setup script(s) or module commands:

source /opt/intel/oneapi/setvars.sh
clck

or for specific components:

source /opt/intel/oneapi/intelpython/latest/vars.sh
source /opt/intel/oneapi/mkl/latest/env/vars.sh
source /opt/intel/oneapi/clck/latest/vars.sh
source /opt/intel/oneapi/mpi/latest/env/vars.sh
# alternatively use psxevars.[sh | csh] or setvars.sh (Intel oneAPI), or modulefiles to setup environment
clck

You can then run

sbatch <script_name>

In both of the above cases, Intel® Cluster Checker will generate a summary output, an in-depth clck_results.log, and a separate clck_execution_warnings.log file.

User-Specific Workflows

Intel® Cluster Checker uses what we call a ‘Framework Definition’ to specify what data is collected, how data is analyzed, and how that information is displayed. By default, Intel® Cluster Checker runs the ‘health_base’ Framework Definition, which provides a quick overall examination of the health of the cluster. Intel® Cluster Checker provides a wide variety of Framework Definitions. We describe here the highest level Framework Definitions for particular types of users; however, you can get a full list of available Framework Definitions by running

clck -X list

You will get further details of a Framework Definition with the option -X and the name of the specific Framework Definition. E.g.

clck -X cpu_base
clck -X select_solutions_sim_mod_user_plus_2021.0 | more
clck -X health_base | more

The rest of this page includes some of the more commonly used Framework Definitions that can be helpful depending on your role. You can also find a full list of Framework Definitions in the Reference section.

Admin:

For the privileged user, there are four different common-use Framework Definitions for cluster analysis. When first running as an administrator, run

clck <options> -F health_base

You can then look in the file clck_results.log to read the in-depth results of the analysis. These are preliminary checks that would work for either user or administrator. For a more comprehensive, administrator-specific run, next run

clck <options> -F health_admin

If you want to extend to further in-depth checking of the intricacies of your cluster’s uniformity, you will also include the Framework Definitions ‘lshw_hardware_uniformity’, which will find discrepancies in hardware or firmware between nodes, and ‘kernel_parameter_uniformity’, which will give an analysis of the uniformity of the kernel setup, by using

clck <options> -F health_extended_admin

If the optional ‘syscfg’ system configuration utility command has been installed, run and tested to ensure the system is configured uniformly across nodes, can run by

clck <options> -F syscfg_settings_uniformity

You can run all of the above in a single run by running multiple framework definitions at once.

clck <options> -F health_extended_admin -F syscfg_settings_utility

These commands will provide preliminary analysis on the screen, with more details available by default in the file clck_results.log. At this point you can explore other framework options to find what serves your needs best. Be aware that some of the user-level Framework Definitions may not run well as root since they include running of an MPI parallel application.

Here is an overview of all the embedded tests the health_extended_admin framework definition contains. As you can see, health_extended_admin is a super set of health_admin, kernel_parameter_uniformity and lshw_hardware_uniformity; and these framework definitions may in turn have additional tests they perform:

health_extended_admin
|-- health_admin
|   |-- health_base
|   |   |-- cpu_user
|   |   |-- environment_variables_uniformity
|   |   |-- ethernet
|   |   |-- infiniband_user
|   |   |-- network_time_uniformity
|   |   |-- node_process_status
|   |   `-- opa_user
|   |-- basic_shells
|   |-- cpu_admin
|   |-- dgemm_cpu_performance
|   |-- mpi_bios
|   |-- infiniband_admin
|   |-- kernel_version_uniformity
|   |-- local_disk_storage
|   |-- memory_uniformity_admin
|   |-- mpi_libfabric
|   |-- opa_admin
|   |-- perl_functionality
|   |-- privileged_user
|   |-- python_functionality
|   |-- rpm_uniformity
|   |-- services_status
|   `-- stream_memory_bandwidth_performance
|-- kernel_parameter_uniformity
`-- lshw_hardware_uniformity

Note: Administrators and privileged users must be aware that the data they collect with privileges may contain information about the servers that should be protected. Data such as system MSR settings. It is highly recommended that the database a privileged user creates is protected and realize that it should not be shared with users you do not want to have access to that type of information.

User:

For the non-privileged cluster user, there are three common-use Framework Definitions for cluster analysis. When first running, run

clck <options> -F health_base

You can then look in the file clck_results.log to read the in-depth results of the analysis. In the event that you desire more extended checking, including several lightweight performance checks (IMB, SGEMM, STREAM), you can next run

clck <options> -F health_user

To add more extensive performance checking (DGEMM, HPL) to the above, you can next run

clck <options> -F health_extended_user

These commands will provide preliminary analysis on the screen, with more details available by default in the file clck_results.log. At this point you can explore other framework options to find what serves your needs best. Be aware that not all tools are user-accessible so some may report data missing.

Here is an overview showing how health_extended_user framework definition is a package containing many different sets of tests including some other framework definitions that contain even more checks and tests, such as health_user and health_base:

health_extended_user
|-- health_user
|   |-- health_base
|   |   |-- cpu_user
|   |   |-- environment_variables_uniformity
|   |   |-- ethernet
|   |   |-- infiniband_user
|   |   |-- network_time_uniformity
|   |   |-- node_process_status
|   |   `-- opa_user
|   |-- basic_internode_connectivity
|   |-- basic_shells
|   |-- file_system_uniformity
|   |-- imb_pingpong_fabric_performance
|   |-- kernel_version_uniformity
|   |-- memory_uniformity_user
|   |-- mpi_local_functionality
|   |-- mpi_multinode_functionality
|   |-- perl_functionality
|   |-- python_functionality
|   |-- sgemm_cpu_performance
|   `-- stream_memory_bandwidth_performance
|-- dgemm_cpu_performance
`-- hpl_cluster_performance

Intel® MPI Library Troubleshooting

Admin:

For the privileged user wanting to make sure their cluster is set up to work with the Intel® MPI Library, run

clck <options> -F mpi_prereq_admin

This Framework Definition helps debug BIOS, software, environment, and hardware issues that could be causing sub-optimal performance or problems using the Intel® MPI Library.

User:

For the non-privileged user wanting to make sure their cluster is set up to work with the Intel® MPI Library, run

clck <options> -F mpi_prereq_user

This Framework Definition helps debug environment and software issues that could be causing sub-optimal performance or problems using the Intel® MPI Library.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® Cluster Checker User Guide

Getting Started

Prerequisites

Environment Setup

User-Specific Workflows

Intel® MPI Library Troubleshooting