User Guide

  • 2021.6.0
  • 07/20/2022
  • Public Content

Configuring Intel ® Cluster Checker

To run Intel® Cluster Checker, users can run the command
clck
which performs data collection followed by immediate analysis. This is the primary suggested usage of Cluster Checker. There may be reasons to perform data collection and analysis separately by running
clck-collect
or
clck-analyze
. All of these commands can be configured using command line options and the configuration file.

Environment Variables

Many of the options for configuring Intel® Cluster Checker, both below and in other sections for this guide, have environmental variable counterparts that also can be set. Examples of environment variables used by Intel® Cluster Checker are:
  • CLCK_ROOT which is set when clckvars.sh/csh is run. This variable sets where the tool’s top level directory is.
  • CLCK_SHARED_TEMP_DIR which is commonly set when running the tool with a higher privilege level. This variable is needed to set a path to a temporary directory that can be accessed by all the nodes.
  • CLCK_COLLECT_NODE_DISCOVER_COMMAND can be used to configure how Cluster Checker discovers available nodes. If unset, the default command is:
    scontrol show hostnames

Command Line Options

There are three ways of running Intel® Cluster Checker from the command line.
  • Collection of data only: clck-collect
  • Analysis of existing data: clck-analyze
  • Combined collection and analysis: clck.
-c / –config=FILE
: full path to configuration file. The default configuration file is CLCK_ROOT/etc/clck.xml.
-C / –re-collect-data
: deprecated, collect full data for a run
-D / –db=FILE
: use specific file name for database file (multiple allowed).
-e / –environment-propagation-off
: Turns off environment variable propagation for select collector extensions
-f / –nodefile=FILE
: Specifies a nodefile containing the list of nodes, one per line. See The Nodefile. If a nodefile is not specified for clck or clck-collect, a Slurm query will be used to determine the available nodes. If no nodefile is specified for clck-analyze, the nodes already present in the database will be used.
-F / –fwd=FILE
: Specifies a framework definition. See the framework definition section in the Reference for more details.
. If a framework definition is not specified, the health framework definition is used. This option can be used multiple times to specify multiple framework definitions. To see a list of available framework definitions, use the command line option -X list.
-g / –groupfile=FILE
: Specifies the xml file detailing specific nodegroups for clck-analyze
-h / –help
: Displays the help message.
-i / –collect-info
: Prints available default collect extensions that can be used.
-l / –log-level=LEVEL
: Specifies the output level. Recognized values are (in increasing order of verbosity)**: alert, critical, error, warning, notice, info, and debug. The default log level is error.
-n / –node-include=NODE
: Filters for only the specified node’s issues in the analyzer output (multiple allowed).
-o / –logfile=FILE
: Specifies the path/name to write output to, default is clck_results.log
-p / –collect-method=FILE
: Specifies the the path/name of the collect extension library to use.
-r / –permutations=NUMBER
: Number of permutations of nodes to use when running cluster data providers. By default, one permutation will run. This option only applies to data collection.
-S / –ignore-subclusters
: Ignores the subcluster annotations in the nodefile. This option only applies to data collection.
-t / –threshold-too-old=NUMBER
: Sets the minimum number of days since collection that will trigger a data too old error. This option only applies to data analysis. Default = 7
-v / –version
: Returns version information.
-X / –FWD_description=FWD
: Prints a description of the framework definition if available. If the value passed is “list”, then it prints a list of found framework definitions.
-X / –FWD_description=list
: Prints a list of all found framework definitions. Then a specific framework definition can replace the word list to get details on the FWD.
-z / –fail-level=LEVEL
: Specifies the lowest severity level at which found issues fail. Recognizes values are (in increasing order of severity)**: informational, warning, and critical. The default level at which issues cause a failure is warning.
–sort-asc=FIELD
: Organizes the output in ascending order of the specified field. Recognized values are “id”, “node”, and “severity”.
–sort-desc=FIELD
: Organizes the output in descending order of the specified field. Recognized values are “id”, “node”, and “severity.”
The
clck
command line does accept multiple option inputs on a single command line. It does not accept comma separated input. If you need to issue multiple option flags the option needs to be repeated multiple times. For example the Framework Definition flag
-F
could be desireable to use multiple times, the command line would look the following:
clck -F health_user -F opa_user -F mpi_prereq_user

The Configuration File

Intel® Cluster Checker provides a main configuration file in XML format to allow for more detailed configuration. Settings done on the fly in the command line section can be set and saved in a config file along with more complicated options such as:
  • suppressing certain types of output
  • setting output format overrides
  • setting network interfaces
  • various collect and analysis environment variables
The configuration file is located at
/opt/intel/oneapi/clck/<version>/etc/clck.xml
. Intel® Cluster Checker uses this file by default, or you can pass it your own config file using the ‘-c’ command line option.
Configuring the Database
You can specify a datastore configuration file in the main configuration file using the tags:
<datastore_extensions> <group path="datastore/intel64/"> <entry config_file="default_sqlite.xml">libsqlite.so</entry> </group> </datastore_extensions>
To use odbc instead of sqlite3, enter libodbc.so instead of libsqlite.so. Multiple entry tags will allow you to specify multiple databases through multiple datastore configuration files.
The datastore configuration file, by default, is located at
/opt/intel/oneapi/clck/<version>/etc/datastore/default_sqlite.xml
and takes the following format:
<configuration> <instance_name>clck_default</instance_name> <source_parameters>read_only=false|source=$HOME/.clck/<version>/clck.db</source_parameters> <type>sqlite3</type> <source_types>data</source_types> </configuration>
The ‘instance_name’ tag defines a database source name. This value must be unique.
The ‘source_parameters’ tag determines whether or not to open the database in read-only mode and indicates which database to use.
The ‘type’ tag specifies what type of database to use. Currently, the only accepted value is ‘sqlite3’.
The ‘source_types’ tag specifies what source type to use. Currently, the only accepted value is ‘data’.
Configuring the Default Framework Definition
You can specify a default framework definition in the main configuration file using the tags:
<framework_definitions> <framework_definition>clock</framework_definition> <framework_definition>health_base</framework_definition> </framework_definitions>
The ‘framework_definition’ tag defines what framework(s) will be run by default. If this is not present, the default framework definition will be ‘health_base’. If the ‘’-F’’ option is used, then that will override the default list of frameworks run.

Using Framework Definitions

Intel® Cluster Checker checks can be divided up into categories. We bundle related sets of providers and their analyzer counterparts in a config file called ‘framework definitions’. A few examples of such areas these bundles can cover are:
  • hardware
  • software
  • networking and fabrics
  • performance
  • memory
These bundles of can be run from the command line with the -F <framework name> command. A full list of framework definitions can be found in this list or with the command line flag -X list. More on framework definitions can be found in the framework definition section of the User Guide. To see a full list of framework definitions issue the following command:
clck -X list

The Nodefile

As demonstrated earlier in the Getting Started section, Intel® Cluster Checker can also configure nodes in various ways. By default, a nodefile is required to run Intel® Cluster Checker using the -f <nodefile name> option. If launching Cluster Checker through Slurm, no nodefile is required and Slurm will provide Cluster Checker with the assigned node list.
The nodefile format is a single server name per line. The server name must be resolvable by the server and should be the value returned by the command line command
hostname
. This means it is pingable and accessable by that name given. i.e.
ping node1
would correctly ping a server of name node1.
Note:
The use of the name ‘localhost’ in the nodefile is not currently supported, instead use the servers resolvable hostname.
Nodefiles can also be used to define what role a node plays within a cluster (head, compute, login, etc). This is not a requirement, but can be helpful if your cluster has differently configured servers. Providers can be configured to act differently or ignore nodes with different roles. For example, on a cluster with a login or head node distinct from the compute nodes, marking that node with a non-compute role would exclude it from the benchmarking that it is not set up to handle.
More on node roles can be found in the Data Collection section of the User Guide.
Similarly to node roles, nodes can also be part of a subcluster. This is set in the nodefile with a syntax similar to setting roles.
More on subclustering can be found in the subcluster section of the Data Collection.
Example nodefile:
head #role: head login #role:login role: compute node1 #role: compute subcluster :A node2 #role: compute subcluster: A node3 #role: enhanced subcluster: A node4 #role: enhanced subcluster: A node5 #role: compute subcluster: B node6 #role: compute subcluster: B node7 #role: enhanced subcluster: B node8 #role: enhanced subcluster: B
Example running Intel® Cluster Checker with a nodefile:
clck -f ./nodefile -F health_extended_user -o cluster_health_extended_user.log

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.