User Guide

  • 2021.1
  • 01/08/2021
  • Public Content

Knowledge Base

Intel® Cluster Checker is an expert system. A classic definition of an expert system is “an intelligent computer program that uses knowledge and inference procedures to solve problems that are difficult enough to require significant human expertise for their solutions” (Edward A. Feigenbaum, “Knowledge Engineering in the 1980s”, Stanford University Computer Science Department, 1982). The problem that Intel® Cluster Checker solves is diagnosing system-level issues with Beowulf style clusters.
Two main elements are called out in this definition: knowledge and inference procedures. Knowledge comes from two sources: observations about an actual system and if/then rules that encapsulate human expertise. Intel® Cluster Checker relies on data providers to make observations about the cluster and saves the result in a database (see the
Data Providers
Expert systems differ from typical procedural programs in that there is not a fixed order of execution. The order is logically inferred, using one of several common schemes such as the Rete algorithm , by dynamically analyzing the interdependence of rules and facts.
One limitation of expert systems is that they are only as good as the knowledge (rules) they contain. To remain relevant, a knowledge base needs to continually grow and change as new and variant cases are uncovered. This chapter addresses how to express human expertise as knowledge base rules to extend the diagnostic capabilities of Intel® Cluster Checker. Please consider contributing extensions to the Intel® Cluster Checker team so that other users can also benefit.
The C Language Integrated Production Systems (CLIPS) was originally created in the 1980s at NASA’s Johnson Space Center (Joseph Giarratano and Gary Riley, Expert Systems: Principles and Programming, Thomson Course Technology, 2005). CLIPS is an expert system shell that combines an inference engine with a language for representing knowledge. Like many AI environments, the CLIPS language is very similar to LISP.
More recently, CLIPS added object oriented capabilities. Intel® Cluster Checker is based on the CLIPS Object Oriented Language (COOL).
The CLIPS User’s Guide is an excellent introduction to CLIPS.
Knowledge Base Overview
Key Concepts
Signs are one of the core elements of the knowledge base. A sign is an objective observation of the system. For example, if one node in a system has an amount of memory differing from the rest of the system, a sign will appear indicating that the memory is not uniform. Signs, then, do not infer anything about the cluster. Rather, they indicate some observation of an issue based on the data collected. In the Intel® Cluster Checker output, signs generated from CLIPS are referred to as “observations”.
All the varieties of signs have a
slot that represents a state diagram, where a sign is first initialized, then transitions to the observed state when a rule is run, and finally becomes diagnosed if the sign is used to make a diagnosis (see Diagnoses below).
slot contains a value that ranges from 0 to 100. These values map to one of three severity levels - informational, warning, and critical. Informational observations map to the severity of 0-24 and provide additional information or minor issues about the system. An informational observation indicates that the cluster is fully functional but may have minor performance issues or not conform to best practices. Warning observations map to a range of 25-74 and indicate that the cluster is essentially functional but has performance issues and/or a non-core capability has functionality issues or is missing. Critical observations map to a range of 75-100 and indicate that a core cluster capability is non-functional or missing. The most severe critical observations indicate that a cluster component may irreparably fail if not addressed immediately. The rule that sets the sign (that is, transitions it into the observed state) sets the severity, and the sign will appear as an observation in the output.
Every sign also has an id slot that corresponds to a message catalog key (
). The message catalog contains a string that describes the sign. Typically the string is a single sentence, but it may be longer. By convention, the
value should be the same as the name of the rule that created it. The
value is also used to look up the sign when making diagnoses.
Finally, the args slot contains variable values to be inserted into the message catalog string. Together, the
slots are roughly analogous to the C printf family of functions. The message catalog can be extended by simply adding new entries.
Intel® Cluster Checker uses signs to make inferences about the cluster, resulting in diagnoses. Diagnoses are based on one or more sign. Using the non-uniform memory example from above, the non-uniform memory sign will result in a non-uniform hardware diagnosis.
Diagnoses are made based on the value of signs. A diagnosis is also defined by a rule. Whenever a diagnosis is made, the signs used to make the diagnosis should be transitioned to the diagnosed state. This is important because signs that are not used to make a diagnosis (that is, left in the observed state) will be printed out as undiagnosed signs. Undiagnosed signs indicate that an issue was found, but Intel® Cluster Checker was unable to infer anything based on it. Not all signs will result in diagnoses.
Similar to signs, diagnoses have
, and
slots. The severity slot will typically be a composite of the signs used to reach the diagnosis. For example, a diagnosis based on a sign reached with low confidence and another sign with a high confidence, should probably have a low to intermediate confidence value depending on the particular case.
Remedies provide actionable steps to resolve an issue, such as changing the permissions on a file or rebooting a node. Remedies are specified using two optional sign slots,
. Similar to
corresponds to a message catalog key (
) and
contains variable values to be inserted into the message catalog string. If the
slot is empty, then no remedy is displayed.
Basic Implementation
CLIPS classes are roughly analogous to C structures or C++ classes. Slots are to member variables as classes are to C structures. A slot typically has some attributes, or defined facets, such as the type, default value, etc. See the CLIPS documentation for more information about facets. The slots are populated with information from the database through analyzer extensions.
The class definition for the Duck example follows and can also be found at
in the SDK Duck Sample* at
(defclass DUCK "This class corresponds to the 'duck' node rating tool." (is-a BASE_NODE BASE_TIMESTAMP DATABASE MULTISET) (role concrete) (pattern-match reactive) (slot count (type INTEGER) (default 1)) (slot sound (type SYMBOL) (allowed-values honk quack) (default honk)))
In addition to the explicitly defined slots, the
class inherits slots from its base classes. For instance, the
slot, which corresponds to an unique node identifier, is inherited from
class. If the class represents a property of multiple nodes, such as the network performance between a pair of nodes, it would instead inherit from the
base classes (
For each class, there is typically a corresponding rule file. For instance, the DUCK class is defined in the file
, and the corresponding rules are defined in the file
in the SDK Duck Sample*. Based on the data contained in the instances and potentially other information such as the hardware configuration of a node, a rule creates one or more signs or diagnoses.
A CLIPS rule has a left-hand side (LHS) and a right-hand side (RHS), separated by the => token. The LHS is the set of if/then conditions that describe when the rule should fire. The RHS contains the action that should be performed when the LHS conditions are met. Typically the action is to create a sign or diagnosis.
Several varieties of signs are provided (
    represents quantities that are either true or false. For example, a process either is either in the zombie state or not.
    represents quantities that correspond to a count of something. For example, the number of network retries.
    represents a measure of performance that is either normal, substandard, or invalid. For example, the measured floating point performance meets expectations for the hardware configuration, does not meet expectations, or is an invalid value (such as negative or not a number).
    is a general sign that can be used if one of the preceding specialized sign classes is not appropriate.
Note that in the output, all signs will be referred to as observations.
Organization and Directory Structure
The knowledge base is divided into several sub-components.
sub-directory contains the core data structures and message handlers used by the rest of the knowledge base. These files should typically not be modified.
The diagnostic knowledge is split between the
subdirectories. Class definitions are part of
while the logic defining signs and diagnoses is contained in the
sub-directory contains lists of hardware components and their properties, as well as the catalog of messages.
Functions that extend the base CLIPS functionality can be put in the
Framework definitions
use the
tag to load CLIPS files. For example, the cpu framework definition uses this tag to load the file cpu.clp, which contains the relevant rules.
Each sub-directory has a file named
. This file loads the rest of the files in the same sub-directory. If, for example, a new rules file is added, then it needs to be added to its corresponding
level load file to be enabled. The user-defined
file, sitting at the kb directory level can contain the following:
(batch* classes/duck.clp) (batch* rules/duck/load.clp)
contains the following:
(batch* duck-honking.clp) (batch* duck-less-than-three-quacks.clp) (batch* duck-stopping.clp)
Automatically Created Objects
object is automatically created for each node being checked. Each
object contains slots for the node architecture, roles, and subcluster membership. These slots may be used to restrict a rule to a particular type of node.
A single instance of the
class named
is automatically created and contains the input configuration parameters. The instance name
is reserved for this purpose and no other instances should use this name. This instance may be used to make the behavior of a rule user configurable.
class contains all user configurable options and is defined in
. A single instance of this class always exists with this reserved name. This class can be extended by adding new slots.
The slots of the
class form a global namespace, so slot names should be chosen with that consideration.
A simplified definition of the
class is as follows:
(defclass CONFIG (is-a USER) (role concrete) (pattern-match reactive) ; clck-checks is a list of connector extensions to be ; performed. (multislot clck-checks (type SYMBOL) (default (create$ all_to_all cpu dgemm environment ...)) ; The maximum allowable age of a data point, in seconds, ; before a data point is considered "too old". The ; default is 1 week. (slot data-age-threshold (type NUMBER) (default 604800)) ...
To use the
class, a corresponding rule would add a single condition to the left hand side:
(defrule duck-data-is-too-old "Identify instances where the most recent DUCK data should be considered too old." ; IF the 'duck' connector extension is enabled (object (is-a CONFIG) (name [config]) (clck-checks $? duck $?) (data-age-threshold ?age-threshold)) ...
The values of the
slots should always have defaults and are configurable in the Intel® Cluster Checker config file.
The following construct can be used to set values for single slot variables.
<configuration> <analyzer> <config> <ssf-layer>core</ssf-layer> </config> </analyzer> </configuration>
The following construct can be used to set values for multislot variables.
<configuration> <analyzer> <config> <clck-checks> <entry>PATTERN1</entry> <entry>PATTERN2</entry> </clck-checks> </config> </analyzer> </configuration>
This section steps through the complete
knowledge base example. The source files are provided online in the SDK Duck Sample, specifically in the folder
Class Definition
Recall that the
command rates nodes on a scale from 1 to 5 quacks, or if there is an error during the evaluation, honks instead of quacks. So the key data elements that need to be included in the knowledge base are a node identifier, the sound (quack or honk), and the number of times the sound is repeated. The following is an example CLIPS class definition that includes all of these elements. In an actual distribution, it would be added to the knowledge base as
(defclass DUCK "This class corresponds to the 'duck' node rating tool." (is-a BASE_NODE BASE_TIMESTAMP DATABASE MULTISET) (role concrete) (pattern-match reactive) (slot count (type INTEGER) (default 1)) (slot sound (type SYMBOL) (allowed-values honk quack) (default honk)))
slot is inherited from the
class, the
slot is inherited from the
class, and the timestamp slot is inherited from the
class. The
inheritance will be described with the uniformity rule.
With the class defined, the analyzer extension can now create instances based on the content of the database. Rules can now be defined to check the output.
Rule 1: Missing good output
In this example, the first rule creates a sign whenever the number of quacks is less than 3. In an actual distribution, the rule would be added to knowledge base as
(defrule duck-less-than-three-quacks "Create a sign whenever the number of 'quacks' is less than 3." ; IF the 'duck' analyzer extension is enabled (object (is-a CONFIG) (name [config]) (clck-checks $? duck $ ?)) ; AND a node instance with the role 'compute' or 'enhanced' ; exists (object (is-a NODE) (node_id ?n) (role $?role&:(member$ compute ?role) |:(member$ enhanced ?role))) ; AND an instance of the DUCK class exists for a node with ; the same node_id and with the sound 'quack' ?o <- (object (is-a DUCK) (count ?c) (node_id ?n) (sound quack)) ; AND the number of quacks is less than 3 (test (< ?c 3)) => ; THEN create a sign (make-instance of COUNTER_SIGN (node_id ?n) (confidence 90) (severity 50) (source ?o) (state observed) (value ?c) (id "duck-less-than-three-quacks") (args (create$ ?c))))
The LHS of this rule steps through a series of conditions.
An instance of the
class with the name
must exist with the
slot containing duck. In other words, only fire this rule if the duck analyzer extension is enabled.
object must exist where the
slot contains either
. In other words, only fire this rule for compute / enhanced nodes. As a side effect, the ‘?n’ variable is populated with the id of the node.
object must exist where the sound is quack and the node_id slot is same as the ?n value found in the prior condition. In other words, only fire this rule for nodes with both a
object and a
object. As a side effect, set the
variable is populated with the number of quacks.
The number of quacks,
, must be less than 3.
If all four of these conditions are met, the rule will fire and execute the action on the right hand side. The rule is automatically evaluated by the inference engine for all possible combinations of objects, so each node is checked by this single rule.
The severity level is arbitrary, and a more sophisticated rule might scale it depending on the number of quacks. For example, 1 quack might have a severity level of 75 while 2 quacks has a severity level of 50. See the tables in
for guidance on setting the severity level.
A message catalog entry with the key duck-less-than-three-quacks would be added to clck/<version>/kb/data/msg_en.xmc in an actual distribution. An example message catalog entry is provided online in the SDK Duck Sample*, located at src/kb/data/msg_en.xmc.
Rule 2: Error case
A second rule should be added for the case where the duck honks, indicating a serious error. The overall construction of the rule is similar to the previous rule.
(defrule duck-honking "If the duck honks like a goose, something serious has happened." ; IF the 'duck' analyzer extension is enabled (object (is-a CONFIG) (name [config]) (clck-checks $? duck $ ?)) ; AND a node instance with the role 'compute' or 'enhanced' ; exists (object (is-a NODE) (node_id ?n) (role $?role&:(member$ compute ?role) |:(member$ enhanced ?role))) ; AND an instance of the DUCK class exists for a node with ; the same node_id and with the sound 'honk' ?o <- (object (is-a DUCK) (node_id ?n) (sound honk)) => ; THEN create a sign (make-instance of BOOLEAN_SIGN (node_id ?n) (confidence 100) (severity 100) (source ?o) (state observed) (value TRUE) (id "duck-honking")))
As above, a message catalog entry with the key
should be added.
Rule 3: Uniformity
Finally, a rule might be added to verify that all nodes have the same quack rating.
Usually the question of uniformity can be sufficiently answered by determining what fraction of nodes have the same / different value as a particular node. This approach avoids the combinatorial explosion of comparing every node to every other node and also avoids the problems associated with choosing a “reference” node. The
class is provided for determining uniformity. A multiset is similar to a set except it is a key / value pair where the value is the number of elements with the same key. For example, the set
corresponds to the multiset
class inherits from the
class. The
message handler, roughly analogous to a C++ constructor, must be added to automatically insert the key / value pair into the multiset when each
instance is created:
(defmessage-handler DUCK init after () "Add MULTISET key / value pairs. Skip non-quacks." (if (eq ?self:sound quack) then (send ?self add (send ?self multiset-key) ?self:count))) (defmessage-handler DUCK multiset-key () "Generate a distinct key for each node architecture, role, and subcluster combination." ; defaults (bind ?architecture x86_64) (bind ?role compute) (bind ?subcluster default) (bind ?ins (find-instance ((?n NODE)) (eq ?n:node_id ?self:node_id))) (if (= (length ?ins) 1) then (bind ?i (nth$ 1 ?ins)) (bind ?architecture (send ?i get-architecture)) (bind ?subcluster (send ?i get-subcluster)) (if (member$ compute (send ?i get-role)) then (bind ?role compute) else (if (member$ enhanced (send ?i get-role)) then (bind ?role enhanced)))) (bind ?key (sym-cat (class ?self) + ?subcluster + ?role + ?architecture)) (return ?key))
message handler creates distinct keys for each subcluster, node architecture, and node role. This is done to avoid comparing fundamentally different nodes. For example, do not compare compute nodes and storage nodes.
The uniformity rule is:
(defrule duck-quack-count-is-not-consistent "Create a sign whenever the number of 'quacks' is not consistent." ; IF the 'duck' analyzer extension is enabled Knowledge Base 107 (object (is-a CONFIG) (name [config]) (clck-checks $? duck $ ?)) ; AND a node instance with the role 'compute' or 'enhanced' ; exists (object (is-a NODE) (node_id ?n) (role $?role&:(member$ compute ?role) |:(member$ enhanced ?role))) ; AND an instance of the DUCK class exists for a node with ; the same node_id and with the sound 'quack' ?o <- (object (is-a DUCK) (node_id ?n) (count ?c) (multiset_control TRUE) (sound quack)) ; AND the fraction of nodes with the same quack count is ; less than 0.9 (test (< (send ?o fraction (send ?o multiset-key) ?c) 0.9)) => (bind ?key (send ?o multiset-key)) (bind ?fraction (- 1 (send ?o fraction ?key ?c))) (make-instance of BOOLEAN_SIGN (node_id ?n) (confidence (* 100 ?fraction)) (severity 80) (state observed) (source ?o) (value TRUE) (id "duck-quack-count-is-not-consistent") (args (create$ (* 100 (send ?o fraction ?key ?c)) ?c))))
condition appears in this rule to guarantee that all values have been added to the multiset before attempting to activate the rule. It should be used in all rules that rely on a multiset value.
The final LHS condition decides that if at least 90% of nodes have the same value, then it is actually correct. This is an arbitrary threshold to try to minimize the number of false positives that get reported.
The RHS creates a temporary variable
that corresponds to the fraction of nodes that have a different number of quacks.
Rule 4: Diagnosis
diagnostic tool does not lend itself to diagnosis. The quack rating scale is unambiguous, but is a closely held trade secret by Waterfowl Industries and additional information such as why a node rated 2 quacks instead of 3 or the duck honked is not provided.
Diagnoses are typically made by combining one or more signs. For example, consider the combination of the proverbial black swan sign and the duck-honking sign to produce the diagnosis that the duck is honking because it is actually a black swan:
(defrule duck-duck-swan "Diagnose the root cause of the honking duck." ; IF the 'duck' analyzer extension is enabled (object (is-a CONFIG) (name [config]) (clck-checks $? duck $ ?)) ; AND a node instance with the role 'compute' or 'enhanced'108 Knowledge Base ; exists (object (is-a NODE) (node_id ?n) (role $?role&:(member$ compute ?role) |:(member$ enhanced ?role))) ; AND a “”duck-honking sign exists for a node with the ; same node_id ?s1 <- (object (is-a SIGN) (node_id ?n) (id "duck-honking")) ; AND a "black-swan" sign exists for a node with the same ; node_id ?s2 <- (object (is-a SIGN) (node_id ?n) (id "black-swan")) => ; THEN create a DIAGNOSIS and mark the signs as diagnosed (send ?s1 put-state diagnosed) (send ?s2 put-state diagnosed) (make-instance of DIAGNOSIS (node_id ?n) (confidence 20) (severity 100) (source (create$ (send ?s1 get-source) (send ?s2 get-source))) (id "duck-duck-swan") (remedy "duck-duck-swan-remedy")))
Note that this rule is not part of the included sample files.
Custom Rules for Framework Definitions
Framework definitions accept native or custom rule sets as long as they are specified as followed:
<configuration> <framework_definition> <kb_mods> <mod>duck.clp</mod> </kb_mods> </framework_definition> </configuration>
contains pointers to the duck class and rule file(s). The following CLIPS file is an example of what
might contain:
(batch* classes/duck.clp) (batch* rules/duck/load.clp)
If multiple
are specified for loading, they have to be located in the same folder, as only one kb mod’s path can be specified per Framework Definition. If no path is given, the default location is assumed
. No duplicate classes or rules should be loaded. These can result in duplicate diagnosis.
Developing with CLIPS
Intel® Cluster Checker requires the following in custom rules:
  • The rule is defined with “defrule”.
  • The ID of the sign or diagnosis is defined either literally (with double quotation marks) or using a variable (beginning with the ? symbol and using the bind keyword).
  • If a rule contains the “not” keyword (such as checking that the same sign does not already exist), it must not be used in the line that defines the sign/diagnosis ID.
  • The arrow (=>) defining the LHS and RHS of the rule must be on its own line.
  • In a diagnosis that can be triggered by multiple signs:
    • Each potential sign ID must be on its own line.
    • In an OR rule (in which many signs may contribute to the same diagnosis), the “|” symbol must appear after the sign ID.
Other than these requirements, coding style is largely a matter of personal preference. The following additional style guidelines are recommended:
  • Do not exceed 80 characters per line.
  • Generally use alphabetical order for any list of items.
  • Use all lower case, except for class names.
  • Uses dashes rather than underscores or CamelCase.
  • Document all classes, functions, message-handlers, rules, etc. using the CLIPS comment field rather than ‘;’ style comments.
  • Use the same value for the rule name and sign / diagnosis id slot.
Debugging and Profiling
CLIPS includes several techniques to help better understand what it is doing.
One of the most debugging useful techniques is the watch capability (see section 13.2.3 in the CLIPS Basic Programming Guide).
CLIPS also includes a good profiling capability (see section 13.16 in the CLIPS Basic Programming Guide).
Additional debug and/or profile statements may be included, in which case, additional output will be displayed when running an analysis.
*Note: The
sample is not available for Intel® Cluster Checker 2019 Beta but will be available for future releases.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at