|
Autonomic systems [5, 6] represent an approach to managing complexity by making individual system nodes and components self-managing.
Such systems manage themselves by monitoring their operation, detecting any anomalies, and then adjusting accordingly to achieve normal
operation.
Autonomic Computing (AC), also referred to by other terms such as Organic Computing and Self-Managed Systems, is inspired by the human
autonomous nervous system, which handles the complexities and uncertainties of the human body without requiring our conscious efforts.
AC emerged as a new strategic and holistic approach to the design of complex distributed computer systems, aiming at realizing computing
systems and applications capable of managing themselves with minimum human intervention. In [5, 6] IBM researchers defined autonomic
managers as being responsible for monitoring a managed element, and creating and executing plans based upon analysis of the data and
knowledge they have acquired. Thus, autonomic systems are represented in closed-loop form consisting of four stages: monitor, analyze,
plan, and execute, as shown in Figure 1.

Figure 1: A typical abstract autonomic framework
click image for larger view
The autonomic manager monitors the managed resources it controls and analyzes the data received. Based upon the data received and
analyzed, the autonomic manager constructs and executes plans to achieve the management goals in accordance with the policies and rules
in effect. It accumulates and uses knowledge (policies and rules) from observations, past experiences, and updates received from its
peers and other components of the management hierarchy, such as management consoles. The manageability interface between the autonomic
manager and the managed resources allows the autonomic manager to "sense" data from the managed resources and to "effect" the desired
actions.
Self-management attributes
The following four underlying attributes were originally defined to constitute self-management.
-
Self-configuring is a system's ability to change its configuration automatically in reaction to runtime changes or to assist in
self-healing, self-optimization, and/or self-protection. For example, if a hard memory error is detected on a memory bank, a platform
can isolate the particular memory bank allowing the system to continue functioning until the faulty part can be replaced.
-
Self-healing is the ability of a platform to effectively recover when a fault occurs. Self-healing can be either reactive or
proactive. A reactive self-healing platform attempts to correct or isolate a fault once it has occurred. A proactive self-healing
platform attempts to predict whether a fault may occur and takes appropriate action to ensure the health of the system is maintained.
-
Self-optimization is the ability of the system to optimize its operations based on a given operation profile. This involves
monitoring its operation and optimizing accordingly, given a set of policies. It may also react to dynamic policy changes within the
platform as indicated by a user.
-
Self-protecting is the ability of a system to defend itself from accidental or malicious external attacks by being aware of
potential threats and being able to handle those threats.
The above list of self-management attributes has been growing steadily and substantially covering features such as self-anticipating,
self-adapting, self-critical, self-destructing, self-diagnosing, and self-recovering. We expect that the list will continue to grow.
For a system to meet the self-management objectives, it must be aware of its internal state (self-awareness) and current external
environment (environment-awareness) [7]. Self-awareness and environment-awareness are achieved through the ability of a platform to
collect raw internal (self-monitoring) and external data (environment monitoring) that is used for the following:
-
Data aggregation. This automatically transforms raw data gathered over time into information upon which predictions, actions,
and strategies are based.
-
Data analysis. This is the analysis of raw and aggregated data that is used to aid in self-healing, self-protection, and self-optimization.
As a platform monitors its internal and external environment, changes may be detected that require the platform to adjust itself
accordingly (self-adjusting).
The self-management features of an autonomic platform are dependent on one another. Using our memory example again, if memory fails in
the platform, the platform will need to take corrective action (self-healing and self-configuration), optimizing itself in an attempt to
continue to meet SLOs (self-optimization). To illustrate the dependencies between the self-management features, a more representative
taxonomy of an autonomic system is shown in Figure 2. The figure shows the enabling technologies (e.g., virtualization and automation)
that are required to build a self-managing system (as represented by the disc) with capabilities (represented as petals) such as self-optimizing, self-healing, and self-protection.
The figure also illustrates the environment and resources that an autonomic system needs
to comprehend, such as the hardware, software, workload, and business requirements. Note that the petals protrude outside the self-managed disc to indicate that systems cannot internally self manage all possible scenarios: there will always be a need for outside
intervention, since the self-managed object may be part of a larger self-managed construct. Petals also overlap to indicate that a
change in one self-management function may impact one or more other self-management functions. For example, a self-healing action may
need to be followed by a self-optimizing action for best performance, as discussed in the memory example.

Figure 2: Taxonomy of an autonomic system
click image for larger view
Autonomic computing environment
In an AC environment, there is often an inherent assumption that actions are purely based on predictive approaches. Although predictive
technologies are an important evolutionary step in building better autonomic platforms, it is possible to have an autonomic platform
that is reactive in nature. In other words, its autonomic behavior is driven by enforcing policies reacting to environmental monitoring
data already collected. This approach to AC is the most common form found today.
Reactive autonomic platforms are, however, limited in their ability to achieve a high level of self-awareness and therefore may not
fully achieve the desired AC operation. In other words, the given policies limit the effectiveness of the AC capabilities.
Proactive-based AC environments rely on machine learning techniques such as Artificial Intelligence (AI) to analyze collected data and
predict operational anomalies before they occur. This approach clearly enables us to achieve the full AC vision. However, this does not
come for free; predictive techniques require additional resources resulting in increased complexity.
An ideal AC environment would incorporate a combination of both proactive and reactive approaches. The ability for a platform to take
proactive actions needs to be controlled by policies that describe not only what autonomic actions may or may not be taken, but also
what the desired normal state of the platform should be.
When autonomic actions are based on local information and limited environmental information, the ability to make accurate decisions may
be limited. This may result in a high number of perceived conditions that are incorrectly identified. If one looks at how, for example,
enterprise management solutions work today, information collected by a platform is consolidated in a management console, where the
management console can leverage the received data from multiple data sources and make a more informed and accurate decision on what has
or may have happened on a particular platform.
Group interactions
When designing an autonomic platform, it is critical that it be able to interact with other platforms in the environment, thereby
increasing its own knowledge by using the collective intelligence of other autonomic platforms. The additional information aids in the
decision process allowing the platform to make decisions based upon local data and data collected from other platforms. Two examples
illustrate the importance of distributed intelligence.
Data centers rely on Network Intrusion Detection Systems (NIDS) to analyze network traffic for patterns and hints in data to determine
whether potential threats exist. NIDS is effective only after it examines a sufficient portion of the network traffic. A Host-based
Intrusion Detection System (HIDS), on the other-hand, can only make decisions based on what it sees, thus relying on knowledge in the
form of signatures and heuristic patterns that it has been provided a priori. By extending HIDS to leverage other HIDS on the network,
the reliability of a HIDS increases. The first example is described in the paper "Towards Autonomic Enterprise Security: Self-Defending
Platforms, Distributed Detection, and Adaptive Feedback" [8] in this issue of the Intel® Technology Journal. The authors describe the
three building blocks for an end-to-end autonomic enterprise, namely detection, self-defense, and adaptive policy management. They
explain how Intel technologies such as Intel® AMT and Intel-led standards initiatives are used to create a self-defending platform and
how distributed intelligence can be used to defend the enterprise by corroborating the likelihood of infection.
The second example deals with self-optimization. Autonomic systems maintain operations when system conditions vary by monitoring their
state and creating and executing plans based upon analysis of the data and on the knowledge they have acquired. A critical component of
autonomic systems is their ability to handle uncertainty in the perceived state and the effect of their actions. The article "Machine
Learning for Adaptive Power Management" [9], also in this issue of the Intel® Technology Journal, describes how Partially Observable
Markov Decision Processes (POMDPs) are used for modeling autonomic systems and how a specific POMDP model is used for an adaptive power
management system in a laptop.
|