|
So far we have experimented and described a direct approach to the problem of APM. We believe that significant performance improvements
might be possible by using stochastic process models of the domain. These models capture the temporal dynamics of user and system state
as well as the annoyance and power costs. Having learning and planning algorithms could potentially be beneficial to other domains as we
describe at the end of this section.
Model-Based Approach
The model-based approach decouples the decision-making process from the problem of learning and estimating the model of the environment.
Learning a model refers to the process of defining/discovering domain variables and how they relate to each other. Using the model
refers to the process of computing decisions given the model. Next we describe three models of increasing expressiveness as well as
increasing difficulty in learning them and using them.
Markov Decision Processes
The MDP model facilitates reasoning in domains where actions change the states stochastically and where there is usually a delayed
reward signal (e.g., winning in backgammon is attributed to possibly many moves along the play and not only to the final move). Figure 6
gives a graphical MDP representation.

Figure 6: This is a graphical representation of an MDP model. The left column represents what happens at time t and the right column
what happens at t+1. Arrows represent dependence.
In an MDP, there exists an underlying state that encompasses all the information needed to represent the world. This state is a summary
of the relevant systems and models. Given the state of the system the future is independent of the past. This means that the true state
captures all the information needed to describe the system. This last property is often referred to as the Markov property. The agent
can act in response to the observed state. The result of the action depends on the state of the world and may be uncertain.
The objective of the agent is to maximize his long-term cumulative reward (typically, the reward of interest is the discounted future
reward where the near future is more important than the more remote future). An immediate reward is received when exercising an action.
This immediate reward depends on the action made by the agent and the true state. The immediate reward that is received in a particular
state following a certain action may also be random.
Formally, MDPs are described as a 4-tuple: <S,A,T,R> as shown in Figure 6. S denotes a finite set of states and A denotes a finite set
of actions. T(s'|s,a) denotes the transition probability from state s to state s' under action a. R(s,a) is the reward for taking action
a in state s. In Figure 6 we see the dependencies between the different elements that define the MDP: the current state (S) and action
(A) determine the next state (S') according to the transition probability (T); the current state (S) and action (A) determine the reward
(R).
The main advantage of the MDP model is its simplicity. On the one hand, learning the parameters and computing an optimal policy can be
done efficiently by several different algorithms (e.g., [2]). On the other hand, the simplicity of the MDP model comes with a price: it
cannot capture unobservable dynamics, and its success hinges on the assumption that the state of the system can be estimated with no
errors.
Dynamic Bayesian Networks
A Dynamic Bayesian Network (DBN) describes a stochastic process where some of the variables are observed and some are not. Figure 7
shows a graphical representation of a special case of DBNs where there is a single variable called a Hidden Markov Model (HMM). In a DBN
the agent reasons about the state of the process indirectly though the observed variables. For example, the doctor in Figure 1 makes an
inference about the internal state of the patients through symptoms.
An HMM is described as a 4-tuple: <S,T,Z,O> as shown in Figure 7. S denotes a finite set of states, and Z denotes a finite set of
observations. T(s'|s) denotes the transition probability from state s to state s'. O(z|s) is the probability of observing the
observation z in state s. In the figure we see the dependencies between the different elements that define the HMM: the current state
(S) determines the next state (S') according to the transition probability (T); the observation (Z) depends on the current state (S)
through the observation probability (O).
The advantage of DBNs is their ability to capture complex dynamics that are not completely observable. The disadvantage of DBNs is their
lack of support for decision making. Estimating the DBN parameters can be done relatively efficiently using algorithms such as the
Expectation-Maximization algorithm.

Figure 7: This is a graphical representation of a simple dynamic Bayesian networks called a Hidden Markov Model
POMDPs
The Partially Observable Markov Decision Process (POMDP) model is the combination of HMMs/ DBNs and MDPs. The idea behind the POMDP
model is to combine the strengths of HMMs (capturing dynamics that depend on unobserved states) and MDPs (taking the decision aspect
into account) without significantly sacrificing computational efficiency. The model is depicted in Figure 8, where it can be seen that
the POMDP contains elements from both MDPs and HMMs. POMDPs are a natural conceptual framework for modeling sequential decision tasks as
illustrated in Figure 1. A POMDP model describes sequential decision tasks under uncertainty. A control policy in a POMDP computes an
action after every observation such that in the long-run (discounted or average) the expected utility is maximized. Uncertainty,
therefore, has two sources in POMDPs. First, the true state of the world is usually unknown (the doctor does not know what the patient
has exactly). Second, even if the state is known, some actions may have uncertain consequences (different treatments may work
differently for patients with the same condition).

Figure 8: This is a graphical representation of a POMDP model. It is created by combining the MDP and HMM models.
A POMDP policy computes actions at every time step. Due to the fact that the system state is not observed, the POMDP policy maps actions
to all possible probability distributions over the states, otherwise called belief states B. The belief state b(s) represents the
agent's current belief that the true state is s. The goal of the policy is to maximize the long-term reward. The POMDP policy can be
computed by solving the Bellman Equation [2]. Figure 9 illustrates the POMDP execution system. For the sake of simplicity we do not discuss how
belief update is done or how the Bellman Equation is solved. A good introduction to POMDPs can be found in [8].

Figure 9: This figure represents the POMDP execution system. The state estimator module takes the previous belief states, the action
taken and observation received, and produces a new belief state. The Policy module takes as input the belief states and outputs an
action.
click image for larger view
Some of the strengths and weaknesses when using the POMDP model are these:
-
Computing the policy. In general, computing the optimal policy is intractable [11],[10]. Still, reasonable approximations can be
made by following some simple heuristics. There are many algorithms for finding an adequate policy that may not necessarily be optimal.
-
Controlling the level of abstraction. The POMDP model allows choosing different levels of abstraction leading to models that have
more or less states depending on the POMDP designer preferences. This leads to a tradeoff between the amount of information that has to
be supplied on the model (a large number of states would require a lot of information on the model) and the potential accuracy of the
model (a large number of states would enable a higher accuracy).
-
Reasoning in terms of information. Reasoning in terms of information allows the designer to distinguish between actions that
maximize utility and actions that provide information on the state.
-
Handling model uncertainty. Lack of knowledge regarding the specifics of a model can be naturally incorporated into the POMDP model.
Similarly to DBNs, the model parameters can often be learned from data, using algorithms such as the Expectation-Maximization algorithm.
This allows for the building of a rough initial guess to start with and refining it later, based on the observed data.
The Model-Based Approach to Adaptive Power Management
Among the different models presented above, the POMDP model is the only one that is rich enough to capture the two main aspects of APM.
First, the world model includes a human user and a complex computer system that cannot be assumed to be perfectly observable. Second,
APM management involves making decisions on which components to turn on and off.
There are many possible ways to define a POMDP for APM. The suggested POMDP model is shown in Figure 10. Casting this model in the
language of POMDP we have this: the actions (A) are to turn on/off (for each component); the state space (S) is a combination of system
state and user state; the possible observations are the various sensors/features we described before; the transition probability (T) and
the observation probability (O) are both unknown a-priori and must be learned from data.

Figure 10: Our model has user, system, and sensor variables. The user variables are not directly observable and represent user context
or activity. The system state indicates whether the component is idle or active. The sensor variables are keyboard and mouse activity,
application running, CPU load, and network traffic. The reward function R represents power cost as well as cost incurred by annoying the
user.
click image for larger view
The main problem in constructing the POMDP for APM is how to construct the state space. This is a thorny problem at the core of applying
POMDP methodology to any real-world problem. It is obvious that a model that fully describes the system or the user context (i.e., the
way the user is interacting with the system) would be too complex to describe, and would certainly not lead to any useful computations.
We therefore have to balance the complexity of such a model with our need to be able to obtain a useful APM policy.
We emphasize that learning a good POMDP model in general involves answering questions such as "what is the user's context," "how does
that context relate to system state," "how does context change over time," and "what is the level of user annoyance." These questions
are not trivial in the field of artificial intelligence. Still, we believe we can obtain partial answers to these questions based, in
part, on heuristics or even arbitrary decisions. Once we have such answers we can offer a solution to the POMDP.
Our ongoing research focuses on finding an adequate model to represent a realistic system and then solving that model. We are currently
studying several aspects of the model. First, we are developing a more complex model for user context (other than past idleness). In
particular, we look at automatic context construction (that is generated by automatically clustering sequences). Second, we are looking
at different ways to measure annoyance. Instead of using an annoyance measure that is the same for all users, we consider learning the
annoyance from the user, based on individual feedback, and even prompt the user for explicit feedback. Third, we are studying the
statistics of the duration between changes of the values of the system and user variables. Fourth, we consider the initial period where
the system is initializing. The problem here is that when the system starts running the amount of information that is available is small
and the decisions are potentially sub-optimal. In this initial stage there might be a lot of value in exploring actions that can provide
a lot of information on the user. The problem of balancing between information gathering and using the information to perform optimally
is known as the exploration-exploitation dilemma. As was shown in [13], model-based approaches may be unsatisfactory in their
initialization phase.
Other domains
There are many real-world problems that can be framed as sequential decision-making tasks under uncertainty. Stochastic models such as
POMDPs offer a good framework for such tasks because they encompass the idea that there are underlying hidden states (user and system)
and the idea that the system needs to constantly take actions under uncertainty to satisfy some long-term goals and to maximize some
performance criterion.
Explicit reasoning about the user state is imperative in many systems that have significant interaction with the user. For example, in
the power management domain, if a user does not mind interruptions we can afford to have more aggressive and less accurate power
management policies, where accuracy is in terms of predicting whether a component can be turned off. The user state could be summarized
by variables that are related to user activity (physical and mental) and utility preferences. Learning mental user models is not an
inconceivable idea and has previously been explored in opponent modeling in games such as in [3] and [4].
The idea of user adaptive autonomic systems such as the APM systems could be extended to domains that go beyond platforms. Instead of an
isolated computer platform we could call it an "environment" which has both user states and task states. The environment produces
observations and utilities. The autonomic agents' task is to maximize their long-term utility as explained in Figure 1. Figure 11 and
Figure 12 show some examples of similar domains previously studied by researchers. In each figure, we describe the domain and how we
would frame it as a POMDP problem.

Figure 11: Example of a prompting scenario where an elder person boils pasta. The system monitors the person, diagnoses his/her state
and prompts him/her appropriately to complete the task. This is a non-platform autonomic system where the state variables capture task
and user states. This problem was studied by the authors in collaboration with the Intel Lablet in Seattle.
click image for larger view

Figure 12: Example of an autonomic system for diagnosing a failed disk on a server and repairing it. It was modeled as a POMDP in [9].
This is a classic platform autonomic system where the user is not explicitly considered.
click image for larger view
There are many other similar problems of current interest to Intel that can naturally fit the model-based decision-making framework.
Among these are the following:
-
Identity capable platforms. In this domain the idea is to develop trusted access across a variety of devices, networks, and
services. This can be achieved by identity-based policies where identity could be the user itself, or some mode of operation. This is
obviously a system that needs to explicitly reason about the user in order to achieve some long-term goals, such as ensuring a trusted
network or certain services.
-
Server power management. In this domain power management has to be done over an entire server farm. The system state is captured by
sensors, which monitor the work load, the temperature, and the different processes that run on each server. The actions manage the
system by assigning workloads dynamically to servers. The reward function defines the objective of the system by providing a positive
reward when the power (and heat) distributions are as balanced as possible and jobs get executed without delays; otherwise, the reward
is negative.
-
Virus detections on networks. In this domain an agent has to predict whether or not there is a worm attack based on local network
traffic patterns. The state here represents the network and traffic patterns. There are no actions here but rather the agent has to
decide whether or not the network is currently under attack. The reward is positive when attacks are correctly identified and negative
otherwise.
-
Memory power management. In this domain, the agent needs to predict whether blocks of RAM can be turned off without impacting the
performance that is perceived by the user. The system and user state here are similar to the APM domain. Actions are whether to turn on
or off blocks of RAM. The reward function represents user's annoyance and power costs similar to the APM problem.
-
Awareness CPU architecture. In this domain the agent needs to forecast the next upcoming CPU state based on usage patterns. This is
again a very similar problem to the OS power management problem except that it needs to be done at the millisecond level.
-
Multi-device transparent power sharing. In this domain machines need to automatically share power. Given potential usage, benefit,
and activity of each device, power should be divided according to how the users' tasks are accomplished and should not take into account
individual device power levels.
|