|
We have established a general methodology for developing a virtualization benchmark. The value of the
benchmark will be amplified to the degree that its component workloads map to the actual work and metrics of
the particular usage model. There is great value in having a proxy industry-standard benchmark since it
permits the community to focus on some standard repeatable test that facilitates comparisons between
configurations and platform technologies. Virtualization opportunities require that some abstraction on top
of a set of workloads is developed, for example a throughput methodology (SPECthruput* and later SPECrate*)
[6] applied on top of the well-known SPEC* CPU benchmarks.
Despite virtualization technologies and implementations dating back more than 30 years, no de-facto or
industry-standard performance metric exists. As virtualization is exploited on commodity high-volume
platforms across a range of server workloads, the community needs standards to compare virtualized
performance (e.g., [7]). These are some of the motivations for an industry-standard benchmark:
-
Motivate the industry to add a performance and optimization discipline to virtualization platforms,
monitors, and tools.
-
Help users considering virtualization to compare alternatives, particularly where they were unwilling or
unable to develop, execute, and maintain their own benchmarks.
-
Accelerate both the development and application of virtualization technologies.
What is a server virtualization performance benchmark and how could it be used? It is a performance
discipline that includes consolidated server workloads running in virtualized partitions. It is most likely
to be useful to compare platforms and virtualization technologies. Some examples on how it would be used are
given:
-
How does the performance, power efficiency, or price/performance of Platform A compare with Platform B?
-
How does VM monitor A compare with VM monitor B (an upgrade, a competitive comparison, comparison of
software and hardware accelerated implementations)?
-
What is the effect of a configuration modification or a component substitution?
What is a good benchmark and how is this question modified by virtualization? We will first list what we
believe is required of benchmarks and then what is desirable.
First, required attributes of benchmarks:
-
They should be relevant to some well-defined user constituency like Web servers, application servers,
database servers, and e-mail servers, that would be likely considered for consolidation.
-
They should be timeless, in the sense they do not specify any platforms, technologies, or usage models
that will not make sense over some reasonably long time.
-
They should provide repeatable results from run to run and provide a means to ensure that valid results
were obtained.
-
They should be agnostic (in terms of platforms, virtualization monitors, and as many other parts of the
stack) to the extent possible.
-
They should be able to automate, repeat, scale, consolidate, and compare performance observations across
a wide range of systems, times, and components.
-
They should use a simple, single unit of measure for performance results (and associated component scores
for those preferring more detail).
-
They should be citable, which depends upon components that are freely accessible to the community.
Second, desirable attributes of benchmarks:
-
They should not require infrastructure that will block most organizations from putting the system
together. This rules out workloads that need a huge number of disks, a load-generation infrastructure, or an
infrastructure that is otherwise encumbered.
-
They should be easy to test. For example, load generation is considered a minor performance consumer and
can be run within the test environment (no external load generators needed).
-
They should be easy to obtain, set up, and test. For example, they should use relatively low-cost,
easy-to-access and set up applications, tools, and utilities.
There are many practical challenges in constructing such a specification. Here are some examples.
-
If you were to start with existing workloads, some of the most visible contemporary industry-standard
server workloads may limit other components from participating in the benchmark or introduce complexities
that cannot be easily managed across a range of systems and timeframes. One solution would be to develop new
server workloads that were totally portable, but this would introduce a new challenge in that these would be
unproven and unknown.
-
Many benchmarks artificially attempt to saturate the CPU and marginally reflect best configuration
practices.
-
Some virtualization technologies may have different feature sets and in order to maximize those that
could participate, we would have to limit the features that could be exercised, for example migration of VMs.
-
Perhaps the most difficult set of decisions is to pick the constituent workloads and define an
aggregation processsince there is no single right set. This will require compromises before it can be
accepted by the community.
Driving a standardized benchmark is the best way to create a performance discipline, achieve continuous
improvement at every layer of the platform stack, and ultimately develop an industry standard for measuring
and optimizing virtualization performance.
|