Introduction
This article will help you determine the scalability of your application. Using this information you can determine the maximum potential speed up of your application based on the number of cores you have available on your system.
Background
Amdahl’s Law states that the speedup in a parallel program is limited by the amount of time spent for the sequential part of the program. Scalability is the ability to decrease runtime inversely proportional to the number of processors. For example, perfectly scaling code has the maximum potential to run in 1/256th the time on 256 cores than it takes to run on 1 core. I added the caveat regarding maximum potential because there are other factors that can dictate how fast your code will run: memory bandwidth, vectorization, etc
Problem Statement
For a moment consider the Quick Sort algorithm. I was recently testing the qsort sample code that ships with Intel® Cilk™ Plus, first running on a Linux* system running a 32 core Sandy Bridge processor and then running the same application on an Intel® Xeon Phi™ coprocessor. I then measured the performance of the Intel® Xeon Phi™ coprocessor version by using Intel® VTune™ Amplifier to see just how well the application was performing.
Viewing the CPU Usage Histogram in VTune Amplifier:
This histogram is telling us we have a target concurrency of 244 threads because that is how many cores we have available but the average concurrency is only 80 threads! If you look at the histogram you will see that the program is spending much of its time with only a few threads running, there is also a relatively small portion of time where the processor is fully utilized, see the right hand side of the histogram. VTune Amplifier clearly shows the CPU Usage as poor and well into the red zone of the histogram.
Why Are We Not Utilizing All of the Cores of the Processor?
First let's examine the algorithm, Quick Sort is a very well know algorithm that has a serial time of O(N lg N). The Quick Sort algorithm does not have good scalability, but there are other sorting algorithms that do scale better, for example parallel merge sort or parallel sample sort. The remainder of this article will show how you can model the scalability of the Quick Sort application.
Modeling Scalability Using Intel® Advisor
Intel Advisor is a powerful programming than tell you:
- Where to add parallelism to your program
- How well the parallelism you are adding will scale.
- Did the parallelism you added introduce any potential threading errors.
This article focuses on the second bullet, as we already have a parallel quick sort application and we want to know how well it scales.
These are the steps to take to model the scalability of qsort.cpp using Intel Advisor
- Copy the qsort.cpp from the Cilk Plus samples
- Since we are modeling how well a serial program will scale, edit qsort.cpp and remote the call to cilk_spawn. Also remove the call to cilk_sync
- The application will now run serially.
- The mechanism Intel Advisor uses to model parallelism is by using annotations you add to the code that inform Advisor XE where you want the code to execute in parallel.
- #include "advisor-annotate.h"
- //cilk_spawn sample_qsort(begin, middle);
ANNOTATE_TASK_BEGIN(one_task);
sample_qsort(begin, middle);
ANNOTATE_TASK_END(one_task);
ANNOTATE_TASK_BEGIN(two_task);
sample_qsort(++middle, ++end); // Exclude pivot and restore end
ANNOTATE_TASK_END(two_task); - Also add the following site information
- ANNOTATE_SITE_BEGIN(one);
- sample_qsort(a, a + n);
- ANNOTATE_SITE_END(one);
- Recompile
- icc -I/opt/intel/advisor_xe/include -o qsort qsort.cpp
- Run the Advisor XE gui
- advixe-gui
- Create a project to hold your data
- File->Create->Project
- Specify a project name
- Specify the binary you just created.
- File->Create->Project
- Click on the Advisor XE workflow button
- This will give you the steps to take in the typical Advisor XE workflow.
- Click the "Collect Suitability Analysis" button
- This will give you the steps to take in the typical Advisor XE workflow.
This will run your application and give you information regarding scaling.
Intel Advisor allows you to select the number of target CPU's to model and reports the potential speedup. For example:
32 cores:
This dialog is showing us that the maximum program speedup we can expect is 14.12 times. The graph clearly shows that the initial scaling looks very good but we hit a maximum speed up between 64 and 128 CPUs. If we add additional CPUs we will not get any additional speed up. This dialog is also showing us that although this particular site contributed to a 19.1 times gain, the entire program saw a 14.12 times gain.
256 cores:
This diagram shows another example, this time with 256 cores. In this case the maximum gain is 16.14 times. Note: you would also see the same gain for 128, 256 and 512 cores.
Summary
Intel Advisor is a powerful programming tool for finding opportunities to parallelize your code. It also has some very useful features that allow you to verify that you are making efficient use of your processor. Intel Advisor can save developers time from trying to optimize based on the target microarchitecture when the problems are not microarchitecture specific!
The performance problems may be due to program scalability: insufficient work, overhead, load imbalance, etc. Focusing on the microarchitecture is useless in these cases.