Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 03

Tera-scale Computing


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1103.06

  • Volume 11
  • Issue 03
  • Published August 22, 2007

Tera-scale Computing

  Section 3 of 8  

Datacenter-on-Chip Architectures: Tera-scale Opportunities and Challenges

SCALABILITY CHALLENGES AND SOLUTIONS

In this section, we start by describing four classes of DoC usage models and then focus on one of them to highlight the potential of tera-scale architecture and describe the key challenges.

Datacenter-on-Chip Usage Models

Virtualization techniques make it possible to consolidate multiple server applications onto a single system. This usage model has been gaining momentum in enterprise datacenters because it improves resource sharing and usage, improves manageability, and reduces cost. We expect this trend to continue growing significantly in the coming years especially with the integration of more and more cores on the die. DoC essentially refers to the potential of multiple datacenter applications running simultaneously on a single-chip server platform. DoC usage scenarios can be classified into four broad categories based on (a) the types of applications being consolidated and (b) the level of communication and cooperation between the applications. Figure 1 illustrates the four DoC usage models that are explained further below.

  • Homogeneous/Non-Cooperating: In this type of consolidation, multiple server applications of the same type are consolidated onto a single platform. However, these applications are independent in nature and no significant communication is required between the applications. A good example is the consolidation of a farm of Web servers that are serving Web pages and are load balanced. For the most part, these different Web servers run on their own without having to communicate with each other.
  • Heterogeneous/Non-cooperating: In this type of server consolidation, multiple different server applications are consolidated onto a single system. It is often the case in enterprise datacenters that servers are under utilized for a significant portion of the time when running a single application. The main motivation for this type of consolidation is to achieve maximum resource usage by consolidating multiple applications onto the same platform. In this type of consolidation, the applications being consolidated are still quite independent and do not need to communicate with each other. One example is that of consolidating an email server, file and print server, and user authentication server.
  • Homogeneous/Cooperating: This type of consolidation occurs when a clustered application (e.g., database cluster) is consolidated onto a single system. Clusters use some sort of message passing either on a regular network fabric like Ethernet or on a more specialized fabric to communicate with each other. This communication turns into inter-VM communication once consolidated onto the same platform.
  • Heterogeneous/Cooperating: This type of usage model occurs when multiple heterogeneous workloads that need to communicate with each other are consolidated. A good example of this type is where a multitiered application, like in TPC-W, is consolidated onto a single system. Here various tiers need to communicate with each other while servicing user requests. Hence inter-VM communication can be a significant factor, and handling this can be a challenge in virtualized environments, as we will see in the later part of this paper.

Mapping to Tera-scale Architectures

The DoC usage models described above can take advantage of more and more cores on-die since they have many applications (potentially multi-threaded) running simultaneously on a single platform. As a result, a tera-scale architecture with several tens of physical cores and hundreds of hardware threads integrated on the die is highly suitable for DoC usage. To illustrate the potential of tera-scale and highlight the challenges, we now focus on a case study of an e-commerce environment based on the TPC- W benchmark.

TPC-W [29] is a benchmark representative of an e-Commerce datacenter environment defined by the Transaction Processing Performance Council (TPC). The performance metric reported by TPC-W is the number of Web interactions processed per second (WIPS). Multiple Web interactions are used to simulate the activity of a retail store, and each interaction is subject to a response time constraint. The TPC-W benchmark is now obsolete; however, the e-Commerce workload that it represents is very relevant and important. A typical TPC-W setup contains several different application components (as shown in Figure 2):

  • Web Servers process incoming HTTP requests and prepare responses to be sent to the clients.
  • Web Cache Servers cache static and dynamic content for fast access to data.
  • Image Servers serve static images that are part of the response Web pages.
  • Application Servers provide the e-Commerce functionality and are responsible for processing customer orders and payments for goods, among other things.
  • Database Servers hold inventory of product, description, availability, pricing and other information.
  • Load Balancer and other Infrastructure Servers distribute processing load among different Web and image servers equally by directing incoming HTTP requests to the server with the least load.



Figure 2: Mapping datacenter workloads to tera-scale
click image for larger view
 

Table 1: Compute/cache/memory capacity data from TPC-W setup (example based on TPC-W publication [30])


Figure 2 shows how these different components are interconnected in a typical TPC-W setup of the past. TPC-W is a perfect example of understanding the requirements and behavior of consolidating multiple tiers (heterogeneous/cooperating) of a datacenter on a tera-scale CMP platform (bottom of Figure 2). Table 2 summarizes the number of systems, compute cores, cache and memory in an example configuration roughly based on a high-performing 2002 TPC-W publication [30]. As shown in the figure as well as the table, there are 63 systems employed in the example TPC-W configuration. Except for the database server, which employed four processors, all other systems consisted of two processors (without multi-threading). As a result, the total number of processors in the configuration was about 130. In a tera-scale CMP platform, we expect that a single processor socket could contain 32 cores each with 4 threads (SMT). As a result, the entire TPC-W example configuration can be potentially consolidated onto such a 32-core, 128-thread single-socket platform.

However, it is also critical that we take into account the amount of platform resources that are needed to support the execution of simultaneously running VMs of this nature. For example, Table 2 shows that the total cache capacity available in the TPC-W configuration adds up to 70MB in size. Given the area constraints and the fact that 32 cores will be integrated onto the die, our previous work [36] has shown that the amount of cache space available is likely to be less than 32MB. As a result, architectural techniques that enhance cache/memory scalability and performance need to be incorporated into the platform. We discuss these further in the next section.

Another key challenge in running heterogeneous VMs of this nature on the same platform is that they will contend for platform resources and interfere with each other. Given that these VMs are likely to get very different utility benefits from platform resources, and that the VMs are likely to be different in importance to the overall performance of the datacenter, it is important that we incorporate adaptability techniques in the platform so that resource usage can be dynamically controlled to provide performance isolation or QoS for DoC platforms. In the following section, we describe adaptability challenges and solutions to address these in tera-scale architectures.

Last but not least, it is also important to consider the overheads of virtualization on the DoC performance. In addition to the basic overhead of handling system calls, context switches, and interrupts for VMs, one primary concern in virtualized platforms is the overhead of I/O virtualization. For example, Figure 3 shows the overheads of virtualization for (a) transmitting network data to external platforms, (b) receiving network data from external platforms, and (c) inter-VM communication between VMs. The data shown in Figure 3 are based on measurements done on a recent Intel® Xeon® dual-core processor (3GHz) dual-socket platform [8] running the Xen hypervisor [3, 33]. The measurements show that (a) receiving network data at 1Gbps and processing requires 75% of CPU utilization under virtualization, (b) transmitting 1Gbps externally requires about 50% of CPU processing, and (c) communicating 1Gbps between VMs on the same platform requires about 70% of CPU utilization. Further, it should be noted that these compute cores are large out-of-order cores without multiple threads sharing the pipeline. As we design tera-scale processors, the use of smaller in-order cores with multiple threads sharing the pipeline may increase the associated processing overhead. However, since most of the cores in the example TPC-W configuration were underutilized (last column in Table 1), there is likely some headroom available to accommodate this extra I/O processing overhead. Extensions to techniques (such as Intel's I/O Acceleration Technology [14, 22]) are needed to minimize this overhead for a virtualized DoC environment. However, this is not covered in this paper.



Figure 3: CPU overheads for network I/O virtualization
click image for larger view
 

  Section 3 of 8  

Back to Top

In this article

Download a PDF of this article.