Introduction
Power consumption is a significant concern for today’s data centers. Power is a monthly fixed cost that all data center providers must pass on to their customers. Competitive industry-wide pricing pressure requires that data center providers find intelligent and creative ways to keep power costs down. In addition, regulatory and OpX factors aimed at reducing total energy consumption have created a demand for more energy-efficient computer platforms. Yet, end-users still need the ability to use the peak performance of their assets to meet business objectives. Energy efficiency is not strictly measured by raw peak or idle power consumption. High performance devices operating at maximum performance for short durations, and then returning to a low-power idle state, are typically the most energy efficient configurations. Ethernet Power Management Technology with DMA Coalescing enables end-users to make a range of choices to determine which trade-offs are acceptable to meet their operational goals.

Power Management Technology
Power Management Technology (PMT) is a standards-based solution, leveraging existing ACPI* and PCI* standards, as well as existing platform power management capabilities of the CPU, chipset and operating system.

PMT provides solutions to common power management approaches by:

- Reducing idle power
- Reducing capacity and power as a function of demand
- Whenever possible, operating at maximum energy efficiency
- Enabling functionality only when needed

Reducing Idle Power
With Intel’s I350-based network controllers and adapters, integrated quad-port configurations consolidate and coordinate functionality between ports on the adapter, effectively increasing energy efficiency. The Intel® Ethernet Controller I350 also supports PCI power management states, which helps to reduce overall power consumption by reducing power when a device is in an idle state.

The Intel® Ethernet Controller I350 also incorporates a high-efficiency integrated switching-voltage regulator (SVR) that reduces overall BOM cost and design complexity. Its design also enables a more efficient power supply to the component.

Reducing Capacity and Power as a Function of Demand
PMT incorporates IEEE* 802.3az support, (http://www.ieee802.org/3/az/index.html) also known as Energy Efficient Ethernet or EEE. This specification defines an optional Low Power Idle (LPI) mode for 1000BASE-T, 100BASE-TX and other interfaces. LPI enables power saving by switching off part of the I350 functionality when no data needs to be transmitted and/or received. When LPI...
support is enabled, the Intel® Ethernet Controller I350 will shut off RX circuitry and send an inband RX LPI Indication on detection that link the partner’s TX moved into LPI state. The I350 PHY will move TX into LPI state and power-down transmit circuitry when receiving an Inband TX LPI request from the integrated LAN controller. In SX states, LPI is supported only in 100 Mbps WOL enabled mode while keeping the receive side active.

Studies indicate that the majority of platforms—both client and server—only use a fraction of the available bandwidth of the local link. Ethernet traffic typically occurs in bursts, leaving long periods of inactivity. IEEE 802.3az enables the network interface to enter into a Low-Power-IIdle (LPI) mode when the adapter detects that the network link is not being fully used. This enables link partners to save energy by cycling between active and LPI states.

**Operation at Maximum Efficiency**

PMT provides a new mode of operation called DMA Coalescing. It changes the system behavior of the LAN interface by changing how frequently packet data is delivered to the system by batching the delivery of packet data and device interrupts to the chipset, CPU and memory.

This behavior has the following effects:

- By batching and increasing the amount of data transferred to the system during any given time, the LAN device enables the rest of the system to enter into low-power platform states (that is PCIe enters ASPM L1, the CPUs activate Package Cx states, and main-memory goes into self-refresh). DMA Coalescing enables these components to stay in these low-power platform states for longer periods.

- Intel's implementation of PMT attempts to make the DMA frequency predictable. This predictability enables the host CPU to pick a deeper low-power state than it might otherwise choose.

- When the CPU wakes to process network activity, the operating system is able to run at higher efficiency because software has more “work” to do for any given interrupt. The observable effect, with benchmarks, is that with increasing network I/O block sizes, CPU usage drops and I/O bandwidth increases.

Figure 1 shows that without DMA Coalescing, the platform is typically kept in higher power states. The vertical lines show the random nature of platform interrupts. Power consumption, represented by the top line, is higher overall because the processor, memory and other system components are brought out of lower power states to handle the incoming data.

In addition, system components are not allowed enough time to achieve deeper low-power states.
Figure 2 shows that with DMA Coalescing, the incoming data packets and interrupts associated with these DMA calls are intelligently batched to keep the system devices in lower power states. This enables the system to handle the packets and interrupts more efficiently. The technique also gives system components the opportunity to achieve deeper low power states.

Note: One impact of delaying interrupts and DMA operations is an increase in latency. Most (not all) applications are quite tolerant of latency.

DMA Coalescing is accomplished by using the existing transmit and receive buffers on the LAN device to store packets rather than immediately transferring packet data to or from host memory (as current LAN solutions do). After either a given amount of network data has been buffered (called a watermark) or, after a configurable timer expires, the LAN device exits out of coalescing mode and bursts data accesses and interrupts to the platform. DMA Coalescing also enhances previously existing interrupt moderation behavior by throttling the observed device interrupt rate in conjunction with the configurable DMA Coalescing timer rate. The interrupt rate is governed by the Interrupt-Moderation-Rate (ITR).

Enable Functionality Only When Needed

With PMT’s support of the ECMA-393 ProxZzzy specification, servers can move to low-power standby states (such as S3), maintain network presence, and be remotely activated via a variety of wakeup packet types.

Intel also supports Low-Power-Link-Up (LPLU). This facility reduces the link power usage in S3 by negotiating the lowest link-speed (where bandwidth isn’t required).

**DMA Coalescing Experiments & Testing**

Experiments were performed to evaluate the power saving benefits of PMTs and the impact on network performance. Intel’s implementation scales to reduce power consumption over a wide range of network usage levels. (See Figure 3.)

At network usage below 5%, EEE (802.3az) was most effective, since there is more time to keep the link in a low-powered state. DMA Coalescing showed no significant benefit at such low usage rates since not much data is transferred at those rates.

DMA Coalescing is most effective in the 5% to 35% range, with maximum benefit at 25% usage. Above 35%, power saving benefits decrease. Industry studies report that most servers experience usage rates of 20–35%, with only 10-15% of a 1 Gbps link’s bandwidth used.

At higher usage, interrupt moderation directly reduces platform power by reducing overall CPU usage. This, combined with the Intel® Ethernet Controller I350’s low active power, provides the active system power benefit.

**Experiments**

- Experiments using an Intel Urbanna DP platform were run as follows:
  1. Vary the network load
  2. Vary Interrupt Moderation Rate
  3. Measure the platform power
  4. Enable DMA Coalescing and vary the DMA Coalescing watchdog time
  5. Fix the Interrupt Moderation Rate (ITR) value
  6. Measure the platform power
- Platform—Test setup
  - 2 x 2.93 GHz Quad-core Xeon® CPUs (X5570)
  - 12 GB (2048 x 6) DDR3 1333MHz memory
  - BIOS defaults—enhanced C-states, C6/Turbo/HT–enabled
  - I350 development—test adapter
  - Linux® 2.6.32 with the following features enabled; tickless, high_res_timers, hpet_timer, on-demand CPU governor, Powertop-timer_stats and PCI-ASPM.
  - Manually force ASPM L1 on the network adaptor port.
  - Network connection at 1 Gbps.
  - Set one port as Receive with smartbits=1514 byte continuous UDP packet stream from another port.
- Results & Observations
  - Throttling interrupts by itself improves power efficiency.

![Figure 3](image-url)
• Adding DMA Coalescing creates further power savings. Figure 4 shows how moderating interrupts improves power efficiency and the addition of DMA Coalescing further increases power savings.

• Peak benefit reached at expected throughput of ~250 Mbps (25%).

• Beyond optimal throughput, power savings begin to decrease. Figure 4 shows the power savings of a single port using interrupt moderation and DMA Coalescing within the context of network usage.

• DMA moderation benefits increase as more time is allowed for coalescing, for example, 250 μS to 5 μS. However, as additional time for coalescing is enabled, response-time latency increases proportionally if the network data is not sufficient to exceed the device water mark.

• Asynchronous activity between two discrete controllers (2x dual-port vs 1x quad-port) interferes with CPU lower power state entry and duration, reducing DMA Coalescing power effectiveness.

Intel® Ethernet Controller I350

• Integrated Quad Port Silicon

• Intel has achieved DMA Coalescing in an integrated quad-port part today!

• Intel synchronizes DMA activity across all four ports of our quad-port controllers beginning with the i350

DMA Coalescing Across Multiple Intel Quad Port Adapters

• Through software emulation, Intel is able to synchronize DMA Coalescing between two Intel adapters

• Typical platform power savings of 15W to 20W per server with DMA Coalescing enabled on a single four port LAN device

Additional Configuration Information

The following platform-level configurations and settings dramatically improve the power efficiency of a system using PMT.

Platform Considerations

Overall, minimize the use of USB* devices. The USB bus is a polled bus; transactions are initiated by the host and not the USB device. Because of this, USB devices contribute more interrupts to the system and make it difficult to control Power Management. USB 2.0 does support a

Software Operating System Tuning

When using Windows* Server 2008 R2:

1. Disable core parking if needed.
2. Install all chipset-specific and device-specific device drivers (such as the Intel Chipset INF updater, as well as vendor-specific graphics drivers).

Contact your local Intel Field representative to obtain the “SelfTest” tool from http://www.intel.com/cd/edesign/library/asma-na/eng/434688.htm. The tool verifies the platform BIOS/OS configuration.

Linux* versions 2.6.33 and later support the required power management hooks to optimize DMA Coalescing. Customizations of the kernel enhance the effect:

1. Enable “tickless” feature with
   Tick=1000 and preemption

3. Enable Enhanced Intel Speedstep® Technology (EIST).

Enhanced Intel Speedstep Technology enables the system to dynamically adjust processor voltage and core frequency. This can result in decreased average power consumption and decreased average heat production.

4. Enable ASPM L1 if possible for additional PCIe power savings.

![Single Port GbE Test](image-url)
PCIe Bus Analyzer Method

The PCIe bus analyzer method requires instrumentation of the LAN adapter with a PCIe interposer and capturing PCIe traffic while the device is being used. Although a full description of the process is beyond the scope of this document, the general process is to, after capturing the traces, use the analyzer-specific software to visualize the bus usage based on the PCIe transactions. An example graph generated by LeCroy PETracer® appears in Figure 6.

Wall-power Method

The raw wall-power measurement of the system is relatively straightforward with the correct wall-power measurement equipment—such as a Watts Up®, Kill a Watt®, or an equivalent power measurement tool. In general, follow the platform measurement guidelines published in the EnergyStar® standards—specifically allowing for a settling time after the system boots, startup processes complete and then go idle.

Package Cx State Residency Method

The Package Cx state residency method requires special software, as well as a basic background on “C states” on CPUs. “C states” in ACPI corresponds to various CPU functional states, similar to D-states for I/O devices, and S-states for platforms. For example, C0 means the CPU is fully operational and executing instructions. The various higher C-states, such as C1, C2, and C3, correspond to lower and lower power states with longer and longer resume times.

The OS typically requests entry into one of these C states based on its own internal heuristics, as well as an exit latency table provided by the BIOS to the OS. The BIOS maps processor-specific C-states to the OS-exposed C-states. For example, if the OS calls the ACPI “C1” state, it would likely be mapped to the Intel C1E power state (where the CPU, on exit, resumes execution at the lowest-frequency available, if Enhanced Intel Speedstep® (EIST) is also enabled). If the OS invokes “C3,” the BIOS could activate either C3 or C6, depending upon how the BIOS is configured.

Lastly, although the BIOS may request that the CPU go into “C6,” the CPU may auto-demote, or select a different, more shallow C state such as C3, based on device access and interrupt delivery patterns.

Methods to Verify Behavior

There are three methods to verify behavior: at the PCIe Bus Analyzer level, via Package Cx state residency counters, and by raw wall-power measurements of the system.
enable the entire package to transition to a Package C state. The dramatic power savings occur whenever the entire package enters these Package C3 or Package C6 states. However, I/O activity such as graphics DMA, disk DMA, or LAN DMA, even at seemingly platform idle, prevents the package from entry into these deep power states. As such, much of the time spent on platform tuning is used to identify the source of this activity. An example of this is running a copy of Windows that hasn’t been activated, causing a small amount of disk activity that isn’t noticeable by looking at CPU usage alone.

To determine the current Package C state residency, special software must be used. On Linux, the powertop tool (versions 2.0 and later) supports reporting these metrics, as does the Linux “turbostat” tool. For Windows, there are Intel tools. The Intel Battery Life Analyzer tool (BLA) reports the Package C states (as well as the actual cause of the I/O activity in many cases) on Intel client platforms. On server platforms, a special perfmon DLL must be installed to read the processor-specific performance counters.

Ideally, on a well-tuned platform at idle, the platform should see 85% or greater Package C6 % residency. At this point, various networking benchmarks can be stated to evaluate the benefits of DMA Coalescing and other Intel PMT features.

Performance testing and Latency
Power saving technology can add additional latency to a system. In most conditions the latency will have little to no actual effect to the performance of a system. However if you have an application that is sensitive to latency, or when performing system throughput testing these Power saving features may need to be disabled or tuned to minimize their impact.

References
Designing Power-Friendly Devices (Intel Whitepaper)
Energy-Efficient Platforms/Green Hill Software (Intel Whitepaper)
Intel I350 Quad-/Dual-Port GbE LAN Controller Datasheet