For most applications, we think about performance in terms of throughput. What matters is how much work an application can do in a certain amount of time. That’s why hardware is usually designed with throughput in mind, and popular software optimization techniques aim to increase it.
However, there are some applications where latency is more important, such as High Frequency Trading (HFT), search engines and telecommunications. Latency is the time it takes to perform a single operation, such as delivering a single packet. Latency and throughput are closely related, but the distinction is important. You can sometimes increase throughput by adding more compute capacity; for example: double the number of servers to do twice the work in the same amount of time. But you can’t deliver a particular message any quicker without optimizing the way the messages are handled within each server.
Some optimizations improve both latency and throughput, but there is usually a trade-off. Throughput solutions tend to store packets in a buffer and process them in batches, but low latency solutions require every packet to be processed immediately.
Consistency is also important. In HFT, huge profits and losses can be made on global events. When news breaks around elections or other significant events, there can be bursts of trading activity with significant price moves. Having an outlier (a relatively high latency trade) at this busy time could result in significant losses.
Latency tuning is a complex topic requiring a wide and deep understanding of networking, kernel organization, CPU and platform performance, and thread synchronization. In this paper, I’ll outline some of the most useful techniques, based on my work with companies in telecommunications and HFT.
Understanding the Challenge of Latency Optimization
Here’s an analogy to illustrate the challenge of latency optimization. Imagine a group of people working in an office, who communicate by passing paper messages. Each message contains the data of a sender, recipient and an action request. Messages are stored on tables in the office. Some people receive messages from the outside world and store them on the table. Others take messages from the table and deliver them to one of the decision makers. Each decision maker only cares about certain types of messages.
The decision makers read the messages and decide whether the action request is to be fulfilled, postponed or cancelled. The requests that will be fulfilled are stored on another table. Messengers take these requests and deliver them to the people who will carry out the actions. That might involve sending the messages to the outside world, and sending confirmations to the original message sender.
To complicate things even more, there is a certain topology of message-passing routes. For example, the office building might have a complicated layout of rooms and corridors and people may need access to some of the rooms. Under normal conditions the system may function reasonably well in handling, let’s say, two hundred messages a day with an average message turnaround of five minutes.
Now, the goal is to dramatically reduce the turnaround time. At the same time, you want to make sure the turnaround time for a message is never more than twice the average. In other words, you want to be able to handle the bursts in activity without causing any latency outliers.
So, how can you improve office efficiency? You could hire more people to move messages around (increasing throughput), and hire faster people (reducing latency). I can imagine you might reduce latency from five minutes to two minutes (maybe even slightly less if you manage to hire Usain Bolt). But you will eventually hit a wall. There is no one faster than Bolt, right? Comparing this approach to computer systems, the people represent processes and this is about executing more threads or processes (to increase throughput) and buying faster computers (to cut latency).
Perhaps the office layout is not the best for the job. It’s important that everyone has enough space to do their job efficiently. Are corridors too narrow so people get stuck there? Make them wider. Are rooms tiny, so people have to queue to get in? Make them bigger. This is like buying a computer with more cores, larger caches and higher memory and I/O bandwidth.
Next, you could use express delivery services, rather than the normal postal service, for messages coming into and out of the office. In a computer system, this is about the choice of network equipment (adapters and switches) and their tuning. As in the office, the fastest delivery option might be the right choice, but is also probably the most expensive.
So now the latency is down to one minute. You can also instruct people and train them to communicate and execute more quickly. This is like tuning software to execute faster. I’ll take 15 percent off the latency for that. We are at 51 seconds.
The next step is to avoid people bumping into each other, or getting in each other’s way. We would like to enable all the people taking messages from the table and putting messages on it to do so at the same time, with no delay. We may want to keep messages sorted in some way (in separate boxes on the table) to streamline the process. There may also be messages of different priorities. In software, this is about improving thread synchronization. Threads should have as parallel and as quick access to the message queue as possible. Fixing bottlenecks increases throughput dramatically, and should also have some effect on latency. Now we can handle bursts of activity, although we do still have the risk of outliers.
People might stop for a chat sometimes or a door may stick in a locked position. There are a lot of little things that could cause delay. The highest priority is to ensure the following: that there are never more people than could fit into a particular space, there are no restrictions on people’s speed, there are no activities unrelated to the current job, and there is no interference from other people. For a computer application, this means we need to ensure that it never runs out of CPU cores, power states are set to maximum performance, and kernel (operating system) or middleware activities are isolated so they do not evict application thread activities.
Now let’s consider whether the office environment is conducive to our goal. Can people open doors easily? Are the floors slippery, so people have to walk with greater care and less speed? The office environment is like the kernel of an operating system. If the office environment can’t be made good enough, perhaps we can avoid part of it. Instead of going through the door, the most dexterous could pass a message through a window. It might be inconvenient, but it’s fast. This is like using kernel bypass solutions for networking.
Instead of relying on a kernel network stack, kernel bypass solutions implement user space networking. It helps to avoid unnecessary memory copies (kernel space to user space) and avoids the scheduler delay when placing the receiver thread for execution. In kernel bypass, the receiver thread typically uses busy-waiting. Rather than waiting on a lock, it continuously checks the lock variable until it flags: “Go!”
On top of that there may be different methods of exchanging messages through windows. You would likely start with delivering hand to hand. This sounds reliable, but it’s not the fastest. That’s how the Transmission Control Protocol (TCP) protocol works. Moving to User Datagram Protocol (UDP) would mean just throwing messages into the receiver’s window. You don’t need to wait for a person’s readiness to get a message from your hand. Looking for further improvement? How about throwing messages through the window so they land right on the table in the message queue? In a networking world, such an approach is called remote direct memory access (RDMA). I believe the latency has been cut to about 35 seconds now.
What about an office purpose-built, according to your design? You can make sure the messengers are able to move freely and their paths are optimized. That could get the latency down to 30 seconds, perhaps. Redesigning the office is like using a field programmable gate array (FPGA). An FPGA is a compute device that can be programmed specifically for a particular application. CPUs are hardcoded, which means they can only execute a particular instruction set with a data flow design that is also fixed. Unlike CPUs, FPGAs are not hardcoded for any particular instruction set so programming them makes them able to run a particular task and only that task. Data flow is also programmed for a particular application. As with a custom-designed office, it’s not easy to create an FPGA or to modify it later. It might deliver the lowest latency, but if anything changes in the workflow, it might not be suitable any more. An FPGA is also a type of office where thousands of people can stroll around (lots of parallelism), but there’s no running allowed (low frequency). I’d recommend using an FPGA only after considering the other options above.
To go further, you’ll need to use performance analysis tools. In part two of this article, I’ll show you how Intel® VTuneTM Amplifier and Intel® Processor Trace technology can be used to identify optimization opportunities.
Making the Right Hardware Choices
Before we look at tuning the hardware, we should consider the different hardware options available.
One of the most important decisions is whether to use a standard CPU or an FPGA.
The most extreme low latency solutions are developed and deployed on FPGAs. Despite the fact that FPGAs are not particularly fast in terms of frequency, they are nearly unlimited in terms of parallelism, because the device can be designed to satisfy the demands of the task at hand. This only makes a difference if the algorithm is highly parallel. There are two ways that parallelism helps. First, it can handle a huge number of packets simultaneously, so it handles bursts very well with a stable latency. As soon as there are more packets than cores in a CPU, there will be a delay. This has an impact on throughput than latency. The second way that parallelism helps is at the instruction level. A CPU can only carry out four instructions per cycle. An FPGA can carry out a nearly unlimited number of instructions simultaneously. For example, it can parse all the fields of an incoming packet concurrently. This is why it delivers lower latency despite its lower frequency.
In low latency applications, the FPGA usually receives a network signal through a PHY chip and does a full parsing of the network packets. It takes roughly half the time, compared to parsing and delivering packets from a network adapter to a CPU (even using the best kernel bypass solutions). In HFT, Ethernet is typically used because exchanges provide Ethernet connectivity. FPGA vendors provide Ethernet building blocks for various needs.
Some low latency solutions are designed to work across CPUs and FPGAs. Currently a typical connection is by PCI-e, but Intel has announced a development module using Intel® Xeon® processors together with FPGAs, where connectivity is by Intel® QuickPath Interconnect (Intel® QPI) link. This reduces connection latency significantly and also increases throughput.
When using CPU-based solutions, the CPU frequency is obviously the most important parameter for most low latency applications. The typical hardware choice is a trade-off between frequency and the number of cores. For particularly critical workloads, it’s not uncommon for server CPUs and other components to be overclocked. Overclocking memory usually has less impact. For a typical trading platform, memory accounts for about 10 percent of latency, though your mileage may vary, so the gains from overclocking are limited. In most cases, it isn’t worth trying it. Be aware that having more DIMMs may cause a drop in memory speed.
Single-socket servers operating independently are generally better suited for latency because they eliminate some of the complications and delay associated with ensuring consistent inter-socket communication.
The lowest latencies and the highest throughputs are achieved by high-performance computing (HPC) specialized interconnect solutions, which are widely used in HPC clusters. For Infiniband* interconnect, the half-roundtrip latency could be as low as 700 nanoseconds. (The half-roundtrip latency is measured from the moment a packet arrives at the network port of a server, until the moment the response has been sent from the server’s network port).
In HFT and telco, long range networking is usually based on Ethernet. To ensure the lowest possible latency when using Ethernet, two critical components must be used - a low latency network adapter and kernel bypass software. The fastest half-roundtrip latency you can get with kernel bypass is about 1.1 microseconds for UDP and slightly slower with TCP. Kernel bypass software implements the network stack in user space and eliminates bottlenecks in the kernel (superfluous data copies and context switches).
Another high throughput and low latency option for Ethernet networking is the Data Plane Development Kit (DPDK). DPDK dedicates certain CPU cores to be the packet receiver threads and uses a permanent polling mode in the driver to ensure the quickest possible response to arriving packets. For more information, see http://dpdk.org/.
When we consider low latency applications, storage is rarely on a low latency path. When we do consider it, the best solution is a solid state drive (SSD). With access latencies of dozens of microseconds, SSDs are much faster than hard drives. There are PCI-e-based NVMe SSDs that provide the lowest latencies and the highest bandwidths.
Intel has announced the 3D XPointTM technology, and released the first SSD based on it. These disks bring latency down to several microseconds. This makes the 3D XPoint technology ideal for high performance SSD storage, delivering up to ten times the performance of NAND across a PCIe NVMe interface. An even better alternative in the future will be non-volatile memory based on 3D XPoint technology.
Tuning the Hardware for Low Latency
The default hardware settings are usually optimized for the highest throughput and reasonably low power consumption. When we’re chasing latency, that’s not what we are looking for. This section provides a checklist for tuning the hardware for latency.
In addition to these suggestions, check for any specific guidance from OEMs on latency tuning for their servers.
Ensure that Turbo is on.
Disable lower CPU power states. Settings vary among different vendors, so after turning C-states off, you should check whether there are extra settings like C1E, and memory and PCI-e power saving states, which should also be disabled.
Check for other settings that might influence performance. This varies greatly by OEM, but should include anything power related, such as fan speed settings.
Disable hyper-threading to reduce variations in latency (jitter).
Disable any virtualization options.
Disable any monitoring options.
Disable Hardware Power Management, introduced in the Intel® Xeon® processor E5-2600 v4 product family. It provides more control over power management, but it can cause jitter and so is not recommended for latency-sensitive applications.
Ensure that the network adapter is inserted into the correct PCI-e slot, where the receiver thread is running. That shaves off inter-socket communication latency and allows Intel® Data Direct I/O Technology to place data directly into the last level cache (LLC) of the same socket.
Bind network interrupts to a core running on the same socket as a receiver thread. Check entry N in /proc/interrupts (where N is the interrupt queue number) and then set it by:
echo core # > /proc/irq/N/smp_affinity
Disable interrupt coalescing. Usually the default mode is adaptive which is much better than any fixed setting, but it is still several microseconds slower than disabling it. The recommended setting is:
ethtool –C <interface> rx-usecs 0 rx-frames 0 tx-usecs 0 tx-frames 0 pkt-rate-low 0 pkt-rate-high 0
Kernel bypass solutions usually come tuned for latency, but there still may be some useful options to try out such as polling settings.
Set the correct power mode. Edit /boot/grub/grub.conf and add:
nosoftlockup intel_idle.max_cstate=0 processor.max_cstate=0 mce=ignore_ce idle=poll
to the kernel line. For more information, see www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt
Turn off the cpuspeed service.
Disable unnecessary kernel services to avoid jitter.
Turn off the IRQ Balance service if interrupt affinity has been set.
Try tuning IPv4 parameters. Although this is more important for throughput, it can help to handle bursts of network activity.
Disable the TCP timestamps option for better CPU utilization:
sysctl -w net.ipv4.tcp_timestamps=0
Disable the TCP selective acks option for better CPU utilization:
sysctl -w net.ipv4.tcp_sack=0
Increase the maximum length of processor input queues:
sysctl -w net.core.netdev_max_backlog=250000
Increase the TCP maximum and default buffer sizes using setsockopt():
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216
sysctl -w net.core.optmem_max=16777216
Increase memory thresholds to prevent packet dropping:
sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216"
Increase the Linux* auto-tuning of TCP buffer limits. The minimum, default, and maximum number of bytes to use are shown below (in the order minimum, default, and maximum):
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
Enable low latency mode for TCP:
sysctl -w net.ipv4.tcp_low_latency=1
For tuning network stack there is a good alternative:
tuned-adm profile network-latency
Set the scaling governor to “performance” mode for each core used by a process:
for ((i=0; i<num_of_cores; i++)); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done
Configure the kernel as preemptive to help reduce the number of outliers.
Use a tickless kernel to help eliminate any regular timer interrupts causing outliers.
Finally, use the isolcpus parameter to isolate the cores allocated to an application from OS processes.
This article provides an introduction to the challenge of latency tuning, the hardware choices available, and a checklist for configuring it for low latency. In the second article in this series, we look at application tuning, including a working example.