Introducing Intel® Advanced Performance Extensions (Intel® APX)

ID 784404
Updated 3/5/2024
Version Latest
Public

author-image

By

Intel® architecture powers datacenters and personal computers around the world. Since its introduction by Intel® in 1978, the architecture has continuously evolved to take advantage of emerging workloads and the relentless pace of Moore’s law. The original instruction set defined only eight 16-bit general-purpose registers, which over the years were doubled in number and quadrupled in size. A large set of vector registers was added, and most recently Intel® AMX introduced two-dimensional matrix registers, providing a big jump in AI performance.1

Today, we are introducing the next major step in the evolution of Intel® architecture. Intel® Advanced Performance Extensions (Intel® APX) expands the entire x86 instruction set with access to more registers and adds various new features that improve general-purpose performance. The extensions are designed to provide efficient performance gains across a variety of workloads – without significantly increasing silicon area or power consumption of the core.

Intel® APX doubles the number of general-purpose registers (GPRs) from 16 to 32. This allows the compiler to keep more values in registers; as a result, APX-compiled code contains 10% fewer loads and more than 20% fewer stores than the same code compiled for an Intel® 64 baseline.2 Register accesses are not only faster, but they also consume significantly less dynamic power than complex load and store operations.

Compiler enabling is straightforward – a new REX2 prefix provides uniform access to the new registers across the legacy integer instruction set. Intel® AVX instructions gain access via new bits defined in the existing EVEX prefix. In addition, legacy integer instructions now can also use EVEX to encode a dedicated destination register operand – turning them into three-operand instructions and reducing the need for extra register move instructions. While the new prefixes increase average instruction length, there are 10% fewer instructions in APX-compiled code2, resulting in similar code density as before.

The new GPRs are XSAVE-enabled, which means that they can be automatically saved and restored by XSAVE/XRSTOR sequences during context switches. They do not change the size and layout of the XSAVE area as they take up the space left behind by the deprecated Intel® MPX registers.

We propose to define the new GPRs as caller-saved (volatile) state in application binary interfaces (ABIs), facilitating interoperability with legacy binaries. Optimized calling conventions can be introduced where legacy compatibility requirements are relaxed. Generally, more register state will need to be managed at function boundaries. In order to reduce the associated overhead, we are adding PUSH2/POP2 instructions that transfer two register values within a single memory operation. The processor tracks these new instructions internally and fast-forwards register data between matching PUSH2 and POP2 instructions without going through memory.

The performance features introduced so far will have limited impact in workloads that suffer from a large number of conditional branch mispredictions. As out-of-order CPUs continue to become deeper and wider, the cost of mispredictions increasingly dominates performance of such workloads. Branch predictor improvements can mitigate this to a limited extent only as data-dependent branches are fundamentally hard to predict.

To address this growing performance issue, we significantly expand the conditional instruction set of x86, which was first introduced with the Intel® Pentium® Pro in the form of CMOV/SET instructions. These instructions are used quite extensively by today’s compilers, but they are too limited for broader use of if-conversion (a compiler optimization that replaces branches with conditional instructions).

Intel® APX adds conditional forms of load, store, and compare/test instructions, and it also adds an option for the compiler to suppress the status flags writes of common instructions. These enhancements expand the applicability of if-conversion to much larger code regions, cutting down on the number of branches that may incur misprediction penalties. All these conditional ISA improvements are implemented via EVEX prefix extensions of existing legacy instructions.

Application developers can take advantage of Intel® APX by simple recompilation – source code changes are not expected to be needed. Workloads written in dynamic languages will automatically benefit as soon as the underlying runtime system has been enabled.

Intel® APX demonstrates the advantage of the variable-length instruction encodings of x86 – new features enhancing the entire instruction set can be defined with only incremental changes to the instruction-decode hardware. This flexibility has allowed Intel® architecture to adapt and flourish over four decades of rapid advances in computing – and it enables the innovations that will keep it thriving into the future.

References

Footnotes

  1. intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html
  2. This projection is based on a prototype simulation of the SPEC CPU® 2017 Integer benchmark suite.