Intel® Integrated Performance Primitives Cryptography Acceleration on 3rd Generation Intel® Xeon® Processor Scalable and 10th Gen Intel® Core™ Processors

Published: 11/30/2020  

Last Updated: 04/15/2021

By Abhinav Singh, Sergey Kirillov

Introduction

Intel® Integrated Performance Primitives (Intel® IPP) Cryptography is a software library that provides a comprehensive set of application domain-specific highly optimized functions. It is a secure, fast and lightweight library of building blocks for cryptography, highly-optimized for various Intel® CPUs. This can provide tremendous development and maintenance savings. You can write programs with one optimized execution path, avoiding the alternative of multiple paths (Intel® Streaming Single Instruction Multiple Data (SIMD) Extensions 2, Supplemental Streaming SIMD Extensions 3, Intel® Advanced Vector Extensions , etc.) to achieve optimal performance across multiple generations of processors.

The goal of the Intel® IPP Cryptography software is to provide algorithmic building blocks with

  • a simple "primitive" C interface and data structures to enhance usability and portability
  • faster time-to-market
  • scalability with Intel® hardware

Intel® IPP Cryptography library is available as part of the Intel® oneAPI Base Toolkit.

Intel® IPP Cryptography library is also open sourced. For details about the open source version, please refer to this link.


History of Cryptography Instruction Set

Bulk encryption/decryption, hash functions and pubic key algorithms constitutes the basis of classic cryptography. Until 2010 these algorithms implemented in software which used the basic x32 and/or x64 instruction set or similar. As a result, the implementations spent quite a few CPU cycles on execution. In addition, implementations of cryptographic algorithms that resisted to side-channel attacks only increased their execution time.

In 2010, Intel launched microprocessors based on Westmere microarchitecture, which expanded Instruction Set Architecture (ISA) by so-called Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) and carry-less Multiplication CLMUL instruction. The purpose of Intel® AES-NI is to improve the speed (as well as improve the resistance to side-channel attacks) of AES-based implementations of standard modes. Together with CLMUL instructions they formed the basis for AES Galois Counter (AES-GCM) mode providing confidentiality and authentication simultaneously.

In 2013, it was initially implemented in low-power Intel Atom® Processor Goldmont microarchitecture the hardware acceleration of Secure Hash Algorithm (SHA). This extension, named SHA-NI, supports SHA-1 and SHA-256 algorithms. 

In 2014, the ADX extension was implemented on the Broadwell microarchitecture. This extension consists of ADCX and ADOX instructions and together with MULX instruction implemented earlier are using in context of multi-precision arithmetic implementations. So, for example, till now the best OpenSSL* public key implementations are based on MULX with accompanied instructions ADCX and ADOX.


Cryptography-related ISA Extension

The client and server configurations of microprocessor inherits all cryptographic extensions mentioned above and contains additional extensions of ISA. Among additional extensions, there are: VAES and VCLMUL instructions, Galois Field New Instructions (GFNI) and IFMA instructions.

VAES and VCLMUL are extensions of the AES-NI and CLMUL instructions correspondingly. They extend existing instructions to the 2x128 and 4x128 vector’s variant. The VAES instructions perform one round of AES encryption/decryption using the same or different value(s) of round key(s). VAES instruction extension helps to implement the AES parallelizable modes even mush more efficient than legacy AES-NI.  2x128 and 4x128 vector variant of CLMUL improves the performance of AES-GCM mode. 

Galois Field New Instructions (GFNI) are presented by three instructions: GF2P8AFFINEQB, GF2P8AFFINEINVQB and GF2P8MULB. The GF2P8AFFINEQB and GF2P8AFFINEINVQB computes affine transformation in the GF(2^8). The first one involves in affine transformation of the element x belonging GF(2^8)  and the second one in the inversion 1/x of x. The last GF2P8MULB computes multiplication of x and y elements of GF(2^8). All three are using GF(2^8) generated by g(x)= x^8 + x^4 + x^3 + x + 1 polynomial matched to AES algorithm. Based on fundamental mathematical isomorphism of GF(2^8) this helps implementing algorithms involving affine transformation and multiplication over any GF(2^8). Thus, in particular, it helps in implementation of SM4 algorithms.

IFMA extension – consists of two instructions VPMADD52LUQ and VPMADD52HUQ for packed multiplication of unsigned 52-bit integers and accumulate low/high52 bit product in 64-bit accumulator. These instructions supported in in 3 forms: 2x64, 4x64 and 8x64. The target for this extension is multi-precision arithmetic and basically multiplicative operations. Using this extension helps to implement efficiently public key cryptography algorithms (RSA and Elliptic Curve based encryption and sign operations).

Intel® IPP Crypto Library

Intel® IPP Crypto Library is focused on efficient implementation/optimization of basic cryptography algorithms. Enabling of new Intel ISA in cryptography helps improve the performance and considered as important activity of Intel® IPP Crypto development. In the Intel® IPP Crypto 2020 Update 3 release, ISA concerning all 3 directions: bulk encryption, hashes (SHA1 and SHA256) and RSA encryption have been implemented and enabled. The result in performance difference between non-enabled and enabled Intel® IPP Crypto are presented in Table 1 – Table 3 later in this article. In both cases the benchmark has been performed on the microprocessor.

Since 2020, another cryptography library, called Crypto Multi-Buffer (MB) is delivered together with Intel® IPP Crypto library. Unlike Intel® IPP Crypto the Crypto MB focuses on parallel processing of 8 independent cryptographic request and aimed to support server and cloud applications. It can be used as a standalone library or together with Intel® Quick Assist Technology (Intel® QAT) Engine. By itself, the “multi-buffer” approach provides advantages that complement enabling. The result in performance difference between OpenSSL* 1.1.1 and Crypto MB are presented in Table 4 – Table 6 later in this article. In both cases the benchmark has been performed on the client and server microprocessor.

 

Enabling Results and Conclusion

The computer platforms and library versions have been used for measurements are the following:

  • Intel® Core™ i7-1065G7 CPU @ 1.30GHz, L1d=192KiB, L1i=128KiB, L2=2MiB, L3=8MiB running with Ubuntu* 20.04.1
  • Intel® Xeon® Platinum 8368 CPU @ 2.40GHz, L1d=48K, L1i=32K, L2=1280K, L3=58368K running with RedHat* 8.1
  • Intel® IPP Cryptography 2020 Update 3
  • OpenSSL* 1.1.1

The result of Intel® IPP Crypto performance are presented in CPU cycles/byte in case of measurement of AES128, SM4, SHA1 and SHA256 algorithms. Performance results of RSA-2048 are presented in CPU cycles/operation.

Length, Bytes w/o New ISA Enabled New ISA Enabled  
   AES128-DEC-CBC  
1024 0.326 0.148 cycles/byte
2048 0.321 0.152 cycles/byte
4096 0.318 0.154 cycles/byte
  AES128-CTR  
1024 0.416 0.206 cycles/byte
2048 0.401 0.182 cycles/byte
4096 0.394 0.169 cycles/byte
  AES128-GCM  
1024 1.35 0.344 cycles/byte
2048 1.31 0.291 cycles/byte
4096 1.26 0.262 cycles/byte
  SM4-DEC-CBC  
1024 5.24 0.92 cycles/byte
2048 4.59 0.918 cycles/byte
4096 4.59 0.928 cycles/byte
  SM4-CTR  
1024 5.3 1.1 cycles/byte
2048 5.25 1.08 cycles/byte
4096 5.23 1.06 cycles/byte

Table 1. Performance of AES128 and SM4 block ciphers with Intel® IPP Crypto with and without new ISA.

Length, Bytes w/o New ISA Enabled New ISA Enabled  
  SHA-1  
1024 4.12 2.3 cycles/byte
2048 3.98 2.18 cycles/byte
4096 3.9 2.12 cycles/byte
  SHA-256  
1024 8.71 2.88 cycles/byte
2048 8.44 2.73 cycles/byte
4096 83 2.66 cycles/byte

Table 2. Performance of SHA-1 and SHA-256 hash functions with Intel® IPP Crypto with and without new ISA.

  w/o New ISA Enabled New ISA Enabled  
  RSA-2048  
private exp (crt) 1978124 1064056 cycles/op
public exp, e=65537 51806 32822 cycles/op
  RSA-3072  
private exp (crt) 6596848 4934034 cycles/op
public exp, e=65537 111736 52076 cycles/op
  RSA-4096  
private exp (crt) 14880000 8086920 cycles/op
public exp, e=65537 196042 70724 cycles/op

Table 3. Performance of public and private keys RSA-2048 operation with Intel® IPP Crypto with and without new ISA.

Concerning to Crypto MB, the performance comparison with OpenSSL* 1.1.1 is presented below in this article. Again, benchmarks of both OpenSSL* and Crypto MB have been measured on client and server. Because OpenSSL* does not demonstrate the differences between runs on client and server, only one number in OpenSSL* column is presented. In contrast, performance of Crypto MB vary depends on target CPU in spite of the same code run in both cases. 
The results below are presented in CPU cycles/operation in case of public key algorithms (RSA and EC). It’s important to note that OpenSSL* performs single RSA or EC operation whereas Crypto MB performs 8 similar operations. So, for fair comparison with OpenSSL* data related to Crypto MB, it should be divided by 8.

  OpenSSL* Crypto MB  
    Client Server  
RSA-2048, public e=65537 60090 123188 109932 cycles/op
RSA-3072, public e=65537 127410 267623 244976 cycles/op
RSA-4096, public e=65537 219198 463649 434724 cycles/op
RSA-2048, private (crt) 2067784 3592084 2636810 cycles/op
RSA-3072, private (crt) 6294468 14866835 14485652 cycles/op
RSA-4096, private (crt) 14168714 29022226 23385406 cycles/op

Table 4. Performance of public and private keys RSA-2048/3072/4096 operations in OpenSSL* 1.1.1 and Crypto MB on client and server.

    OpenSSL* Crypto MB  
  EC   Client Server  
DH P256 177496 368513 340726 cycles/op
P384 2938032 1326145 1482418 cycles/op
P521 1107801 2041656 1654938 cycles/op
X25519 118727 190977 140426 cycles/op
DSA, sign P256 75207 142040 136626 cycles/op
P384 3110865 511742 564720 cycles/op
P521 951580 897758 787000 cycles/op

Table 5. Performance of ECDH and ECDSA sign over different EC in OpenSSL* 1.1.1 and Crypto MB on client and server.

Length, Byte OpenSSL* Client
64 36.7 2.8
1024 15.7 1.25
8192 13.9 1.25

Table 6. Performance of SM3 implementation in OpenSSL 1.1.1 and Crypto MB on client.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.