IPP Crypto acceleration Ice Lake

Abhinav Singh, Sergey Kirillov

Introduction

Intel® Integrated Performance Primitives (Intel® IPP) Cryptography is a software library that provides a comprehensive set of application domain-specific highly optimized functions. It is a secure, fast and lightweight library of building blocks for cryptography, highly-optimized for various Intel® CPUs. This can provide tremendous development and maintenance savings. You can write programs with one optimized execution path, avoiding the alternative of multiple paths (Intel® Streaming Single Instruction Multiple Data (SIMD) Extensions 2, Supplemental Streaming SIMD Extensions 3, Intel® Advanced Vector Extensions , etc.) to achieve optimal performance across multiple generations of processors.

The goal of the Intel® IPP Cryptography software is to provide algorithmic building blocks with

a simple "primitive" C interface and data structures to enhance usability and portability
faster time-to-market
scalability with Intel® hardware

Intel® IPP Cryptography library is available as part of the Intel® oneAPI Base Toolkit.

Intel® IPP Cryptography library is also open sourced. For details about the open source version, please refer to this link.

History of Cryptography Instruction Set

Bulk encryption/decryption, hash functions and pubic key algorithms constitutes the basis of classic cryptography. Until 2010 these algorithms implemented in software which used the basic x32 and/or x64 instruction set or similar. As a result, the implementations spent quite a few CPU cycles on execution. In addition, implementations of cryptographic algorithms that resisted to side-channel attacks only increased their execution time.

In 2010, Intel launched microprocessors based on Westmere microarchitecture, which expanded Instruction Set Architecture (ISA) by so-called Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) and carry-less Multiplication CLMUL instruction. The purpose of Intel® AES-NI is to improve the speed (as well as improve the resistance to side-channel attacks) of AES-based implementations of standard modes. Together with CLMUL instructions they formed the basis for AES Galois Counter (AES-GCM) mode providing confidentiality and authentication simultaneously.

In 2013, it was initially implemented in low-power Intel Atom® Processor Goldmont microarchitecture the hardware acceleration of Secure Hash Algorithm (SHA). This extension, named SHA-NI, supports SHA-1 and SHA-256 algorithms.

In 2014, the ADX extension was implemented on the Broadwell microarchitecture. This extension consists of ADCX and ADOX instructions and together with MULX instruction implemented earlier are using in context of multi-precision arithmetic implementations. So, for example, till now the best OpenSSL* public key implementations are based on MULX with accompanied instructions ADCX and ADOX.

Cryptography-related ISA Extension

The client and server configurations of microprocessor inherits all cryptographic extensions mentioned above and contains additional extensions of ISA. Among additional extensions, there are: VAES and VCLMUL instructions, Galois Field New Instructions (GFNI) and IFMA instructions.

VAES and VCLMUL are extensions of the AES-NI and CLMUL instructions correspondingly. They extend existing instructions to the 2x128 and 4x128 vector’s variant. The VAES instructions perform one round of AES encryption/decryption using the same or different value(s) of round key(s). VAES instruction extension helps to implement the AES parallelizable modes even mush more efficient than legacy AES-NI. 2x128 and 4x128 vector variant of CLMUL improves the performance of AES-GCM mode.

Galois Field New Instructions (GFNI) are presented by three instructions: GF2P8AFFINEQB, GF2P8AFFINEINVQB and GF2P8MULB. The GF2P8AFFINEQB and GF2P8AFFINEINVQB computes affine transformation in the GF(2^8). The first one involves in affine transformation of the element x belonging GF(2^8) and the second one in the inversion 1/x of x. The last GF2P8MULB computes multiplication of x and y elements of GF(2^8). All three are using GF(2^8) generated by g(x)= x^8 + x^4 + x^3 + x + 1 polynomial matched to AES algorithm. Based on fundamental mathematical isomorphism of GF(2^8) this helps implementing algorithms involving affine transformation and multiplication over any GF(2^8). Thus, in particular, it helps in implementation of SM4 algorithms.

IFMA extension – consists of two instructions VPMADD52LUQ and VPMADD52HUQ for packed multiplication of unsigned 52-bit integers and accumulate low/high52 bit product in 64-bit accumulator. These instructions supported in in 3 forms: 2x64, 4x64 and 8x64. The target for this extension is multi-precision arithmetic and basically multiplicative operations. Using this extension helps to implement efficiently public key cryptography algorithms (RSA and Elliptic Curve based encryption and sign operations).

Intel® IPP Crypto Library

Intel® IPP Crypto Library is focused on efficient implementation/optimization of basic cryptography algorithms. Enabling of new Intel ISA in cryptography helps improve the performance and considered as important activity of Intel® IPP Crypto development. In the Intel® IPP Crypto 2020 Update 3 release, ISA concerning all 3 directions: bulk encryption, hashes (SHA1 and SHA256) and RSA encryption have been implemented and enabled. The result in performance difference between non-enabled and enabled Intel® IPP Crypto are presented in Table 1 – Table 3 later in this article. In both cases the benchmark has been performed on the microprocessor.

Since 2020, another cryptography library, called Crypto Multi-Buffer (MB) is delivered together with Intel® IPP Crypto library. Unlike Intel® IPP Crypto the Crypto MB focuses on parallel processing of 8 independent cryptographic request and aimed to support server and cloud applications. It can be used as a standalone library or together with Intel® Quick Assist Technology (Intel® QAT) Engine. By itself, the “multi-buffer” approach provides advantages that complement enabling. The result in performance difference between OpenSSL* 1.1.1 and Crypto MB are presented in Table 4 – Table 6 later in this article. In both cases the benchmark has been performed on the client and server microprocessor.

Enabling Results and Conclusion

The computer platforms and library versions have been used for measurements are the following:

Intel® Core™ i7-1065G7 CPU @ 1.30GHz, L1d=192KiB, L1i=128KiB, L2=2MiB, L3=8MiB running with Ubuntu* 20.04.1
Intel® Xeon® Platinum 8368 CPU @ 2.40GHz, L1d=48K, L1i=32K, L2=1280K, L3=58368K running with RedHat* 8.1
Intel® IPP Cryptography 2020 Update 3
OpenSSL* 1.1.1

The result of Intel® IPP Crypto performance are presented in CPU cycles/byte in case of measurement of AES128, SM4, SHA1 and SHA256 algorithms. Performance results of RSA-2048 are presented in CPU cycles/operation.

Length, Bytes	w/o New ISA Enabled	New ISA Enabled
	AES128-DEC-CBC
1024	0.326	0.148	cycles/byte
2048	0.321	0.152	cycles/byte
4096	0.318	0.154	cycles/byte
	AES128-CTR
1024	0.416	0.206	cycles/byte
2048	0.401	0.182	cycles/byte
4096	0.394	0.169	cycles/byte
	AES128-GCM
1024	1.35	0.344	cycles/byte
2048	1.31	0.291	cycles/byte
4096	1.26	0.262	cycles/byte
	SM4-DEC-CBC
1024	5.24	0.92	cycles/byte
2048	4.59	0.918	cycles/byte
4096	4.59	0.928	cycles/byte
	SM4-CTR
1024	5.3	1.1	cycles/byte
2048	5.25	1.08	cycles/byte
4096	5.23	1.06	cycles/byte

Table 1. Performance of AES128 and SM4 block ciphers with Intel® IPP Crypto with and without new ISA.

Length, Bytes	w/o New ISA Enabled	New ISA Enabled
	SHA-1
1024	4.12	2.3	cycles/byte
2048	3.98	2.18	cycles/byte
4096	3.9	2.12	cycles/byte
	SHA-256
1024	8.71	2.88	cycles/byte
2048	8.44	2.73	cycles/byte
4096	83	2.66	cycles/byte

Table 2. Performance of SHA-1 and SHA-256 hash functions with Intel® IPP Crypto with and without new ISA.

	w/o New ISA Enabled	New ISA Enabled
	RSA-2048
private exp (crt)	1978124	1064056	cycles/op
public exp, e=65537	51806	32822	cycles/op
	RSA-3072
private exp (crt)	6596848	4934034	cycles/op
public exp, e=65537	111736	52076	cycles/op
	RSA-4096
private exp (crt)	14880000	8086920	cycles/op
public exp, e=65537	196042	70724	cycles/op

Table 3. Performance of public and private keys RSA-2048 operation with Intel® IPP Crypto with and without new ISA.

Concerning to Crypto MB, the performance comparison with OpenSSL* 1.1.1 is presented below in this article. Again, benchmarks of both OpenSSL* and Crypto MB have been measured on client and server. Because OpenSSL* does not demonstrate the differences between runs on client and server, only one number in OpenSSL* column is presented. In contrast, performance of Crypto MB vary depends on target CPU in spite of the same code run in both cases.
The results below are presented in CPU cycles/operation in case of public key algorithms (RSA and EC). It’s important to note that OpenSSL* performs single RSA or EC operation whereas Crypto MB performs 8 similar operations. So, for fair comparison with OpenSSL* data related to Crypto MB, it should be divided by 8.

	OpenSSL*	Crypto MB
		Client	Server
RSA-2048, public e=65537	60090	123188	109932	cycles/op
RSA-3072, public e=65537	127410	267623	244976	cycles/op
RSA-4096, public e=65537	219198	463649	434724	cycles/op
RSA-2048, private (crt)	2067784	3592084	2636810	cycles/op
RSA-3072, private (crt)	6294468	14866835	14485652	cycles/op
RSA-4096, private (crt)	14168714	29022226	23385406	cycles/op

Table 4. Performance of public and private keys RSA-2048/3072/4096 operations in OpenSSL* 1.1.1 and Crypto MB on client and server.

		OpenSSL*	Crypto MB
	EC		Client	Server
DH	P256	177496	368513	340726	cycles/op
	P384	2938032	1326145	1482418	cycles/op
	P521	1107801	2041656	1654938	cycles/op
	X25519	118727	190977	140426	cycles/op
DSA, sign	P256	75207	142040	136626	cycles/op
	P384	3110865	511742	564720	cycles/op
	P521	951580	897758	787000	cycles/op

Table 5. Performance of ECDH and ECDSA sign over different EC in OpenSSL* 1.1.1 and Crypto MB on client and server.

Length, Byte	OpenSSL*	Client
64	36.7	2.8
1024	15.7	1.25
8192	13.9	1.25

Table 6. Performance of SM3 implementation in OpenSSL 1.1.1 and Crypto MB on client.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® Integrated Performance Primitives Cryptography Acceleration on 3rd Generation Intel® Xeon® Processor Scalable and 10th Gen Intel® Core™ Processors

Introduction

History of Cryptography Instruction Set

Cryptography-related ISA Extension

Intel® IPP Crypto Library

Enabling Results and Conclusion

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® Integrated Performance Primitives Cryptography Acceleration on 3rd Generation Intel® Xeon® Processor Scalable and 10th Gen Intel® Core™ Processors

Introduction

History of Cryptography Instruction Set

Cryptography-related ISA Extension

Intel® IPP Crypto Library

Enabling Results and Conclusion

Product and Performance Information