IPP Crypto acceleration Ice Lake

Abhinav Singh, Sergey Kirillov

Introduction

Intel® Integrated Performance Primitives (Intel® IPP) Cryptography is a software library that provides a comprehensive set of application domain-specific highly optimized functions. It is a secure, fast and lightweight library of building blocks for cryptography, highly-optimized for various Intel® CPUs. This can provide tremendous development and maintenance savings. You can write programs with one optimized execution path, avoiding the alternative of multiple paths (Intel® Streaming Single Instruction Multiple Data (SIMD) Extensions 2, Supplemental Streaming SIMD Extensions 3, Intel® Advanced Vector Extensions , etc.) to achieve optimal performance across multiple generations of processors.

The goal of the Intel® IPP Cryptography software is to provide algorithmic building blocks with

a simple "primitive" C interface and data structures to enhance usability and portability
faster time-to-market
scalability with Intel® hardware

Intel® IPP Cryptography library is available as part of the Intel® oneAPI Base Toolkit.

Intel® IPP Cryptography library is also open sourced. For details about the open source version, please refer to this link.

History of Cryptography Instruction Set

Bulk encryption/decryption, hash functions and pubic key algorithms constitutes the basis of classic cryptography. Until 2010 these algorithms implemented in software which used the basic x32 and/or x64 instruction set or similar. As a result, the implementations spent quite a few CPU cycles on execution. In addition, implementations of cryptographic algorithms that resisted to side-channel attacks only increased their execution time.

In 2010, Intel launched microprocessors based on Westmere microarchitecture, which expanded Instruction Set Architecture (ISA) by so-called Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) and carry-less Multiplication CLMUL instruction. The purpose of Intel® AES-NI is to improve the speed (as well improve the resistance to side-channel attacks) of AES-based implementations of standard modes. Together with CLMUL instructions they formed the basis for AES Galois Counter (AES-GCM) mode providing confidentiality and authentication simultaneously.

In 2013, it was initially implemented in low-power Intel Atom® Processor Goldmont microarchitecture the hardware acceleration of Secure Hash Algorithm (SHA). This extension, named SHA-NI, supports SHA-1 and SHA-256 algorithms.

In 2014, the ADX extension was implemented on the Broadwell microarchitecture. This extension consists of ADCX and ADOX instructions and together with MULX instruction implemented earlier are using in context of multi-precision arithmetic implementations. So, for example, till now the best OpenSSL* public key implementations are based on MULX with accompanied instructions ADCX and ADOX.

Cryptography-related ISA Extension

The client and server configurations of microprocessor inherits all cryptographic extensions mentioned above and contains additional extensions of ISA. Among additional extensions, there are: VAES and VCLMUL instructions, Galois Field New Instructions (GFNI) and IFMA instructions.

VAES and VCLMUL are extensions of the AES-NI and CLMUL instructions correspondingly. They extend existing instructions to the 2x128 and 4x128 vector’s variant. The VAES instructions perform one round of AES encryption/decryption using the same or different value(s) of round key(s). VAES instruction extension helps to implement the AES parallelizable modes even mush more efficient than legacy AES-NI. 2x128 and 4x128 vector variant of CLMUL improves the performance of AES-GCM mode.

Galois Field New Instructions (GFNI) are presented by three instructions: GF2P8AFFINEQB, GF2P8AFFINEINVQB and GF2P8MULB. The GF2P8AFFINEQB and GF2P8AFFINEINVQB computes affine transformation in the GF(2^8). The first one involves in affine transformation of the element x belonging GF(2^8) and the second one in the inversion 1/x of x. The last GF2P8MULB computes multiplication of x and y elements of GF(2^8). All three are using GF(2^8) generated by g(x)= x^8 + x^4 + x^3 + x + 1 polynomial matched to AES algorithm. Based on fundamental mathematical isomorphism of GF(2^8) this helps implementing algorithms involving affine transformation and multiplication over any GF(2^8). Thus, in particular, it helps in implementation of SM4 algorithms.

IFMA extension – consists of two instructions VPMADD52LUQ and VPMADD52HUQ for packed multiplication of unsigned 52-bit integers and accumulate low/high52 bit product in 64-bit accumulator. These instructions supported in in 3 forms: 2x64, 4x64 and 8x64. The target for this extension is multi-precision arithmetic and basically multiplicative operations. Using this extension helps to implement efficiently public key cryptography algorithms (RSA and Elliptic Curve based encryption and sign operations).

Intel® IPP Crypto Library

Intel® IPP Crypto Library is focused on efficient implementation/optimization of basic cryptography algorithms. Enabling of new Intel ISA in cryptography helps improve the performance and considered as important activity of Intel® IPP Crypto development. In the Intel® IPP Crypto 2020 Update 3 release, ISA concerning all 3 directions: bulk encryption, hashes (SHA1 and SHA256) and RSA encryption have been implemented and enabled. The result in performance difference between non-enabled and enabled Intel® IPP Crypto are presented in Table 1 – Table 3 later in this article. In both cases the benchmark has been performed on the microprocessor.

Since 2020, another cryptography library, called Crypto Multi-Buffer (MB) is delivered together with Intel® IPP Crypto library. Unlike Intel® IPP Crypto the Crypto MB focuses on parallel processing of 8 independent cryptographic request and aimed to support server and cloud applications. It can be used as a standalone library or together with Intel® Quick Assist Technology (Intel® QAT) Engine. By itself, the “multi-buffer” approach provides advantages that complement enabling. The result in performance difference between OpenSSL* 1.1.1 and Crypto MB are presented in Table 4 – Table 6 later in this article. In both cases the benchmark has been performed on the client and server microprocessor.

Enabling Results and Conclusion

The computer platforms and library versions have been used for measurements are the following:

Intel® Core™ i7-1065G7 CPU @ 1.30GHz, L1d=192KiB, L1i=128KiB, L2=2MiB, L3=8MiB running with Ubuntu* 20.04.1
Intel® Xeon® CPU @ 2.2HGz, Ice Lake Server, L1d=48K, L1i=32K, L2=1280K, L3=36864K running with RedHat* 8.1
Intel® IPP Cryptography 2020 Update 3
OpenSSL* 1.1.1

The result of Intel® IPP Crypto performance are presented in CPU cycles/byte in case of measurement of AES128, SM4, SHA1 and SHA256 algorithms. Performance results of RSA-2048 are presented in CPU cycles/operation.

	AES128-DEC-CBC
Length, Bytes	w/o New ISA Enabled	New ISA Enabled
1024	0.125	0.0586	cycle/byte
2048	0.124	0.0601	cycle/byte
4096	0.122	0.0605	cycle/byte
	AES128-CTR
1024	0.159	0.0811	cycle/byte
2048	0.154	0.0713	cycle/byte
4096	0.152	0.068	cycle/byte
	AES128-GCM
1024	0.505	0.136	cycle/byte
2048	0.479	0.119	cycle/byte
4096	0.465	0.109	cycle/byte
	AES128-XTS
1024	0.174	0.0996	cycle/byte
2048	0.166	0.0806	cycle/byte
4096	0.158	0.071	cycle/byte
	SM4-DEC-CBC
1024	2.15	0.375	cycle/byte
2048	1.9	0.369	cycle/byte
4096	1.83	0.367	cycle/byte
	SM4-CTR
1024	2.22	0.434	cycle/byte
2048	2.19	0.42	cycle/byte
4096	2.18	0.414	cycle/byte

Table 1. Performance of AES128 and SM4 block ciphers with Intel® IPP Crypto with and without new ISA.

	SHA-1
Length, Bytes	w/o New ISA Enabled	New ISA Enabled
1024	1.59	0.896	cycle/byte
2048	1.52	0.857	cycle/byte
4096	1.49	0.838	cycle/byte
	SHA-256
1024	3.33	1.11	cycle/byte
2048	3.22	1.08	cycle/byte
4096	3.17	1.05	cycle/byte

Table 2. Performance of SHA-1 and SHA-256 hash functions with Intel® IPP Crypto with and without new ISA.

	RSA-2048
	w/o New ISA Enabled	New ISA Enabled
private exp (crt)	760563	404080	cycles/op
public exp, e=65537	20168	12266	cycles/op

Table 3. Performance of public and private keys RSA-2048 operation with Intel® IPP Crypto with and without new ISA.

Concerning to Crypto MB, the performance comparison with OpenSSL* 1.1.1 is presented below in this article. Again, benchmarks of both OpenSSL* and Crypto MB have been measured on client and server. Because OpenSSL* does not demonstrate the differences between runs on client and server, only one number in OpenSSL* column is presented. In contrast, performance of Crypto MB vary depends on target CPU in spite of the same code run in both cases.
The results below are presented in CPU cycles/operation in case of public key algorithms (RSA and EC). It’s important to note that OpenSSL* performs single RSA or EC operation whereas Crypto MB performs 8 similar operations. So, for fair comparison with OpenSSL* data related to Crypto MB, it should be divided by 8.

	OpenSSL*	Crypto MB
		Client	Server
RSA-2048, public e=65537	59491	123188	87542	cycles/op
RSA-3072, public e=65537	125746	267623	186446	cycles/op
RSA-4096, public e=65537	216909	463649	321544	cycles/op
RSA-2048, private (crt)	2027312	3592084	2051226	cycles/op
RSA-3072, private (crt)	6282572	14866835	8521718	cycles/op
RSA-4096, private (crt)	14066625	29022226	18969184	cycles/op

Table 4. Performance of public and private keys RSA-2048/3072/4096 operations in OpenSSL* 1.1.1 and Crypto MB on client and server.

		OpenSSL*	Crypto MB
	EC		Client	Server
DH	P256	173376	368513	328036	cycles/op
	P384	2981591	1326145	1457200	cycles/op
	P521	1033005	2041656	1602712	cycles/op
	X25519	122354	190977	137273	cycles/op
DSA, sign	P256	73031	142040	131474	cycles/op
	P384	3129416	511742	547092	cycles/op
	P521	895759	897758	747956	cycles/op

Table 5. Performance of ECDH and ECDSA sign over different EC in OpenSSL* 1.1.1 and Crypto MB on client and server.

Length, Byte	OpenSSL*	Client
64	36.7	2.8
1024	15.7	1.25
8192	13.9	1.25

Table 6. Performance of SM3 implementation in OpenSSL 1.1.1 and Crypto MB on client.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® Integrated Performance Primitives Cryptography Acceleration on 3rd Generation Intel® Xeon® Processor Scalable and 10th Gen Intel® Core™ Processors

Introduction

History of Cryptography Instruction Set

Cryptography-related ISA Extension

Intel® IPP Crypto Library

Enabling Results and Conclusion

Product and Performance Information