Building Software Acceleration Features in the Intel® Quick Assist Technology (Intel® QAT) Engine for OpenSSL* 1.1.1
Published: 11/12/2020
Last Updated: 03/29/2021
Updated 5/6/2021 with performance data for the Intel Xeon Scalable processor family.
Updated 3/29/2021 for release 0.6.5 of the Intel® Quick Assist Technology Engine for OpenSSL
Intel® Quick Assist Technology (Intel® QAT) has been expanded to provide software-based acceleration of cryptographic operations through instructions in the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) family. This software-based acceleration has been incorporated into the Intel QAT Engine for OpenSSL*, a dynamically loadable module that uses the OpenSSL ENGINE framework, allowing administrators to add this capability to OpenSSL without having to rebuild or replace their existing OpenSSL libraries.
Software acceleration is provided for the following algorithms:
- RSA with 2048, 3072, and 4096 bit keys
- ECDH for the Montgomery Curve X25519 and NIST Prime Curves P-256 and P-384
- ECDSA for the NIST Prime Curves P-256 and P-384
- AES-GCM with 128, 192, and 256 bit keys
About This Guide
This guide steps you through the process of building the Intel QAT Engine for OpenSSL on the following Linux distributions, but it can be adapted to others.
- CentOS* Linux* 8.2
- Ubuntu* Server 20.04 LTS
Two build procedures are provided: one that uses the distribution-provided build of OpenSSL, and one that creates a customized installation using OpenSSL built from source. Each methodology has its pros and cons, an dyou should choose the procedure that works best for your environment. Using the distribution-provided OpenSSL means less complexity as you are running OpenSSL out of its standard system path, but it ties you to a specific version that is integrated with the OS. Building OpenSSL from source lets you control the version that you deploy independent of the distribution-provided build, and that makes it possible to perform version updates as needed without disrupting system operations. This added flexibility comes at a cost, however, as you'll need to add the OpenSSL binary directory to your PATH and update LD_LIBRARY_PATH to include the shared library directories for OpenSSL and its dependency libraries.
Click the tab for the desired build option to view the procedure.
|
Testing the Engine
Once the engine is in place, you can proceed with functionality tests.
The first test is to ensure the engine loads.
$ openssl engine -v qatengine
(qatengine) Reference implementation of QAT crypto engine(qat_sw) v0.6.5
ENABLE_EXTERNAL_POLLING, POLL, ENABLE_HEURISTIC_POLLING,
GET_NUM_REQUESTS_IN_FLIGHT, INIT_ENGINE
If the above command returns errors such as the following:
139667965596992:error:25066067:DSO support routines:dlfcn_load:could not load the shared library:../crypto/dso/dso_dlfcn.c:118:filename(/usr/lib/x86_64-linux-gnu/engines-1.1/qatengine.so): libcrypto_mb.so: cannot open shared object file: No such file or directory
139667965596992:error:25070067:DSO support routines:DSO_load:could not load the shared library:../crypto/dso/dso_lib.c:162:
139667965596992:error:260B6084:engine routines:dynamic_load:dso not found:../crypto/engine/eng_dyn.c:414:
139667965596992:error:2606A074:engine routines:ENGINE_by_id:no such engine:../crypto/engine/eng_list.c:334:id=qatengine
If you are using the distribution-provided OpenSSL
- Make sure the Intel® Multi-Buffer Crypto for IPsec Library and the Intel IPSec Library are both installed into /usr/lib. If you did not set a prefix for the former, it will install into /usr/local and you’ll need to set LD_LIBRARY_PATH in your environment.
- If you're running on CentOS 8.2, make sure you have run ldconfig.
If you built OpenSSL from source
- Make sure LD_LIBRARY_PATH is set to the paths for OpenSSL, the Intel Multi-Buffer Crypto for IPsec Library, and the Intel IPSec Library. All three paths must be present.
- Verify your installation paths and make sure they are in /opt/tool/version
Assuming the engine loads correctly, you can test the software acceleration for each of the enabled algorithms. To do that, we'll run “openssl speed” on the individual algorithms and compare the engine performance to the baseline.
On Intel Xeon Scalable processor families, you must use processor affinitiy (also known as CPU pinning) to bind these processes to a core. The multibuffer implementations make use of AVX-512 features that produce internal power transitions, and if the CPU scheduler moves these jobs to other cores during execution then multiple power transitions will occur, countering performance gains. Most server applications support CPU affinity masks in some form, but for “openssl speed” we must rely on the taskset command to do this for us.
RSA
The RSA acceleration makes use of an asynchronous scheduling algorithm which Intel calls multi-buffer, which processes multiple operations in parallel. To test the accelerator performance, you must supply the -async_jobs parameter to “openssl speed”. On current Intel architectures, 8 asynchronous jobs delivers optimal performance.
While both sign and verify operations are accelerated, the largest gains are in signing. This translates to performance gains for servers processing TLS handshakes with RSA certificates.
2048-bit keys
Baseline |
taskset 0x1 openssl speed rsa2048 |
---|---|
Intel QAT Engine for OpenSSL |
taskset 0x1 openssl speed -engine qatengine -async_jobs 8 rsa2048 |
3072-bit keys
Baseline |
taskset 0x1 openssl speed rsa3072 |
---|---|
Intel QAT Engine for OpenSSL |
taskset 0x1 openssl speed -engine qatengine -async_jobs 8 rsa3072 |
4096-bit keys
Baseline |
taskset 0x1 openssl speed rsa4096 |
---|---|
Intel QAT Engine for OpenSSL |
taskset 0x1 openssl speed -engine qatengine -async_jobs 8 rsa4096 |
ECDH
Like RSA, the ECDH acceleration makes use of an asynchronous scheduling algorithm.
Montgomery EC Curve X25519
Baseline |
taskset 0x1 openssl speed ecdhx25519 |
---|---|
Intel QAT Engine for OpenSSL |
taskset 0x1 openssl speed -engine qatengine -async_jobs 8 ecdhx25519 |
NIST Curve P-256
Baseline |
taskset 0x1 openssl speed ecdhp256 |
---|---|
Intel QAT Engine for OpenSSL |
taskset 0x1 openssl speed -engine qatengine -async_jobs 8 ecdhp256 |
NIST Curve P-384
Baseline |
taskset 0x1 openssl speed ecdhp384 |
---|---|
Intel QAT Engine for OpenSSL |
taskset 0x1 openssl speed -engine qatengine -async_jobs 8 ecdhp384 |
ECDSA
The ECDSA algorithms also make use of asynchronous scheduling.
NIST Curve P-256
Baseline |
taskset 0x1 openssl speed ecdsap256 |
---|---|
Intel QAT Engine for OpenSSL |
taskset 0x1 openssl speed -engine qatengine -async_jobs 8 ecdsap256 |
NIST Curve P-384
Baseline |
taskset 0x1 openssl speed ecdsap384 |
---|---|
Intel QAT Engine for OpenSSL |
taskset 0x1 openssl speed -engine qatengine -async_jobs 8 ecdsap384 |
AES GCM Encryption
The AES GCM encryption acceleration is a purely vectorized implementation of the respective EVP ciphers. Key sizes of 128, 192, and 256 bits are supported but 192-bit encryption is almost never used in practice.
“Openssl speed” reports performance for several block sizes, but real-world applications tend to use buffers that are 8k or larger for increased efficiency and performance. This is also where the largest gains are seen.
128-bit keys
Baseline |
taskset 0x1 openssl speed -evp aes-128-gcm |
---|---|
Intel QAT Engine for OpenSSL |
taskset 0x1 openssl speed -engine qatengine -evp aes-128-gcm |
256-bit keys
Baseline |
taskset 0x1 openssl speed -evp aes-256-gcm |
---|---|
Intel QAT Engine for OpenSSL |
taskset 0x1 openssl speed -engine qatengine -evp aes-256-gcm |
Typical Performance Gains
The performance of each algorithm, as measured by “openssl speed”, will vary based on your hardware, system configuration, and BIOS settings. There will also be variations between runs due to fluctuations in normal system activity.
Client System Performance
The results given below are typical values, obtained from the system described in Table 1.
System |
Dell Inc.* XPS* 13 7390 2-in-1 Laptop |
---|---|
CPU |
Intel® CoreTM i7-1065G7 (4 cores, 8 threads) @ 1.30 GHz |
CPU FEATURES |
Intel® Hyper-Threading Technology enabled |
Memory |
16 GB (2x 8GB) LPDDR4 SDRAM 3733 MT/s |
Storage |
512 GB M.2 NVMe SSD |
OS |
Ubuntu 20.04 LTS |
Note that Intel® Turbo Boost Technology 2.0 was disabled for these runs so that the performance gains shown came solely from the Intel QAT Engine for OpenSSL, and not from variations in clock speed.
This is the complete output from OpenSSL for RSA using 2048-bit keys, without the Intel QAT Engine:
$ openssl speed rsa2048
Doing 2048 bits private rsa's for 10s: 5799 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 202497 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1k 25 Mar 2021
built on: Fri Mar 26 22:09:23 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) idea(int) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.001724s 0.000049s 579.9 20249.7
This is the output with the Intel QAT Engine for OpenSSL:
$ openssl speed -engine qatengine -async_jobs 8 rsa2048
engine "qatengine" set.
Doing 2048 bits private rsa's for 10s: 26640 2048 bits private RSA's in 9.99s
Doing 2048 bits public rsa's for 10s: 577880 2048 bits public RSA's in 9.37s
OpenSSL 1.1.1k 25 Mar 2021
built on: Fri Mar 26 22:09:23 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) idea(int) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.000375s 0.000016s 2666.7 61673.4
The number of sign operations per second jumps from 579.9 to 2666.7, showing a roughly 4.6x performance gain when using the Intel QAT Engine for OpenSSL.
The results for all the enabled algorithms are provided in Table 2.
Algorithm |
Operation |
Baseline |
Intel QAT Engine for OpenSSL |
Speedup |
|
---|---|---|---|---|---|
RSA 2048 |
sign |
579.9 |
2667.7 |
signs/sec |
4.60x |
RSA 3072 |
sign |
194 |
673.3 |
signs/sec |
3.47x |
RSA 4096 |
sign |
88 |
348.5 |
signs/sec |
3.96x |
ECDH X25519 |
n/a |
10796 |
49200 |
ops/sec |
4.56x |
ECDH P-256 |
n/a |
7191 |
22783 |
ops/sec |
3.17x |
ECDH P-384 |
n/a |
417 |
5939.4 |
ops/sec |
14.24x |
ECDSA P-256 |
sign |
17059 |
44524.4 |
signs/sec |
2.61x |
ECDSA P-384 |
sign |
394 |
13905.2 |
signs/sec |
35.29x |
AES-128-GCM |
encrypt (8k blocks) |
2368791 |
4723040.3 |
kB/sec |
1.99x |
AES-256-GCM |
encrypt (8k blocks) |
2013831 |
4093460.5 |
kB/sec |
2.03x |
On the target system, the performance gains in RSA signs range from 3.5 to 4.6x. The gains in ECDH X25519 operations exceed 4x. GCM shows a 2x gain for both 128- and 256-bit keys. The NIST Curve P-384 algorithms show rather significant gains, especially compared to the NIST Curve P-256 algorithms who see improvements of only 2.6 to 3x, but this is because the starting points for P-384 were general-purpose software implementations with no other code optimizations. All of the other algorithms began with AVX-2 code paths, and thus there was significantly more room for improvement in the P-384 code.
Server System Performance
Certain SKUs of 3rd generation Intel Xeon Scalable processors contain two Fuse Multiply ADD (FMA) units instead of one, and this translates to better performance. The results given below are typical values for the system described in Table 3, which has two FMA units.
System |
Intel® Server Board M50CYP Family |
---|---|
CPU |
2x Intel® Xeon® Platinum 8368 CPU @ 2.40GHz |
CPU FEATURES |
Intel® Hyper-Threading Technology enabled |
Memory |
64 GB (4x 16GB) DDR4 Registered SDRAM 3200 MT/s |
Storage |
960 GB M.2 NVMe SSD |
OS |
Ubuntu 20.04 LTS |
Algorithm |
Operation |
Baseline |
Intel QAT Engine for OpenSSL |
Measure |
Speedup |
---|---|---|---|---|---|
RSA 2048 |
sign |
1134.6 |
6908.3 |
signs/sec |
6.09x |
RSA 3072 |
sign |
371.2 |
1294.6 |
signs/sec |
3.49x |
RSA 4096 |
sign |
167.6 |
802.1 |
signs/sec |
4.79x |
ECDH x25519 |
n/a |
20158.6 |
120003.8 |
ops/sec |
5.95x |
ECDH p256 |
n/a |
13490.5 |
44741.4 |
ops/sec |
3.32x |
ECDH p384 |
n/a |
803.6 |
12882.6 |
ops/sec |
16.03x |
ECDSA p256 |
signs |
32103.9 |
81653.3 |
signs/sec |
2.54x |
ECDSA p384 |
signs |
779.4 |
29880.2 |
signs/sec |
38.34x |
AES-128-GCM |
encrypt (8k blocks) |
4374242 |
9467142 |
kB/sec |
2.16x |
AES-256-GCM |
encrypt (8k blocks) |
3721579 |
8326487 |
kB/sec |
2.24x |
Note that “openssl speed” measures the raw performance of the algorithm itself. The performance of real-world applications will vary depending on the workload and other factors. Configuring an application to use the QAT engine is an application-specific procedure, and at minimum, it requires that the application support OpenSSL’s asynchronous interface. Consult your application’s documentation for guidance.
§
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.