Real-World Challenge of Post-Quantum Cryptography: When Theory Meets...

Author:
Anas Hlayhel, INT31

Quantum computing is getting close enough to reality that it threatens to break the cryptography we rely on today. To prepare for that moment, the National Institute of Standards and Technology (NIST) has published two new lattice-based post quantum standards, FIPS 203: Module-Lattice-Based Key-Encapsulation Mechanism Standard (ML-KEM) and FIPS 204: Module-Lattice-Based Digital Signature Standard (ML-DSA) that promise strong mathematical security. But as every security engineer knows, an algorithm is only as strong as its implementation.

This article walks through the real world challenges of implementing these schemes securely and efficiently. We start with the basics—why input validation matters and how simple oversights can expose secrets. Then we dig into the importance of strong, hedged randomness, followed by the often overlooked risks of data dependent timing and side channel leaks. Next, we explore how to zeroize sensitive data correctly including some easy-to-miss corner cases. Finally, we look at some performance optimizations that help keep implementations efficient without sacrificing security.

By the end, readers will not just understand what these algorithms do but will develop some appreciation of what it actually takes to implement post quantum cryptography safely and efficiently in the messy, unpredictable world of real software.

Input Validation: The First Line of Defense

Be careful who you allow into your own house! Input validation might seem basic, but it's critical for post-quantum cryptography. Beyond checking for null pointers, you need to ensure that all parameters are within their specified ranges to prevent unexpected behavior. You also must check critical input/output buffer sizes. For example, check sizes of encapsulation keys, decapsulation keys, and ciphertexts for ML-KEM. Likewise, check sizes of public keys, private keys, and ciphertexts for ML-DSA. Use Table 1 below (synthesized from FIPS 203 and FIPS 204) to obtain expected sizes:

ML-KEM Parameter Sets	Encapsulation Key	Decapsulation Key	Ciphertext	Shared Secret Key
ML-KEM-512	800	1632	768	32
ML-KEM-768	1184	2400	1088	32
ML-KEM-1024	1568	3168	1568	32
ML-DSA Parameter Sets	Public Key	Private Key	Signature Size	-
ML-DSA-44	1312	2560	2420
ML-DSA-65	1952	4032	3309
ML-DSA-87	2592	4896	4627

Table 1: Sizes (in bytes) of keys and other buffers for ML-KEM & ML-DSA

Here's a counterexample of what can happen if input validation was omitted. Function ByteDecode in ML-KEM (FIPS 203, Algorithm 6) decodes messages by converting a byte array into an array of binary integers. If ByteDecode is called by the Encrypt algorithm (FIPS 203, Algorithm 14), the expected input message is 32 bytes. If an attacker has access to this function interface, they can call it with a message less than 32 bytes. The ByteDecode function will still read 32 bytes (as expected by the standard), potentially causing a buffer to overread and eventually information disclosure if sensitive data resided in adjacent memory. Adding a size check to ByteDecode could help mitigate this potential attack.

Use Strong Hedged Randomness

All cryptographic keys must originate from a source of randomness. Hence, it’s essential to understand the path of randomness, from the time the master seed is generated, going through derivative seeds, all the way to the actual secrets directly used in cryptographic operations. In Figure 1, I try to show the path of randomness for ML-DSA in bold red arrows (synthesized from FIPS 204, Algorithms 1, 2, 6, and 7). I hope I have made it easy for you to see that critical secrets like private key s1 and one-time secret y can be traced back to a master seed 𝜉.

Pay attention to that master seed since the security of the entire system could rest on it. Make sure to extract the master seed 𝜉 from a strong entropy source, so pick an approved SP 800-90A1 Random Bit Generator (RBG). You’ll find that the security strength of the RBG directly affects the security strength of all ML-DSA. For example, selecting an RBG with 128 bits of security for ML-DSA-44 yields the lowest ML-DSA security strength (Category 1 using FIPS 204 terminology) while selecting RBG with 256 bits of security for ML-DSA-87 yields the highest security strength (Category 5 in FIPS 204).

Figure 1: Randomness paths in ML-DSA shown in bold arrows

Another thing you can do to boost the security of your ML-DSA implementation is to pay more attention to the per-message secret y. Vector y plays a crucial role in disguising secret s₁ in the response z = y + c · s₁. If you look at the randomness path leading to y, you notice two types of randomness: 𝐾 (precomputed during key generation) and 𝑟𝑛𝑑 (fresh randomness at signing time). To boost the security of ML-DSA against side-channel attacks, always use a “hedged” option in the signing procedure. Hedged in this context means making sure that both randomness sources 𝐾 and 𝑟𝑛𝑑 are indeed used and that the second source, 𝑟𝑛𝑑, is also sourced from a strong RBG.

Eliminate Data-Dependent Timing

Here's where implementation gets truly treacherous. Your algorithm might be mathematically perfect, but if it takes slightly longer to process certain inputs, an attacker with precise timing measurements can potentially extract your secret keys. This is the variable time attack problem, and it's particularly nasty for post-quantum cryptography.

Modular Reduction

One of the main culprits of data-dependent timing is modular reduction - a fundamental operation that happens throughout these algorithms. Consider the Number Theoretic Transform (NTT) algorithms (FIPS 203 Algorithms 9 & 10 and FIPS 204 Algorithms 41 & 42) where we do modulo reductions on multiplication, subtraction, and addition in consecutive steps. If your compiler decides to implement modulo operations using trial divisions (keeps trying until it finds the right quotient), the execution time could depend on the operand size. If the operand happens to be a secret, suddenly your secret is leaking through timing variations measured in nanoseconds.

It’s better to consider one of the safer methods that avoid division altogether and use bitwise shifts (divisions by a power of two) instead. One can utilize known and tested techniques like Barrett or Montgomery reductions. Barrett can be faster if we’re doing one standalone modular operation while Montgomery yields better performance if we’re doing a sequence of modular operations. Reason being, you need to convert to Montgomery form (by multiplying by 2³² modulo q), an expensive operation, before the start of Montgomery reductions then extract the factor 2³² from the final results. So there had better be several modular reductions to make it worth the while.

For ML-KEM, FIPS 203 does not seem to favor one method over another, so we opt for the simpler Barrett reduction. To compute the Barrett reduction, we need to know the modulus q and its length in bits k. For ML-KEM, this would be modulus q = 3329, which has a bit-length k = ⌈log₂(q)⌉ = 12 bits. Below, Listing 1 shows steps to apply the Barrett reduction on x = 1,234,567

//Pre-compute Barrett constant

μ = ⌊((2^2k)/q)⌋ = ⌊(2²⁴/3329)⌋=5039

//Compute the quotient approximation

quotient = ⌊(x*μ)⌋/(2^2k)=⌊(1234567*5039)⌋/2²⁴=370

//Compute the result

result=x - q * "quotient" = 1234567 - 3329 * 370=2837

Notice above that the computed quotient is an approximation. This is because it will get you correct results only till we are close to twice the size of the modulus (24 bits for ML-KEM). After that, the quotient could fall short by 1. This means the result will overflow by one modulus q, hence one extra reduction is needed at the end. Consider for example the highest possible result for ML-KEM. This would be when we multiply two coefficients each one at its maximum, which is q – 1 = 3328. The value x that we want to reduce would be: 3328 * 3328 = 11,075,584, which is 24 bits. To reduce 11,075,584 mod 3329, we do the same steps as in Listing 1 above, with the additional end reduction step as shown in the Listing 2 below:

//Compute the quotient approximation

quotient = ⌊(x*μ)⌋/(2^2k)=⌊(11,075,584*5039)⌋/2²⁴=3326

//Compute the result

result=x - q * "quotient" = 11075584 - 3329 * 3326=3330

//Perform one extra reduction

result = result – q = 3330 – 3329 = 1

For ML-DSA, FIPS 204 specifies Montgomery as the reduction method of choice for modular multiplication. Though it has been employed widely in classical cryptography (RSA exponentiation and ECC field multiplication), Montgomery reduction is particularly elegant for ML-DSA. FIPS 204 provides a Montgomery reduction algorithm (Algorithm 49 as captured in Figure 2 below) customized for ML-DSA’s modulus q = 8380417 and its size of 23 bits which fits into a 32-bit radix (the reason you see 2³² used throughout the algorithm).

Figure 2: ML-DSA Montgomery Reduction Algorithm

If you have done Montgomery reduction in classic cryptography, you’ll see the elegance of this algorithm. Instead of the iterative word-by-word Montgomery reduction used in classical cryptography, post-quantum values are small enough that reduction can be done in one step. This algorithm can be easily implemented in C by using int32 type casting for mod 2³² and 32-bit shift for division by 2³² as you see below:

int64_t qinv = 58728449; // q^(-1) mod 2^32

int32_t t = (int32_t)(a * qinv);

return (a - (int64_t)t * q) >> 32;

One last remark regarding those reductions is that they may overflow by one q hence needing correction. One pitfall is to add a conditional statement like: if result >= q, then do correction. This conditional statement violates constant time that you were trying so hard to keep. Instead, correct your result using a mask. An example is shown here:

mask = -(result >= q); // Creates 0x00...00 or 0xFF...FF

value -= mask & q;

As you can see, the exact same code will be executed whether our result is above or below the modulus, hence it’s constant time.

Early Function Exit

Another source of constant time violations is early function exits. Consider the Decapsulation algorithm inside ML-KEM (FIPS 203, Algorithm 18) captured in Figure 3 below:

Figure 3: ML-KEM Decapsulation Algorithm with Implicit Rejection

Implicit rejection at line 10 means that an attacker must never know if the key was accepted or rejected. We do this by returning a fake key \(\overline{K}\) when ciphertexts don’t match (i.e. rejection case). However, some implementations might exit early at line 5 if function call K-PKE.Decrypt fails. This early exit defeats the purpose of implicit rejection since it occurs before key return. Even if you insert a fake key return at the early exit, an attacker with the right equipment might still deduce from the early exit that a rejection must have happened.

Sometimes, there seems to be constant time violations in the standard itself. Consider the rejection sampling loop in the ML-DSA internal Sign (FIPS 204 Algorithm 7). I constructed Figure 4 below from a snapshot of the relevant code (right) and a corresponding state diagram (left) to help visualize where validity checks occur and the resulting rejection and acceptance paths.

Figure 4: ML-DSA Internal Sign Rejection Sampling Loop

You can see from the figure that this loop has 3 different paths depending on 4 validity checks. This can offer an attacker a wealth of time information due to early exits, variations in loop counts, and signature success rate. In the standard’s defense, one might say that the values at risk are not highly sensitive. However, if you follow the overall defense-in-depth tone that these standards promote, you’d want to prevent leakage of even slightly sensitive data, or it may leak information about highly sensitive data. Even if this was not likely, we should always be wary of time-dependent logic, even when it is safe, because it could mask other unsafe time violations. At this point, the best practice is to boost your constant-time validation to make sure no real time violation escapes your radar.

Zeroize Everything Properly

One aspect that often gets overlooked is proper destruction of secrets at the right time. This is even more critical for PQC algorithms since they generate a sizeable volume of intermediate values during computation - temporary vectors, random values, partial results. Each of these could potentially leak information if not properly destroyed after use.

FIPS 204 standard is refreshingly explicit about this: "implementations of ML-DSA shall ensure that any potentially sensitive intermediate data is destroyed as soon as it is no longer needed." This applies even to verification algorithms, which usually contain only public data, because in some applications, the verification inputs themselves need to remain confidential.

For ML-KEM, this includes intermediate computation values like e, e₁, and e₂, randomness generated by RBG like d and z, and private keys like s and dk_PKE. The most secure method is explicit zeroization - writing zeros over the secrets in memory.

There are optimization opportunities even when you destroy secrets. Consider the signer’s commitment vector y and the signer’s response vector z in ML-DSA. Since they both have the same size, you can use the same memory allocation for both. First store y. The last time you need y is when you start computing z = y + c .s₁. So, you can simply store z on top of y, hence destroying y in the process.

One common pitfall is attempting to zeroize secrets at the end of code without memory barriers. Many compilers might optimize the zeroization away, leaving secrets in memory. It's part of secure development to intentionally and explicitly insert a compiler barrier to ensure sensitive data is actually erased. One easy way to do this without calling any libraries is to use inline assembly:

asm volatile("" ::: "memory");

Rejection Sampling and Maximum Iterations

Here is a corner case where it is easy to miss destroying secrets. ML-DSA includes rejection sampling loops that could theoretically run indefinitely. FIPS 204 Appendix C provides a maximum iteration count option. If this option is used and the maximum number of iterations is exceeded, your implementation must destroy all intermediate results - even when the signature is invalid, vectors z, h, w, s₁, s₂, and t₀ must be zeroized.

Stack vs. Heap Memory Management

Secrets can reside in either heap or stack memory, and both need careful zeroization. Some implementations skip zeroizing stack memory, thinking it will be destroyed at function exit. This is incorrect. At function exit, the stack pointer moves but doesn't wipe the data. The next function might access residual data and retrieve previous secrets. Make sure to zeroize data structures residing on stack as soon as they’re no longer needed.

Boost Performance through Cache Locality

Memory layout decisions have profound performance implications. When implementing a two-dimensional structure like matrix A (FIPS 203 Algorithm 13 and FIPS 204 Algorithm 6) into the one-dimensional memory, you have two choices: row-wise layout or column-wise layout. The row-wise layout generally provides better cache locality, but there are cases in ML-KEM where the transpose of matrix A is used (FIPS 203 Algorithm 14, line 19). In such cases, providing the column-wise layout can make transpose operations cache-friendly. If we take ML-KEM-512 as an example, A is a 4x4 matrix. Below, Figure 5 shows the row-wise and column-wise memory layout respectively. Keep in mind that modules (matrices and vectors) in both ML-KEM and ML-DSA are made of polynomials (p_ij) each with 256 coefficients, each coefficient is bound by modulus q.

Figure 5: ML-KEM-512 matrix with its two possible memory layouts

Another optimization is to notice which variables are used together and place them all into one contiguous memory allocation. Take for example Step 21 in FIPS 203 Algorithm 14:

𝑣 ← NTT^-1(t^^T ∘ ŷ) + e₂ + 𝜇

Here you see variables t̂, ŷ, e₂, and 𝜇 are used together in one operation. If you allocate one memory region for all these variables, it will improve cache locality and performance.

Looking Forward: Build PQC That Actually Holds Up

As we head into the quantum era, one thing is clear: strong math isn’t enough. Secure post quantum cryptography depends on disciplined engineering. That means locking down input validation, sourcing strong randomness, and eliminating data dependent timing wherever it hides. It means treating secrets with respect through proper zeroization and squeezing out safe performance with smart cache aware memory layouts among other things.

Get these five areas right, and you’ve set ML KEM and ML DSA on the right path to deliver the post quantum security and efficiency they promise. Miss any of them, and even the best algorithms can crumble in real world software. The future is coming fast—let’s make sure our implementations are ready for it.

Finally, this article only scratched the surface of performance optimization. In the future, we will dig deeper into optimization methods and techniques that would really boost your performance. Stay tuned.

Share Your Feedback

We want to hear from you. Send comments, questions, and feedback to the INT31 team.

About the Author

Anas Hlayhel is a Cryptographer and a Principal Engineer in the INT31 Cryptography Team within the Intel Product Assurance and Security (IPAS) group at Intel. His research interests include post-quantum cryptography, low-level cryptographic implementation and optimization, high-value asset protection, and remote attestation. His prior experience includes mask design, pre and post silicon validation, and red team security research. Outside of work, he enjoys nature and reading non-fiction

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Real-World Challenge of Post-Quantum Cryptography: When Theory Meets Implementation