From Benchmarks to Proofs: How Intel Drives Formal Verification for...

Author:
Bo-Yuan Huang, INT31

Firmware underpins platform trust by initializing hardware, setting security boundaries, and bridging silicon with software. Even a single firmware defect can potentially compromise the security protection built on top of it. The INT31 team approaches firmware security assurance from multiple angles, employing offensive research within the Security Development Lifecycle (SDL) and engaging collaboratively with industry partners—because no single approach is sufficient.

To further strengthen these assurance practices, we pursue formal verification, which provides mathematical evidence to rigorously reason about firmware behavior across all inputs and states. However, applying formal verification to real-world firmware remains a significant research challenge, not a solved problem. In addition to active internal research, we are also tackling this challenge through academic collaboration, open-source tooling, and community engagement. This post describes that effort: how we are driving formal verification for firmware security, demonstrated on the open-source Intel® Trust Domain Extensions (Intel® TDX) Module, and how the broader research community can contribute.

Intel TDX Module: Security-Critical and a Verification Challenge

Intel TDX is a confidential computing technology that provides hardware-enforced VM-level isolation through Trust Domains (TDs), whose memory and CPU state are protected from the host VMM and other privileged software. At the heart of this architecture sits the Intel TDX Module, a runtime firmware component running in a privileged CPU execution mode (Secure Arbitration Mode, or SEAM).

The Intel TDX Module orchestrates interactions between the untrusted VMM and isolated TDs through a dedicated instruction interface. These interface functions constitute the module's attack surface: they carry untrusted inputs from outside the trust boundary, and the Intel TDX Module is designed to validate them and enforce the platform's security policies.

Intel open-sourced the Intel TDX Module, enabling external research and collaboration. However, open source alone does not make verification tractable. Firmware presents distinct challenges that standard software verification tools were not designed for, especially hardware dependency (specialized instructions that break tool abstraction layers) and stateful execution (behavior depends on registers, control structures, and internal state, not just function arguments). These challenges are not unique to Intel TDX, but they are especially acute here, making the Intel TDX Module an architecturally significant target for firmware verification research.

In the following sections, we describe three papers, to be presented at TACAS¹, FSE², and HOST³ in 2026, that tackle these challenges from complementary angles. Working with LMU Munich and National Taiwan University (NTU), we conducted the first large-scale empirical study of firmware verification and built open tools and benchmarks for the broader community. In parallel, a collaboration with Princeton University developed a scalable flow-level verification methodology using assume-guarantee to verify properties across Intel TDX lifecycles. The following sections trace this progression: characterizing the problem, building the infrastructure, and scaling the methodology.

A Large-Scale Firmware Verification Study

Partnering with LMU Munich and NTU, we conducted what we believe is the first large-scale empirical study of applying automatic software verifiers to commodity firmware. We selected 22 interface functions from the Intel TDX Module, covering various Intel TDX lifecycles for both Host and Guest, and derived 418 verification tasks from the unmodified code base. These are tasks extracted directly from shipping firmware, not simplified models.

We adapted established code-level model checking techniques with three firmware-specific extensions: (1) a CEGAR-like hardware abstraction that overapproximates firmware-hardware interactions and refines only on false alarms; (2) systematic proof-harness engineering for negative-space verification, where each harness sets up symbolic state, encodes preconditions, and checks for proper error-handling; and (3) modular type-based constraints serving as reusable invariants over firmware data structures. Together, these techniques transform firmware verification into a form that existing software-analysis tools can attempt.

We evaluated six state-of-the-art tools spanning the major verification paradigms: CBMC and ESBMC (bounded model checking), KLEE (symbolic execution), MOPSA (abstract interpretation), Ultimate Automizer (automata-based analysis), and CPAchecker (configurable software verification). 59% of tasks were solved by at least one tool out of the box—a rate reflecting both the maturity of current tools and the inherent difficulty of firmware code. Many of the unsolved tasks stemmed not from weaknesses in the tools' core analysis techniques but from firmware-specific code patterns that triggered frontend parsing failures.

The most significant finding concerns why tools fail on firmware. Three code patterns appeared repeatedly as persistent obstacles:

Nested type punning: Firmware interfaces with hardware registers through deeply nested, often anonymous unions and interchangeable pointer-type casting (Figure 2). This is how firmware shares a finite set of hardware registers across operations, yet most software verification tools cannot parse these constructs or silently misinterpret them.
Compiler memory-layout attributes: Firmware often uses compiler attributes such as __attribute__((aligned(n))) and __packed__ for hardware-precise memory layouts, which several tools mishandle, producing incorrect results.
Reimplemented standard libraries: With no OS underneath, firmware reimplements memcpy, memset, and similar functions using inline assembly that verification tools cannot recognize as semantic equivalents, blocking built-in optimizations entirely.

These are not corner cases; they are pervasive firmware idioms. The obstacles are well-characterized and have been reported upstream, with several tool maintainers already addressing them. The study was selected as an ETAPS Distinguished Paper. All verification tasks, proof harnesses, and hardware abstractions are publicly released. For the full methodology and results, we refer readers to the paper¹ and its supplementary materials.

Tooling, Benchmarks, and Community Engagement

Our case study with LMU Munich and NTU shows the gap between industrial firmware and academic verification tools is structural. It stems from missing language support and parsing failures, not just parameter tuning. We approached this gap from multiple directions: building tooling to automate the verification workflow, releasing benchmarks to the community, establishing standardized primitives, and catalyzing a new competition track.

On the tooling side, a key component is HarnessForge, an open-source tool that automates the extraction of self-contained verification tasks from large, multi-file C projects. Off-the-shelf verifiers expect a single C file with a main entry point, but industrial firmware spans many source and header files with complex build systems.

HarnessForge automates the pipeline: it intercepts the project's build command to derive the configuration, assembles relevant source files into a single compilation unit, and applies static program slicing to strip away unreachable definitions. It is analyzer-agnostic and operates without modifying the original source code, ensuring that any counterexample or diagnostic produced by the verifier maps directly back to the production codebase.

Beyond preprocessing, HarnessForge directly improves verification outcomes. In our Intel TDX Module tasks, its slicing capability reduced effective code size by 60% on average and increased the number of correct verifier results by removing irrelevant code that may contain unsupported firmware-specific constructs. More importantly, HarnessForge is not limited to Intel TDX. It has been applied to other codebases such as AWS C Common Library and GNU Coreutils, demonstrating its applicability across diverse programming styles and architectures. The tool paper² details its implementation.

Besides HarnessForge, we contributed the Intel TDX Module verification tasks to the International Competition on Software Verification (SV-COMP), leading to the creation of a dedicated firmware verification track—a first in the competition's history. These tasks also exposed that tool-specific havocking primitives significantly outperform generic approaches for initializing complex data structures, motivating the adoption of a new standardized primitive ( __VERIFIER_nondet_memory() ) for this purpose in SV-COMP. In addition, we introduced litmus testing tasks designed to catch quiet failures, where verifiers silently produce incorrect results on firmware-specific constructs rather than crashing or reporting an error.

Several top-performing software verifiers are actively addressing the issues our benchmarks exposed: fixing anonymous union handling, improving nested type support, and investigating failures on complex firmware structures. The firmware track provides tool developers with real code from a real security-critical product, with properties that matter in practice. The benchmarks are public, HarnessForge is open-source, and the firmware verification track is live.

Scaling Verification to Flow-Level Properties

Verifying individual interface functions is necessary but not sufficient for firmware security. A VMM accomplishes meaningful tasks, e.g., creating a Trust Domain, adding memory pages, or tearing down a TD, by issuing sequences of SEAMCALL functions in a precise order. We call these sequences flows. Consider setting up a new TD: the VMM must call TDH.MNG.CREATE, then TDH.MNG.KEY.CONFIG, then TDH.MNG.ADDCX multiple times, and finally TDH.MNG.INIT. Skipping a step or reversing the order results in invalid TD state. Function-level verification alone cannot catch these issues; reasoning about the composition of functions and the system state they collectively produce is required.

Collaborating with Princeton University, we tackled this problem on 8 flows that cover major Intel TDX lifecycles, such as Intel TDX Bootup, TD Setup, TD Destruction, etc. Together these span 25 distinct SEAMCALL functions. The key bottleneck is complexity, where naively inlining all function bodies into a single verification query does not scale for any of the leading verification tools.

Our approach is a two-step modular verification methodology. First, we verify each function individually using assume-guarantee reasoning, producing a verified function summary that abstracts the function body into a compact pre/postcondition contract. Second, we compose these summaries in the flow order to construct flow verification tasks.

The key mechanism is the function summary. Once a SEAMCALL function is proven with respect to its function contract, its body is replaced with a non-deterministic assignment constrained by postconditions. The model checker no longer re-analyzes function internals when checking the flow; it reasons only about guaranteed state transitions. The summaries also enable a directional shift in assume-guarantee reasoning: preconditions assumed during function verification become assertions in the flow context, catching cases where a preceding function left the system in an unexpected state.

The results demonstrate the viability of the approach, with the verification of all 8 selected flows converged successfully. The modular methodology is what makes this tractable; without function summaries, every flow exceeds a 2-hour timeout. The paper³ details the technical analysis of the assume-guarantee construction and scaling behavior.

Our Collaborators, What's Next, and Where to Find Us

Each of these efforts—the empirical study, the tooling and benchmarks, and the flow-level methodology—reflects sustained collaboration across institutions. Prof. Dirk Beyer's group at LMU Munich's SoSy Lab, including Po-Chun Chien, Thomas Lemberger, and Nian-Ze Lee (now leading the ForMACE Lab at NTU), brought world-class expertise in verification infrastructure and competition-driven research as the organizers behind SV-COMP. Their understanding of what makes software verification tools succeed at scale shaped the benchmarking foundation underpinning our approach. On the formal methods front, Prof. Sharad Malik and Prof. Aarti Gupta at Princeton, working with Sophia Zhang, advanced the modular verification methodology and hardware-software co-verification techniques. Their work on decomposing complex firmware flows into tractable verification subproblems opened new scalability directions for firmware verification.

We will present the Intel TDX case study as an ETAPS Distinguished Paper at TACAS in April 2026¹, the flow-level verification methodology at HOST in May 2026³, and a demo of HarnessForge at the FSE tool demo track in July 2026². We welcome engagement from researchers working on verification, firmware security, or formal methods more broadly.

Building on these results, we are extending the methodology to concurrency and multi-core verification scenarios, expanding flow coverage across additional firmware modules, and integrating property-directed slicing directly into the toolchain. We are also exploring how AI-assisted harness generation can lower the barrier to entry for verification engineers, and we will continue growing the open firmware verification benchmark to provide the research community with a shared foundation.

Formal verification is one component of Intel's broader approach to firmware security, complementing industry partnerships, offensive security research, and the SDL practices integrated throughout our development process. The tools, benchmarks, and methodologies described here are publicly available. We invite tool developers, firmware engineers, and formal methods researchers to build on these resources—and to connect with us at the conferences coming up.

References

A Case Study in Firmware Verification: Applying Formal Methods to Intel® TDX Module," International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), 2026
"HarnessForge: Automated Extraction of Verification Tasks from Industry-Scale Software Projects," International Conference on the Foundations of Software Engineering (FSE), 2026
"Formal Firmware Verification of an At-Scale VM-level TEE Architecture," International Symposium on Hardware Oriented Security and Trust (HOST), 2026

Share Your Feedback

We want to hear from you. Send comments, questions, and feedback to the INT31 team.

About the Author

Bo-Yuan Huang is an offensive security research scientist in Intel's INT31 team, where he focuses on formal methods for hardware and firmware security assurance. His work centers on automated verification, scalable assurance, and security architecture hardening, with particular expertise in confidential computing and graphics accelerators. His research emphasizes rigorous analysis of real-world SoCs and firmware and has been published widely in security and verification communities. Before joining Intel, he earned his Ph.D. in Electrical and Computer Engineering from Princeton University.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

From Benchmarks to Proofs: How Intel Drives Formal Verification for Firmware Security