How AI Automates Security Gap Detection in Specifications

Author:
Priyam Biswas, INT31

What Is Specification Mining and Why Does It Matter?

Every engineered system starts with a specification. Whether it describes a processor interface, a communication protocol, or a firmware API, the specification is the authoritative document that defines how the system should behave. The fundamental challenge is that specifications are written in natural language. Engineers must capture complex architectural intentions, including state machines, security boundaries, error conditions, and component interactions, using text, tables, and diagrams. However, natural language is inherently ambiguous and context-dependent. What seems clear to the specification author may be interpreted differently by the engineers who must translate that intent into implementations across multiple downstream phases of development.

This translation gap matters because specifications inform every subsequent phase. Engineers derive security properties from them. Test engineers build test plans based on them. Verification engineers write assertions based on them. Designers implement the architecture described in them. Each phase requires extracting structured, actionable knowledge from natural language, and each phase risks losing or misinterpreting the original architectural intent. The further downstream, the harder it becomes to recover what the specification author meant.

Specification mining, the automated extraction of structured, actionable knowledge from natural language specifications, addresses this challenge. The problem space is broad: specification mining can support threat modeling by identifying trust boundaries and attack surfaces, test case generation by surfacing untested scenarios, security property extraction by formalizing implicit constraints, compliance verification by mapping specification content against regulatory requirements, and security gap detection by identifying what specifications leave unsaid. These are all facets of the same underlying problem: transforming unstructured specification text into structured, actionable outputs that downstream engineering phases can consume reliably.

In practice, this work is manual. A domain expert reads through the specification, mentally reasons about the implications of each feature, checks whether the specification addresses edge cases and failure modes, and documents findings as requirements, test cases, or security rules. For a 500-page specification, this can take days to weeks. The process is thorough when done well, but it is slow, it depends on expert availability, and it is difficult to scale across many specifications.

We explored whether generative AI could automate significant portions of this workflow. While the underlying pipeline and techniques apply to the broader specification mining problem, this post focuses primarily on one facet: identifying security gaps in specifications, places where architectural intent has been incompletely captured. A previous blog post in this series discussed how formal verification can be used to verify such gaps once they are identified. This post focuses on the step before that:

How do we identify security gaps in specifications systematically and at scale?
Does AI-driven specification mining remove the need for human security architects, or does it augment them?

We describe the approach we developed, the key design decisions, and what we learned.

An Illustrative Example

To ground the discussion, consider the following excerpt from the NVMe Base Specification, Revision 2.3 (NVM Express, 2025), Section 3.2.1, describing namespace identifier types and their relationship to controllers:

"Active NSIDs for a controller refer to namespaces that are attached to that controller. Allocated NSIDs that are inactive for a controller refer to namespaces that are not attached to that controller.

An allocated NSID may be an active NSID for some controllers and an inactive NSID for other controllers in the same NVM subsystem if the namespace that the NSID refers to is attached to some controllers, but not all controllers, in the NVM subsystem.

Unless otherwise noted, specifying an inactive NSID in a command that uses the Namespace Identifier (NSID) field shall cause the controller to abort the command with a status code of Invalid Field in Command. Specifying an invalid NSID in a command that uses the NSID field shall cause the controller to abort the command with a status code of Invalid Namespace or Format."

This is clear, well-written specification language. It defines NSID classification, the multi-controller sharing model, and the error behavior for inactive and invalid NSIDs. It also references other sections for further detail: the "unless otherwise noted" qualifier implies exceptions defined elsewhere, and Section 8.1.16 covers namespace management operations in detail. The full picture of namespace behavior is distributed across multiple sections of a 784-page document. This fragmentation is itself part of the challenge: a reader must aggregate cross-referenced sections into a coherent set of requirements before they can reason about what is missing. Now consider the security questions it does not answer:

The spec returns different error codes for inactive NSIDs ("Invalid Field in Command") versus invalid NSIDs ("Invalid Namespace or Format"). Can an attacker distinguish allocated-but-inactive namespaces from nonexistent ones by probing NSIDs and observing which error code comes back?
The spec says a namespace can be active on some controllers and inactive on others. What prevents a controller from issuing commands against an NSID that is active on a different controller, bypassing namespace isolation?
The spec describes the steady state (active vs. inactive) but not the transition. What happens if a namespace is being detached from a controller while I/O commands referencing that NSID are already in flight?
The spec qualifies the abort behavior with "unless otherwise noted." Which commands are the exceptions, and do those exceptions introduce paths where an inactive NSID is silently accepted?

This gap between what is explicitly specified and what is left unspecified is called the negative space. It includes undefined error-handling behavior, implicit assumptions about input validity, missing constraints on state transitions, and unaddressed interactions between components. Negative space is where many real-world vulnerabilities originate, not because the specification is poorly written, but because it is impossible to enumerate every scenario in natural language. The specification author may have clear architectural intent around namespace isolation, but the text does not explicitly address error-code side channels, concurrent attach/detach transitions, or the scope of the "unless otherwise noted" qualifier. The firmware developer implements the expected execution flow. The verification engineer tests the documented scenarios. The driver developer assumes the controller handles edge cases. Each phase operates on incomplete architectural intent, and security gaps emerge in the spaces between these assumptions.

A security architect reading this passage would mentally enumerate these scenarios and check whether they are addressed elsewhere in the document. If they are not, the architect would document them as negative-space rules for the design and verification teams to address.

Multiply this across hundreds of chapters, each containing register definitions, state machines, and command set descriptions, and the scale of the problem becomes apparent.

Why a Simple LLM Prompt Is Not Enough

The obvious starting point is to give a large language model (LLM) the specification text and ask it to identify security gaps. We tried this. It produces useful results, but several characteristics of specification documents and LLM behavior limit its effectiveness:

Specifications are multimodal. Much of the critical information lives not in the text but in state-machine diagrams, register-map tables, and block-level architecture figures. A text-only approach misses these entirely.

Context spans the full document and beyond. A constraint introduced in one chapter may modify the meaning of behavior described in another. Definitions in one specification may govern behavior described in a companion document. For example, the NVMe Base Specification defines namespace types, but the NVMe Management Interface Specification and individual I/O Command Set specifications define additional operations on those namespaces. An expert must cross-reference multiple documents to understand the complete picture. An LLM call operating on a single chapter in isolation cannot see these dependencies.

The task is generative, not retrieval. The system must reason about what is absent, not find what is present. This requires domain knowledge about what types of gaps to look for. Without that guidance, LLMs tend to produce generic observations ("ensure proper error handling") rather than specific, actionable rules.

LLMs produce plausible answers, not necessarily correct ones. A single-prompt approach generates rules that "seem right" on first reading but are not grounded in the actual specification text. The model may hallucinate constraints that do not exist, misattribute behavior to the wrong component, or conflate concepts from different sections. Without a structured quality gate that checks whether each generated rule traces back to specific specification evidence, the output contains a mix of valid insights and confident-sounding fabrications that an engineer cannot distinguish without re-reading the specification themselves.

These limitations led us to a multi-stage architecture that combines several AI techniques.

Approach: A Multi-Stage Pipeline

The following diagram shows the overall workflow:

Stages 1-2: Multimodal Ingestion and Knowledge Graph Construction

The first two stages extract the full information content of the specification and organize it into a queryable knowledge graph.

Text and tables are extracted directly from the document. For figures, the system uses an AI vision model in an interactive mode. Rather than sending a full-page image to the model in one shot, the model is given the ability to request zoomed-in views of specific diagram regions. This matters because state-machine transition labels and register bit-field annotations are often too small to read reliably at page resolution. The model iteratively focuses on the regions that matter, extracting structured data: states, signals, transitions, and register definitions. This interactive approach consistently outperforms single-pass image captioning, especially for dense technical diagrams common in hardware and protocol specifications.

The extracted text, tables, and visual elements feed into a knowledge graph construction pipeline. A language model identifies entities such as components, registers, state machine nodes, and interface signals, along with typed relationships between them: which register controls which state machine, which signal triggers which transition, which command modifies which resource. A graph clustering algorithm then partitions the graph into functionally related subgraphs (for example, namespace management, power control, error reporting), which serve as the unit of analysis in later stages.

This graph serves as a structured representation of the specification. When the system analyzes a specific chapter, it queries the graph to retrieve relevant context from across the entire document. This is what enables cross-chapter reasoning. The graph captures structural relationships (this register controls that state machine, this signal triggers that transition) that a flat text representation loses.

Stage 3: Multi-Agent Rule Generation

The core reasoning stage uses multiple AI agents working together:

A Router maps each chapter to the most relevant security domain categories. This is done by an LLM, not by keyword matching, so it handles chapters that discuss security-relevant topics without using expected terminology.

The LLM generates candidate negative-space rules using the chapter text, knowledge graph context, and domain-specific threat scenarios as input.

Skill Selection retrieves the matching domain skills and extracts their threat scenarios, analysis questions, and gap patterns. These become part of the prompt context for rule generation.

A Quality Gate (Critic) evaluates each candidate rule on three criteria: Is it grounded in evidence from the specification? Is it specific enough to be useful (not just "ensure proper error handling")? Can an engineer write a test or assertion for it? Rules that do not meet the threshold are rejected.

A Visual agent cross-references the accepted rules against data extracted from diagrams and generates additional rules for gaps visible only in state machines or register maps.

Returning to the NVMe namespace example from earlier, the system would generate rules such as:

Rule: "The NVMe Base Specification returns distinct error codes for inactive NSIDs (Invalid Field in Command) and invalid NSIDs (Invalid Namespace or Format). The specification does not address whether this difference in error response is observable by an attacker who can probe sequential NSID values across controllers in a multi-controller NVM subsystem."
Category: Access Control / Side Channel
Potential Impact: An attacker sharing an NVM subsystem could determine the valid NSID range by probing sequential values and observing which error code is returned, confirming which identifiers fall within the namespace address space even without access to the Identify Controller data structure.
The critic would accept this rule because it is grounded (the spec defines distinct error codes for inactive vs. invalid NSIDs but does not address the observability of this distinction), specific (it names the exact mechanism, NSID probing, and the multi-controller context), and actionable (an engineer can write a test that probes sequential NSIDs and checks whether error responses leak allocation state).

Domain Skills: Structured Expertise for AI

One key design decision was to build a library of domain-specific security skills. Each skill captures structured threat scenarios and analysis guidance for a particular security area, such as access control, input validation, buffer management, side-channel attacks, or fault injection.

These skills are not hardcoded rules. They are structured representations of how a domain expert would approach the problem: what threat categories to consider, what types of missing constraints to look for, and what questions to ask about each architectural feature. An experienced security architect reviewing a namespace management section would instinctively check for isolation boundaries, error-handling completeness, and race conditions in state transitions. The skills encode that instinct as explicit guidance. The model reasons freely over the specification text but does so with the same orientation a senior architect would bring to the reading.

The skill system is designed to be extensible. Adding a new domain, for instance cryptographic protocol analysis or automotive functional safety, requires authoring a new skill that captures how an expert in that domain would read a specification. The core pipeline does not change.

Stage 4: Coverage Analysis

After rules are generated, the system assesses how well the specification already addresses each one. A fast keyword-based scan classifies the clear cases. Uncertain cases are batched and sent to an LLM for deeper analysis. Each rule receives a tag: Covered, Partially Covered, or Not Covered, with supporting evidence from the specification text. This coverage map is the primary deliverable for security architects.

What We Learned

Graph-based retrieval outperforms flat chunking. Early versions used standard retrieval-augmented generation (RAG), where the specification text is split into chunks and retrieved by similarity search. Replacing this with a knowledge graph that captures entities and relationships produced significantly more relevant context for rule generation.

Quality gates are essential. Without a critic agent, many generated rules are too generic or insufficiently grounded in the specification. Automated scoring based on traceability, specificity, and actionability criteria filters out these substandard rules before human validation.

Cross-chapter memory prevents duplication. When chapters are processed in parallel, the same rule can appear from multiple chapters discussing the same component. Tracking previously generated rules and instructing the models to find new gaps reduces this duplication significantly.

Interactive visual extraction matters. Single-pass AI inference on full-page images misses fine-grained details such as state transition predicates, register field encodings, interconnect signal labels etc. that are critical for accurate diagram extraction. The iterative zoom approach makes diagram extraction reliable where single-pass extraction is not.

Output consistency requires deliberate engineering. Generative models are inherently non-deterministic, and the same specification chapter can produce different rules on successive runs. The pipeline addresses this through low-temperature inference, content-hash caching of LLM responses, structured output schemas, and rule consolidation that merges semantically similar findings. These controls do not eliminate variation entirely, but they reduce it to the point where successive runs on the same document produce substantially overlapping rule sets.

Results

In pilot deployments on hardware specifications:

The approach reduced the time spent on initial gap identification from days of manual expert effort to hours of automated analysis, with expert review time on top.
The system surfaced negative-space scenarios that manual reviewers had missed, demonstrating the value of systematic analysis guided by domain skills.
Expert validation of generated rules showed high accuracy after the quality gate, with most accepted rules confirmed as valid and actionable.
The pipeline handled documents exceeding a thousand pages through adaptive context management.

This approach definitely does not replace security architects. It replaces the most time-consuming portion of their workflow: the exhaustive reading and scenario enumeration. The architects can then focus on validation, prioritization, and strategic analysis.

What Is Next

The rules generated today are exported as structured files. The next step is integration with downstream verification environments: feeding negative-space rules into formal property generation, assertion synthesis, or test-plan creation. The connection between gap identification (this post) and formal verification of those gaps (previous post) is a natural one, and closing this loop is an active area of work.

We also see value in community-contributed scenario repositories where practitioners share domain-specific gap patterns. The guidance that steers rule generation today is curated from expert knowledge. Opening this to the broader community would increase coverage across domains and products.

Conclusion

Specification mining requires natural language understanding, domain expertise, and adversarial thinking. It has historically been an exclusively manual, labor-intensive process because it demands all three capabilities simultaneously. Large language models, combined with structured knowledge retrieval and multi-agent orchestration, can automate the exhaustive reading and systematic enumeration that consume most of the analytical effort.

The approach is not limited to gap identification or hardware security. Any domain where structured documents govern system behavior, and where security, reliability as well as compliance depends on identifying gaps in architectural intention, is a candidate. When gap identification becomes routine and comprehensive, it transforms specification review from a weeks-long bottleneck into a scalable, repeatable process.

Acknowledgements

I would like to thank Jason M. Fung, Bo-Yuan Huang, Jonny Valamehr, Brent M. Sherman, Setareh Sharifian, and Thais Moreira Hamasaki for their valuable feedback on this blog post, which helped improve both the technical clarity and presentation of the work.

The specification mining research presented here is the result of close collaboration with a dedicated team of researchers. I extend my sincere gratitude to my principal collaborators Shriti Tanya, Sayak Ray, Ben Gras, Mehdi Mohtashemi, Jason M. Fung, and Vivek Tiwari for their valuable feedback on the technical work and their contributions to conceptualizing and advancing the AI-powered specification analysis methodology discussed herein.

Share Your Feedback

We want to hear from you. Send comments, questions, and feedback to the INT31 team.

About the Author

Priyam Biswas is an Offensive Security Research Scientist at IPAS, where she specializes in secure system development and vulnerability identification. Her current work drives the integration of Generative AI into security assurance and the development of next-generation security mechanisms. Priyam earned her PhD in Computer Science from Purdue University, with a research foundation in compiler-based security and applied cryptography.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

From Reading Specs to Mining Them: How AI Automates Security Gap Detection in Specifications