SpecTrap: How Compliance Pressure Degrades AI-Generated Formal Specifications
Rahul Kumar
AI models are increasingly used to generate formal specifications and property-based tests for verification pipelines. We show that production-style prompts ("generate at least 8 properties, do not refuse") cause specification soundness to collapse: GPT-4o drops from 60% to 13% file-level correctness (p = 1.76×10⁻⁴, 252 generations, 3 models). The one-line remediation that fixes factual fabrication does not transfer to specification generation -- a novel negative finding. We release SpecTrap, a pip-installable tool that adversarially tests AI spec generators using Hypothesis validation and Z3 cross-checking, with all data and code open source.
Specification synthesis typically focuses on two primary criteria: soundness and the ability to filter out incorrect programs. This work ensures soundness by combining the Hypothesis PBT library with the Z3 SMT solver. These tools complement one another to provide a robust verification framework. In contrast, specification strength is evaluated based on the diversity of generated specifications; results indicate that LLMs struggle in this area. Enhancing strength, potentially through iterative refinement, will be key to improve the quality of output.
Cite this work
@misc {
title={
(HckPrj) SpecTrap: How Compliance Pressure Degrades AI-Generated Formal Specifications
},
author={
Rahul Kumar
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


