May 25, 2026
SpecCheck: When LLMs Formalize, Who Checks the Spec?
Bhagyesh Kumar
Proof checkers guarantee that code satisfies a specification, but not that the specification captures what the user intended. When large language models formalize natural language requirements, they can silently weaken postconditions, drop edge cases, or resolve ambiguous terms in ways the user did not expect: a failure mode we call intent drift. We present SpecCheck, a pipeline that detects intent drift through two complementary techniques: ambiguity elicitation, which surfaces implicit assumptions before formalization, and round-robin cross-validation, where two independent models each formalize the same requirement and informalize the other's output. Across Lean~4 theorems (VERINA) and Python Hypothesis property suites, we find that the value of cross-model validation is formalism-dependent: on Lean~4, it exposes under-formalization that single-model assessment systematically misses (20-percentage-point gap), while on property-based tests, single-model assessment is already well-calibrated. The full pipeline achieves a 97% confirmation rate on correct specifications and 100% detection on deliberately weakened ones. Ambiguity count at elicitation time predicts which requirements will cause trouble downstream, flagging difficult requirements before formalization.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) SpecCheck: When LLMs Formalize, Who Checks the Spec?
},
author={
Bhagyesh Kumar
},
date={
5/25/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


