Cross-Model Spec Comparison: Finding Disagreement in Candidate Lean 4 Specifications Generated by OpenAI Models
Henry Ward
We study whether candidate formal specifications produced by different OpenAI models agree on the intended behavior of small systems. For three specification tasks, we generate three Lean 4-oriented candidates per task, normalize them into a shared canonical format, and compare assumptions, preconditions, postconditions, invariants, and semantic boundary choices. We then search for concrete counterexamples using executable validators and property-based style witnesses. In the live OpenAI-backed run, the models converged on the sortedness and token-bucket tasks, but diverged sharply on token expiry: every pairwise comparison for that task produced a counterexample at the exact expiry boundary. The main takeaway is that cross-model disagreement is a practical signal for underspecification when the prompt exposes a crisp semantic boundary, and that a lightweight normalization plus witness-generation pipeline can turn that disagreement into reportable evidence under a very small budget.
A clean, honest proof of concept, and the token-expiry result is the most convincing of the three: every model pair splits at exactly the inclusive/exclusive boundary, the kind of ambiguity that can undermine a specification. The disposition is right throughout: the two converging tasks are reported as such, and the dual-use risk is raised without prompting. The evidence is thinner than the claim, though. One of three tasks produced any disagreement at all, and its witness was hand-seeded for precisely the boundary it found, so the result partly shows that a check finds the case it was written to find. That also leaves the two 0/9 convergences uninterpretable: a settled spec, or a prompt that was simply not adversarial enough? A benchmark of boundaries that were not pre-seeded, together with a measured false-positive rate for the disagreement classifier the authors themselves call noisy, is what would carry this from a demonstration to a method.
Good work. This is a cool project that is filled with good ideas, albeit its evaluation scope is fairly small.
Cite this work
@misc {
title={
(HckPrj) Cross-Model Spec Comparison: Finding Disagreement in Candidate Lean 4 Specifications Generated by OpenAI Models
},
author={
Henry Ward
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


