May 24, 2026
Cross-Model Spec Comparison: Finding Disagreement in Candidate Lean 4 Specifications Generated by OpenAI Models
Henry Ward
We study whether candidate formal specifications produced by different OpenAI models agree on the intended behavior of small systems. For three specification tasks, we generate three Lean 4-oriented candidates per task, normalize them into a shared canonical format, and compare assumptions, preconditions, postconditions, invariants, and semantic boundary choices. We then search for concrete counterexamples using executable validators and property-based style witnesses. In the live OpenAI-backed run, the models converged on the sortedness and token-bucket tasks, but diverged sharply on token expiry: every pairwise comparison for that task produced a counterexample at the exact expiry boundary. The main takeaway is that cross-model disagreement is a practical signal for underspecification when the prompt exposes a crisp semantic boundary, and that a lightweight normalization plus witness-generation pipeline can turn that disagreement into reportable evidence under a very small budget.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Cross-Model Spec Comparison: Finding Disagreement in Candidate Lean 4 Specifications Generated by OpenAI Models
},
author={
Henry Ward
},
date={
5/24/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


