May 24, 2026
Vibe-Coding Specs: Eliciting, Editing, and Verifying Specifications for AI Coding Agents
Linh Le, David Williams-King
Specifications for real systems do not exist as one-shot artifacts: the user's intent emerges as they discover edge cases, rewrite drafts, and react to failing tests. We present an iterative pipeline that takes this seriously --- every spec artifact (the Lean predicate, the Python reference oracle, the generated code, the LLM-emitted concrete test cases) is independently editable and individually re-verifiable. Each round of the brief regenerates only the artifacts the reviewer asks for; the rest are preserved. This operationalises Dodds' position that "specifications don't exist" in the complete-and-coherent form formal-verification tools expect --- the right tooling shape is iterate-and-export, not generate-and-ship. Concrete contributions: (i) the iterative pipeline itself --- editable Lean, code, oracle, and test cases; (ii) LLM-emitted concrete test cases derived from the spec, run row-by-row against the agent's implementation; (iii) a behavioral + structural invariant DSL; (iv) mutation testing for specifications; (v) Lean 4 generation of a structural Diff->Prop block and an algorithmic predicate over the function's arguments, both type-checked by lake build; (vi) Hypothesis property-based testing of the agent's code against the (editable) reference oracle. On the 60-pair evaluation derived from prior work, our structured validator reaches 98.3% accuracy, 2.2% false-accept rate, and Cohen's K = 0.957 while matching the strongest LLM judge at ~300x lower wall-clock cost. The mutation harness finds 39.7% of perturbations load-bearing with zero brittle mutations. Cross-dataset experiments on five Python benchmarks spanning 2021-2025 (MBPP, HumanEval, BigCodeBench, HumanEval Pro, LiveCodeBench) show 100% Lean type-check across 25 problems, and a codegen-validation rate that improves from 40% to 92% after three targeted fixes. We trace the iterative pipeline end-to-end on a deliberately under-specified word-wrap brief whose intent stabilised only after four rounds of refinement, and use the simpler is_not_prime task to show the same pipeline handling a clean one-shot intent. All code, the Lean prelude, the evaluation harness, and a browser demo are released.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Vibe-Coding Specs: Eliciting, Editing, and Verifying Specifications for AI Coding Agents
},
author={
Linh Le, David Williams-King
},
date={
5/24/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


