Verified But Wrong
Philip Nilsson
Verified But Wrong studies a target-validity failure mode in vericoding: formally verified code can satisfy a supplied specification while violating the natural-language intent that specification was meant to capture. We audit a public Dafny vericoding benchmark, validate 11 externally sourced demonstrations including 3 direct-Dafny cases, and show in a controlled benchmark that incomplete specs can select wrong implementations. The core recommendation is simple: before using a formal spec as a candidate-selection target, vericoding pipelines should audit whether the spec is actually the right target.
Aligning natural language with formal specifications is a challenging but important task. This work introduces an additional step to identify discrepancies between the natural language instructions and the generated specifications. However, the methodology for detecting these gaps and ensuring the reliability of this check is insufficiently explained. The verification process appears to rely heavily on model-based evaluation; incorporating a simple, deterministic symbolic check would significantly enhance the system's reliability.
Cite this work
@misc {
title={
(HckPrj) Verified But Wrong
},
author={
Philip Nilsson
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


