SpecSaboteur
Raj Taware
SpecSaboteur validates formal specifications by generating adversarial "malicious compliance" implementations that satisfy every spec constraint while violating intended behavior — the dual of CEGIS applied to specification refinement. Tested on 14 Dafny and software specs across 3 strength tiers, it detects 12 gap patterns (including security-critical reentrancy and auth gaps), achieves convergence in ≤2 refinement iterations, and produces a gap taxonomy for spec-repair training — all at zero API cost using open-source models.
The inversion at the heart of this, generating a verified implementation that satisfies the spec yet violates intent and treating that as a concrete gap, is a genuinely original idea, and grounding Layer 1 in Dafny makes the gaps it finds real. The weak/medium/strong gradient is honest evidence the method discriminates. The execution does not yet match the concept. One of the quantitative pillars is broken: the sampling script recorded zero gaps for every trial, and the detection table is reconstructed from console logs. Layer 2's LLM judge inverts, with strong specs yielding more gaps than weak, so the trustworthy results are the six Dafny specs, and the generalization-to-real-software claim should be softened to match. The strong-tier false positive is evidence that soundness rests on an LLM intent oracle, not the verifier. The duality exposition restates itself and could be cut, freeing room for the random-code baseline the paper admits is missing.
This is a very exciting idea and the hackathon submission was a good proof-of-concept. I would be very surprised if there was not a significant research opportunity in this direction and there is applicability to adversarial training.
The pipeline is well designed and structured. The execution suffered somewhat from implementation issues and it is not clear that 'convergence' implies some epistemic value rather than lack of model creativity One of the tables was poorly formatted but otherwise the write-up was clear and understandable.
The limitations highlighted were sensible caveats; I would have liked the authors to broaden assessment beyond only Qwen2.5-Coder-32B. The benchmark as presented was pretty small and maybe overly simple to be able to make reasonable judgement on scalability. Illustration, even if brief, of how this technique performs with harder problems or stronger models would have been a strong value-add to the results.
Cite this work
@misc {
title={
(HckPrj) SpecSaboteur
},
author={
Raj Taware
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


