Counterexample-Guided Validation & Repair of LLM-Generated Safety Specifications
Yifei Lu
LLMs can turn natural-language safety requirements into formal specifications, but a fluent specification can still be the wrong one. It might quietly drop a guard, block access the policy should allow, or respond inconsistently when a safety-relevant input changes. We built a pipeline that checks whether an LLM-generated specification matches the intended policy, using biosecurity access control as a testbed. Each requirement is broken into traceable safety obligations such as clearance, training, approval validity, expiration, external anonymization, and emergency logging. Candidate specifications are written in a restricted DSL, and four oracles judge each one together: obligation coverage, sampled behavioral agreement, metamorphic relations, and SMT-guided counterexample search over Z3. Every candidate ends up with a risk-weighted review card. Across 30 simulated LLM candidates and 60 controlled mutants, small manual test suites caught only 63.3% of injected faults and 55.3% of the high-severity ones, while systematic validation reached 98.9% overall and 100% on high-severity faults. Feeding counterexamples back as repair signals fixed 29 of 30 candidates, raising the mean score from 70.1 to 95.4 and clearing every high-severity under-constraint, with one regression. In a live run, specifications from Gemini 3.5 Flash were parsed and validated end-to-end, and obligation-guided prompting reached full obligation coverage. Together, these results show that LLM-generated safety specifications can be validated systematically, with each failure traced back to a specific missing guard.
This project attacks an core part of the deployment of AI to SPS. The particular idea here is not especially new, and the active focus of ongoing research, but the execution is excellent.
The use of large amounts of external ground truth make it a compelling contribution to the field. The experimental design is well-considered and the authors displayed considerable honesty and epistemic hygiene in their analysis and discussion. The write-up is convincing in its gathering of relevant evidence and discussion of what the work does and does not show.
It suffers, slightly, from relying on a non-adversarial threat model. I expect that high-value deployment of this and similar techniques will be in a situation where that assumption does not hold (e.g. with concerns of secret loyalties or misalignment). The arguments for applicability are also a little too dependent on confidence in gold validity.
The author has good instincts for this problem. The observation that all three systematic strategies converge on the same detection rate and that the formal oracles add zero raw detection over sampling, thus their value is diagnostic, not detective is sharp.
The execution is sparse relative to those instincts, and the useful future work is to build the harder experiments the author's own thinking is pointing at.
Cite this work
@misc {
title={
(HckPrj) Counterexample-Guided Validation & Repair of LLM-Generated Safety Specifications
},
author={
Yifei Lu
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


