May 24, 2026

Vibe-Coding Specs: Eliciting, Editing, and Verifying Specifications for AI Coding Agents

Linh Le, David Williams-King

Specifications for real systems do not exist as one-shot artifacts: the user's intent emerges as they discover edge cases, rewrite drafts, and react to failing tests. We present an iterative pipeline that takes this seriously --- every spec artifact (the Lean predicate, the Python reference oracle, the generated code, the LLM-emitted concrete test cases) is independently editable and individually re-verifiable. Each round of the brief regenerates only the artifacts the reviewer asks for; the rest are preserved. This operationalises Dodds' position that "specifications don't exist" in the complete-and-coherent form formal-verification tools expect --- the right tooling shape is iterate-and-export, not generate-and-ship. Concrete contributions: (i) the iterative pipeline itself --- editable Lean, code, oracle, and test cases; (ii) LLM-emitted concrete test cases derived from the spec, run row-by-row against the agent's implementation; (iii) a behavioral + structural invariant DSL; (iv) mutation testing for specifications; (v) Lean 4 generation of a structural Diff->Prop block and an algorithmic predicate over the function's arguments, both type-checked by lake build; (vi) Hypothesis property-based testing of the agent's code against the (editable) reference oracle. On the 60-pair evaluation derived from prior work, our structured validator reaches 98.3% accuracy, 2.2% false-accept rate, and Cohen's K = 0.957 while matching the strongest LLM judge at ~300x lower wall-clock cost. The mutation harness finds 39.7% of perturbations load-bearing with zero brittle mutations. Cross-dataset experiments on five Python benchmarks spanning 2021-2025 (MBPP, HumanEval, BigCodeBench, HumanEval Pro, LiveCodeBench) show 100% Lean type-check across 25 problems, and a codegen-validation rate that improves from 40% to 92% after three targeted fixes. We trace the iterative pipeline end-to-end on a deliberately under-specified word-wrap brief whose intent stabilised only after four rounds of refinement, and use the simpler is_not_prime task to show the same pipeline handling a clean one-shot intent. All code, the Lean prelude, the evaluation harness, and a browser demo are released.

Review Project

See Code

View Related Sprint

Reviewer's Comments

A notably complete submission: a working NL→Lean→code→validate loop with real Lean that type-checks under lake build, released code, and a genuine ablation; the execution and presentation are thorough and well-polished. Two reservations weigh on the impact score. The headline numbers, 98.3% accuracy and κ=0.957, are measured against labels the authors assigned on a corpus they built, which is agreement with their own intent rather than validator correctness; a handful of third-party-labeled pairs would settle it. And the 40→92% codegen jump is harness and parsing work, not evidence the spec layer improved, so the paper would read more honestly leading with that distinction. The structural invariants are also behavior-agnostic by design: scope-clean but semantically-wrong candidates would be needed before the validator can be trusted against an LLM judge.

The authors were transparently honest about the circularity problem and a good next set of steps could be to try to measure it and define its boundary. Future work could also be on semantic adequacy, where one proves equivalence of the emitted predicate to the oracle on a bounded domain. Good work overall; keep this going!

Cite this work

@misc {

title={

(HckPrj) Vibe-Coding Specs: Eliciting, Editing, and Verifying Specifications for AI Coding Agents

author={

Linh Le, David Williams-King

date={

5/24/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026