Jun 22, 2026

HAZE: Adversarial Multi-Agent Scrutiny for Vulnerability Detection

ALEX DANIEL NOVELO FUENTES

HAZE (Hallucination-Aware Zero-sum Examination) is an AI safety project that tackles a failure mode in AI-assisted security auditing: single-agent LLM auditors tend to manufacture findings, flagging safe code as vulnerable. HAZE reframes the task from "is there a bug?" to "did this accusation survive refutation?" using a multi-agent debate protocol—two attacker and two defender models argue over a code artifact while an isolated judge rules only on what withstands cross-examination.

Grounded in AI safety principles (debate as scalable oversight and AI-control-style distrust of any single LLM pass), it was measured on a 10-CVE benchmark: a single-agent baseline flagged 80% of patched code as vulnerable, while HAZE cut that false-positive rate to 20% while keeping 80% recall—a 4× reduction in hallucinated findings, at roughly USD 0.02 per verdict. Built and reproducible, it shows how adversarial multi-agent scrutiny can make AI-driven security pipelines more trustworthy.

Review Project

See Code

View Related Sprint

Reviewer's Comments

HAZE addresses an important problem in LLM-based code auditing by reducing false positives through a structured 2v2 prosecution/defense debate with an independent judge. The benchmark-based evaluation on labeled CVEs, along with comparisons against a single-agent baseline, provides a solid foundation. The reduction in false positives (80% to 20% at 80% recall), thoughtful error analysis, statistical reporting, and transparent discussion of limitations make this a credible and well-executed study. I also appreciated the open-source evaluation harness and the authors' restraint in not overstating statistically underpowered results.

The primary limitation is the small evaluation set, which makes the observed improvement promising but not yet conclusive. Expanding the benchmark would significantly strengthen the statistical claims. The system would also benefit from multi-file context or retrieval support to address cross-file false negatives, and it would be useful to better characterize the vulnerabilities missed when recall drops from 100% to 80%, since false negatives can have a high security impact.

Overall, this is a thoughtful, well-measured paper with a practical approach to improving LLM vulnerability detection, and it provides a strong foundation for future work.

This is the most rigorous submission of the set. The framing — single-agent auditors over-flag ("100% recall, 80% FPR") so reframe the question as "did the accusation survive refutation?" — is sharp, and the work is unusually honest about its own limits. Main issues:

The headline result is underpowered and the team knows it. McNemar p=0.34 at n=20 means the 80%→20% FPR drop is directional, not significant. They state this plainly (good), but the abstract and README still lead with "4× reduction" as a headline. The substance is fine; the framing slightly oversells what the statistics support.

Two different framings/runs are blended. Phase 1 (n=20, VULNERABLE/SEGURO) is the primary, but the error taxonomy and round-ablation come from the web run (n=120, "malware" framing), which the paper itself marks "superseded." Conclusions drawn from a superseded framing shouldn't carry into the headline narrative without a caveat. Re-derive the taxonomy on Phase 1 data.

Figure inconsistencies. Figure 1/Figure 2 show FPR=30 for HAZE, but Table 1 and the abstract report 20%. The figures appear to plot the web run while the text reports Phase 1 — a reader can't tell which number is canonical. Reconcile.

n=20 with the baseline as the comparator is thin. The baseline (100% recall via flagging nearly everything) is almost a strawman — a single agent prompted toward calibration, not just detection, would be a fairer control. Worth noting the baseline is weak by construction.

Judge is Claude Sonnet 4.5 while one debater is also a strong model — possible model-identity effects on judging aren't probed. A judge-swap ablation would strengthen the isolation claim.

Real world scale application was missing how to frame the HAZE with multiple LLM's

I like this one because it goes after a real and annoying problem: ask one model to find a bug and it will invent one to please you. Turning the judge's question into "did the accusation survive a rebuttal" is a clever fix. And you actually measured it on labeled CVEs instead of just shipping a demo, with different models in each seat and a separate judge, which matters. You were straight about the stats too, the p value is not significant and you said so. Two things hold it back. With only 20 cases it is underpowered, so getting to the 60-plus you mention would tell you if the false-positive drop is real. And the single-agent baseline is kind of a punching bag, it over-flags by design, so beating it is not saying much yet. Keep an eye on the recall you are giving up. Good, honest build.

Cite this work

@misc {

title={

(HckPrj) HAZE: Adversarial Multi-Agent Scrutiny for Vulnerability Detection

author={

ALEX DANIEL NOVELO FUENTES

date={

6/22/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026