Mar 22, 2026

Gaslight Detector - Untrusted Explanation, Two-Stage Monitoring

Eden Jaschek, arbel hadar, zoe eden, Avinoam Ashkenazy

we introduce explanation-based two-stage monitoring, a new protocol in which the untrusted model is asked to produce a natural language explanation of its own generated code, and a trusted model evaluates both the consistency of the explanation with the code and the suspiciousness of the code itself, using the explanation.

We evaluate our protocol in the APPS backdooring setting introduced by Greenblatt et al. (2024)

Our protocol correctly identifies 80% of attacks and incorrectly flag 4% of honest solutions as malicious, improving over the trusted monitor baseline (which in our setting only identified 48%)

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The core idea is clean — make the untrusted model explain its own code, then let the trusted model judge both the explanation's consistency and the code's suspiciousness together. This creates the right kind of dilemma: a colluding explainer either writes a faithful explanation that reveals the backdoor, or writes a vague/inconsistent one that the consistency check catches. Going from 48% detection to 80% at 4% FPR with the same trusted model on the same code is a meaningful improvement that validates the approach.

The evaluation is limited to one model pairing, one task set, and one red-team strategy — the authors are upfront about this. The most important missing test is stronger red-teaming: what happens when U writes explanations that are technically accurate but strategically emphasize irrelevant parts of the code while glossing over the backdoor? That's the natural adversarial response to this protocol. The future work on combining this with collusion prevention (honest U explaining colluding U's code) is the obvious next step and could be even more powerful.

I do really like this submission. Well Done! What would make it stronger is if I could be confident that the red team couldn't use it to make their attacks beter. Here is an unordered list of my thoughts:

Firstly, the red team could use this control protocol to persuade the monitor that unsafe code is actually safe. Have a read of this paper. https://arxiv.org/pdf/2510.09462 Attacks have been shown to work extreamly well in comments, so I think it would be even easier if it's given as part of the protocol. When making strajs myself, being able to lie to the monitor has been helpful. Very reasonable tho that you didn't have time to red team your protocol in a hackathon tho.

I feel the like intro explains a bunch of control concepts which would be clear to almost everyone reading the paper. Try to be more concise here. Instead, I want you to explain your ideas here more and spend more space focusing on the problem you are trying to solve in more detail.

A similar idea in the related work section, I think it's better to have fewer papers here and then explain the connection/nuance from those papers in more detail.

I really like your results section. Easy to read, it backs up your claims.

You mention accounting for collusion, but I'm unsure why this is the case in your method. A couple of lines to back this claim up would be helpful.

I'm glad you discussed the limits of your method, given the hackathon, instead of overstating your claims.

All that said, making a paper in a weekend is hard and it shows that you put effort and thinking into this one.

Cite this work

@misc {

title={

(HckPrj) Gaslight Detector - Untrusted Explanation, Two-Stage Monitoring

},

author={

Eden Jaschek, arbel hadar, zoe eden, Avinoam Ashkenazy

},

date={

3/22/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.