Mar 22, 2026

Gaslight Detector - Untrusted Explanation, Two-Stage Monitoring

Eden Jaschek, arbel hadar, zoe eden, Avinoam Ashkenazy

we introduce explanation-based two-stage monitoring, a new protocol in which the untrusted model is asked to produce a natural language explanation of its own generated code, and a trusted model evaluates both the consistency of the explanation with the code and the suspiciousness of the code itself, using the explanation.

We evaluate our protocol in the APPS backdooring setting introduced by Greenblatt et al. (2024)

Our protocol correctly identifies 80% of attacks and incorrectly flag 4% of honest solutions as malicious, improving over the trusted monitor baseline (which in our setting only identified 48%)

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The core idea is clean — make the untrusted model explain its own code, then let the trusted model judge both the explanation's consistency and the code's suspiciousness together. This creates the right kind of dilemma: a colluding explainer either writes a faithful explanation that reveals the backdoor, or writes a vague/inconsistent one that the consistency check catches. Going from 48% detection to 80% at 4% FPR with the same trusted model on the same code is a meaningful improvement that validates the approach.

The evaluation is limited to one model pairing, one task set, and one red-team strategy — the authors are upfront about this. The most important missing test is stronger red-teaming: what happens when U writes explanations that are technically accurate but strategically emphasize irrelevant parts of the code while glossing over the backdoor? That's the natural adversarial response to this protocol. The future work on combining this with collusion prevention (honest U explaining colluding U's code) is the obvious next step and could be even more powerful.

I do really like this submission. Well Done! What would make it stronger is if I could be confident that the red team couldn't use it to make their attacks beter. Here is an unordered list of my thoughts:

Firstly, the red team could use this control protocol to persuade the monitor that unsafe code is actually safe. Have a read of this paper. https://arxiv.org/pdf/2510.09462 Attacks have been shown to work extreamly well in comments, so I think it would be even easier if it's given as part of the protocol. When making strajs myself, being able to lie to the monitor has been helpful. Very reasonable tho that you didn't have time to red team your protocol in a hackathon tho.

I feel the like intro explains a bunch of control concepts which would be clear to almost everyone reading the paper. Try to be more concise here. Instead, I want you to explain your ideas here more and spend more space focusing on the problem you are trying to solve in more detail.

A similar idea in the related work section, I think it's better to have fewer papers here and then explain the connection/nuance from those papers in more detail.

I really like your results section. Easy to read, it backs up your claims.

You mention accounting for collusion, but I'm unsure why this is the case in your method. A couple of lines to back this claim up would be helpful.

I'm glad you discussed the limits of your method, given the hackathon, instead of overstating your claims.

All that said, making a paper in a weekend is hard and it shows that you put effort and thinking into this one.

Cite this work

@misc {

title={

(HckPrj) Gaslight Detector - Untrusted Explanation, Two-Stage Monitoring

},

author={

Eden Jaschek, arbel hadar, zoe eden, Avinoam Ashkenazy

},

date={

3/22/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.