Mar 22, 2026
Gaslight Detector - Untrusted Explanation, Two-Stage Monitoring
Eden Jaschek, arbel hadar, zoe eden, Avinoam Ashkenazy
we introduce explanation-based two-stage monitoring, a new protocol in which the untrusted model is asked to produce a natural language explanation of its own generated code, and a trusted model evaluates both the consistency of the explanation with the code and the suspiciousness of the code itself, using the explanation.
We evaluate our protocol in the APPS backdooring setting introduced by Greenblatt et al. (2024)
Our protocol correctly identifies 80% of attacks and incorrectly flag 4% of honest solutions as malicious, improving over the trusted monitor baseline (which in our setting only identified 48%)
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Gaslight Detector - Untrusted Explanation, Two-Stage Monitoring
},
author={
Eden Jaschek, arbel hadar, zoe eden, Avinoam Ashkenazy
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


