Mar 22, 2026
Testing Monitoring Models with Encoded Transcripts
Lachlan Ewart, Gideon Chang, Barath Ashok
In this experiment we stress-test CoT monitorability by using
transcripts of malicious activities with different encodings on
Claude Haiku 4.5, finding independent variables to assess what
affects monitorability effectiveness. (Proportion of text encoded,
batch size of monitoring model). We focus on testing encodings
which have potential to be learned by few-shot or fine tuning,
based on the work of Fabien Roger and Shiyuan Guo [1]. We find
that Haiku 4.5 does well against common encodings, and has a
decrease in accuracy on more complex encodings (ROT13
applied to every other letter). We test on Qwen 2.5-3B to check
that this is not a local case for Haiku 4.5.
They encoded transcripts different ways and tested if monitors can still flag them. Partial encoding harder to catch than full encoding, keyword targeted ROT13 was the best bypass. Useful findings. But the task is about stealing sweets which is way too simple to know if this transfers to real attacks. No confidence intervals, no statistical tests, Qwen comparison uses a different batch size so not comparable. Good question, needs better execution.
Interesting paper. Might be strengthened by framing the paper around the encoding complexity, between complex enough to evade monitors, to simple enough for a model to reason in
Cite this work
@misc {
title={
(HckPrj) Testing Monitoring Models with Encoded Transcripts
},
author={
Lachlan Ewart, Gideon Chang, Barath Ashok
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


