Mar 23, 2026
Frankenstein ducks the linear probes
Abhinav Chand, Balaji R
Recent work has shown that linear probes are very effective in
detecting deceptive behavior of large language models, both
pre-generation (intent) and post-generation (harmful content) of
LLM’s response . We investigate fine-tuning methods to
suppress detection by deception probes. Our results show that we
can reduce the detection rate of linear probe monitors with our
fine-tuning method.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Frankenstein ducks the linear probes
},
author={
Abhinav Chand, Balaji R
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


