Jul 1, 2024
DETECTING AND CONTROLLING DECEPTIVE REPRESENTATION IN LLMS WITH REPRESENTATIONAL ENGINEERING
Avyay M Casheekar, Kaushik Sanjay Prabhakar, Kanishk Rath, Sienka Dounia
Representation Engineering to detect and control deception, with a focus on deceptive sandbagging
test
Very well written, explained and excellent visualisations. It might also be informative to have more contextual analysis of where the various methods fail. Would be curious to see you look into why the Positive Deceptive Control might be increasing the amount of deceptive output in samples where the initial prompt was designed to elicit truthfulness.
The experiments are well executed, though the results could be presented somewhat more clearly. I think it would be better however to make it more clear what probes and similar techniques can a cannot do, and how do they compare to baseline techniques like finetuning and RLHF.
Cite this work
@misc {
title={
DETECTING AND CONTROLLING DECEPTIVE REPRESENTATION IN LLMS WITH REPRESENTATIONAL ENGINEERING
},
author={
Avyay M Casheekar, Kaushik Sanjay Prabhakar, Kanishk Rath, Sienka Dounia
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


