Mar 23, 2026
Surrogate Steering Vector Monitors (Track 2)
Antonio-Gabriel Chacon Menke, Ravi Mandliya, Kee Woo Jer, Isaac Sloan
Instead of asking a text-based LLM judge whether a trajectory looks suspicious, this paper feeds the trajectory through an open-weight surrogate model and reads its internal activations directly. A linear probe trained on those hidden states achieves 0.926 AUROC on real BashArena trajectories — well above GPT-4.1-mini's 0.650 — and when ensembled with GPT catches 53.2% of attacks at 4% FPR versus 34.2% or 30.2% alone, confirming the two signals are genuinely complementary. The paper also releases a token-level labeled dataset marking exactly where malicious actions appear in trajectories, to enable future targeted activation extraction.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Surrogate Steering Vector Monitors (Track 2)
},
author={
Antonio-Gabriel Chacon Menke, Ravi Mandliya, Kee Woo Jer, Isaac Sloan
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


