Reading Intent, Not Output: Cheap Activation Safety Monitors That Transfer Across Languages and Models
Antonio-Gabriel Chacon Menke
Text-based safety monitors inherit the weaknesses of the text they read: they are brittle across languages, blind to intent hidden behind a boilerplate refusal, and always one step late. We show that a model's internal activations offer a cheap, complementary channel that sidesteps all three problems at once. A single linear probe, trained only on English reasoning traces, detects fine-grained safety behaviours in Spanish and Vietnamese with virtually no loss (mean AUC 0.93 and 0.92, against 0.89 on English cross-validation), zero target-language labels required. The same probe identifies which harm category underlies a near-identical refusal (reasoning AUC 0.83 vs.\ text 0.72), recovers a covertly instructed goal the answer never states, and flags a request's harm class from the very first generated token. The monitor is also portable: mapped from one model family to another through a single learned affine bridge, it loses only 0.004 AUC, and a borrowed natural-language explainer can then turn any flagged activation into a plain-English description without retraining. In a triage setting the probe alone catches 95% of harm at a 7.4% false-alarm rate on benign requests. Against a fair multilingual encoder the accuracy gap is modest; the advantage lies in cost, latency, and the cases where the text simply does not carry the safety-relevant information.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Reading Intent, Not Output: Cheap Activation Safety Monitors That Transfer Across Languages and Models
},
author={
Antonio-Gabriel Chacon Menke
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


