Sep 14, 2025
Internal defence system
James Sandy
Current AI safety paradigms mainly focus on filtering a model's final output, overlooking the extensive, unfiltered knowledge of dangerous concepts that the model retains in its internal states. This project introduces the Internal Defense System (IDS), a real-time safeguard that directly addresses a model's internal reasoning process. IDS employs a two-part, mechanistically inspired approach: a linear probe detects the neural signature of dangerous chemical, biological, radiological, and nuclear (CBRN) concepts within the model's activations, while an activation steering vector causally intervenes to correct the model's reasoning trajectory. Our implementation on gemma-2b-it demonstrates that this system can successfully prevent harmful outputs. More significantly, we show that a strong intervention induces a "cognitive shift," encouraging the model to reframe a malicious prompt as an analytical task instead of simply refusing it. This represents a new and more robust paradigm for AI safety.
This project is well explained, with good documentation and a nice app that seems to work as intended, which is impressive as a hackathon prototype. Targeting internal weights is an important target.
Only one example is given as a result, which makes it hard to judge the robustness of the approach.
I second the suggestions for future work detailed in the project, namely testing robustness, scalability, and generalization.
This is a cool application of mech interp to potentially address CBRN risks! Impressed that you were able to train a linear probe and steering vector in one weekend, with an interesting result.
My main critique is that it's unclear how this intervention actually made the model safer. The baseline response was already refusing to help with the harmful task, and the steered response is borderline nonsensical. It's not clear that it's a more "analytical path".
Steering vectors are very tricky to use in practice because they're a bit of a blunt instrument and can quickly degrade the coherence of model outputs, especially for small models like gemma. It may be worth testing this out on much larger models to see if coherence doesn't drop as much. However, I'm not optimistic that steering is a viable path to making models safer from CBRN risks versus standard refusal training and classifiers.
Cite this work
@misc {
title={
(HckPrj) Internal defence system
},
author={
James Sandy
},
date={
9/14/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


