Sep 14, 2025

Internal defence system

James Sandy

Current AI safety paradigms mainly focus on filtering a model's final output, overlooking the extensive, unfiltered knowledge of dangerous concepts that the model retains in its internal states. This project introduces the Internal Defense System (IDS), a real-time safeguard that directly addresses a model's internal reasoning process. IDS employs a two-part, mechanistically inspired approach: a linear probe detects the neural signature of dangerous chemical, biological, radiological, and nuclear (CBRN) concepts within the model's activations, while an activation steering vector causally intervenes to correct the model's reasoning trajectory. Our implementation on gemma-2b-it demonstrates that this system can successfully prevent harmful outputs. More significantly, we show that a strong intervention induces a "cognitive shift," encouraging the model to reframe a malicious prompt as an analytical task instead of simply refusing it. This represents a new and more robust paradigm for AI safety.

Download

Review Project

See Code

View Presentation

View Related Sprint

Reviewer's Comments

This project is well explained, with good documentation and a nice app that seems to work as intended, which is impressive as a hackathon prototype. Targeting internal weights is an important target.

Only one example is given as a result, which makes it hard to judge the robustness of the approach.

I second the suggestions for future work detailed in the project, namely testing robustness, scalability, and generalization.

This is a cool application of mech interp to potentially address CBRN risks! Impressed that you were able to train a linear probe and steering vector in one weekend, with an interesting result.

My main critique is that it's unclear how this intervention actually made the model safer. The baseline response was already refusing to help with the harmful task, and the steered response is borderline nonsensical. It's not clear that it's a more "analytical path".

Steering vectors are very tricky to use in practice because they're a bit of a blunt instrument and can quickly degrade the coherence of model outputs, especially for small models like gemma. It may be worth testing this out on much larger models to see if coherence doesn't drop as much. However, I'm not optimistic that steering is a viable path to making models safer from CBRN risks versus standard refusal training and classifiers.

Cite this work

@misc {

title={

(HckPrj) Internal defence system

author={

James Sandy

date={

9/14/25

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Feb 2, 2026