Jun 30, 2024

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

Simon Lermen, Matteusz Dziemian

Sparse autoencoders (SAE) have been one of the most promising approaches to

neural network interpretability. They can be used to recover highly interpretable

features which are otherwise obscured in superposition. However, the large num-

ber of generated features makes it necessary to use models for natural language

labelling. While one model labels features based on observed activations, another

model is used to estimate activations on texts. These estimates are then compared

to the ground truth activations to create a score. Using models for automating

interpretability introduces a key weakness into the process, since deceptive models

may want to hide certain features by mislabelling them. For example, models may

want to hide features about deception from human overseers. We demonstrate a

method by which models can create deceptive explanations that not only avoid

detection but also achieve high scores. Our agents are based on the Llama 3 70B

Instruct model. While our agents can perform the entire task end-to-end, they

are not reliable and can sometimes fail. We propose some directions for future

mitigation strategies that may help defend against our method.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The work suggests a novel way that SAE feature labelling may fail, due to coordinated deception between the explainer model and simulation model in the SAE feature labelling process. This is a novel and clever attack, and pointing it out is a worthwhile contribution on its own. The work is also especially well-presented for a hackathon project (though an additional few sentences on the Table 1 caption would make interpreting the results easier - I’m a bit confused about what the columns mean). Future work could experiment with the discussed mitigation strategies.

This is a clear potential deception issue in future automated interpretability work and proves a key limitation to the approach of using LLMs without enough human oversight in our neuron explanations. I especially liked the figure 6 encoded message ;) A great and in-depth project for a weekend's work and revealing a clear vulnerability. I wonder about its applicability to SoTA methods in interpretability since LLM-based explanation and prediction models might not be feasible with compute limitations on larger models. It would also be nice with more reporting of results though the conceptual work is clearly robust. Great work and definitely worth a write-up for the AlignmentForum.

Cite this work

@misc {

title={

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

},

author={

Simon Lermen, Matteusz Dziemian

},

date={

6/30/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.