Mar 23, 2026
Steganography: Detection & Control
Sweta Jena, Hugo Minh D. Nguyen, Nico Hillbrand
As Large Language Models (LLMs) increasingly engage in multi-agent environments, the risk of autonomous coordination via steganography—the embedding of hidden messages in unsuspicious text—presents a significant safety and alignment challenge. This project investigates whether a receiver LLM "recognizes" these covert signals at a latent level, even when not explicitly prompted to decode them. By applying Mechanistic Interpretability techniques, we audit the internal activations of the recipient model to identify the neural circuitry dedicated to signal recognition. Specifically, we employ activation probing to determine if steganographic presence is a linearly separable feature in the residual stream and use causal mediation analysis to isolate the specific attention heads responsible for processing hidden payloads. Our findings demonstrate that LLMs develop potential internal "detectors" for these patterns, offering a path toward real-time latent monitoring and the development of internal safety filters to prevent unauthorized AI-to-AI coordination. However, further work needs to be done to disentangle steganographic recognition signals from compliance or non-compliance of performing a covert action based on recognition of a secret message.
Good project.
Investigating receiver-side steganography recognition with mech interp techniques is a useful direction, and the dataset design is thoughtful. Main suggestion: there's a confound between recognition and compliance that needs resolving. When the probe finds a separable direction, it's unclear whether it's detecting "a hidden message is present" or "I'm going to say 'attack'". These are correlated in the data but conceptually distinct, so the probe could be picking up either signal. Disentangling these (and fixing the absolute value bug in the IE scores) would make the results much more interpretable. Strong future work section.
Cite this work
@misc {
title={
(HckPrj) Steganography: Detection & Control
},
author={
Sweta Jena, Hugo Minh D. Nguyen, Nico Hillbrand
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


