Mar 23, 2026
Steganography: Detection & Control
Sweta Jena, Hugo Minh D. Nguyen, Nico Hillbrand
As Large Language Models (LLMs) increasingly engage in multi-agent environments, the risk of autonomous coordination via steganography—the embedding of hidden messages in unsuspicious text—presents a significant safety and alignment challenge. This project investigates whether a receiver LLM "recognizes" these covert signals at a latent level, even when not explicitly prompted to decode them. By applying Mechanistic Interpretability techniques, we audit the internal activations of the recipient model to identify the neural circuitry dedicated to signal recognition. Specifically, we employ activation probing to determine if steganographic presence is a linearly separable feature in the residual stream and use causal mediation analysis to isolate the specific attention heads responsible for processing hidden payloads. Our findings demonstrate that LLMs develop potential internal "detectors" for these patterns, offering a path toward real-time latent monitoring and the development of internal safety filters to prevent unauthorized AI-to-AI coordination. However, further work needs to be done to disentangle steganographic recognition signals from compliance or non-compliance of performing a covert action based on recognition of a secret message.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Steganography: Detection & Control
},
author={
Sweta Jena, Hugo Minh D. Nguyen, Nico Hillbrand
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


