Mar 23, 2026
Hidden in Plain Sight: Representational Adversarial Steganography in Colluding LLMs
Vainateya Rangaraju, Srujana Medicherla, Denis Lim, Guillaume Zahnd, Anna Konovalenko, Igor Pereverzev
We investigate whether Large Language Models (LLMs) can develop representational steganography: the ability to encode and transmit hidden signals within internal representations that are not readily detectable by oversight mechanisms. We apply this to the threat models of deception and collusion in multi-agent settings. We show that this capability emerges under adversarial pressure, and discuss the robustness of our experiments, and what they might show for existing machine oversight regimes.
A well-presented, technical implementation and exploration of steganography in a shared-activation space situation. The rank-8 constraint is a smart bottleneck. I thought the figures were very useful.
The motivation of this felt like the notable weak point. Current multi-agent systems communicate through text and not through shared residual streams, so the practical relevance of this threat model isn't clear to me. This ultimately detracted from the underlying idea and the work - which I think were both quite good. Also, this result required deliberate trial-and-error training pressure, and did not arise spontaneously, which makes this a less pressing safety concern.
I still think there are fruitful areas to explore here, if you wanted to take this further. There may be an argument to be made about multi-headed or mixture-of-experts architectures where components do share internal representations, and whether selection pressures in those settings could produce something analogous to your training dynamics. I'd be interested in your take on that!
Interesting work on representational steganography for multi-agent control. But I have two main concerns:
1. The threat model: it's not clear why two agents in the wild would share representations through an explicit latent channel; real multi-agent communication happens via text/APIs, so the setup feels contrived
2. The evaluation - the key claim is that communication survives adversarial pressure from the monitor, but the available results mostly show Experiment 1 (communication without evasion), and ROC/safety curves would have been much more interpretable than point accuracies for telling the safety story about monitor evasion tradeoffs.
Cite this work
@misc {
title={
(HckPrj) Hidden in Plain Sight: Representational Adversarial Steganography in Colluding LLMs
},
author={
Vainateya Rangaraju, Srujana Medicherla, Denis Lim, Guillaume Zahnd, Anna Konovalenko, Igor Pereverzev
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


