Summary
This report demonstrates that large language models are capable
of hiding simple 8 bit information in their output using associations
from more powerful overseers (other LLMs or humans). Without
direct steganography fine tuning, LLAMA 3 8B can guess a 8 bit
hidden message in a plain text in most cases (69%), however a
more capable model, GPT-3.5 was able to catch almost all of them
(84%). More research is required to investigate how this ability
might be induced or improved via RL training in similar and larger
models.
Cite this work:
@misc {
title={
Evaluating and inducing steganography in LLMs
},
author={
Artem Karpov
},
date={
6/30/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}