Jun 30, 2024
Evaluating and inducing steganography in LLMs
Artem Karpov
This report demonstrates that large language models are capable
of hiding simple 8 bit information in their output using associations
from more powerful overseers (other LLMs or humans). Without
direct steganography fine tuning, LLAMA 3 8B can guess a 8 bit
hidden message in a plain text in most cases (69%), however a
more capable model, GPT-3.5 was able to catch almost all of them
(84%). More research is required to investigate how this ability
might be induced or improved via RL training in similar and larger
models.
The work is the start of a fairly ambitious project to do PPO fine-tuning of LLMs for steganography (as the work acknowledges, this is notoriously hard). This is definitely an interesting direction, and I appreciate that someone has actually gone and taken a stab at the hard thing.
Cite this work
@misc {
title={
Evaluating and inducing steganography in LLMs
},
author={
Artem Karpov
},
date={
6/30/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


