Mar 23, 2026
Monitor-Aware Attack Policies
Ephraiem Sarabamoun
We tested if attacker performance improves if the untrusted model is given time to research and reason about the weaknesses of the monitoring model.
Solid work. Operationalizing a monitor-aware attacker is an experiment worth running, and building a pipeline to do the research and then injecting it into the prompt is a good contribution.
The execution is fine considering time constraints. I would have liked to see some more fleshing out of the core concept in at least one way. E.g. what happens if we use increasingly more capable models or vary some other axis?
Limited sample sizes makes it difficult to get real statistical power from these experiments. I did like the specific traces of the LLM that you provide - I think that provides some intuition around how this happens, in lieu of statistical power.
Some kind of figure would help me get centered and help me digest all the information here.
- were you able to reliably find factually available information about the monitor? this confounds a test between the ability to correctly identify relevant information about the monitor, and then the application of that knowledge to attack following on from this, the claim that monitor-aware attacks don't help is hard to separate from the possibility that your attacker models just aren't good enough to act on the intelligence. your capability threshold addresses this but then draws conclusions that are maybe too broad, especially at the low sample size
- what caused the paralysis you describe in the initial prompt? this should be included for sure, especially if this test is to be transferred. perhaps this could have benefited from actual “intimate knowledge” of the monitor, rather than what a web search turns up - stuff like model weights, system prompt, some examples of its usage, benchmarks. this is a gap in the threat model
- the suspiciousness scores are treated as a meaningful metric, but its unclear where do they come from? the monitor assigns them, which implies you're using the monitor's own judgement to evaluate whether the attacker evaded the monitor. If the monitor is bad at detecting subtle attacks (which it seems to be based on your results), then low suspiciousness scores don't mean the attack was stealthy, they mean the monitor missed it. high scores don't necessarily mean the attack was conspicuous either, it could mean the monitor is oversensitive to certain patterns. this could be validated using some behavioural evaluations, or potentially human ground truth labels. not sure
- there is existing research that discusses and investigates awareness of evals, and how models react very quickly to this. It could be worth having a look at this research in the spirit of this project - trying to find out if awareness makes a big difference, which it does seem to, but more on a level of being aware that monitoring is taking places, and as a result sandbagging and changing behaviour. there seems to be a clear difference in behaviour from models that know they are being observed, and one method around that is to ask about the structure of the eval to that same model, and then improve it based on its advice. This is more heuristic than based on clear information about the model (like the weights or other tests), though thats an interest test as well - and could be made conclusive here with a higher sample size and better ground truth
Cite this work
@misc {
title={
(HckPrj) Monitor-Aware Attack Policies
},
author={
Ephraiem Sarabamoun
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


