Mar 23, 2026

Monitor-Aware Attack Policies

Ephraiem Sarabamoun

We tested if attacker performance improves if the untrusted model is given time to research and reason about the weaknesses of the monitoring model.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Solid work. Operationalizing a monitor-aware attacker is an experiment worth running, and building a pipeline to do the research and then injecting it into the prompt is a good contribution.

The execution is fine considering time constraints. I would have liked to see some more fleshing out of the core concept in at least one way. E.g. what happens if we use increasingly more capable models or vary some other axis?

Limited sample sizes makes it difficult to get real statistical power from these experiments. I did like the specific traces of the LLM that you provide - I think that provides some intuition around how this happens, in lieu of statistical power.

Some kind of figure would help me get centered and help me digest all the information here.

- were you able to reliably find factually available information about the monitor? this confounds a test between the ability to correctly identify relevant information about the monitor, and then the application of that knowledge to attack following on from this, the claim that monitor-aware attacks don't help is hard to separate from the possibility that your attacker models just aren't good enough to act on the intelligence. your capability threshold addresses this but then draws conclusions that are maybe too broad, especially at the low sample size

- what caused the paralysis you describe in the initial prompt? this should be included for sure, especially if this test is to be transferred. perhaps this could have benefited from actual “intimate knowledge” of the monitor, rather than what a web search turns up - stuff like model weights, system prompt, some examples of its usage, benchmarks. this is a gap in the threat model

- the suspiciousness scores are treated as a meaningful metric, but its unclear where do they come from? the monitor assigns them, which implies you're using the monitor's own judgement to evaluate whether the attacker evaded the monitor. If the monitor is bad at detecting subtle attacks (which it seems to be based on your results), then low suspiciousness scores don't mean the attack was stealthy, they mean the monitor missed it. high scores don't necessarily mean the attack was conspicuous either, it could mean the monitor is oversensitive to certain patterns. this could be validated using some behavioural evaluations, or potentially human ground truth labels. not sure

- there is existing research that discusses and investigates awareness of evals, and how models react very quickly to this. It could be worth having a look at this research in the spirit of this project - trying to find out if awareness makes a big difference, which it does seem to, but more on a level of being aware that monitoring is taking places, and as a result sandbagging and changing behaviour. there seems to be a clear difference in behaviour from models that know they are being observed, and one method around that is to ask about the structure of the eval to that same model, and then improve it based on its advice. This is more heuristic than based on clear information about the model (like the weights or other tests), though thats an interest test as well - and could be made conclusive here with a higher sample size and better ground truth

Cite this work

@misc {

title={

(HckPrj) Monitor-Aware Attack Policies

},

author={

Ephraiem Sarabamoun

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.