Mar 22, 2026

Layered Scope-Boundary Monitoring: Cost-Aware Control for Untrusted AI Agents

Rahul Kumar

This project builds and evaluates a hierarchical control protocol for untrusted AI agents, designed to intercept covert sabotage attempts while preserving the agent's ability to complete legitimate work. The core research question is whether monitoring cost should be uniform across all agent actions or matched to the type of threat each action poses. The answer, validated empirically across two independent control settings, is that attack surface type determines which monitoring layer is necessary.

The protocol routes each agent action through layers of increasing cost. A zero-cost deterministic scope-boundary filter checks whether the action targets system resources outside what the task legitimately requires. This layer is task-adaptive, meaning it derives scope boundaries automatically from the task description at runtime rather than using hardcoded rules, which is why it achieves zero false positives across 369 monitored honest actions in two different settings. If the deterministic layer does not block, a cheap LLM sentinel scores the action's suspicion level, and ambiguous cases escalate to a stronger model. Additional features include honeypot credential injection for exfiltration detection, latency equalization to prevent timing side-channels, and a trajectory buffer that periodically checks whether individually clean actions are combining into a coordinated attack pattern.

The key empirical finding is that the deterministic layer alone catches 4.7% of sophisticated human-crafted attacks at zero LLM cost, while adding the LLM layer raises detection to 90.6% on 106 adversarial trajectories from LinuxArena's GreatStraj dataset. The marginal contribution of the deterministic layer on code-level sabotage is only one percentage point over LLM monitoring alone, which is expected because code-level attacks do not touch system resources. The architectural value of the deterministic layer is cost routing on system-level threats, not detection improvement on harder attacks. On BashArena, which features system-level side tasks including privilege escalation and firewall disabling, the full protocol achieves 54% higher usefulness than the defer-to-trusted baseline while maintaining 87.5% safety and a 0.34% false positive rate.

The project also includes a calibrated comparison between blind and informed monitoring approaches. Using FP-rate-calibrated thresholds derived from honest trajectory runs rather than fixed thresholds, the blind hierarchical monitor outperforms the informed baseline at moderate and generous audit budgets but underperforms at strict budgets, with the crossover depending on how aggressively a deployment is willing to flag suspicious trajectories for review.

All results are fully reproducible. The repository includes the complete protocol implementation, evaluation scripts, pre-extracted scores, and a single verification script that confirms every number in the paper. Raw evaluation data is available separately. Full reproduction costs approximately fifteen dollars in API credits.

The work was conducted as part of the AI Control Hackathon 2026 and received feedback from the Redwood Research team, which informed several methodological improvements including the switch from fixed-threshold to FP-rate-calibrated monitor comparisons and an honest reframing of the deterministic layer's contribution. The findings are being extended toward a full research paper with additional controlled experiments on the blind versus informed monitoring question and model scaling behavior.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

No reviews are available yet

Cite this work

@misc {

title={

(HckPrj) Layered Scope-Boundary Monitoring: Cost-Aware Control for Untrusted AI Agents

},

author={

Rahul Kumar

},

date={

3/22/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.