Mar 22, 2026
Layered Scope-Boundary Monitoring: Cost-Aware Control for Untrusted AI Agents
Rahul Kumar
This project builds and evaluates a hierarchical control protocol for untrusted AI agents, designed to intercept covert sabotage attempts while preserving the agent's ability to complete legitimate work. The core research question is whether monitoring cost should be uniform across all agent actions or matched to the type of threat each action poses. The answer, validated empirically across two independent control settings, is that attack surface type determines which monitoring layer is necessary.
The protocol routes each agent action through layers of increasing cost. A zero-cost deterministic scope-boundary filter checks whether the action targets system resources outside what the task legitimately requires. This layer is task-adaptive, meaning it derives scope boundaries automatically from the task description at runtime rather than using hardcoded rules, which is why it achieves zero false positives across 369 monitored honest actions in two different settings. If the deterministic layer does not block, a cheap LLM sentinel scores the action's suspicion level, and ambiguous cases escalate to a stronger model. Additional features include honeypot credential injection for exfiltration detection, latency equalization to prevent timing side-channels, and a trajectory buffer that periodically checks whether individually clean actions are combining into a coordinated attack pattern.
The key empirical finding is that the deterministic layer alone catches 4.7% of sophisticated human-crafted attacks at zero LLM cost, while adding the LLM layer raises detection to 90.6% on 106 adversarial trajectories from LinuxArena's GreatStraj dataset. The marginal contribution of the deterministic layer on code-level sabotage is only one percentage point over LLM monitoring alone, which is expected because code-level attacks do not touch system resources. The architectural value of the deterministic layer is cost routing on system-level threats, not detection improvement on harder attacks. On BashArena, which features system-level side tasks including privilege escalation and firewall disabling, the full protocol achieves 54% higher usefulness than the defer-to-trusted baseline while maintaining 87.5% safety and a 0.34% false positive rate.
The project also includes a calibrated comparison between blind and informed monitoring approaches. Using FP-rate-calibrated thresholds derived from honest trajectory runs rather than fixed thresholds, the blind hierarchical monitor outperforms the informed baseline at moderate and generous audit budgets but underperforms at strict budgets, with the crossover depending on how aggressively a deployment is willing to flag suspicious trajectories for review.
All results are fully reproducible. The repository includes the complete protocol implementation, evaluation scripts, pre-extracted scores, and a single verification script that confirms every number in the paper. Raw evaluation data is available separately. Full reproduction costs approximately fifteen dollars in API credits.
The work was conducted as part of the AI Control Hackathon 2026 and received feedback from the Redwood Research team, which informed several methodological improvements including the switch from fixed-threshold to FP-rate-calibrated monitor comparisons and an honest reframing of the deterministic layer's contribution. The findings are being extended toward a full research paper with additional controlled experiments on the blind versus informed monitoring question and model scaling behavior.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Layered Scope-Boundary Monitoring: Cost-Aware Control for Untrusted AI Agents
},
author={
Rahul Kumar
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


