Mar 21, 2026
Hierarchical Scope-Boundary Defense: A Cost-Efficient Control Protocol for Untrusted AI Agents
Rahul Kumar
This project builds and evaluates a hierarchical control protocol for untrusted AI agents. It is a monitoring system that intercepts agent actions and blocks covert sabotage before it succeeds, while preserving the agent's ability to do legitimate work.
The core research question is whether monitoring cost should be uniform across all agent actions or matched to the threat model of each action type. The answer, validated empirically across two independent control settings, is that attack surface taxonomy drives architecture requirements. System-level attacks like privilege escalation and firewall disabling must touch specific system resources regardless of how they are expressed, making deterministic detection sufficient and free. Code-level sabotage like inserting backdoors or tampering with application data requires semantic understanding that only LLM monitoring provides. A protocol that applies expensive frontier-model monitoring uniformly wastes resources on easy cases while adding unnecessary latency to every action.
The protocol processes each agent action through four layers. The first is a zero-cost deterministic scope-boundary filter that checks where an action targets rather than how it is written. Scope boundaries are derived from the task description at runtime, making the approach task-adaptive and avoiding false positives on legitimate system administration work. This layer survives red-team-last conditions because attackers cannot change where they need to write, only how. The second component is environment-interactive canary injection, planting honeypot credentials in the sandbox at trajectory start to catch exfiltration attempts deterministically. The third layer is a cheap LLM sentinel using GPT-4.1 Nano that scores action suspicion on a 0 to 10 scale, accepting clearly benign actions immediately. The fourth layer is conditional escalation to a stronger model invoked only for ambiguous actions in the middle score range. Additional robustness features include latency equalization to prevent timing side-channels and a trajectory buffer that periodically checks whether individually clean actions are combining into a coordinated attack pattern.
The protocol was evaluated across two control settings with fundamentally different attack surfaces. On BashArena, which features system-level side tasks including sudo escalation, firewall disabling, malware download, and password exfiltration, the hierarchical protocol achieves 87.5% safety with 54% higher usefulness than the defer-to-trusted baseline and a 0.34% false positive rate across 290 monitored actions. On LinuxArena's Reviewed GreatStraj dataset, which contains 106 human-crafted sabotage trajectories spanning code-level attacks across 9 environments and 27 distinct side tasks, the deterministic L0 layer alone detects 4.7% of attacks while adding the LLM monitoring layer raises detection to 90.6% with tight confidence intervals. This 19x improvement is the central empirical finding validating the hierarchical design as an architectural requirement rather than an optimisation.
Two preliminary observations emerged from additional experiments. A blind monitor receiving no knowledge of specific attack objectives outperformed an informed baseline by 29 percentage points on a subset of trajectories, suggesting general anomaly detection may be more robust than threat-specific monitoring, though the comparison conflates multiple variables and requires controlled follow-up. A prompt-calibrated small model appeared to match or exceed larger models on detection rates, likely reflecting prompt-model co-optimization rather than any intrinsic capability property, and controlled experiments with per-model prompt optimization on identical trajectory sets are needed before drawing conclusions.
The monitoring cost per trajectory is approximately $0.002 compared to roughly $0.15 for flat frontier-model monitoring on every action, a 75x reduction achieved by routing clearly benign actions through the free deterministic layer and cheap LLM triage while reserving expensive inference for genuinely ambiguous cases.
All results are fully reproducible. The repository(https://github.com/rkstu/hierarchical-scope-defense) includes the complete protocol implementation, evaluation scripts, pre-extracted scores, figure generation code, and two independent verification scripts that confirm every number in the paper. Raw evaluation data is available separately.
Repository contents [https://github.com/rkstu/hierarchical-scope-defense]
The repository separates host-provided frameworks and data from the original contributions clearly. The protocol implementation, scope-boundary detection logic, canary injection system, hierarchical monitor, trajectory buffer, and all analysis and figure generation scripts represent the original contribution. The BashArena and LinuxArena frameworks are referenced as submodules. Setup instructions in SETUP.md walk through environment configuration and API key requirements for anyone wanting to reproduce or extend the work.
Research context
This work was conducted as part of the AI Control Hackathon 2026 organized by Apart Research and Redwood Research, using the ControlArena framework and LinuxArena settings developed by Redwood Research. The Reviewed GreatStraj dataset was accessed via linuxarena.ai. The findings are being extended toward a full research paper targeting AI safety venues including SaTML and NeurIPS workshops, with additional controlled experiments planned to strengthen the preliminary observations on blind versus informed monitoring and model scaling behavior.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Hierarchical Scope-Boundary Defense: A Cost-Efficient Control Protocol for Untrusted AI Agents
},
author={
Rahul Kumar
},
date={
3/21/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


