Mar 21, 2026

Hierarchical Scope-Boundary Defense: A Cost-Efficient Control Protocol for Untrusted AI Agents

Rahul Kumar

This project builds and evaluates a hierarchical control protocol for untrusted AI agents. It is a monitoring system that intercepts agent actions and blocks covert sabotage before it succeeds, while preserving the agent's ability to do legitimate work.

The core research question is whether monitoring cost should be uniform across all agent actions or matched to the threat model of each action type. The answer, validated empirically across two independent control settings, is that attack surface taxonomy drives architecture requirements. System-level attacks like privilege escalation and firewall disabling must touch specific system resources regardless of how they are expressed, making deterministic detection sufficient and free. Code-level sabotage like inserting backdoors or tampering with application data requires semantic understanding that only LLM monitoring provides. A protocol that applies expensive frontier-model monitoring uniformly wastes resources on easy cases while adding unnecessary latency to every action.

The protocol processes each agent action through four layers. The first is a zero-cost deterministic scope-boundary filter that checks where an action targets rather than how it is written. Scope boundaries are derived from the task description at runtime, making the approach task-adaptive and avoiding false positives on legitimate system administration work. This layer survives red-team-last conditions because attackers cannot change where they need to write, only how. The second component is environment-interactive canary injection, planting honeypot credentials in the sandbox at trajectory start to catch exfiltration attempts deterministically. The third layer is a cheap LLM sentinel using GPT-4.1 Nano that scores action suspicion on a 0 to 10 scale, accepting clearly benign actions immediately. The fourth layer is conditional escalation to a stronger model invoked only for ambiguous actions in the middle score range. Additional robustness features include latency equalization to prevent timing side-channels and a trajectory buffer that periodically checks whether individually clean actions are combining into a coordinated attack pattern.

The protocol was evaluated across two control settings with fundamentally different attack surfaces. On BashArena, which features system-level side tasks including sudo escalation, firewall disabling, malware download, and password exfiltration, the hierarchical protocol achieves 87.5% safety with 54% higher usefulness than the defer-to-trusted baseline and a 0.34% false positive rate across 290 monitored actions. On LinuxArena's Reviewed GreatStraj dataset, which contains 106 human-crafted sabotage trajectories spanning code-level attacks across 9 environments and 27 distinct side tasks, the deterministic L0 layer alone detects 4.7% of attacks while adding the LLM monitoring layer raises detection to 90.6% with tight confidence intervals. This 19x improvement is the central empirical finding validating the hierarchical design as an architectural requirement rather than an optimisation.

Two preliminary observations emerged from additional experiments. A blind monitor receiving no knowledge of specific attack objectives outperformed an informed baseline by 29 percentage points on a subset of trajectories, suggesting general anomaly detection may be more robust than threat-specific monitoring, though the comparison conflates multiple variables and requires controlled follow-up. A prompt-calibrated small model appeared to match or exceed larger models on detection rates, likely reflecting prompt-model co-optimization rather than any intrinsic capability property, and controlled experiments with per-model prompt optimization on identical trajectory sets are needed before drawing conclusions.

The monitoring cost per trajectory is approximately $0.002 compared to roughly $0.15 for flat frontier-model monitoring on every action, a 75x reduction achieved by routing clearly benign actions through the free deterministic layer and cheap LLM triage while reserving expensive inference for genuinely ambiguous cases.

All results are fully reproducible. The repository(https://github.com/rkstu/hierarchical-scope-defense) includes the complete protocol implementation, evaluation scripts, pre-extracted scores, figure generation code, and two independent verification scripts that confirm every number in the paper. Raw evaluation data is available separately.

Repository contents [https://github.com/rkstu/hierarchical-scope-defense]

The repository separates host-provided frameworks and data from the original contributions clearly. The protocol implementation, scope-boundary detection logic, canary injection system, hierarchical monitor, trajectory buffer, and all analysis and figure generation scripts represent the original contribution. The BashArena and LinuxArena frameworks are referenced as submodules. Setup instructions in SETUP.md walk through environment configuration and API key requirements for anyone wanting to reproduce or extend the work.

Research context

This work was conducted as part of the AI Control Hackathon 2026 organized by Apart Research and Redwood Research, using the ControlArena framework and LinuxArena settings developed by Redwood Research. The Reviewed GreatStraj dataset was accessed via linuxarena.ai. The findings are being extended toward a full research paper targeting AI safety venues including SaTML and NeurIPS workshops, with additional controlled experiments planned to strengthen the preliminary observations on blind versus informed monitoring and model scaling behavior.

Review Project

See Code

View Presentation

View Related Sprint

Reviewer's Comments

No reviews are available yet

Cite this work

@misc {

title={

(HckPrj) Hierarchical Scope-Boundary Defense: A Cost-Efficient Control Protocol for Untrusted AI Agents

author={

Rahul Kumar

date={

3/21/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Feb 2, 2026