Mar 22, 2026

Layered Scope-Boundary Monitoring: Cost-Aware Control for Untrusted AI Agents

Rahul Kumar

This project builds and evaluates a hierarchical control protocol for untrusted AI agents, designed to intercept covert sabotage attempts while preserving the agent's ability to complete legitimate work. The core research question is whether monitoring cost should be uniform across all agent actions or matched to the type of threat each action poses. The answer, validated empirically across two independent control settings, is that attack surface type determines which monitoring layer is necessary.

The protocol routes each agent action through layers of increasing cost. A zero-cost deterministic scope-boundary filter checks whether the action targets system resources outside what the task legitimately requires. This layer is task-adaptive, meaning it derives scope boundaries automatically from the task description at runtime rather than using hardcoded rules, which is why it achieves zero false positives across 369 monitored honest actions in two different settings. If the deterministic layer does not block, a cheap LLM sentinel scores the action's suspicion level, and ambiguous cases escalate to a stronger model. Additional features include honeypot credential injection for exfiltration detection, latency equalization to prevent timing side-channels, and a trajectory buffer that periodically checks whether individually clean actions are combining into a coordinated attack pattern.

The key empirical finding is that the deterministic layer alone catches 4.7% of sophisticated human-crafted attacks at zero LLM cost, while adding the LLM layer raises detection to 90.6% on 106 adversarial trajectories from LinuxArena's GreatStraj dataset. The marginal contribution of the deterministic layer on code-level sabotage is only one percentage point over LLM monitoring alone, which is expected because code-level attacks do not touch system resources. The architectural value of the deterministic layer is cost routing on system-level threats, not detection improvement on harder attacks. On BashArena, which features system-level side tasks including privilege escalation and firewall disabling, the full protocol achieves 54% higher usefulness than the defer-to-trusted baseline while maintaining 87.5% safety and a 0.34% false positive rate.

The project also includes a calibrated comparison between blind and informed monitoring approaches. Using FP-rate-calibrated thresholds derived from honest trajectory runs rather than fixed thresholds, the blind hierarchical monitor outperforms the informed baseline at moderate and generous audit budgets but underperforms at strict budgets, with the crossover depending on how aggressively a deployment is willing to flag suspicious trajectories for review.

All results are fully reproducible. The repository includes the complete protocol implementation, evaluation scripts, pre-extracted scores, and a single verification script that confirms every number in the paper. Raw evaluation data is available separately. Full reproduction costs approximately fifteen dollars in API credits.

The work was conducted as part of the AI Control Hackathon 2026 and received feedback from the Redwood Research team, which informed several methodological improvements including the switch from fixed-threshold to FP-rate-calibrated monitor comparisons and an honest reframing of the deterministic layer's contribution. The findings are being extended toward a full research paper with additional controlled experiments on the blind versus informed monitoring question and model scaling behavior.

Review Project

See Code

View Presentation

View Related Sprint

Reviewer's Comments

This paper implements a multi-level approach to monitoring untrusted agents. The first layer (L0) is algorithmic, while L1 and L2 are LLM-based (L2 is stronger). The goal is to reduce cost by only relying on strong (expensive) monitors when cheap ones fail to detect it.

I think that's an interesting approach to explore, but LinuxArena is not the right setting for that. In LinuxArena capable agents who know they are monitored are evaluated, and they know they only have one shot. In this setting most agent actions are innocuous, with very few well-crafted attacks.

The approach implemented in this paper spends resources on benign actions by going through L0, L1, and L2 in every step. The paper would be much stronger if it explored layered monitoring the other way around: weaker monitors escalate to stronger ones in case of doubt (so benign actions are cheaper to evaluate rather than more expensive).

The approach from this paper might be useful for settings that face a high frequency of low-quality attacks, and I'm curious to see how it fares there.

Note: The paper doesn't seem to contain results from experiments including L2 (L0 and L1 only).

I like the idea of using more traditional security techniques and adding more layers to blue team protocols that aren't LLM based. I like the transperancy of your code/reasoning in your work.

I'd probably use more samples for DTT. Overall, I think this is quite interesting work. The other place to improve (which is understandable given time contraints) is doing some iteration of red team to try and attack against this blue team protocol.

I overall quite like this project.

Cite this work

@misc {

title={

(HckPrj) Layered Scope-Boundary Monitoring: Cost-Aware Control for Untrusted AI Agents

author={

Rahul Kumar

date={

3/22/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026