Adaptive Permission Sandbox for LLM Agents

Shaheen Raja

LLM agents are powerful but unsafe by default, often executing harmful or unauthorized actions such as destructive SQL queries or leaking sensitive data. My project introduces an Adaptive Permission Sandbox with a human‑readable DSL that enforces safety rules proactively, before execution.

Policies cover static restrictions (e.g., deny DROP TABLE), stateful requirements (e.g., require schema_lookup before SELECT), data‑flow protections, and runtime guards. Each rule is compiled into monitorable formulas and formally verified, ensuring provable guarantees and auditability. By blocking unsafe behavior at the source, this system delivers adversarial robustness, compliance readiness, and a practical path toward trustworthy AI agent deployment.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Strengths. This is a clearly written, well-structured submission that frames a real and important problem — proactive enforcement at the tool-call boundary for LLM agents — in a way that is accessible to security engineers without sacrificing rigor. The "AWS IAM for AI agents, formally verified" framing is genuinely useful: it maps onto existing operator intuitions while pointing at the additional axes (temporal sequencing, data-flow labels, runtime resource limits) where agent permissions need to go beyond classical IAM. The four-category taxonomy — static, stateful, data-flow, and runtime — covers an orthogonal and useful space. The control-plane vs. data-plane separation argument for adversarial robustness is the strongest conceptual contribution here: because the monitor operates on structured tool-call metadata rather than the model's natural-language reasoning, injected text in tool outputs cannot reach the enforcement decision surface. This is a meaningfully stronger story than content filtering and worth foregrounding even further. The DSL example, the worked enforcement trace in Listing 2, the architecture description in Section 4.1, and the terminal screenshot on slide 5 showing a DROP blocked while a SELECT succeeds together make the basic idea concrete and inspectable.

Opportunities to strengthen. The biggest gap is an inconsistency between what the paper claims and what the slides describe as actually built, and more generally between the formal claims and the evidence provided. A few places where this could be tightened, in roughly descending order of importance:

1. The paper and slides disagree on what is delivered. The paper presents Lean 4 proof-carrying compilation and Z3 satisfiability checks as part of the current pipeline (e.g., abstract: "compiled into monitorable LTLf formulas, and formally verified with Lean/Z3"; Section 6.1: "Every compiled policy emits a Lean 4 proof obligation..."). The slide deck's "Next Steps" panel, however, lists "Integrate formal verification (Lean/Z3)" as future work. This is a load-bearing inconsistency: the formal verification angle is the submission's most novel and differentiated claim, and the slides effectively acknowledge it is aspirational. Two clean ways to resolve this: either (a) build a minimal Lean proof for one compiled policy and link to it, or (b) reframe the paper's abstract and Section 6.1 so that the verification pipeline is clearly labeled as planned rather than delivered. The current mismatch will undermine reader trust in the rest of the claims.

2. Empirical evidence is largely absent from the paper. Section 5 is explicitly titled "Evaluation Plan" and contains targets ("we target sub-millisecond overhead") rather than measurements. The latency numbers cited ("under 0.1 ms" for static rule evaluation, "under 0.5 ms" for stateful monitor advancement on traces up to 1,000 events) are described as "initial profiling on synthetic workloads," but no methodology, hardware, workload definition, or distribution of results is reported. The three adversarial scenario suites in Table 3 (SQL Destruction, File Exfiltration, Secret Leakage) are described but not run — there are no block rates, no detection rates, no false-positive numbers. The slide-5 screenshot is the only piece of evidence that any of this runs, and it covers only the basic SQL allow/deny case. Even a small results table on a handful of scenarios would substantially raise credibility.

3. The data-flow tracker is the hardest piece and gets the least treatment. Section 4.1 describes it in one sentence — "propagates taint labels through tool outputs and checks flow rules before any data leaves the sandbox" — and the slide-5 DSL example notably omits the flow keyword that's in the paper's Listing 1, suggesting this component may also be aspirational in the current prototype. Information-flow control is notoriously subtle: implicit flows, label transformations under aggregation, joins on tainted data, and declassification policies are all genuinely hard. It would strengthen the submission to acknowledge this complexity explicitly and describe what the prototype actually handles versus what it punts to future work.

4. A public artifact would help a lot. There is no GitHub link, no implementation source for any of the five components, and Listing 2 is presented as an illustrative trace rather than real captured output. A minimal repo containing the DSL parser, the SQL-agent prototype that produced the slide-5 screenshot, and (when integrated) the Lean policy file would let reviewers verify the work end-to-end.

5. Clarify Z3's role. The abstract says policies are "formally verified with Lean/Z3," which suggests parity between the two tools. Section 6.1 reveals Z3 is used only for compile-time satisfiability and contradiction checks, not for the core LTLf-to-DFA correctness proof (which is Lean's job). Tightening this distinction would help readers calibrate the depth of the verification story.

6. Comparison to existing solutions is missing. Related work covers IAM, specification mining (Chen et al.), BabelBench, and formal LTLf monitoring tools (MONA, Spot), but does not engage with the practical agent-guardrail ecosystem already shipped by industry. A short paragraph contrasting the formal-methods angle against these — including the data plane / control plane argument framed as a concrete head-to-head example — would help position the contribution.

7. Citations need a proper bibliography. "Chen et al. (2024)," "von Hippel et al.," and "BabelBench (2024)" are referenced in the body, but there is no References section anywhere in the document. Adding a bibliography with venues and links would let readers follow the threads.

8. Residual attack surface inside permitted actions deserves more attention. A SELECT that exfiltrates the right rows, or a benign-looking sequence that composes into harm, remains in scope even with the static and stateful rules. The data-flow tracker is presumably the answer, which makes its under-specification (point 3) load-bearing. Walking through one such attack end-to-end and showing exactly how the flow labels catch it would tighten the threat-model story.

Overall. The design instincts here are good, the framing is sharp, and the architectural invariant about control plane vs. data plane is a genuinely useful contribution to articulate. The path to making this a substantially stronger piece of work is mostly about aligning the paper's claims with what the slides acknowledge as the current state of the prototype — and then either delivering on the Lean/Z3 verification before the final cutoff or reframing it as planned work. With those adjustments, this could become a credible contribution to the agent-safety toolchain space.

The paper presents a formally verified system with provable guarantees, yet none of it appears to have actually run. Section 5 is an evaluation plan, the Lean and Z3 proof certificates are described but never produced, and Listing 2 is a hand-written session illustration. Implementing even the single SQL agent scenario with real block rates and latencies would strengthen the argument far more than the current breadth, and references such as "Chen et al." and "von Hippel et al." need full titles, venues, and years so the related work can be checked.

Cite this work

@misc {

title={

(HckPrj) Adaptive Permission Sandbox for LLM Agents

},

author={

Shaheen Raja

},

date={

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.