Mar 23, 2026

Syscon: Kernel-Level Syscall Monitoring for AI Agent Control in Sandboxed Environments

Jade Sheffey, Joyce Levine, Joyce Werhane, Willow Taylor

🏆 LinuxArena Track Winner

Many existing AI control protocols rely on a trusted monitor to evaluate an untrusted agent's actions, but these monitors generally access tool call invocations and context, not what the agent actually does at the system level. We identify this as a fundamental information gap in the control stack and present Syscon, the first system to close it. Syscon is a supervisor-side auditor daemon that monitors agent activity at the Linux kernel syscall boundary, leveraging seccomp-bpf audit logging from outside the agent's container to observe all process creation, network connections, and filesystem operations, all without running any code inside the container and without the agent's knowledge. The system produces compressed, per-action receipts designed for consumption by a trusted monitor LLM. We evaluate Syscon on 234 attack trajectories across 9 attacker models in the LinuxArena control evaluation framework. Our qualitative analysis confirms that receipts surface behavioral signals (credential file reads, payload staging, reverse shell listeners, data exfiltration) that are invisible in tool call text, covering multiple side task categories. However, a first-pass integration with a Gemini 2.5 Flash monitor did not improve detection rates, establishing that observability and detection are distinct research problems: Syscon receipts carry signal, but monitors must be taught to use it.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The idea is brilliant and needed. A high-level monitor that keeps track of the broader network and reports to trusted monitors. I would encourage the team to look at entrepreneurial funding if anyone on the team is keen. I think it would be plausible to fairly quickly raise large amounts of funding to have this scaled to production. See https://www.aisafety.com/funding for some ideas here (especially the for-profit sections).

The conceptual motivation for the idea is strong - it's easy to hide simple malicious commands in complex chunks of code. It could be harder to hide the network level outcomes.

The clear display of the negative results was great - and bode's well for the team's honesty. The findings were easy to interpret, and they used multiple model families as attackers. I would not be surprised that gemini-2.5 flash (with no reasoning) is just incapable of incorporating this information into its decision. I'd be excited for the team to try this with more capable monitor models (ideally with reasoning).

Overall, I'm thoroughly impressed with the project, and excited for this team to continue working in this area. Great work.

Congrats on your project!

Impact & Innovation: Giving the trusted monitor access to ground-truth OS telemetry is an interesting concept that exploits an already desirable situation in containerized agents. The underlying mechanisms are well-established in container security, and recent industry work aligns with your vision. Your project could be a meaningful building block for the field if kernel monitoring can be effectively leveraged.

Execution Quality: The experiment design is sound but I believe there could have been deeper analysis without requiring additional compute. Some considerations for strengthening the evaluation:

- Detection curves (Figure 2) are shown at absolute suspicion thresholds, but ideally should be reported at false-positive rates. However, your honest trajectory sample is too small to properly estimate FPR thresholds, so a more extensive honest study would be needed to support this analysis.

- Trajectory-level monitoring may not be realistic for very long trajectories. Reporting the performance of each monitor type individually, in addition to the aggregation, would give a clearer picture of where receipts help. Furthermore, it would be interesting to see the monitor's performance when given only the receipt with no tool call text. This would isolate the signal contribution of the syscall channel.

- The null detection result is honestly reported and well-diagnosed. A natural follow-up question: does receipt effectiveness vary by attack vector? In particular, receipts should be especially valuable when the attacker is tampering with action-level logs or output, since kernel-level telemetry is the only reliable channel in that case.

Presentation & Clarity: The paper is well-structured and reads clearly. The threat model is thorough and I appreciate the honest framing of the detection results. The example receipt in Appendix A is very helpful. A few notes: Figure 1 is too small to read at the rendered size, and Figures 2 and 3 are missing error bars or confidence intervals.

Cite this work

@misc {

title={

(HckPrj) Syscon: Kernel-Level Syscall Monitoring for AI Agent Control in Sandboxed Environments

},

author={

Jade Sheffey, Joyce Levine, Joyce Werhane, Willow Taylor

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.