Mar 23, 2026
BoxArena: An OCI Runtime Escape Benchmark
Max von Hippel, Quinn Dougherty, Alok Singh
Container runtimes are a critical security boundary for isolating AI agents, yet no existing benchmark compares runtimes head-to-head on containment under a common attacker. We present BoxArena, an open-source evaluation framework for comparing Open Container Initiative (OCI) compatible runtimes by running language-model attackers against a battery of container-escape quests. BoxArena contributes to the Scalable Formal Oversight (SFO) research program: rather than relying just on alignment, it treats the isolation boundary itself as the auditable artifact and measures how well each runtime enforces that boundary under adversarial pressure. The system provides a harness with runtime switching, a local backend with static leaderboard export, and host-side monitoring. The quest battery has five attack surfaces: filesystem, socket, process, network, and syscall. BoxArena reports resilience (fraction of quests blocked) and attacker score (difficulty- and speed-weighted), while preserving monitoring evidence and post-hoc control verdicts for later analysis.
This is a good step towards the work of SandboxEscapeBench. Asking the dual question is conceptually interesting. Results are quite thin, particularly in answering this dual, and easy runtimes aren't too differentiated, but that's understandable, and overall it's solid infrastructure and conceptual work. Maybe the most important part, for control: monitoring is mentioned many times but no monitor results are shown.
This project seems mostly useful as a container escape capability eval and a proof of concept that more security focused runtimes do actually better resist agent container escape. I would've liked to see some metric of overhead - presumably the more secure runtimes either restrict some affordances within the container, or are more expensive to run, or something?
Methodologically I would've liked some more assurance that L3 and L4 are feasible. My impression is that container security often comes down to capabilities and misconfiguration, and if caps are dropped and there is no deliberate vulnerability then container escape becomes significant-novel-0day level difficult.
The writeup was a bit hard to parse, referencing formal verification a bunch of times despite it not really being relevant to the project.
The followup I'd be most excited to see is a cost-benefit analysis taking into account 1) costs of more secure runtimes 2) costs vs benefits of configuration details like dropping caps in the context of AI R&D, eg my impression is many agent containers will need to access GPUs and maybe require random other caps that could weaken security
Cite this work
@misc {
title={
(HckPrj) BoxArena: An OCI Runtime Escape Benchmark
},
author={
Max von Hippel, Quinn Dougherty, Alok Singh
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


