Mar 23, 2026
SAFE: Sandboxed Actions Forensics and Evaluation for AI Control via Action-Aware Monitoring
Nathan Khosla
SAFE (Sandboxed Action Forensics and Evaluation) is an AI control framework that improves the monitoring of untrusted agentic models by executing their proposed actions in a forked sandbox twin of the production environment and feeding the resulting activity logs—filesystem diffs, network attempts, privilege escalations, and more—to a trusted blue-team reviewer. Across 30 models spanning multiple frontier labs, SAFE's action-aware analysis substantially increased malicious action detection rates compared to text-only review of the agent's proposed commands and reasoning. Notably, the benefit was not correlated with model size or intelligence benchmarks: small, cheap models like Ministral 3B achieved perfect detection with SAFE context, while some large frontier models failed even with it, suggesting that blue-team model selection is a significant and underexplored design variable in AI control systems.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) SAFE: Sandboxed Actions Forensics and Evaluation for AI Control via Action-Aware Monitoring
},
author={
Nathan Khosla
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


