Mar 23, 2026
Track 1 - Vaultrix Control
XIAOHU ZHU
We presents five novel AI control evaluation settings designed to test control protocols against misaligned AI agents operating in realistic autonomous workloads in our real world Vaultrix project. The framework uses a tested Python library (vaultrix.control) compatible with ControlArena.
The design process revealed that the more autonomous an agent is, the more its legitimate actions overlap with attack patterns, creating an irreducible false-positive problem for monitors (the safety-usefulness tradeoff).
Multi-agent Difficulty: Multi-agent collusion (CollusionNet) is qualitatively harder to detect than single-agent attacks because the attack is distributed across agents and embedded in shared artifacts, a gap current protocols do not support.
Persistence Risk: Persistence mechanisms (systemd services, cron jobs) pose the greatest unaddressed risk, as they create harm that outlasts the agent's session and bypass typical post-use container teardowns.
The current limitations are that the monitors are only regex-based baselines and would be bypassed by a sophisticated agent that obfuscates its actions. The tasks are currently evaluated via artifact dictionaries rather than live execution in fully dockerized environments.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Track 1 - Vaultrix Control
},
author={
XIAOHU ZHU
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


