Mar 22, 2026
TraceGuard: Structured Multi-Dimensional Monitoring as a Collusion-Resistant Control Protocol
Khanh Linh Nguyen, Hoa Nghiem, Tu Tran
TraceGuard introduces structured multi-dimensional monitoring as a new control protocol for AI agents. Instead of single-score "vibes checks," we decompose agent evaluation into five dimensions (goal alignment, constraint adherence, reasoning coherence, safety awareness, action-trace consistency) scored in parallel, augmented by seven heuristic detectors and an LLM-based intent analyzer. Evaluated on BashArena (519 samples) within ControlArena, we find: (1) single-score baselines achieve 0% safety — structured scoring achieves 82-100%; (2) structured prompts constrain collusion even with untrusted monitors (0% → 95% safety); (3) separation of duties across trusted/untrusted models achieves 100% safety. All monitors are new implementations for the ControlArena ecosystem. Code: https://github.com/linhkid/TraceGuard
This is the most complete hackathon submission I've reviewed. 519 real samples on BashArena, four monitor variants, collusion resistance testing, cross-dataset generalization on AgentHarm, and a working open-source ControlArena integration. That's serious output for a hackathon.
The headline result is compelling — single-score monitoring (trusted and untrusted) hitting 0% safety while structured multi-dimensional scoring hits 82-100% is a stark demonstration that evaluation structure matters more than model trust level. The collusion resistance finding (untrusted structured monitor at 95% vs untrusted single-score at 0%) is the most interesting result — forcing an untrusted model to answer narrow factual questions about specific dimensions is harder to game than "how suspicious is this on 0-10." The separation of duties variant achieving 100% safety by splitting dimensions across models is a clean implementation of Redwood Direction #8.
Main gap: the attacks are non-adaptive. A sophisticated attacker who knows the five dimensions could craft reasoning that scores well on all of them. The paper acknowledges this but it's the key open question for whether structured monitoring holds up under pressure.
- this is very compelling, and a pleasure to read
- trusted model is not very clearly defined - it should be clearly stated that this just means « the other model »
- reasoning coherence is doing a lot of heavy lifting where the monitor is weaker than the agent - this should be addressed, maybe it’s not as big of an issue as it seems
- the increased cost is pretty significant here, BECAUSE of the robustness. the point about training classifiers is a good one, could have more detail
Cite this work
@misc {
title={
(HckPrj) TraceGuard: Structured Multi-Dimensional Monitoring as a Collusion-Resistant Control Protocol
},
author={
Khanh Linh Nguyen, Hoa Nghiem, Tu Tran
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


