Mar 22, 2026

TraceGuard: Structured Multi-Dimensional Monitoring as a Collusion-Resistant Control Protocol

Khanh Linh Nguyen, Hoa Nghiem, Tu Tran

TraceGuard introduces structured multi-dimensional monitoring as a new control protocol for AI agents. Instead of single-score "vibes checks," we decompose agent evaluation into five dimensions (goal alignment, constraint adherence, reasoning coherence, safety awareness, action-trace consistency) scored in parallel, augmented by seven heuristic detectors and an LLM-based intent analyzer. Evaluated on BashArena (519 samples) within ControlArena, we find: (1) single-score baselines achieve 0% safety — structured scoring achieves 82-100%; (2) structured prompts constrain collusion even with untrusted monitors (0% → 95% safety); (3) separation of duties across trusted/untrusted models achieves 100% safety. All monitors are new implementations for the ControlArena ecosystem. Code: https://github.com/linhkid/TraceGuard

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

This is the most complete hackathon submission I've reviewed. 519 real samples on BashArena, four monitor variants, collusion resistance testing, cross-dataset generalization on AgentHarm, and a working open-source ControlArena integration. That's serious output for a hackathon.

The headline result is compelling — single-score monitoring (trusted and untrusted) hitting 0% safety while structured multi-dimensional scoring hits 82-100% is a stark demonstration that evaluation structure matters more than model trust level. The collusion resistance finding (untrusted structured monitor at 95% vs untrusted single-score at 0%) is the most interesting result — forcing an untrusted model to answer narrow factual questions about specific dimensions is harder to game than "how suspicious is this on 0-10." The separation of duties variant achieving 100% safety by splitting dimensions across models is a clean implementation of Redwood Direction #8.

Main gap: the attacks are non-adaptive. A sophisticated attacker who knows the five dimensions could craft reasoning that scores well on all of them. The paper acknowledges this but it's the key open question for whether structured monitoring holds up under pressure.

- this is very compelling, and a pleasure to read

- trusted model is not very clearly defined - it should be clearly stated that this just means « the other model »

- reasoning coherence is doing a lot of heavy lifting where the monitor is weaker than the agent - this should be addressed, maybe it’s not as big of an issue as it seems

- the increased cost is pretty significant here, BECAUSE of the robustness. the point about training classifiers is a good one, could have more detail

Cite this work

@misc {

title={

(HckPrj) TraceGuard: Structured Multi-Dimensional Monitoring as a Collusion-Resistant Control Protocol

},

author={

Khanh Linh Nguyen, Hoa Nghiem, Tu Tran

},

date={

3/22/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.