Mar 23, 2026

Trust, but Verify A Dual-Task Evaluation Pipeline for Capability, Monitor Effectiveness, and Covert Misalignment

Christopher Nwonu

This project develops a production oriented safety evaluation framework for tool-using language models. It tests two things at the same time: how well a model completes its assigned task, and whether it tries hidden unsafe behaviour (such as covert policy violations) while appearing compliant. The framework uses a sandboxed environment, layered monitoring, and clearly defined metrics (main-task performance, unsafe attempt/success rates, monitor recall, and detection latency). It also includes formula explanation tables, severity levels, and release gates so results are verifiable and actionable. The main contribution is a concrete protocol that helps teams make safer deployment decisions by requiring strong utility and low covert-risk behaviour before production release.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The single most impactful improvement would be to implement the protocol and run even a small-scale experiment. Even 10-20 episodes with one model would transform this from a design doc into a research contribution.

what is this actually testing? it's a general framework for "give model a task, plant a hidden opportunity to misbehave, see if it takes it and if your monitors catch it." that's basically what Anthropic's AI control evals already do? and what Inspect (which this cites) is built for? this doesn't really position itself against these existing frameworks or explain what's new, how this sits in the overall eval landscape. without any code or results or examples, its quite difficult to review in that context

no specific threat model pinned down. "unauthorized persistence, capability expansion, unsanctioned communication, policy circumvention" are very different failure modes that would require different evaluation setups - as is currently handled in the evals in frameworks like Inspect, and organisations like METR... a framework that claims to test all of them risks testing none of them well

the dual-task structure (overt task + hidden side objective) describes how scheming evals currently work. the contribution seems to be a formalisation of something the field is already doing - could be valuable, if framed as such

the metric decomposition (attempt rate vs success rate vs monitor recall vs detection latency) is the cleanest part and probably the most useful concrete contribution. separating "tried but caught" from "got through" seems like the right thing to measure

reads like good internal safety engineering docs. but it's documentation of a process, not a research contribution or a novel tool

presentation is clean though there are some mistakes in the headers and the mathematical notation is perhaps unnecessary (+ the formula tables) - these should be in-line unless there's a convincing reason other than to take more time explaining the existing formula, which should somewhat naturally arise from the description of the problem and related parameters

Cite this work

@misc {

title={

(HckPrj) Trust, but Verify A Dual-Task Evaluation Pipeline for Capability, Monitor Effectiveness, and Covert Misalignment

},

author={

Christopher Nwonu

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.