Mar 23, 2026
Trust, but Verify A Dual-Task Evaluation Pipeline for Capability, Monitor Effectiveness, and Covert Misalignment
Christopher Nwonu
This project develops a production oriented safety evaluation framework for tool-using language models. It tests two things at the same time: how well a model completes its assigned task, and whether it tries hidden unsafe behaviour (such as covert policy violations) while appearing compliant. The framework uses a sandboxed environment, layered monitoring, and clearly defined metrics (main-task performance, unsafe attempt/success rates, monitor recall, and detection latency). It also includes formula explanation tables, severity levels, and release gates so results are verifiable and actionable. The main contribution is a concrete protocol that helps teams make safer deployment decisions by requiring strong utility and low covert-risk behaviour before production release.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Trust, but Verify A Dual-Task Evaluation Pipeline for Capability, Monitor Effectiveness, and Covert Misalignment
},
author={
Christopher Nwonu
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


