Mar 23, 2026
You Can't Fool All of Them: Behavioral Comparison for AI Control Against Sophisticated Code Injection
Janhavi Khindkar, Sherin Bharathiy, Soumyadeep Bose
Trusted monitoring — where a weak model reviews a strong model's code — fails completely against injection attacks. We confirm this: DeepSeek-V3 injection drops GPT-4o-mini monitoring to ~6% detection on APPS problems.
We propose a Behavioral Comparison Protocol that recovers 65% detection (16% FP) by replacing code judgment with structural comparison. Instead of asking the weak monitor to judge code, we have an independent evaluator (Qwen-32B) solve the same task and compare structurally with the attacker's code, then conduct multi-round interrogation. Injection fools readers but cannot hide structural differences between independently generated solutions.
Key results:
- 65% detection on backdoored code vs ~6% for standard monitoring (+59pp, AUC=0.727)
- Capability threshold: code-specialized models (Qwen-32B) cannot generate effective injection — only frontier models (DeepSeek-V3) can
- Collusion pilot (n=5): cross-family evaluator detects 100% vs same-family 40% on identical backdoored code — Phase 2 scores 0.00 for same-family on all tasks, empirically confirming implicit collusion from shared training distributions
Core insight: compare, don't judge — architecturally immune to injection by construction.
Setup: DeepSeek-V3 (attacker) → GPT-4o-mini (monitor) → Qwen-32B (cross-family evaluator) → DeepSeek-Coder-33B (same-family, collusion baseline). Total cost: ~$0.60.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) You Can't Fool All of Them: Behavioral Comparison for AI Control Against Sophisticated Code Injection
},
author={
Janhavi Khindkar, Sherin Bharathiy, Soumyadeep Bose
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


