Mar 31, 2025
If Everything’s Suspicious, Nothing Is
Yulong Lin, Jonathan Ng
We aim to explore whether code complexity and obfuscation significantly impair the ability to accurately identify and control potentially harmful AI-generated outputs. Trusted monitoring relies on suspiciousness scores to tell them apart, so if the suspiciousness variance is decreased, the monitor might have a more difficult time telling them apart.
Our primary hypothesis was that increased complexity through obfuscation strategies would significantly elevate suspiciousness scores and simultaneously reduce variance, making effective monitoring increasingly difficult. We evaluated this by creating a novel red-team strategy leveraging a Python obfuscator within the APPS environment, testing three attacker models (GPT-4o, Claude 3.7 Sonnet, and Claude 3.5 Haiku) across multiple prompt-engineering configurations.
Our key findings indicate that obfuscation notably increased suspiciousness scores and consistently reduced variance across all tested models, making it difficult for a monitor to identify backdoored code. Particularly significant was Claude 3.7 Sonnet's increased vulnerability to backdoor insertion when complexity was introduced. These findings validate our hypothesis and underscore the urgent need for more sophisticated and adaptive monitoring strategies capable of countering advanced obfuscation techniques in AI-generated content.
No reviews are available yet
Cite this work
@misc {
title={
@misc {
},
author={
Yulong Lin, Jonathan Ng
},
date={
3/31/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}