Summary
We aim to explore whether code complexity and obfuscation significantly impair the ability to accurately identify and control potentially harmful AI-generated outputs. Trusted monitoring relies on suspiciousness scores to tell them apart, so if the suspiciousness variance is decreased, the monitor might have a more difficult time telling them apart.
Our primary hypothesis was that increased complexity through obfuscation strategies would significantly elevate suspiciousness scores and simultaneously reduce variance, making effective monitoring increasingly difficult. We evaluated this by creating a novel red-team strategy leveraging a Python obfuscator within the APPS environment, testing three attacker models (GPT-4o, Claude 3.7 Sonnet, and Claude 3.5 Haiku) across multiple prompt-engineering configurations.
Our key findings indicate that obfuscation notably increased suspiciousness scores and consistently reduced variance across all tested models, making it difficult for a monitor to identify backdoored code. Particularly significant was Claude 3.7 Sonnet's increased vulnerability to backdoor insertion when complexity was introduced. These findings validate our hypothesis and underscore the urgent need for more sophisticated and adaptive monitoring strategies capable of countering advanced obfuscation techniques in AI-generated content.
Cite this work:
@misc {
title={
If Everything’s Suspicious, Nothing Is
},
author={
Yulong Lin, Jonathan Ng
},
date={
3/31/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}