Mar 23, 2026
Passing the Test, Breaking the System: Latent Failures in Software Engineering Agents
Jeremias Ferrao
Real-world engineering post-mortems document a recurring failure pattern: a developer completes a task successfully on the surface but leaves behind code that causes harm later, credentials exposed in logs, state mutations without transaction guards, log files with no rotation policy. We ask whether AI software engineering agents reproduce this pattern. We introduce BAIT-Bench, a benchmark of 36 tasks across five failure modes drawn from real incidents, where each task has a correct-looking solution that passes all tests but contains a latent flaw detectable only through post-hoc environment inspection. Evaluating eleven models, we find primary task success of 78-94% against latent task success of only 18-36%, a gap that model scale does not close. A ControlArena LLM overseer monitor fails to flag any trajectory, since the failures arise from omission rather than suspicious action. Two interventions improve latent safety relative to the baseline: open-ended prompting raises latent success roughly 3x, and a pre-task environment scouting step roughly doubles it at no primary task cost. These results suggest that richer context and broader task framing are more effective safety levers than tighter agent control in the context of latent failures.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Passing the Test, Breaking the System: Latent Failures in Software Engineering Agents
},
author={
Jeremias Ferrao
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


