Mar 22, 2026
Cultural Translation as an AI Control Failure Mode: Regional Degradation and Cross-Lingual Safety Confounds in Bulgarian/Slavic Contexts
Henry Ward
AI control evaluations are commonly benchmarked in generic English/Western framing, potentially underestimating failure risk in regional deployments. We test this with a paired design that holds harmful intent constant while varying context (Western baseline vs Bulgarian/Slavic framing). The benchmark covers five control-relevant failure families (sycophancy, interrogation evasion, deceptive alignment, shutdown resistance, prompt injection) across 360 prompts (180 matched pairs; 180 baseline, 180 regional). Mean total risk increased from 0.41 to 0.96 (Δ=0.55; +134.10%). Harmful support increased from 0.37 to 0.93 (Δ=0.55; +146.91%), while refusal decreased from 0.59 to 0.043 (Δ=-0.55; -92.60%). Family-level total-risk means were consistently higher under regional framing: deceptive alignment 0.77→1.0, interrogation evasion 0.13→0.86, prompt injection 0→0.94, shutdown resistance 0.88→1.0, sycophancy 0.25→1.0. Inferential analysis found statistically significant differences for total risk, harmful support, and refusal (all permutations p < 0.001). Operationally, control evaluations should include region-specific stress tests; otherwise, safety performance can be materially overestimated in non-Western contexts.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Cultural Translation as an AI Control Failure Mode: Regional Degradation and Cross-Lingual Safety Confounds in Bulgarian/Slavic Contexts
},
author={
Henry Ward
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


