Jun 21, 2026

The Compliance Cliff is Language-Dependent: Constitutional AI Immunity Breaks Under Hindi Pressure

Rahul

AI models are often trained to be helpful and compliant, but excessive compliance can cause them to invent answers when the correct response is “I cannot determine from the provided information.” Recent work found that some Constitutional AI (CAI) models are highly resistant to this failure mode in English. In this paper, we ask whether that robustness transfers to other languages.

We evaluate four frontier language models on a controlled set of Hindi tasks where the answer is intentionally missing from the provided context. The only experimental change is the language of the compliance-forcing system prompt: English or Hindi. We find that Claude Haiku 4.5, which shows complete immunity to compliance-induced fabrication under English pressure, fabricates answers in 30% of cases when the same pressure is expressed in Hindi. In contrast, DeepSeek V3 remains robust across both languages, while LLaMA 3.3 70B and Qwen3-30B, already vulnerable in English, show no significant language-dependent change. Importantly, all models achieve perfect accuracy on answerable Hindi tasks, indicating that the observed failures are not caused by poor language understanding. Instead, the failure occurs specifically when compliance conflicts with honesty.

We also identify a simple mitigation: a one-sentence metacognitive reminder asking the model to verify whether the answer is actually present in the context. This intervention eliminates fabrication in our experiments while preserving task performance.

Our results reveal a previously undocumented language-dependent compliance cliff, where safety properties measured in English do not necessarily generalize to Hindi. These findings suggest that English-only safety evaluations can substantially overestimate real-world robustness in multilingual deployments and highlight the need for cross-lingual safety testing as a standard evaluation practice. To support verification and follow-up research, we release all artifacts required for full reproducibility, including the dataset, experimental pipeline, source code, raw model outputs, evaluation scripts, and statistical analyses.

Review Project

See Code

View Related Sprint

Reviewer's Comments

The inference that CAI lead to hallucination gap between English and Hindi is unreliable; there's no indication that Qwen3-70B or LLama3.3 do not do CAI training (in fact, there very likely do use RLAIF at large, and the boundary between RLAIF and CAI is a very blurry one).

Terms used in the report could be standardized; e.g. hallucination instead of "compliance-induced fabrication".

Statistical significance seems dubious given the small sample.

It would help reader calibration if the answerable & unanswerable questions are presented verbatim in the appendix. Currently we don't know in what exact ways are the unanswerable questions unanswerable.

This project asks whether a known AI safety property — that Anthropic's Claude models refuse to make up answers even when instructed never to refuse — holds up when those instructions are given in Hindi rather than English. The team runs a controlled experiment across four AI models and finds that Claude Haiku 4.5 fabricates answers 30% of the time under Hindi pressure versus 0% in English, while the other (already weaker) models show no change. Crucially, they also show a single added sentence in the system prompt completely reverses the Hindi failure, making this both a safety warning and a deployable fix for the 600+ million Hindi speakers using these systems today.

Strengths

1. Important, focused problem. The paper picks a concrete and under explored safety gap — AI models widely deployed for Hindi speakers are safety-certified only in English — and makes the stakes legible: a model that appears perfectly safe in testing could fabricate medical or agricultural advice in real deployments.

2. Smart experimental design. The team wisely checked that all four models handle Hindi just fine on answerable questions (445/445 correct), which rules out the simple explanation that "the model just doesn't understand Hindi" and isolates the pressure language as what's actually changing.

3. Fully reproducible and deployable. All 890 test records, the evaluation code, and a one-command reproduction script are publicly released, and the proposed fix is a single sentence requiring no retraining — so both the finding and its remedy are immediately usable by others.

Weaknesses

1. The Hindi and English pressure prompts weren't cross-checked for equal strength. Both were written by a single author, and if the Hindi version happens to sound more forceful or harder to push back on, the entire observed difference could be about prompt writing rather than language — a human rating of both prompts would close this gap.

2. The statistics treat 5 questions asked 10 times each as 50 independent data points. Repeating the same question doesn't give the same statistical confidence as asking 50 different questions, so the reported p-value is likely overstated and the true uncertainty is wider than shown.

3. The claim that only CAI models show this gap is hard to fully verify. The other models already fabricate roughly half the time in English, so they can't really get much worse in Hindi — the "no effect" result for them may just mean there was no room left to fall, rather than that they're genuinely immune to the language shift.

4. The one-sentence defense was never tested on questions the model should answer. A prompt that simply makes the model refuse more would also score zero fabrications on unanswerable items, so without checking that answerable questions still get correct responses, "eliminates fabrication" and "makes the model refuse everything" look identical from the data.

Cite this work

@misc {

title={

(HckPrj) The Compliance Cliff is Language-Dependent: Constitutional AI Immunity Breaks Under Hindi Pressure

author={

Rahul

date={

6/21/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026