Jun 21, 2026

LangGap: Does the Text-Action Safety Gap Widen Across Languages?

Asma Ahmed

LangGap is a study of whether AI agents keep their safety promises when they act, not just when they talk, and whether that holds up across languages. An agent can refuse a harmful request in plain words and still go ahead and call the tool that does the damage. The question here is whether that text–action split gets worse when the request comes in Urdu or code-mixed Urdu-English instead of English.

The tricky part of this kind of work is telling a real safety failure apart from the model simply not understanding the Urdu. So every harmful prompt had a benign twin: if the model handled the legitimate version, it clearly understood the request, and any refusal counts as a real choice rather than confusion.

Testing claude-sonnet-4.5 this way turned up two things worth flagging. On the one task with a clean English baseline to compare against, the harmful-execution gap didn't widen across languages. What did show up was over-refusal: the model stalled or declined legitimate non-English requests 15 to 46 percent of the time and refused none in English. That's not a safety hole, but it's a real cost, and it lands entirely on non-English users.

The bigger surprise was about how the attacks were written. Generating red-team prompts with an automated tool (Petri) instead of by hand flipped the model's behavior. The generated prompts ran about twice as long, packed with reference numbers, deadlines, and pre-authorization claims, and the model treated that whole register as a social-engineering attack, refusing nearly everything, harmful and benign alike. The catch for evaluation is real: lean on automated prompts and you might decide a model is safe when it has only learned to flinch at one obvious pattern, missing the risk from plainer, human-written requests.

Review Project

View Related Sprint

Reviewer's Comments

LangGap tackles a genuinely important and underexplored problem: whether the text-action safety gap widens when an LLM agent is operating in a non-English or code-mixed language. The comprehension control design — using benign twins to distinguish genuine safety degradation from model confusion — is methodologically sound and is the most important contribution of the paper. The finding that comprehension was 100% in every cell means the over-refusal results are cleanly interpretable as friction on understood tasks, not a parsing failure.

The null result on harmful execution (near-floor across all three languages) runs against what the multilingual safety literature predicts, and the author handles this carefully rather than either dismissing it or overclaiming. The provenance finding (Petri-generated prompts triggering social-engineering heuristics and suppressing both harmful and benign execution) is surprising and practically significant — it's a real methodological warning for anyone using automated red-teaming tools to evaluate multilingual safety. The main limitation is that the null rests on a single valid task and a single model version, which the author is upfront about. The data-exfiltration task not yielding a usable English baseline (100% execution due to authority-laundering framing) is an honest methodological disclosure, though it does mean the headline RQ-A finding has less support than the paper structure implies. A redesigned second task is the clear next step. The G1 confound (Petri prompts roughly twice as long as hand-authored ones) appropriately limits what can be claimed from the provenance comparison, and the author's restraint in scoping the RQ-B interpretation accordingly is exactly right. Solid work with a clear and honest account of what it does and doesn't show.

I enjoyed reading this paper and found it to be a thoughtful and timely submission. The topic felt very relevant because AI agents are increasingly moving from simply answering questions to taking actions through tools, and bringing a multilingual lens to that problem was a strong choice.

I was especially interested in the benign-twin test design and the results that came out of it. It helps separate true safety behavior from simple language misunderstanding, which is an important distinction. I also liked that the paper did not force an expected conclusion; the valid financial-fraud task did not show higher harmful execution in Urdu or code-mixed prompts. At the same time, the benign-task results showed that legitimate non-English requests can still face more friction, which is an important practical concern for real users.

The Petri prompt-provenance finding was also valuable. The fact that automated prompts triggered refusal or stalling even for benign requests is a strong reminder that red-team prompt style can materially affect measured safety outcomes.

My main constructive feedback is that the future work identified by the paper is essential to fully establish the central claim. It would be interesting to see whether the same patterns hold when the scope expands across multiple model families and a wider set of languages. A redesigned data-exfiltration task and length-matched Petri comparisons would also help make the findings stronger and easier to generalize. Overall, this is a strong exploratory study that raises the right questions about multilingual agent safety evaluation.

This is one of the more careful pieces of work I have read in a while. The comprehension control is the strongest idea here. The usual problem with multilingual safety tests is that you cannot tell a real refusal apart from the model just not understanding the Urdu. The benign-twin design solves that cleanly: if the model handles the legitimate version, it understood the request, so any refusal is a real choice and not confusion. The existing gap literature does not separate this, and you do.

I also want to credit the honesty all the way through. You pre-registered validity guards, and when G1 failed you reported it openly and scoped the whole RQ-B claim around it instead of hiding it. The refusal classifier is blind to the prompt and language and you verified that in code. The dual-use handling is careful — simulated tools, prompts withheld, and a direct warning against reading your null as "Urdu is already safe." This is how this kind of work should be done.

A few things to work on, in order of importance:

1. The main RQ-A null is a null at the floor, and this should be your top caveat rather than a later limitation. Harmful execution in English is already 2% with a CI that touches zero. You cannot detect a gap widening across languages when the baseline has almost no room to move. So the honest reading is not "the gap does not widen in Urdu," it is "I could not build conditions where a gap could even show up." That is a power problem more than a result, and because it speaks directly to your title question, it belongs up front.

2. RQ-B is confounded and you say so. The G1 failure ties density, authority register, and naturalness together, so the inversion is real but the cause is not pinned down. Your scoping is precise and the length-matched regeneration is the right fix. No change needed to the claim, just keep it framed as automated-generation behavior, not naturalness.

3. Both findings rest on one model version, and RQ-A on a single valid task. A second valid harmful task is needed before the null can be read as general. You already name this, so it is mostly a matter of follow-up work.

4. Smaller points: the Urdu classifier sits at κ=0.75, though leaning on combined non-completion handles that well; the hand-authored drift canary was not saved, so only the PG arm has a verified stability check; and n=30 on the exfiltration harmful cells is thin.

What stands out is that nearly every concern I would raise, you have already flagged yourself. That is rare and it counts in your favor. The provenance finding in particular — that automated red-teaming can make a model look safer than it is — is the most useful thing in the paper and deserves to be pushed further.

On presentation: claims are scoped precisely and the limitations are laid out clearly. The abstract is dense and a few sections need a careful read, but the overall structure is strong.

Overall this is disciplined, honest work on a problem the field overlooks. The next step is a second valid task and a length-matched red-team arm, so the null and the provenance result both rest on firmer ground.

The real problem is that your main result rests on almost nothing actually happening. The AI basically never did the harmful thing in any language, so "there's no gap between languages" is really hard to tell apart from "nothing happened at all," and the one test where things did go wrong got dropped because it broke. So I'd build at least one test where the AI clearly misbehaves in English to start with, which gives you a real baseline to compare other languages against, and I'd try other companies' models too, since you only tested one. The most interesting thing you found is almost a side note, which is that the AI treats perfectly reasonable requests in Urdu as suspicious and refuses them up to a third of the time, and that's a real, unfair burden on non-English users that's worth putting front and center. Finally, since you can't share the AI itself, please share everything else you can (your prompts, your settings, your scoring rules) so other people can check your work rather than just take your word for it.

Your methodology of checking whether the model truly understood the request before judging its safety behavior is a smart and original idea that the field has been missing — well done. To make this work even stronger, try running the same experiment on one or two other AI models like GPT-4o or Gemini to see if you get similar results. Also, if you can regenerate the automated prompts to match the length of the hand-written ones, you'll be able to prove more cleanly why the automated red-teaming behaved so differently, You can run a clean experiment where only one thing changes at a time — the source (human vs. machine), not the length.

Cite this work

@misc {

title={

(HckPrj) LangGap: Does the Text-Action Safety Gap Widen Across Languages?

author={

Asma Ahmed

date={

6/21/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026