Jun 21, 2026

LatentGuard: Mitigating Multilingual Safety Bypass via Mid-Layer Latent Steering

Aishwarya Mukherjee, Surjit Chowdhary

Modern LLMs rely heavily on Reinforcement Learning from Human Feedback (RLHF) for safety alignment, a process that remains overwhelmingly English-centric. Consequently, when prompted in low-resource Indic languages like Bengali, models suffer from Safety Drift, systematically failing to refuse harmful inputs or exhibiting high refusal variance. Conventional mitigations attempt cross-lingual fine-tuning using synthetically translated data. However, this approach triggers Knowledge Collapse, while superficial linguistic fluency remains, factual integrity and geometric safety boundaries dissolve because the underlying representations are corrupted by poor token fragmentation. Recursive fine-tuning on these shattered representations accelerates degradation rather than preventing it.

Using mechanistic interpretability on Gemma 2 2B Instruct, we quantify this internal geometric decay. Our analysis reveals a late-stage safety bypass: the model successfully maps Bengali malicious intent to English safety anchors in its mid-layers, but experiences severe cross-lingual drift immediately prior to output generation. To address this structural failure, we introduce LatentGuard, a training-free inference intervention. By applying mid-layer latent steering, we actively project drifting Bengali representations back into the stable English safety subspace right before divergence occurs. In our evaluations, LatentGuard reduced the Bengali Attack Success Rate (ASR) on safety benchmarks by 20%, while maintaining fluency. This approach dynamically restores cross-lingual safety boundaries without the computational costs, data scarcity, or Knowledge Collapse risks inherent to synthetic fine-tuning.

Review Project

View Related Sprint

Reviewer's Comments

I really love this project. This is ambitious and impressively complete - the only project to run the full arc of diagnose, localize, intervene, and ablate, and you do it with real statistics (t(34)=4.2, p<0.001) and a clean layer x alpha ablation. The mechanistic story is well supported (drift bottoming out at layer 13, then diverging sharply by layer 25), and the "Language Bleed" observation is excellent, mature self-critique: noticing the steered model refuses in English rather than Bengali, recognizing this breaks accessibility for the very users it's meant to protect, and proposing orthogonal projection as the fix. The info-hazard notice and open dataset are good practice.

Strong, creative work - the orthogonal-projection version that disentangles safety from language is the most valuable next step.

LatentGuard addresses a real and underappreciated failure mode: safety alignment in LLMs degrades systematically for low-resource Indic languages like Bengali, and synthetic fine-tuning — the current standard mitigation — risks Knowledge Collapse by forcing models to memorize broken token representations. The paper's clearest contribution is the "illusion of alignment" finding: Gemma-2-2B-IT successfully maps Bengali harmful intent to English safety anchors at Layer 13 (cross-lingual drift = 0.047, ~95% cosine similarity), but this alignment unravels catastrophically by Layer 25 (drift = 0.510, ~49% similarity). This non-monotonic trajectory reframes the problem from "the model doesn't understand Bengali" to "the model understands but structurally diverges before output generation" — a meaningful mechanistic reframing with practical implications. The training-free inference intervention follows naturally from this diagnostic, and the three-phase structure (behavioral profiling → latent diagnostics → steering intervention) is well-organized and easy to follow.

The primary limitation is scope. The failure cohort is 35 prompts from a single model and a single language. Whether the Late-Stage Safety Bypass generalizes to Hindi, Tamil, Telugu, or Marathi — or to Llama-3 or Mistral — is entirely open. The headline claim ("multilingual safety boundaries") outpaces what 35 Bengali prompts can support. Additionally, refusal detection relies entirely on pattern-matched indicator strings rather than human verification or LLM judging, which means the reported 20pp improvement (60% → 80% refusal rate) may partially reflect the model emitting English-language boilerplate rather than genuine cross-lingual safety restoration. A stratified human evaluation of even 10–15 steered responses would substantially strengthen the result.

The Language Bleed finding — steered responses arriving in English despite Bengali input — is disclosed honestly and is the paper's most practically significant limitation. The orthogonal projection approach suggested in future work (projecting only the safety-relevant component of the steering vector, orthogonal to the English-Bengali language direction) is exactly the right next step and should be the first priority for follow-on work.

This paper admits that multilingual safety disparities and latent steering are established in the research, and this is a case study of a specific English/Bengali gap in a single gemma model that was studied.

For each of the 150 parallel prompt pairs the metrics used were refusal mismatch, token fragmentation and confident hallucination. Only the first of these are actually a measure of safety the other two are more about low capability. So to improve this work I would be interested to see the work done with a variety of models, with more diverse prompts.

The current steering vector is computed from the same 35-prompt failure cohort used for evaluation, which risks overfitting.

In methods it is said "To create our target distribution for Phase 2 and 3, we analyzed a 150-prompt baseline

and isolated a 35-prompt failure cohort. instances of verified asymmetric failure where English

prompts triggered refusal while the Bengali translation triggered compliance."

Then later in the steering results, the paper says the intervention was applied to the 35 failure prompts, and the baseline before steering refusal rate was already 60.0%. This was confusing because I would have espected the baseline to be 0% refused, 100% harmful compliance as per the cohort definition. So it is important that the paper maintains terms like failure cohort consistently and seperate out output degeneration, refusal, steering, token fragmentation so a cohort flow diagram would really help to see what happened, maybe like a sankey diagram.

I would be excited to see validation experiments done, these negative controls, especially a random vector, a shuffled prompt-label vector, a same-norm vector unrelated to Bengali-English differences, would be valuable in making this result strong.

Cite this work

@misc {

title={

(HckPrj) LatentGuard: Mitigating Multilingual Safety Bypass via Mid-Layer Latent Steering

author={

Aishwarya Mukherjee, Surjit Chowdhary

date={

6/21/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026