Jun 22, 2026

FAP: A Benchmark Dataset and Mechanism for Filtering Adversarial Payloads in Natural Language Prompts

Rabbika Azmi, Tanisha Ojha

We address a critical security gap in Text-to-SQL systems serving code-switched (Hinglish) users, where English-optimized defenses fail to detect SQL injection payloads embedded in multilingual prompts. We introduce FAP, a framework combining fine-tuned UniXcoder (structural code analysis) with zero-shot Qwen3-4B (semantic intent verification) via MLP fusion. Our dataset comprises 20,000 code-switched samples spanning four linguistic tiers (English, Hinglish Light/Natural/Heavy) derived from Spider and SQLShield. FAP achieves 81.73% recall and 84.03% F1-score, versus CodeBERT-only's 8.67% recall, demonstrating that semantic reasoning is essential for multilingual security. The work exposes a critical AI safety gap: existing defenses remain English-centric, leaving non-English speakers vulnerable to undetected linguistic attacks. Our framework and dataset are released to enable community research on multilingual database security.

Review Project

View Related Sprint

Reviewer's Comments

This paper addresses an important and underexplored problem: securing code-switched (Hinglish/Hindi) Text-to-SQL systems against prompt injection and SQL attacks. The dual-encoder architecture combining structural and semantic analysis is well motivated, and the implementation is supported by a complete evaluation pipeline. The reported improvement in recall over baseline methods highlights the potential of the proposed approach.

The main limitation is that the benchmark is machine-generated and lacks native-speaker validation, while the key hypothesis—that security performance degrades as code-switching increases—has not yet been evaluated through the planned tier-wise analysis. Completing this analysis should be the highest priority. In addition, baseline comparisons should be performed on identical evaluation sets, and component-level metrics should be clearly distinguished from end-to-end benchmark results to avoid confusion.

Overall, this is a promising contribution to multilingual AI security with a solid technical foundation, and additional validation would significantly strengthen its impact.

The problem is strong and genuinely neglected: code-switched (Hinglish) SQL-injection against Text-to-SQL systems is a real, specific blind spot in English-centric defenses, and it's well-scoped. The architecture (fine-tuned UniXcoder for structure + zero-shot Qwen3-4B for intent, fused via MLP) is sensible and the structure-vs-semantics framing is a reasonable hypothesis. A few things I can't assess but would need to before trusting the headline:

The CodeBERT 8.67% recall comparison looks like a strawman. CodeBERT zero-shot or unmodified on a Hinglish distribution it never saw would naturally collapse; the fair comparison is CodeBERT (or UniXcoder) fine-tuned on the same 20k samples. As stated, the 8.67%-vs-81.73% gap may largely measure "trained on the target distribution vs not," not "structure-only vs structure+semantics." The ablation that actually supports the thesis is UniXcoder-alone vs Qwen-alone vs fused — does the report have it?

81.73% recall is the number that matters, and it's not high for a security filter. Missing ~18% of injection payloads is a lot for a defense; recall, not F1, is the operative metric for a filter, and the abstract leads with F1. How does precision/FPR look, and what's the threshold trade-off?

Dataset provenance. 20k samples "derived from Spider and SQLShield" across four Hinglish tiers — were the code-switched variants machine-generated or native-speaker-validated? If templated/auto-translated (as is common), the tiers may not reflect real Hinglish, and the model may be learning the generation artifact. This is the same authenticity question that mattered for MoMo.

No code verifiable here — and a benchmark+mechanism paper lives or dies on whether the dataset and harness are real and reproducible.

This project addresses an important and underexplored problem: security for multilingual and code-switched text-to-SQL systems. The focus on Hinglish prompts is valuable because many deployed AI systems are evaluated mostly in English, while real users often interact in mixed-language settings. The dataset contribution is the strongest part of the work, and the multi-perspective pipeline combining structural SQL features with semantic intent reasoning is a sensible direction.

The paper is also practically motivated: text-to-SQL systems can cause real harm if generated queries bypass authorization, modify data, or leak sensitive rows. Separating SQL generation from security classification is a good design choice, and the reported improvement in recall over the CodeBERT-only baseline suggests that semantic reasoning is important for this task.

The main weakness is evaluation clarity. Some reported metrics appear inconsistent: for example, FAP is described as maintaining a low false positive rate, but the confusion matrix indicates 68 false positives out of 1,400 benign samples, which is closer to 4.9%, not 0.36%. The false negative rate also appears inconsistent with the reported recall and confusion matrix. These numbers should be recalculated and presented consistently before submission. The comparison is also uneven because Qwen has fewer test samples than the other methods, and the “tier-wise performance” most relevant to the multilingual claim is still pending.

The dataset construction also needs more detail and validation. Since Hinglish variants are LLM-translated from Spider, the benchmark may reflect the translator model’s style rather than real code-switched user behavior. Human validation by Hinglish speakers, tier-wise examples, and checks for label quality would strengthen the benchmark substantially. It would also help to separate prompt-injection detection, SQL-injection detection, schema exploration, and authorization bypass into distinct categories rather than one broad malicious label.

Overall, this is a promising and socially relevant hackathon project with a useful dataset direction, but the empirical claims need cleanup, stronger multilingual validation, and more realistic security evaluation before the framework can be considered production-ready.

Cite this work

@misc {

title={

(HckPrj) FAP: A Benchmark Dataset and Mechanism for Filtering Adversarial Payloads in Natural Language Prompts

author={

Rabbika Azmi, Tanisha Ojha

date={

6/22/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026