Jun 21, 2026

CrescendoDefense: Securing Open-Source Language Models Against Multi-Turn Jailbreak Attacks in Asia

Mahek Nishant Vedant

CrescendoDefense is a lightweight runtime defense framework designed to mitigate multi-turn jailbreak attacks against open-source large language models. Unlike traditional jailbreak attempts, Crescendo-style attacks gradually escalate conversations through memory stacking, semantic drift, guard-lowering dialogue, and prompt disguising, making them difficult to detect using conventional prompt-level moderation.

The proposed framework combines three complementary layers: a Semantic Kinematic Detector that monitors conversational escalation, a Strategic Context Eviction module that disrupts adversarial memory accumulation, and a Semantic Response Auditor that intercepts unsafe outputs before delivery. Evaluated on a benchmark of 15 adversarial Crescendo-style attack scenarios and 7 benign or risky-benign conversations using Llama-3.2-3B-Instruct, CrescendoDefense reduced Attack Success Rate (ASR) from 86.67% to 26.67%, representing a 69.2% relative reduction in successful jailbreaks.

The framework is model-agnostic, requires no retraining, and offers a practical defense-in-depth approach for improving the safety of open-source AI systems deployed in resource-constrained environments.

Review Project

See Code

View Related Sprint

Reviewer's Comments

CrescendoDefense addresses a real and underdefended threat: multi-turn jailbreaks that distribute harmful intent across turns evade single-turn moderation. The three-layer architecture (trajectory detection -> context eviction -> response auditing) is logically sound, the ablation cleanly shows each layer's contribution, and the 69.2% ASR reduction without model retraining is genuinely useful for resource-constrained deployments like those in Asia.

The main concern is benchmark scale: 15 adversarial + 7 benign = 22 scenarios is small. The 28.57% FPR means 2 of 7 benign conversations are incorrectly blocked — but with n=7 that's highly sensitive to individual cases. A larger benign set (at minimum 50-100 conversations) is needed to trust the FPR estimate. Similarly, the 15 adversarial scenarios are all hand-constructed Crescendo attacks; it would strengthen generalizability claims to test against Crescendo attacks from an external benchmark or from a red-teaming session with others constructing the attacks. Suggested next steps: (1) Adaptive thresholding is correctly identified as future work — even a brief ablation over threshold values would show the ASR/FPR tradeoff curve. (2) The strategic context eviction module evicts intermediate turns; it would be useful to report whether this causes measurable degradation in response quality on benign conversations that genuinely need history (e.g., technical multi-turn problem solving). The paper discusses this as a theoretical risk but does not measure it.

Good hackathon contribution — the defense-in-depth framing and the Asia open-source deployment motivation are well-argued.

CrescendoDefense ships a three-layer runtime defense for Llama-3.2-3B-Instruct against Crescendo-style multi-turn jailbreaks, no weight modification required. Layer 1 runs a Semantic Kinematic Detector that tracks absolute risk, velocity, acceleration, and cumulative risk of the conversation embedding against predefined harmful-objective anchors. Layer 2 fires after Layer 1 triggers and compresses history to system prompt plus first, previous, and latest user turns to disrupt scaffolding. Layer 3 audits the response, comparing a joint prompt-completion-prefix embedding against unsafe-completion profiles and applying fast-path signature checks before delivery. On a 22-scenario benchmark (15 adversarial, 5 benign, 2 risky-benign), the full pipeline cuts ASR from 86.67% baseline to 26.67%, a 69.2% relative reduction, at FPR 28.57%. The Asia framing stays motivational. Empirically the framework targets resource-constrained open-source deployments where retraining is impractical, but the data itself is English-only.

The defense-in-depth design lands well. Mapping each layer to a specific Crescendo mechanism (memory stacking, guard-lowering dialogue, semantic drift, prompt disguising) is one of the cleaner threat-model decompositions I have seen at a weekend scope. Four kinematic signals on Layer 1 catch gradual escalation that single-turn moderation misses, and computing velocity, acceleration, and cumulative excess risk above a baseline beats a single similarity threshold in a non-trivial way. The ablation across raw, L1+L2, L1+L3, and full pipeline attributes contributions cleanly. The L1+L2 result (40.00% ASR, 0% FPR) is actually the most interesting row in the paper because it isolates trajectory monitoring plus context eviction from output auditing and lands at a zero-false-positive operating point that a real operator could deploy tomorrow. Splitting the code into per-layer modules plus a pipeline helps readability, and all-MiniLM-L6-v2 keeps runtime cost defensible for the deployment story.

A few things would meaningfully strengthen the contribution. The benchmark is small. With 15 adversarial and 7 benign or risky-benign cases, each benign sample weighs 14.3 percentage points of FPR, so the headline 28.57% number is hard to read as a real operating point rather than two unlucky misclassifications. Running at least one Asian-language attack set (Hindi, Hinglish, Mandarin, Vietnamese) inside the main paper, rather than deferring multilingual evaluation to Section 7, would matter the most. As it stands the Asia framing carries no empirical weight, and the embedding model leans predominantly English-trained, so claims about Asian deployment generalize on faith. An adaptive-attacker evaluation in which the adversary knows the anchor clusters and the eviction policy, and intentionally builds trajectories that stay below the cumulative threshold or re-injects evicted context via the latest-turn slot, would also help. The current setup only measures static Crescendo patterns, and the dual-use section already flags this risk without testing it. A head-to-head against an off-the-shelf moderation baseline (Llama Guard, OpenAI Moderation, or a single-turn classifier) would clarify whether the trajectory signals actually add value beyond stronger per-turn filtering.

The broader signal here matters. Runtime, weight-free defenses for open-source LLMs sit at exactly the layer most likely to be deployed in low-resource Global South contexts, where retraining and RLHF pipelines stay out of reach for most operators. The work surfaces a real gap. Multi-turn jailbreaks remain under-defended relative to single-prompt attacks, and the FPR-vs-ASR trade-off the paper exposes (Layer 3 buys ASR at a visible usability cost) is the right conversation for the field to have. The Asia and multilingual dimension stays aspirational in this submission. Closing it, even with a single non-English attack suite, would move the contribution from a sound English-only Crescendo defense study toward the multilingual open-source safety contribution the title promises.

Impact Potential and Innovation:

Multi-turn jailbreaks are a real and underdefended attack surface, and the model-agnostic runtime framing is genuinely useful for open-source deployment contexts where retraining isn't feasible, but the three components (trajectory monitoring, context eviction, output auditing) are each individually not new, and the "Asia" framing in the title is never substantiated by any Asia-specific evaluation, which is a missed opportunity given the hackathon's theme.

Execution Quality:

15 adversarial scenarios is too small to put much confidence in the 26.67% ASR figure, but the bigger problem is the 28.57% false positive rate on the full pipeline, meaning roughly one in three benign or borderline conversations gets incorrectly flagged, and this is largely glossed over rather than treated as the practical blocker it would be in any real deployment.

Presentation and Clarity:

The architecture is explained clearly and the results table is easy to read, but the "Asia" in the title does real damage to the paper's credibility since it implies a geographic evaluation that never happens, and the conclusion overstates what a 22-scenario test on a 3B model actually demonstrates.

Cite this work

@misc {

title={

(HckPrj) CrescendoDefense: Securing Open-Source Language Models Against Multi-Turn Jailbreak Attacks in Asia

author={

Mahek Nishant Vedant

date={

6/21/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026