Probing Jailbreak Brittleness: Capability Limits vs Alignment Failures in Small Language Models
Baisayan Bhattacharya, Sankeerthana Satini, Harshprabha, Devansh Gupta
This project investigates whether smaller language models are more susceptible to jailbreak attacks than larger models, and whether this vulnerability is due to capability limitations or alignment brittleness. We evaluate instruction-tuned Mistral and Qwen models across XSTest and a modified version of AdvBench. For each prompt, we extract final-token hidden activations from all model layers during inference and train layer-wise linear probes to test whether safe and unsafe prompts are internally separable before generation. We then compare the best probe’s internal risk prediction with the model’s generated response behaviour. This allows us to distinguish missed internal risk, where unsafe compliance occurs without probe-detected risk, from known-risk unsafe responses, where risk is internally decodable but the model still complies. Our results show that unsafe prompt information is often linearly decodable before generation, and that many unsafe compliance cases occur despite probe-detected risk. While scaling improves safety within the Mistral family, the model family also plays a major role, with Qwen models demonstrating stronger refusal behaviour. Overall, our findings suggest that jailbreak susceptibility is often better explained by brittle translation from internal risk representations to output behaviour than by absent risk detection alone.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Probing Jailbreak Brittleness: Capability Limits vs Alignment Failures in Small Language Models
},
author={
Baisayan Bhattacharya, Sankeerthana Satini, Harshprabha, Devansh Gupta
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


