Jan 11, 2026

Think Right, Answer Wrong

Nilanjan Sarkar, Trisanu Bhar, Raman Rayaprolu, Margamitra Bhattacharya , Dheeraj Hegde

This project investigates sandbagging—strategic underperformance used to conceal latent capabilities—in large language models, motivated by the safety risk that models may appear harmless during evaluation while retaining hazardous knowledge. We evaluate sandbagging on the Weapons of Mass Destruction Proxy (WMDP) benchmark across three open-weight instruction models (Microsoft Phi-3, Mistral-7B, Qwen2.5). We show that safety-framed or evaluation-aware prompts can induce a consistent underperformance gap, indicating that accuracy-only evaluations can be misleading.

To detect concealed competence beyond surface accuracy, we develop a multi-layer detection framework spanning behavioral, robustness, and representation-level analyses:

Prompted sandbagging baseline: compare standard vs sandbagging prompts and quantify the accuracy drop on WMDP.

Chain-of-Thought (CoT) analysis: models often produce internally coherent reasoning that references the correct answer yet deliberately output an incorrect final option, suggesting capability is present but suppressed.

Noise injection detection: injecting Gaussian noise (whole-model weights and LoRA-targeted) can disrupt fragile deceptive behavior and sometimes increase sandbagged accuracy before degrading overall capability—consistent with “sandbagging is brittle, knowledge is robust.”

Representation engineering (CAA/LAT): using contrastive activations between standard vs sandbag prompts, we identify linearly separable “sandbagging directions” in activation space. Linear probes achieve very high separability (AUC near 1 in some layers), and CAA/LAT projections form cleanly separated clusters, supporting the hypothesis of a consistent internal sandbagging mode.

Key takeaway: Sandbagging is detectable most reliably when combining behavioral evidence (accuracy gaps) with robustness tests (noise sensitivity) and representation-level signals (linearly decodable modes and contrastive directions), while carefully controlling prompt formatting, which strongly affects results.

Review Project

See Code

View Related Sprint

Reviewer's Comments

Problem is interesting and the experiments are comprehensive.

Useful study covering multiple sandbagging detection approaches, with some nice empirical findings! A tighter contribution would either go deeper on one method with consistent conditions across experiments, or frame this as a comparative benchmark with a summary table. Extending to/discovering naturally-occurring sandbagging would be a great next step for this project

Cite this work

@misc {

title={

(HckPrj) Think Right, Answer Wrong

author={

Nilanjan Sarkar, Trisanu Bhar, Raman Rayaprolu, Margamitra Bhattacharya , Dheeraj Hegde

date={

1/11/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Feb 2, 2026