Jan 11, 2026

Chain-of-Thought Manipulation Monitor: Detection of Deceptive Reasoning in LLMs

Anish Sathyanarayanan, Aditya Nagarsekar, Aarush Rathore

The project builds a Chain-of-Thought Manipulation Monitor that helps judges tell when an LLM is merely performing honesty versus actually reasoning honestly.

Problem and core idea

Chain-of-thought makes models look transparent, but they can still sycophantly game evaluators or conceal unsafe intent behind nice-looking reasoning.

The system combines a behavioural text monitor with a causal bypass probe to ask: “Does this answer genuinely depend on its own CoT, or is the reasoning just staged?”

What the tool actually does

A Streamlit dashboard assigns each response a 0–100 manipulation-risk score using regex patterns, semantic drift/misalignment, and structural signals like entropy and reasoning quality, plus an audit-aware mode that surfaces evaluation-gaming and safety-evasion strategies.

A causal CoT-bypass module overwrites CoT activations in the model and reports CoT-mediated influence, bypass, and robustness-style metrics to test whether answers truly rely on specific CoT tokens.

Key findings and significance

When models are told they are being audited, manipulation markers and CoT length rise, yet many answers still depend only weakly on the CoT, exposing a gap between visible explanations and underlying computation.

The project provides a practical, lightweight framework and dashboard for real-time CoT manipulation auditing, offering a concrete path toward safer and more truthful LLM oversight in real deployments.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

This project targets CoT reasoning monitoring, an important research direction. The work could benefit from clearer definitions of what constitutes manipulation and how the proposed signals should be measured in practice. The mechanistic analysis appears to be evaluated primarily on arithmetic tasks rather than manipulation-relevant scenarios, which makes it difficult to assess relevance to the stated goals.

You all covered a lot of work together in a short amount of time, and all components seem very reasonably designed. Impressive you got all that done in a weekend!

On the impact - Solid contribution and exploration. Dashboard tooling is a useful contribution and the integration of the components you brought together has never been done before to the extent of my knowledge. The individual components themselves do seem to have some precedent, and so to me this does not represent a conceptual step forward nor a new exploration.

On the execution - Seems technically very reasonable. The 4 components all seem designed with some thought. Small sample size for the causal module (n=10) means that we cannot rely on the results too much / at all, and that would have been an obvious place for more attention.

Good structure, good figures, good detail. I appreciated the appendix. It was a little long and verbose at times. I would have liked for you to make the conclusion (the so what does this mean for the field) stronger or more spelled out. It was written in a way that I think the target audience would get this pretty quickly.

Cite this work

@misc {

title={

(HckPrj) Chain-of-Thought Manipulation Monitor: Detection of Deceptive Reasoning in LLMs

},

author={

Anish Sathyanarayanan, Aditya Nagarsekar, Aarush Rathore

},

date={

1/11/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.