Jul 28, 2025

Comments & Extensions of Subliminal Learning

Clark Miyamoto, Tets Nagashima, Alex Jiang

Here we perform empirical studies of subliminal learning. We find evidence that representations are important for subliminal learning to occur.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The authors focus on a nice, well-scoped project which is achievable within the constraints of a hackathon. They study a toy version of subliminal learning where a student model is distilled using *only* the "auxiliary logits" from a trained teacher model, possibly on a different dataset, where the auxiliary logits are designed to be completely meaningless. Nevertheless, they find that teacher accuracy can be transmitted to the student via the meaningless auxiliary logits.

It would have been nice to have a short discussion of how this toy version of subliminal learning relates to that of Cloud et al (I lean towards them being related but I'm not sure). In any case, the author's definition of "subliminal learning" is interesting enough to warrant study on its own merits.

They present some preliminary evidence that increasing width decreases their definition of subliminal learning - the most compelling evidence is from figure 4 where we see that increasing the number of auxiliary logits increases subliminal learning for student models of width <=2048, but barely affects subliminal learning for the width 4096 model. The authors draw a nice connection to NTK as a hypothesis which may explain this result.

Some of the results / presentation are pretty preliminary / rough around the edges (for example, the trend in figure 3 is not very clear; it would be nice to have some sort of p-value), but this is to be expected for a hackathon. The submission's strongest selling point is its focus on a minimal example of a "subliminal learning-type phenomenon" which still yields interesting results. This could be very useful for eventually building up to a tractable "physics of DL"-style theory of subliminal learning.

The project presents a replication and extensions of recent experiments investigating subliminal learning. The experiments are well-motivated and rigorous, and appear to lay a solid foundation for further investigations of the origin of subliminal learning. The project could be improved by making the connection with physics more explicit. The possible connection between feature learning and subliminal learning that was briefly mentioned is interesting, and could be a fruitful topic to analyze further in future work.

Excellent result, ive seen results like that for llms which were sort of vague and this is a very reasonable attempt to do explore the phenomenon for mlps and mnist dataset. Very good result especially within 3 days, good code as well, well-written.

This is a timely project well scoped for a hackathon project, exploring subliminal learning in vision models as width, auxiliary logits, and noise are varied in a pretty clean way, representing a reasonably ‘physics-like’ way of doing controlled experiments. The results (representations should match, accuracy flattens as width approaches the NTK regime) were not surprising, but a good sanity check; follow up work could explore mechanisms for why this happens or safety relevant limits (e.g., the finite-width onset of NTK-like behavior) to subliminal learning or other failure modes.

The paper could benefit from a clearer abstract and more of an introduction/background about the safety relevance, and there could have been a stronger connection to physics (which, as it stands, was sort of implicit). The authors also seem surprised that their accuracy plateau didn’t match the paper, but as these use different metrics (MSE loss v. KL divergence), I wouldn’t expect them to capture the same information. In particular, KL maximizes information transfer, so the MSE is likely missing some small auxiliary logits that nevertheless carry subliminal signal.

Cite this work

@misc {

title={

(HckPrj) Comments & Extensions of Subliminal Learning

},

author={

Clark Miyamoto, Tets Nagashima, Alex Jiang

},

date={

7/28/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.