Jul 27, 2025

A Mechanism for the Emergence of Superposition in Toy Models

Wu Shuin Jian, Choong Kai Zhe, Harshank Shrotriya

We conjecture that superposition, rather than a problem, is a key feature of neural networks that allows them to efficiently model high dimensional data with limited resources. We explore how Nonlinear Matrix Decomposition might provide insight into how superposition works and could lead to the creation of tools for interpretability.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The authors propose that nonlinear matrix decomposition (NMD), an iterative method to obtain a low-rank approximation of a sparse nonnegative matrix, may be a useful lens to understand the phenomenon of superposition in neural networks. As evidence they show that in a small instance of the “Toy Models of Superposition” experiment (with six sparse features and three hidden neurons), the basis vectors of the rank 3 matrix \Theta that one learns upon applying NMD to compress the 6-row data matrix to rank 3 resemble the feature directions found by the TMS experiment. The authors also speculatively propose that the stepwise loss function for TMS with (n=6, m=3) that appears during training (as discussed in section 5 of the TMS paper) could be explained by the model effectively discovering a higher-rank NMD representation of the dataset at each step.

This work is conceptually strong. The proposed connection between superposition and NMD is new to me and backed by a crisp supporting experiment. The connection to AI safety is indirect (implicit through superposition being an open problem for mechanistic interpretability which in turn is arguably critical for AI safety), but this chain of logic is usually taken for granted in the interpretability community. The initial success of the idea in a small toy model is promising. I would however caution that this model is simple enough that many different interpretability proposals can be made to work in it, and positive results in larger TMS models (as a minimum that’s hopefully not too hard to run with existing code) as well as models beyond TMS would help to make a stronger case for a ‘smoking gun’ signal to pursue the project beyond the weekend hackathon. It could also be interesting to try to analytically understand why NMD and TMS match in the model that the authors studied.

While the idea is interesting, and intuitively makes sense, the rigor is lacking. While they are discussed as conceptually similar, the experiments only provide visual justification and no quantitative metrics. Including such metrics would massively improve the quality of the paper.

The project observes the fact that the target of toy models of superposition is equivalent to the target of toy models of superposition. This simple but nice observations relates the two data properties of superposition and positive factorization (in general distinct and independently observed in ML data). I think this project would be significantly improved by an experiment: as the two approaches use very different learning algorithms, it would be quite interesting to compare their performance, the loss of Saul's algorithm in the TMS context, etc.

Note about the abstract : superposition is not considered a pathological behavior, but is specifically posited as a way that networks can compress sparse vectors. The connection between Saul's matrix factorization work and toy models of superposition is not (at least not in a clear way) indicative of either being a compression phenomenon, and the paper does not seem to motivate this.

"This report argues that superposition is not a pathological behavior, but a

structural consequence of efficient representation under resource constraints."

I don't believe this is a controversial or novel claim. TMS has already made this claim & your paper seems to agree with that interpretation. To fix this (potential) contradiction, you could clarify what you mean by pathological & how TMS makes a different claim than above.

I did like the general connection of NMD w/ TMS as well as the step-loss function; however, it would've been good to include the step-loss function as a figure.

I am unsure on the usefulness of understanding how TMS is learned. TMS is reconstructing known features, whereas NN's are learning features, where to store them in a compressed way in order to predict the label. It's unclear if learning how TMS learns it's features will generalize to the real case.

This entry proposes that superposition in neural networks can be described through the lens of nonlinear matrix decomposition (NMD), and finds empirical evidence that toy models of superposition and NMD produce very similar sets of overlapping features. This seems like a theoretical connection that could potentially be of interest for the mechanistic interpretability community. It would be great for this work to be extended further, and to see if NMD could inspire new interpretability methods that are competitive with sparse autoencoders.

Cite this work

@misc {

title={

(HckPrj) A Mechanism for the Emergence of Superposition in Toy Models

},

author={

Wu Shuin Jian, Choong Kai Zhe, Harshank Shrotriya

},

date={

7/27/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.