Mar 10, 2025
Superposition, but at a Cross-MLP Layers view?
Woon Yee Ng
To understand causal relationships between features (extracted by SAE) across MLP layers, this study introduces the Coordinated Sparse Autoencoder Network (CoSAEN). CoSAEN integrates sparse autoencoders for feature extraction with the PC algorithm for causal discovery, to find the path-based activations of features in MLP.
Natalia Perez-Campanero
The project tackles an important problem: understanding feature causation across MLP layers in language models. The approach of combining Sparse Autoencoders (SAEs) with causal discovery (PC algorithm) seems novel and potentially promising for identifying how early-layer features influence later-layer features. The project is well written, if somewhat lacking in details. Understanding feature causation could help identify and mitigate biases, vulnerabilities, or unintended behaviors in language models, although there is not much explicit discussion of this in the write-up, making the AI safety element more implicit. I would encourage you to more explicitly discuss safety risks that could be addressed using this framework. There is also little justification provided for this choice of algorithm relative to alternative causal discovery methods, or discussion of the limitations of the PC algorithm, especially in the context of high-dimensional data, which would strengthen the project.
Liv Gorton
This project presents a novel integration of sparse autoencoders with causal discovery methods that clearly identifies an existing gap in the literature. More methodology details (e.g. on the algorithm used) would be very useful.
Cite this work
@misc {
title={
@misc {
},
author={
Woon Yee Ng
},
date={
3/10/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}