Red-teaming with Mech-Interpretability

Red teaming large language models (LLMs) is crucial for identifying vulnerabilities before deployment, yet systematically creating effective adversarial prompts remains challenging. This project introduces a novel approach that leverages mechanistic interpretability to enhance red teaming efficiency. We developed a system that analyzes prompt effectiveness using neural activation patterns from the Goodfire API. By scraping 1,034 successful jailbreak attempts from JailbreakBench and combining them with 2,000 benign interactions from UltraChat, we created a balanced dataset of harmful and helpful prompts. This allowed us to train a 3-layer MLP classifier that identifies "high entropy" prompts—those most likely to elicit unsafe model behaviors. Our dashboard provides red teamers with real-time feedback on prompt effectiveness, highlighting specific neural activations that correlate with successful attacks. This interpretability-driven approach offers two key advantages: (1) it enables targeted refinement of prompts based on activation patterns rather than trial-and-error, and (2) it provides quantitative metrics for evaluating potential vulnerabilities.

Devina Jain

🏆

4th place

3rd place

2nd place

1st place

by peer review

Anonymous

View related publication

Reviewer nameConstructive critiqueCriteria 1Innovation & Literature Foundation: How well does the project demonstrate understanding and engagement with existing literature in the field? To what extent does the project introduce novel methodologies or approaches? How effectively does the project build upon and extend established research? How innovative is the project's approach while maintaining scientific rigor? Criteria 2AI Safety Impact: How clearly does the project identify and address specific AI safety challenges? To what extent does the solution consider and mitigate potential risks? How well does the project demonstrate potential for scaling or generalizing to broader safety applications? How thoroughly does the project analyze limitations and potential negative consequences?Criteria 3Technical Quality & Documentation:How well-documented and reproducible is the project's methodology? How robust and reliable is the technical implementation? How effectively does the project communicate complex ideas and results? How well does the implementation consider practical constraints and limitations?

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Cite this work

@misc {
title={

Red-teaming with Mech-Interpretability

},
author={

Devina Jain

},
year={

2025

},
organization={Apart Research},
note={Research submission to the

Women in AI Safety Hackathon

research sprint hosted by Apart.},
month={

March

},
howpublished={https://apartresearch.com}
}

Reviewer comments

Natalia Perez-Campanero

March 12, 2025

This project makes a valuable contribution to AI safety by providing a systematic and interpretable approach to red teaming. combining mechanistic interpretability with red teaming seems like a great step forward in understanding and mitigating vulnerabilities in LLMs, and the use of neural activation patterns to understand and predict jailbreak attempts is innovative as well as offering an accessible understanding of potential failure modes.. The literature review is comprehensive, referencing key works in both red teaming and mechanistic interpretability, and the interactive dashboard is a big plus. The methodology is well-documented, with a clear explanation of the dataset construction, feature extraction, and classifier development - although a more thorough error analysis and discussion of edge cases would strengthen this. Excellent work! It might also be good to provide a more explicit threat model, particularly in terms of how the identified vulnerabilities could be exploited in real-world scenarios. Delving a bit more into the adversarial robustness literature could help here.

4.1

Out of

/ 5

Natalia Perez-Campanero

March 12, 2025

Superposition, but at a Cross-MLP Layers view?

The project tackles an important problem: understanding feature causation across MLP layers in language models. The approach of combining Sparse Autoencoders (SAEs) with causal discovery (PC algorithm) seems novel and potentially promising for identifying how early-layer features influence later-layer features. The project is well written, if somewhat lacking in details. Understanding feature causation could help identify and mitigate biases, vulnerabilities, or unintended behaviors in language models, although there is not much explicit discussion of this in the write-up, making the AI safety element more implicit. I would encourage you to more explicitly discuss safety risks that could be addressed using this framework. There is also little justification provided for this choice of algorithm relative to alternative causal discovery methods, or discussion of the limitations of the PC algorithm, especially in the context of high-dimensional data, which would strengthen the project.

over

February 26, 2025

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

jaw.drop

Jason Schreiber

January 24, 2025

CoTEP: A Multi-Modal Chain of Thought Evaluation Platform for the Next Generation of SOTA AI Models

High-quality evaluations of chains of thought are an interesting opportunity. I'd love to see experiments in this direction!

Esben Kran

January 24, 2025

CoTEP: A Multi-Modal Chain of Thought Evaluation Platform for the Next Generation of SOTA AI Models

This is a super cool project! Really interesting to get expert-driven CoTs in for evaluation. There's a few questions regarding the impact on AI safety since it's a capability evaluation and will help to get stronger training data but the actual outlined strategy seems very reasonable. I highly suggest moving forward with this work and getting experimental data about existing CoT models, especially DeepSeek's R1 since it represents the next paradigm and CoT is visible. Great work.

Edward Yee

January 21, 2025

January 22, 2025

Navigating the AGI Revolution: Retraining and Redefining Human Purpose

This is a cool tool that could be used. But it's unclear how much value-add it would be to those actually being replaced. Retraining is the hard part of the process, and that's not being covered in this tool. Not to mention how to monetise such a solution.

Finn Metz

January 21, 2025

January 22, 2025

AI Safety Evaluation – Benchmarking Framework

Framework /Protocol is a bit too high-level. Please be very specific about what exactly you are trying to build -> What are you going to provide to the customer (can be for-profit or for-profit). It just needs to be clear whether you are trying to write a report or build a software solution and for whom you are doing so.

Michaël Trazzi

January 20, 2025

January 21, 2025

Cite2Root

This approach currently doesn't seem to apply enough to core AI risk challenges, and it's unclear given the lack of empirical results given in this paper if things will scale or if this intervention actually works. The concrete tooling is appreciated though.

Fabari Agbora

January 20, 2025

Safe.ai: AI Agent Risk Assessment Platform

I think that the concept of an AI Agents doctor is a really great way to solve the AI assurance problem. There has to be guarantees for AI agents given the possibility for these agents to act outside their scope. I am really curious about the technical implementation. I hope these bold researchers offer a lasting solution to the AI assurance problem.

Ubaid

January 20, 2025

Safe.ai: AI Agent Risk Assessment Platform

A very well balanced and interesting paper. Well done

Aryan Thakar

January 19, 2025

January 20, 2025

LLM-prompt-optimiser based SAAS platform for evaluations

pretty good, problem overview and process could use more detail maybe

kabir

December 29, 2024

MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark

Cool that this was done!

Test

December 18, 2024

December 19, 2024

Detecting and Controlling Deceptive Representation in LLMs with Representational Engineering

test

Bharat

December 8, 2024

Grandfather Paradox in AI – Bias Mitigation & Ethical AI1

Clear and articulate problem with a plausible solution The concepts of data validation, data provenance/ transparent, verifiable data handling all give a very good reason to use blockchain rather than being an ad-hoc addition for novelty value Coluld benefit with detail on how smaller organizations or under-resourced sectors could adopt these measures without excessive cost. Including examples of pilot implementations would also be useful

Esben Kran

December 6, 2024

SAGE: Safe, Adaptive Generation Engine for Long Form Document Generation in Collaborative, High Stakes Domains

My impression is that you implemented SAE features in your long-context document editing tool and I think this seems pretty awesome. When it comes to your node-based document iteration engine and its evaluation suite, this also seems very valuable and is probably more relevant when it comes to safety than the features used for content development. You link to the blog posts and I agree that the verification of content in financial documents is very important, though I’ll mention that your submission probably doesn’t score maximum on methodology due to the lack of experiments to validate your method. The safety arguments also aren’t super strong and the submitted project is somewhat adjacent to the topic of the hackathon, though your product seems relevant to ensure accuracy in financial documents. In terms of directing your work towards safety, I suggest you take existing documents and discover errors using various feature-supported evaluators to improve the project (e.g. going for some of the public pitch decks might be a good example). If you can prove that, it’s simply rock’n’roll and I can see some use cases beyond finance as well (though it’s a great place to start a company).

Jason Schreiber

November 30, 2024

Let LLM Agents Perform LLM Surgery

This is a creative research idea that deserves more exploration. I would love to see some quantification of the results here!

Mateusz Dziemian

November 29, 2024

Tentative proposal for AI control with weak supervisors trough Mechanistic Inspection

The plan and study are heading towards an interesting direction. To further help reduce the SAE features, maybe you could ask the trusted model to pick the words/ tokens that it wants features for and have it filter through features based on queries? Or something along that direction where the trusted model is given more control over the task to reduce the search space.

Liv Gorton

November 27, 2024

Sparse Autoencoders and Gemma 2-2B: Pioneering Demographic-Sensitive Language Modeling for Opinion QA

This project introduces a framework for steering the model towards representing more diverse perspectives. Focusing more on their contribution rather than describing existing methodology in detail (e.g. Gemma architecture) would make it easier to follow their paper. The authors note that they ran out of time and weren't able to implement their proposal. It'd be great to see them continue this work in the future.

Liv Gorton

November 27, 2024

Unveiling Latent Beliefs Using Sparse Autoencoders

I really like the direction of applying SAEs to identify hallucinations or uncertainty in model responses. The dataset was well-chosen and the methodology was sound. The plot for figure 1 could be improved by plotting the most impactful features (perhaps with the entire figure, with a figure legend in the appendix). If I'm reading the figure correctly, it appears that nudging the feature positively causes more incorrect answers, even for features that seem related to correctness. It'd be interesting to see a qualitative analysis into what might be happening there!

Liv Gorton

November 27, 2024

SAGE: Safe, Adaptive Generation Engine for Long Form Document Generation in Collaborative, High Stakes Domains

This is a really well-presented work that presents a framework/tooling to improve long-form document generation. In future, it would be interesting to see a benchmark on document quality whether steering the LLM towards a specific thing (e.g. to be more technical) deteriorates quality in other ways.

Dhruba Patra

November 27, 2024

Utilitarian Decision-Making in Models - Evaluation and Steering

This is pretty cool and well thought out!

Liv Gorton

November 26, 2024

November 27, 2024

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

This project attempts to overcome the issue of using LLMs to aid in scaling interpretability. More details on what each function does would make understanding the work easier. It'd be interesting to see if the additional functions in the appendix improve the performance on later layers. It'd also be nice to see if these ideas generalise to other LLMs (could be done with open sourced models with SAEs on all layers like Gemma 2B).

Tom McGrath

November 26, 2024

November 27, 2024

Edufire - Personalized Education Platform Using LLM Steering

This is a substantial project that makes good use of the customisation offered by model steering to help users tailor educational content to their preferences. The author got a lot done in the weekend and used steering in a very sensible and effective way. In terms of the writeup it would be useful and interesting to have some visualisations of the site - currently it's a little hard to imagine. I would also focus on what's new and how it leads to better outcomes, rather than focus on the tech stack,

Liv Gorton

November 26, 2024

November 27, 2024

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

This is a creative approach to auto-steering and seems like a promising direction! The choice of the tic-tac-toe environment makes a lot of sense given the time constraints (and I’m surprised to see how well it works!) and it’d be interesting to see how this generalises to other tasks.

Tom McGrath

November 26, 2024

November 27, 2024

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

This project aims to find alternative ways of interpreting language model features rather than asking another LLM to interpret them. The method proposed is to use the structure of the synthetic data used to train TuringLLM to generate candidate feature labels, either from uni- and bigrams or from topic data extracted from the filenames. These interpretation functions allow the authors to label the majority of features in the earlier layers of the model, but accuracy drops substantially past layer 4 of 12. I would be interested to see examples of unlabeled features in later layers to understand whether they are more abstract features, mostly related to output tokens, or mostly uninterpretable. The presentation of the work could be improved by describing the evaluation process and the motivation behind the choice of functions in more detail. I'm also not sure how this is intended to generalise to more complex features in more powerful models - some insight into this would be valuable.

Tom McGrath

November 26, 2024

Faithful or Factual? Tuning Mistake Acknowledgment in LLMs

This is an interesting result: the authors look at faithfulness in chain-of-thought reasoning and surprisngly find that single features can substantially alter faithfulness. The methodology is sensible and the experiments are carried out well and well-documented. The relation to safety is subtle but well-justified. In this case the method used (adding mistakes) means that increased faithfulness leads to a decrease in performance. This is a sensible experimental methodology for understanding if features can control faithfulness - a cool result in its own right. In my anecdotal experience in LLM reasoning, incorrect answers typically arise from a lack of faithfulness to an otherwise correct chain of thought. It would be an interesting extension to the paper to see if the features identified in this paper lead to an increase in performance in natural chain of thought reasoning settings.

Tom McGrath

November 26, 2024

Let LLM Agents Perform LLM Surgery

This is a cool idea - getting an agent to edit the internals of another agent to perform the task. As the authors observe, this is pretty difficult (even for frontier models) but as far as I can tell they had some successes. Quantifying these successes/failures can be difficult, especially at a relatively small scale and with time constraints - I'd encourage the authors to report a few transcripts of success & failure so we can get a feel for how well this performs and where steering agents succeeds and fails.

Tom McGrath

November 26, 2024

BBLLM

This project develops a visualisation tool for language model SAE latents. Visualisation is an important and underexplored area in interpretability, so it's cool to see this. The visualisation is a graph, where features are connected to one another if they co-occur sufficiently frequently. The tool is interesting but I'd really like to see some example of the kind of application it might be used in, or an interesting insight (even something very minor) that the authors obtained from using the tool.

Tom McGrath

November 26, 2024

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

This project aims to develop programmatic methods for interpreting the internals of the Turing LLM. These methods rely on the statistics of the synthetic dataset used to train the base model. The writeup is a bit unclear as to what exactly these methods do, but the examples and function names give a decent picture. Figure 1 is interesting: it appears that topic and uni/bi-gram statistics dominate in the earlier layers, but later layers are relatively poorly explained - what is going on here? Potentially the last couple of layers could be explained by the _next_ token, which would be interesting to see.

Alana Xiang

November 26, 2024

Encouraging Chain-of-Thought Reasoning

Very cool stab at increasing CoT via steering. I would like to see a fuller investigation of how the faithfulness of the steered CoT compares to prompted CoT.

Alana Xiang

November 26, 2024

Auto Prompt Injection

Very interesting idea to use prompts for steering! I would like to see how this technique compares to steering directly. I encourage the authors to further study how well this method generalizes.

Alana Xiang

November 26, 2024

Investigating Feature Effects on Manipulation Susceptibility

Interesting to see the Portuguese feature pop up again after reading https://www.apartresearch.com/project/assessing-language-model-cybersecurity-capabilities-with-feature-steering ! The password setup is an interesting environment to study jailbreaking, and the team finds interesting results. Good work!

Jaime Raldua

November 26, 2024

Feature Tuning versus Prompting for Ambiguous Questions

This is a useful comparison to see. The combined version seems particularly interesting

Alana Xiang

November 26, 2024

Clear Thought and Clear Speech: Reducing Grammatical Scope Ambiguity

The author successfully uses steering to decrease "grammatical scope ambiguity." With more time, I'd love to see the author work on quantifying the effect and comparing this approach to baselines like prompting. Good work!

Jaime Raldua

November 26, 2024

Tentative proposal for AI control with weak supervisors trough Mechanistic Inspection

SAEs for AI Control sound like an amazing idea! as you point out there seem to be major blockers on the way and also it would have been very useful to see some code and a more clear roadmap of how your next steps would look like if you would have had more time (e.g. a couple of weeks) to continue on this project

Alana Xiang

November 26, 2024

Investigate arithmetic features in Multi-lingual LLMs

This team finds some features that activate on GSM8K. They made an interesting decision to compare across languages. With more time, I'd love to see this team investigate why they were unable to improve the performance of the model via steering.

Jaime Raldua

November 26, 2024

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

The combination of RL and AS looks really promising! Very surprised of seen a 3x improvement, would love to see a longer version of this work

Alana Xiang

November 26, 2024

Assessing Language Model Cybersecurity Capabilities with Feature Steering

Good idea to use steering to improve cybersecurity abilities. With more time, I'd like to see more work on whether the Portuguese feature boost generalizes to other datasets. I'm particularly interested in generalization beyond multiple-choice questions. I'd also like to see research on why this feature is relevant to performance in this case. Overall, very cool to find a case where a feature has an effect completely detached from its label. Good work!

Alana Xiang

November 26, 2024

Can we steer a model’s behavior with just one prompt? investigating SAE-driven auto-steering

Very cool work on automating steering! Fun and creative. With more time, I'd love to see comparisons with strong baselines.

Jaime Raldua

November 26, 2024

Sparse Autoencoders and Gemma 2-2B: Pioneering Demographic-Sensitive Language Modeling for Opinion QA

Very original idea and promising results!

Jaime Raldua

November 26, 2024

Unveiling Latent Beliefs Using Sparse Autoencoders

Very promising results! on the point 4 it would have been better to emphasize the contribution of your work instead of talking about next steps only

Alana Xiang

November 26, 2024

Faithful or Factual? Tuning Mistake Acknowledgment in LLMs

This team develops a reasonable experiment setup and executes it well. Their results point to an interesting possibility, that subtracting the "acknowledging mistakes" feature could lead to higher faithfulness. With more time, a graph I would've liked to see is faithfulness by steering value. I would also be interested in seeing the team explore whether allowing the model to continue the CoT will recover the faithfulness lost by this steering by acknowledging the reasoning error out loud. Good work!

Tom McGrath

November 25, 2024

November 26, 2024

Assessing Language Model Cybersecurity Capabilities with Feature Steering

This is a cool and interesting result - I wonder why turning this feature down improves performance! It's certainly possible that the feature is completely mislabeled; autointerp is far from perfect and sometimes gets very confused. I'd be interested in seeing some qualitative samples of what happens when this feature is steered in various contexts, as well as a steering plot covering WMDP scores at a higher resolution. I worry that there may have been a class imbalance in the data (e.g. more 'A's than 'C's) and steering simply moved the model more towards the overrepresented class.

Tom McGrath

November 25, 2024

November 26, 2024

Utilitarian Decision-Making in Models - Evaluation and Steering

This is very well executed and presented research. The comparison of model values vs the human response KDE is interesting, but my favourite plot is figure 2 - it's very surprising how different features have remarkably different trajectories through the moral landscape. It's surprising that most features actually appear to avoid the modal human, and only a single feature actually steers the model in that direction. It's unfortunate that the OUS has so few questions and is so sensitive (e.g. the difference between models being entirely accounted for by question IH2).

Tom McGrath

November 25, 2024

November 26, 2024

Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities

This work covers an important problem and applies a sensible methodology. The performance of the results is impressive - I had to check in the code that the results were in fact on a test set. I'd be interested in seeing how often harmless prompts are misclassified though. Definitely worth extending further - these results are quite promising.

Tom McGrath

November 25, 2024

November 26, 2024

Feature Tuning versus Prompting for Ambiguous Questions

This is an interesting comparison - the relative merits of prompting and feature steering comes up a lot and it's great to see some very grounded evaluations. The feature steering looks to have been done well, and the qualitative observations are good.

Tom McGrath

November 25, 2024

November 26, 2024

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

This is an interesting and imaginative project, and the results are pretty cool. It's impressive to include feature steering inside an RL loop, and I'm quite surprised that it works! The project writeup is clear and well written.

Tom McGrath

November 25, 2024

November 26, 2024

Unveiling Latent Beliefs Using Sparse Autoencoders

These findings are cool and somewhat surprising - I didn't realise we can nudge models towards being wrong so easily! I'm having trouble parsing figure 1, however - surely with nudge strength set to zero all features should provide the same outputs, but we see an almost 20% range in percentage correctness between features. Should I conclude that some features can in fact steer the model substantially towards correct answers? If so then that's interesting and I'd highlight it more.

Shana Douglass

November 25, 2024

Promoting School-Level Accountability for the Responsible Deployment of AI and Related Systems in K-12 Education: Mitigating Bias and Increasing Transparency

This memorandum offers a comprehensive and insightful analysis of the potential risks associated with AI in K-12 education, particularly regarding bias and opaqueness. The proposal's focus on equity and transparency is commendable. The recommendation to leverage Title I and Title IV funding to promote human oversight, AI training, and stakeholder engagement is a practical and effective approach. By aligning these measures with existing federal funding mechanisms, the proposal offers a realistic and scalable solution to mitigate the risks of AI in education. However, a more detailed analysis of the potential costs and funding mechanisms associated with the implementation of these recommendations would further strengthen the proposal.

Alana Xiang

November 25, 2024

Math Speaks All Languages: Enhancing LLM Problem-Solving Across Multilingual Contexts

This is creative paper which finds a new domain on which SAE features generalize well (across languages in grade school math). The surprising finding that the French steering vectors had a larger impact on English and Russian performance than French performance warrants further inquiry. I think this paper could've significantly improved on novelty if it pursued this direction. Given more time, I would also love to see the authors inspect whether the features they found generalize beyond GSM8K. Good work!

John Doe

November 25, 2024

Investigating Feature Effects on Manipulation Susceptibility

I love the project because of x, y. and z

Jaye Nias

November 21, 2024

Promoting School-Level Accountability for the Responsible Deployment of AI and Related Systems in K-12 Education: Mitigating Bias and Increasing Transparency

This policy memorandum provides a thoughtful and well-rounded examination of the potential risks associated with bias and opaqueness in intelligent systems used in K–12 education. The concerns about exacerbating inequality are both relevant and timely. The recommendation to incorporate Title I and Title IV financing criteria, which include human oversight, AI training for teachers and students, and open communication with stakeholders, is a strong and practical approach. These measures promote the responsible and transparent use of intelligent systems, while ensuring accountability and taking proactive steps to prevent harm to students. One of the strengths of this memorandum is its clear presentation of the suggested mitigations, thoughtfully considering both their benefits and limitations. While linking these solutions to federal funding mechanisms may not be entirely new, it is a strategy that has historically been effective in driving equity-focused initiatives within education. The proposed approach, therefore, offers a realistic and impactful way to encourage the responsible use of AI in educational settings, with a focus on protecting students’ interests.

Testing 2024-11-15

November 15, 2024

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

This is a test

Monica Lopez

November 3, 2024

October 27, 2024

Jason Schreiber

August 5, 2024

Esben Kran

February 24, 2024

July 19, 2023

Who cares about brackets?

I like the simple operationalization of your research question into GPT2-small. It seems like exploring multiple operationalizations would be useful to elucidate your results, though I personally imagine it's pretty good. Seems like one of those tasks that show that we cannot use our current methods to properly investigate every circuit, unfortunately. Puts a serious limiting factor on our mechanistic interpretability usefulness. Good work!

June Rock

February 24, 2024

January 4, 2024

Z bfkfkh T U z

This liver health supplement is doing wonders for my energy levels: https://www.socialsurge.ai/recommends/liv-pure/

Bart

February 24, 2024

July 19, 2023

Who cares about brackets?

Interesting work! An extensive range of experiments shows that even relatively easy tasks might not be easy to locate in LLMs. I believe this work sheds a light on how limited our current methodology is and bracketed sequence classification might serve as a good toy-problem task for future development of interpretability methods.

Jason Hoelscher-Obermaier

February 24, 2024

November 29, 2023

Visual Prompt Injection Detection

Fascinating project! I liked how many different aspects of the multimode prompt injection problem this work touched on. Analyzing CLIP embeddings seems like a great idea. I'd love to see follow-up work on how many known visual prompt injections can be detected in that way. The gradient corruption also seems worth studying further with an eye toward the risk of transfer to black-box models. Would be wonderful to see whether ideas for defense against attacks can come from the gradient corruption line of thinking as well. Congratulations to the authors for a really inspiring project and write-up!

Esben Kran

February 24, 2024

November 29, 2023

Visual Prompt Injection Detection

This is a great project and I'm excited to see more visual prompt injection research. It covers the cases we'd like to see in visual prompt injection studies (gradient, hidden, vision tower analysis). It seems like a great first step towards an evals dataset for VPI. Great work!

Tim

February 24, 2024

October 2, 2023

Uncertainty about value naturally leads to empowerment

The main problems named w.r.t formalizing agency as the number of reachable states are very relevant. It is mentioned that not only the number of states is important but it also needs to be considered how desirable these states are and if they are reachable. However,er it seems that the authors consider "number of reachable states" and empowerment as the same thing, which is not the case. Further, the authors proposition that a "Good notion of empowerment should measure whether we can achieve some particular states, once we set out to do so." seems to very much coincide with the true definition of empowerment by Salge et all. Hence, it would be relevant to compare the author's "multiple value function" optimization objective to that of empowerment. The authors also propose a new environment, which seems to be very useful, thoughtful and could be a nice starting point for some experiments.

Ben Smith

February 24, 2024

October 2, 2023

Uncertainty about value naturally leads to empowerment

It's possible this is a novel topic, but there isn't a clear finding, and it's quite speculative. So there's not much novel here beyond an idea. It is a very interesting idea, and I give the entry points for that. I thought "attainable utility preservation" had already got a lot further in talking about how you can quantify the different goals that might be achieved from a starting point, taking into account the value of each goal with a diversity of possible goals.

Vincent

February 24, 2024

September 15, 2023

Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text

the order of choices is interesting and I just saw a paper about that comes out recently (https://arxiv.org/abs/2308.11483?)

Esben Kran

February 24, 2024

July 19, 2023

Towards Interpretability of 5 digit addition

This is a wonderful mechanistic explanation of a phenomenon discovered through interpreting the learning curves of a simple algorithmic task. Of course, it would have benefitted from experimental data but it is conceptually so strong that you probably expect it to work. Future work should already take into account how we might want to generalize this to larger models and why it's useful for AI safety. E.g. I would be interested if this is expanded stepwise into more and more complex tasks, e.g. adding multiplication, then division, then sequence of operations, and so on for us to generalize into larger models some of these toy tasks. Good work!

Ben Smith

February 24, 2024

October 2, 2023

Uncertainty about value naturally leads to empowerment

I thought "attainable utility preservation" had already got a lot further in talking about how you can quantify the different goals that might be achieved from a a starting point, taking into account the value of each goal with a diversity of possible goals. It's possible this is a novel topic, but there isn't a clear finding, and it's quite speculative. So there's not much novel here beyond an idea. Still, it's an interesting idea, and worthwhile to start a Gymnasium environment for testing the idea. So I give authors some points for all that.

Bart

February 24, 2024

July 19, 2023

Towards Interpretability of 5 digit addition

Interesting and orginal submission, quite different than the others. Good example of learning to "Think like a Transformer". I would encourage the author to perform some experiments (or work together with someone with more experience) to see if they can confirm or falsify their hypotheses!

Jason Hoelscher-Obermaier

February 24, 2024

November 29, 2023

Towards High-Quality Model-Written Evaluations

The project is really well motivated: Finding ways to auto-generate higher-quality model evaluations is extremely valuable. I like how this project makes good use of an existing technique (Evol-Instruct) and evaluates its potential for model-written evaluations. I also like a lot the authors' frankness about the negative finding. I would like to encourage the authors to dive more into (a) how reliable the scoring method for the model-written generations is and (b) what kind of evolutions are induced by Evol-Instruct to figure out the bottlenecks of this idea. I agree with them (in their conclusion) that this idea has potential even though the initial results were negative.

Jacob P

February 24, 2024

November 29, 2023

Towards High-Quality Model-Written Evaluations

Cool idea for improving evals! I'd try pairing high-quality evaluations with low-quality perhaps by getting the model to worsen high-quality ones, that would probably work better as a few-shot prompt. If you continue work on this, I'd spend some time thinking about how best to de-risk this. Is there some scenario where we know LMs can improve things?

Esben Kran

February 24, 2024

November 29, 2023

Towards High-Quality Model-Written Evaluations

It's too bad that it didn't show improved performance but the idea is quite good and utilizing existing automated improvement methods on evals datasets seems like a good project to take on. With more work, it might also become very impactful for research and I implore you to continue the work if you find potential for yourselves! Good job. See also [evalugator](https://github.com/LRudL/evalugator) for more LLM-generated evals work (by Rudolf).

Bart

February 24, 2024

July 19, 2023

Toward a Working Deep Dream for LLM's

I believe the goal of this project is interesting, and is an interesting avenue to explore further. Unfortunately, results from early experiments didn't work out, preventing a deeper investigation of this approach.

Esben Kran

February 24, 2024

January 11, 2024

The EU AI Act: Caution against a potential "Ultron"

This is excellently done and a professional overview of the full EU AI Act. It's impressive to include a full summary of so much content in so few pages. Case 1 might have been slightly too unclear since this is not what was meant, however, it is a very good example of Case 3 work; summarizing the EU AI Act. I evaluated this under Case 3: Explainers of AI concepts since it is a concise explainer for the full EU AI Act. One way to improve it would be to add references to direct parts of the act as you explain parts. I like the quote format and the titles that reference concepts directly.

Esben Kran

February 24, 2024

July 19, 2023

Toward a Working Deep Dream for LLM's

I love good regularization techniques. Similar work includes Neuron to Graph (Foote et al., 2023) and work by Michelle Lo on reconstructing what neurons activate to. It seems this technique quite easily generates bogus sentences that, yes, we can see what exactly activates the neuron, but it's not suuper useful for understanding the features it affects the output for. But this seems like a really good first step into what might more accurately than (especially) the OpenAI work explain what MLP neurons do. Future work might also include reformulating it into a functional activation model like in the OAI work and Foote et al., 2023. Good work!

Jason Hoelscher-Obermaier

February 24, 2024

February 14, 2024

Seemingly Human: Dark Patterns in ChatGPT

Lovely project! I love the connections made to the existing literature on dark patterns. The proposed focus on mismatch between developer and user incentives in the context of AI applications seems like an extremely valuable and timely addition to the existing literature on misalignment, with a lot of potential for connecting AI ethics and AI safety. Also really like the approach to empirical evaluation taken here, which seems to hold a lot of potential. Going forward, I would want to see a more in-depth investigation of the conversations flagged for dark patterns and I would expect a few rounds of iteration to be necessary for robust results here. In terms of the write-up I'm missing tentative high-level conclusions on the level of dark pattern usage, its trend over time, and proposals for a natural baseline to compare against. Very minor write-up grievance: It wasn't clear to me which model was used as overseer.

Jason Hoelscher-Obermaier

February 24, 2024

August 21, 2023

SADDER - Situational Awareness Dataset for Detecting Extreme Risks

Cool idea and execution! For the causal influence dataset, I would have loved to see more of the dataset samples. Seeing that even GPT-4 still benefits from being told it's a chatbot was really interesting and surprising. For the train/deploy distinction dataset, I really liked the idea of how the dataset is constructed. The analysis could be a bit more detailed though: E.g., having confusion matrices would convey a lot more info than raw accuracies. Very cool project overall!

Christian Schroeder de Witt

February 24, 2024

February 14, 2024

Seemingly Human: Dark Patterns in ChatGPT

I love the idea of this project. In addition to what Jason has remarked, I think a major opportunity would lie in developing tools that can protect users from such dark patterns. For example, a local trusted supervisor-chatbot that filters the interactions and warns the user if e.g. there is a risk of disclosing too much sensitive information.

Esben Kran

February 24, 2024

September 7, 2023

Residual Stream Verification via California Housing Prices Experiment

This is an interesting question to investigate and I'm excited by your progress within the 24 hours! Understanding what role the residual stream plays in memory transfer and how subspace "competition" works is important. I assume "subspace" in your project means information occupation within the residual stream. It seems that the bandwidth and subspace projects measurements are not included in the results. I like your plot showing the impact on model output and it would be interesting to see which sorts of features (qualitative description) these differences correlate with. E.g. I can imagine that some types of early-stage processing is lost and a feature just looking for the word "the" (or something less frequent) might be outcompeted in the residual stream by more complex processes. This might also indicate an inverse scaling phenomenon. Great job! PS: The video presentation is private.

Bart

February 24, 2024

July 27, 2023

Residual Stream Verification via California Housing Prices Experiment

Overall impressions: - Interesting project, exploring the role of the residual stream is an interesting avenue. - I like the SHAP value plots! Suggestions for improvement: - It is not completely clear how the formulas for the subspace projection and bandwidth measurements are used in your experiments. The results section (that shows SHAP values) seems different from your planned methodology. - More information could be provided on the dataset, model architectures, training process, hyperparameters etc. This contextualizes the experimental conditions. - Also, more information could be provided in the result sections. Including metrics like training/validation accuracy, loss curves, performance on a test set etc. would strengthen it.

Esben Kran

February 24, 2024

September 7, 2023

Problem 9.60 - Dimensionaliy reduction

This is a great project within the time allotted, well done! It's important for us to understand these types of dynamics and plotting it over layers provides a useful granularization. There's a question of what these results mean and why the IMDB dataset isn't as interpretable (I'd expect it to be related to the performance itself). Maybe you'd want to separate the PCA'd activations based on if the prediction was correct or not.

Bart

February 24, 2024

July 19, 2023

Relating induction heads in Transformers to temporal context model in human free recall

Cool and original project! I think the reformulation of TCM as an induction head is very interesting, and the experiment show some interesting preliminary results. This work has great potential to publish as a paper with a bit more experiments, so I would definitely encourage you to work further on this,

Esben Kran

February 24, 2024

July 19, 2023

Relating induction heads in Transformers to temporal context model in human free recall

This project is super interesting and a great case study in comparing Transformers to cognitive models of memory. I would love to be able to dive deeper into this project and read the three referenced papers. I'm not sure what to critique here but I'm also personally positively biased towards cognitive science and it's a great interdisciplinary work. The only thing is that there isn't much discussion of the safety implications, e.g. can we use this functional correlate to understand how human-like a Transformer's memory is? Good work and I recommend you take this further!

Geraldine Antle

February 24, 2024

December 22, 2023

Qolbjec pfb

Generate original, high-quality long-form content.AI writing tool for 1-click SEO-optimized articles, blog posts & content. Available in 48 languages, Writing AI to create content designed to rank on Google. Try free trial now https://seowriting.ai/?fp_ref=freetrial

Bart

February 24, 2024

July 27, 2023

Problem 9.60 - Dimensionaliy reduction

Strengths: - Interesting project! Understanding how language models process information is important. - I like the visualizations of the PCA dimensions. They clearly show the results, and on the toy dataset you clearly see the progress over the layers. Suggestions for improvement: - I would like to see a bit more background information on the experimental set-up. For example, what does the toy data set look like? What model do you use for classification? Did you split train and test set? - I would like to see a bit more discussion on the results. Why do you think the accuracy of the toy dataset is so much higher?

Erik Jenner

February 24, 2024

September 26, 2023

Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions

Building agents that help other agents with unknown goals is an important problem and I like how this project just tries to tackle that problem in a straightforward way, with several experiments and techniques. The parts on dealing with underrepresented goals is also nice. Using PCA to detect unusual inputs is a cool (albeit not new) idea, and it seems to work (though with big error bars). The code also looks well-done and easy to work with at a glance. For the core setup of training a helper agent, it would probably be fruitful to explore connections to Cooperative IRL/Assistance games, and build on existing work in that direction (e.g. https://openreview.net/forum?id=DFIoGDZejIB). The biggest room for improvement in my view are the experiments. RL is really noisy, and to get meaningful results, several runs with different random seeds are essential (even if the curves look as different as in Fig. 4, it's hard to know whether the effect is real otherwise). I'm also confused why all the results have episode lengths of at least a few hundred. Looking at the environment, it seems like a good policy pair should get lengths of about 20, so unless I'm misunderstanding something, it seems the RL training didn't work well enough or wasn't run for long enough to give meaningful results.

Ben Smith

February 24, 2024

October 2, 2023

Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions

Not much grounding in the literature I don't really understand how this is distinct from a single-agent problem where the goal is unknown except through reward. This problem arises because the helper has access to the leader's reward function! if it was doing inverse reinforcement learning or something I'd get it but that's not what's going no they've quoted "FMH21" which appear to be grounding their methods. so that perhaps suggests at least some novelty. Overall, an interesting paper and a good experiment, but it is unclear to me how this is distinct from a single agent with some hidden objectives it has to figure out. But I might be missing something.

Esben Kran

February 24, 2024

July 19, 2023

Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model

Great negative results for a hypothesized result of SoLU models. Interesting side result to see that the LN scale factor grows meaningfully differently conditional on the token sequence.

Jason Hoelscher-Obermaier

February 24, 2024

August 22, 2023

Preliminary measures of faithfulness in least-to-most prompting

Very readable and interesting results. One question I had: How do the results on post-hoc reasoning in CoT/L2M square with the results from http://arxiv.org/abs/2305.04388 which suggest that CoT explanations can be unfaithful?

Bart

February 24, 2024

July 19, 2023

Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model

Interesting work! Well-designed experiments that don't find evidence for the smearing hypothesis. Would definitely encourage continuing this work, and see if the results replicate on models with more than one-layer!

Esben Kran

February 24, 2024

July 19, 2023

One is 1- Analyzing Activations of Numerical Words vs Digits

This is a very interesting investigation into something that seems foundational in LLMs, this sort of sequence modeling structure that is shared between tasks. These are both quite informative results for AI functioning and probably replicate quite a bit to humans. Great in-depth experiments as well and good circuits experimental work. It was a lot to cover in a 10 minute video so no worries about being a bit rushed there. Excited that you want to continue working on this!

Bart

February 24, 2024

July 19, 2023

One is 1- Analyzing Activations of Numerical Words vs Digits

Impressive range of experiments and interesting discovery of the shared sequence heads. I would definitely encourage you to continue your work and see if you can get from digits to other sequences through latent space addition or similar techniques.

(author)

February 24, 2024

July 17, 2023

One is 1- Analyzing Activations of Numerical Words vs Digits

(I'm the author and accidentally hit 'rate this project' but did not mean to rate it, so I am submitting 5 to balance out the 3 I gave back to the 4 stars given from someone else before)

Charlotte

February 24, 2024

January 12, 2024

Obsolescent Souls

I very much like the story. If you have time for this, I would be interested in reading your AI goes well scenario, what would be the scenario in which all of your "what ifs" are fulfilled.

Esben Kran

February 24, 2024

January 11, 2024

Obsolescent Souls

This is an excellent way to use the capabilities of vignettes in a super strong way! I like how you emphasize a scenario that is otherwise looked over; one where all our alignment and risk mitigation work goes quite alright. The "What ifs" are very enjoyable as well and provide a perspective on what one might learn from the story beyond what the reader might think. The relation to contemporary sources is also very good. It is inherently a difficult thing to try to represent the systemic effects of AI technology in a concise manner but I think you succeeded!

Esben Kran

February 24, 2024

July 19, 2023

Multimodal Similarity Detection in Transformer Models

Nice work, though I was missing some plots here. Since you say pure GPTs don't seem to work, it would be interesting to see the difference to fine-tuned models. Totally fine that you used Claude etc. but I'd love if you proofread your work. Interesting and would be nice to see the developments.

Charlotte

February 24, 2024

January 12, 2024

Obsolescent Souls

I very much like the story. If you have time for this, I would be interested in reading your AI goes well scenario, what would be the scenario in which all of your "what ifs" are fulfilled.

Diana Cruz

February 24, 2024

January 16, 2024

Nqyxqdevscnrg

Hi there, I just wanted to know if you require a better solution to manage SEO, SMO, SMM, PPC Campaigns, keyword research, Reporting etc. We are a leading Digital Marketing Agency, offering marketing solutions at affordable prices. We can manage all as we have a 150+ expert team of professionals and help you save a hefty amount on hiring resources. Interested? Do write back to me, I’d love to chat. If you are interested, then we can send you our past work details, client testimonials, price list and an affordable quotation with the best offer. Many thanks, Diana Wishing you a fantastic New Year filled with achievements and growth! Your Website : alignmentjam.com

Jason Hoelscher-Obermaier

February 24, 2024

November 29, 2023

Multifaceted Benchmarking

Good tooling for running benchmarks is extremely important, which makes the question raised in this report "How can we systematically evaluate ethical capabilities of LLMs across all available benchmark datasets?" really valuable. I like how the report raises the important research question of how and in which order ethical capabilities emerge across language models. To really address this question would require a larger study though with models of more sizes -- which is understandably impossible in the time of the hackathon. A really important point raised in the discussion is the question of where exactly the gap in the ecosystem is, given the availability of tools like EleutherAI's evaluation harness. I would encourage the authors to spend more time thinking about what these tools are lacking to become more widely used and more useful for AI safety research!

Jacob P

February 24, 2024

November 29, 2023

Multifaceted Benchmarking

Preliminary results, but very good to see that ethics reasoning appears to be improving rapidly with scale! Comparing a pre/post RLHF model (e.g. llama vs llama 2 chat at different scales) would be great to get a sense of whether models can be successfully blocked from improving in MACHIAVELLI while still improving on ETHICS.

Esben Kran

February 24, 2024

January 11, 2024

Model Cards for AI Algorithm Governance

It is very focused on the model cards, proposes a good structure for them and relates it *directly* to existing frameworks. This is a great submission! The appendix is very useful and shows the background work that went into it. One thing to add might be the framework of reporting, i.e. are all these answers fully public? And which should be public if not? What does the software system for reporting look like? I didn't know about China's setup, very interesting!

Esben Kran

February 24, 2024

July 4, 2023

MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark

This is an impressive critique with great and concrete improvement points that consider the pros and cons and what sorts of edge cases we will have to implement solutions to. Of course, I am missing a bit of an empirical evaluation or that you yourselves implement these, though the "idea format" of this clearly enabled you to explore the ideas qualitatively during the weekend's work. Great job! I'd recommend you polish it as a blog post and post it since it seems to point out some critical components needed for future work on safety benchmarks. If you plan to make it into a paper, you're of course welcome to wait with posting. Really interesting work!

Jason Hoelscher-Obermaier

February 24, 2024

January 11, 2024

Model Cards for AI Algorithm Governance

Very cool idea! A few things that come to mind: How capable (and in which domains?) do models need to be to be subject to compulsory model cards? How would you deal with evolving state-of-the-art on the evaluations side? Would there be some kind of verification of the submitted information?

Esben Kran

February 24, 2024

November 29, 2023

Multifaceted Benchmarking

Great motivation for the study. Curriculum learning for ethical judgements might be a great area to investigate even further though it might be hard to get results, as you also see here. A question I have is whether this isn't already implemented in other evals harnesses, such as EleutherAI's that you mention? Otherwise, I definitely think there's the space for a review of existing ethical benchmarks and what is missing -- both in terms of their quality but also in terms of other benchmarks that would be good to develop.

Red-teaming with Mech-Interpretability

Related projects

Utilitarian Decision-Making in Models - Evaluation and Steering

Sue-Per GPT: Legal RAG Assistant

Counting Letters, Chaining Premises & Solving Equations: Exploring Inverse Scaling Problems with GPT-3

Simulation Operators: The Next Level of the Annotation Business

Player Of Games

Beyond Refusal: Scrubbing Hazards from Open-Source Models

Cite this work

Reviewer comments

This project is private