Fuzzing Large Language Models
We used fuzzing to test AI models for unexpected responses and security risks. Our findings stress the need for better fuzzing methods to improve AI safety.
We used fuzzing to test AI models for unexpected responses and security risks. Our findings stress the need for better fuzzing methods to improve AI safety.
@misc {
title={
Fuzzing Large Language Models
},
author={
Esben Kran
},
year={
2023
2024
},
organization={Apart Research},
note={Research submission to the
research sprint hosted by Apart.},
month={
February
May
},
howpublished={https://apartresearch.com}
}
Esben Kran
|
December 1, 2024
December 1, 2024
This is a great project! Very well done. I think it's good it has a lower IH than humans and that this IH is less variable as well. It's very interesting to see these underlying patterns and the comparison to human baselines is very interesting. It seems like there are clear pathways forward to studying the moral impact of irrelevant features, such as the elephant features you mention. You may also run the model at a higher temperature and get the density function for each model to compare with human values. There is of course an implicit assumption that utilitarianism is opposed to deontology in the methodological design and it would be great to check if these patterns shift under e.g. legalistic scenarios vs. moral dilemmas (https://paperswithcode.com/dataset/ethics-1).
Kutay Buyruk
|
November 30, 2024
December 1, 2024
Putting feature steering into an RL environment is very interesting, great idea! One detail that could be improved was also mentioning the maximum possible draw rate against the optimal policy. I took some research and, if I'm not mistaken, the game can always be forced to a draw if the second player is also playing optimally. If that's the case, jump from 1% to 3% still has room to grow in future work. In any case, it is a great proof-of-concept for an interesting application of feature steering.
Mateusz Dziemian
|
November 30, 2024
November 30, 2024
I think that using SAE features to find different biases and information about a dataset seems like a worthwhile direction. However, I'm struggling to understand what each of the things you did are trying to achieve. I think that this comes from using too many methods that make understanding the end goal of the final result blurry.
Jaime Raldua
|
November 30, 2024
November 30, 2024
There seems to be existing very similar work and the results from this trial (as acknoledged from the author) is not very succesful. Overall it is an interesting approach that could benefit from more lit.review and further exploration.
Jaime Raldua
|
November 30, 2024
November 30, 2024
Very interesting visualisation tool! It would have been great to see a bit more of lit.review and see there is specific valua added where other existing techniques fall short. The fact that is ready for local deployment definately deserves extra points.
Jaime Raldua
|
November 30, 2024
November 30, 2024
Not related to AI Safety. I do not understand if this is mostly to promote the Benki finance company
Jason Schreiber
|
November 30, 2024
November 30, 2024
I really like the application idea here. It would be amazing to understand better the downstream implications of some of the biases identified so far on model performance and really work out the threat model being addressed here in detail. This seems worthy of followup work.
Jason Schreiber
|
November 30, 2024
November 30, 2024
This is a creative research idea that deserves more exploration. I would love to see some quantification of the results here!
Jason Schreiber
|
November 30, 2024
November 30, 2024
This is a really interesting application of feature steering. It's great to see more subtle applications of feature steering being explored - with some additional development this work could be a really interesting data point on the limits (or lack thereof) of current feature steering methods
Jason Schreiber
|
November 30, 2024
November 30, 2024
Great research question. I find the intersection of math problems with its relatively clear evaluation criteria and multi-linguality a cool test bed for evaluating the robustness of feature steering. I hope the authors will iterate on this since it seems a worthy avenue!
Jason Schreiber
|
November 30, 2024
November 30, 2024
Transfer of steering behaviors to blackbox models is a really interesting topic with high relevance for AI safety and security. The approach taken here seems creative and a good avenue to explore. I think the authors should not let them be discouraged by the initially inconclusive results and iterate on methodology and dataset size. I think this has a lot of research promise!
Jason Schreiber
|
November 30, 2024
November 30, 2024
A promising application area for feature steering! The project sounds very ambitious. I would personally love to see some initial results on a more narrowly scoped version of this.
Jason Schreiber
|
November 30, 2024
November 30, 2024
cool idea with relevance to AI safety (model oversight / reasoning transparency; though slight caveat regarding faithfulness of CoT). I think this deserves further exploration and could potentially shed light on important methodological questions (such as faithfulness of model reasoning). These questions are not easy to study but this seems like a great first step!
Jason Schreiber
|
November 30, 2024
November 30, 2024
great and creative idea with quite some potential relevance for AI safety research. this line of research could provide a very relevant and important datapoint for the crucial capability elicitation debate (how far are models from the upper bounds of their capabilities? how much effort per added percentage point of performance? etc). I feel the currently proposed methodology is not sufficient to answer that question clearly (which is fair for a weekend hackathon!) and I’d be most excited about exploring transfer between datasets (given that rn iiuc you are using the same questions for identifying features to attenuate or accentuate and for evaluation). definitely lots of follow-up potential here!
Jason Schreiber
|
November 30, 2024
November 30, 2024
very cool project! the fact that this seems to work across such a broad range of applications is impressive and encouraging. I'd be particularly excited about applications for safety-related research, particularly applications to red-teaming and capability elicitation.
Liv Gorton
|
November 29, 2024
November 30, 2024
I like the comparison between different approaches to using SAEs to unlearn dangerous knowledge. This project uses a very sensible approach (e.g. including MMLU as standard performance benchmark) and is clearly presented. In future, it could be interesting to explore the robustness of the unlearning and why performance on MMLU comp sci appears to increase.
Liv Gorton
|
November 29, 2024
November 30, 2024
I found this project really interesting! It is surprising how poorly the LLMs seemed to model human moral intuitions, even when steered. This is well-written and well-presented!
Liv Gorton
|
November 29, 2024
November 30, 2024
This is addressing a really important practical problem. I liked their approach, am impressed with the results, and would be keen to see them build on this work!
Liv Gorton
|
November 29, 2024
November 30, 2024
I was excited to see someone try to do this -- I think how easy it is to reconstruct things like the dictionary vectors has important practical implications. This work would be improved by providing more motivation for the technique of choice, perhaps providing an example or figure that demonstrates why it should work.
Liv Gorton
|
November 29, 2024
November 30, 2024
Comparison of feature steering and prompting is a really important benchmark for the usefulness of SAEs. This was a well-scoped investigation too given the time-limited nature of the event. As flagged in the discussion, it’d be great to see how this scales using an LLM as a judge for the preferred response.
Liv Gorton
|
November 29, 2024
November 30, 2024
This work presents a way to visualise SAE latents that frequently activate together. With some additional time, it'd be cool to see some insights gained from this kind of technique! It'd be especially cool if there was some insight that wasn't easily uncovered via something like UMAP or PCA over the dictionary vectors.
Liv Gorton
|
November 29, 2024
November 30, 2024
Interesting preliminary discussion on how SAEs might be able to enhance the capability of a weaker but trusted LLM to supervise a more powerful, untrusted LLM. The author highlights a lot of important challenges to succeeding in this work. They flagged that they ran out of time to implement their idea and so in future, I’d be really excited to see the results they’d be able to produce with this approach!
Mateusz Dziemian
|
November 29, 2024
November 29, 2024
Interesting study direction, which seems to be very worthwhile for further study. I think that the example that you provide does show some success in this idea. I’m mainly not sure if it makes sense to find the activations in residual stream for features which are monosemantic from an SAE, as the reason for SAEs is that the residual stream itself is polysemantic.
Mateusz Dziemian
|
November 29, 2024
November 29, 2024
Definitely a worthwhile idea to study and as you state, more samples will definitely be useful to verify the results. I think it would also be good to use SAE feature activations to check if you’re actually activating what you want + use activation steering as a baseline to compare against.
Mateusz Dziemian
|
November 29, 2024
November 29, 2024
Great efficiency over a weekend! The study provides useful insights into security and information protection using SAEs and warrants further research into deepening the understanding of this direction. For further study I would take inspiration from Anthropic's recent bias study, and add standard benchmark performance metrics with the security features being varied.
Mateusz Dziemian
|
November 29, 2024
November 29, 2024
The plan and study are heading towards an interesting direction. To further help reduce the SAE features, maybe you could ask the trusted model to pick the words/ tokens that it wants features for and have it filter through features based on queries? Or something along that direction where the trusted model is given more control over the task to reduce the search space.
Mateusz Dziemian
|
November 29, 2024
November 29, 2024
Good initial work! Definitely interesting to see that different languages have lower activations in such a universal topic like maths. Would be interested to see the difference in language dependent and in-dependent features on other languages and math benchmarks.
Mateusz Dziemian
|
November 29, 2024
November 29, 2024
Quite surprising results! Wasn’t expecting the model to have features which it doesn’t activate enough during inference that could boost performance. The results are very intriguing and warrant a further study to get a better understanding on the trade-offs for such features and if they just improve maths performance or other domains too.
Mateusz Dziemian
|
November 29, 2024
November 29, 2024
Seems to be in a similar direction to recent anthropic work focused on using SAE to improve bias/ fairness etc. Worthwhile checking that work before doing next steps.
Mateusz Dziemian
|
November 29, 2024
November 29, 2024
Useful potential application of SAE that I’m sure the community would be interested in trying out. For further research I would be interested in seeing how this compares to just prompting the model to act a certain way and fine tuning with LoRA etc. I imagine this direction lies in a very good sweat spot of compute and effectiveness of getting models to act a certain way.
Mateusz Dziemian
|
November 29, 2024
November 29, 2024
Great work! It’s very easy to understand what you were investigating and the results are also clearly presented. A little bit surprising to see that nearly 0 steering has the strongest accuracy, but the edge case results are insightful. I would be interested to see the effects of negative steering on pre instruct tuning vs post and seeing if negative steering is faithful if the models are allowed to generate their own CoT. Hope you continue this work as you already have interesting results.
Simon Lermen
|
November 28, 2024
November 28, 2024
This is a paper about identifying features that light up for prompt injections or jailbreaks. Potentially quite useful, as it might offer a practical method to harden models against such attacks by suppressing such features. Alternatively, it could help detect features that trigger when a prompt injection fails. It's interesting that steering with the Portugal feature leads to such a significant effect, though they haven't applied a proper control here. They should compare 1-2 control features to 1-2 target features. Possibly, the Portugal feature is mislabeled? They show that some Goodfire features are mislabeled, pointing to issues with LLM-written labels. Goodfire needs to use a lot of samples to write explanations, and validation is lacking. I could not find in the report which model they used for the SAE, what is the language model that it was trained on?
Simon Lermen
|
November 28, 2024
November 28, 2024
In the introduction, they talk about anecdotal evidence for a quite specific behavior, which I have never heard of: "Anecdotal evidence suggests that LLMs are more likely to correct information in common domains of knowledge that are well represented in the training data, whereas they are more likely to apologize for being wrong when it comes to more niche domains of knowledge or specific problems." I think models clearly understand the concept of a degree of belief, but it might be more accurate to simply ask them about it directly. I conducted a small experiment asking ChatGPT which seemed more likely: the moon landings being fake or the Earth being flat. It decided that "the moon landings being fake" is more likely, which I believe is true. They also conducted an experiment steering a model while it solved arithmetic questions to make it produce false results. They experimented with a few features and included a control feature; however, it appears they never returned to this and didn't show results for the control feature. The main plot shows a straight line for one of the features, which could be the private feature, but there is no labeling or legend for the different features. In general, the title seems aspirational—it makes sense that they could only explore the idea superficially as part of this hackathon. For the most part, it appears they just identified features related to truthfulness, manipulated them in both directions, and evaluated the results on a dataset. It would be interesting to identify more specific features for belief, such as changing how certain the model is in certain areas, rather than simply whether its outputs are correct or false. They also haven't yet uncovered any hidden beliefs in the model, as the title suggests. This could be fascinating—for example, does the model give medical advice while being uncertain? Could belief steering be used to make the model doubt itself more or be more cautious in high-stakes areas?
Simon Lermen
|
November 28, 2024
November 28, 2024
This paper focuses on labeling, i.e., writing explanations for features or latents. This is an interesting topic with some low-hanging fruit. There have been concerns about using other models to label features, as this could open the door for subversion by the labeling model. Another hackathon introduced a graph-based explanation approach. I think they haven’t properly explained how they derive results for their functions. They describe functions for specific tokens, connections between tokens, and topics. I’m not quite sure if this approach can scale or if it brings us much closer to converting features into code, but I still find this general direction interesting. Verifying that features function like code would be a significant step towards explanation label integrity.
Simon Lermen
|
November 28, 2024
November 28, 2024
A useful comparison would be how prompting compares to steering for answering ambiguous questions. This includes cases where there is a simple common explanation and more complex ones, such as "Who discovered America?" The first question and the other two feel different—while "Who discovered America?" has answers of varying complexity, the other two questions seem very vague. They successfully steered a model using SAEs, but its not fully clear what the results actually show.
Simon Lermen
|
November 28, 2024
November 28, 2024
It seems strange to use Google Translate or Mistral for translation when there are much better options available. This is especially problematic for challenging math problems. There’s also an inconsistency in Figure 1, where the numbers add up to 101 despite stating that there are 100 problems. French translations worked much worse, and I’m afraid the translations might not have been accurate. The text didn’t clarify it fully, but it seems they used Mistral for French and Google Translate for Russian. In general, the finding is interesting: they ran a math benchmark, used a contrastive method to find differences between correct and incorrect math answers, and checked if these differences transferred between languages—and they do. This provides evidence that the vector truly captures some sense of mathematical accuracy. That being said, they only evaluated the steering vector on samples where it was incorrect without steering. While this fixes some outcomes, it’s possible this also breaks previously correct responses. In French, it only corrected 2 out of 21, and I’d guess that some false answers might arise just from resampling. They also applied it to only 21 out of 77 false samples. I would also find it more interesting to take the correctness vector for a language in which the model performed better. Using a control feature and validating on all math problems could be beneficial too. You’d expect better results if you resampled on false outputs, so it would be interesting to see what happens if you steer for an unrelated feature.
Simon Lermen
|
November 28, 2024
November 28, 2024
This is a paper about identifying features that light up for prompt injections or jailbreaks. Potentially quite useful, as it might offer a practical method to harden models against such attacks by suppressing such features. Alternatively, it could help detect features that trigger when a prompt injection fails. It's interesting that steering with the Portugal feature leads to such a significant effect, though they haven't applied a proper control here. They should compare 1-2 control features to 1-2 target features. Possibly, the Portugal feature is mislabeled? They show that some Goodfire features are mislabeled, pointing to issues with LLM-written labels. Goodfire needs to use a lot of samples to write explanations, and validation is lacking.
Simon Lermen
|
November 28, 2024
November 28, 2024
Issues: Very little implementation. I think that they propose a simple idea, i.e., monitor which SAE features light up while the LLM is performing a task. However, I am not sure anyone has ever attempted to actually implement this. The main contribution of the report is Section 4, which goes through potential failure modes—for example, unreliable labels or too many features. I think this paper could be quite interesting. For one, they could set up a simple model of supervision, which was their goal, but they couldn't complete it. After creating such a model and finding a few cases where it works, they could then explore different failure modes. This would need to be compared to some other supervision method.
Simon Lermen
|
November 28, 2024
November 28, 2024
Similar to papers such as Recovering the Input and Output Embedding of OpenAI Models (https://arxiv.org/pdf/2403.06634), this paper seeks to recover the SAE vectors from Goodfire's API. The authors admit that they failed at their objective but generally conclude that it should be possible. The method they propose seems reasonable, but I am confused about a key detail: the steering vectors are typically applied to the residual stream, though it’s not entirely clear at which position. The authors want to cache output activations, though those might be in the token space after applying the output embedding? It seems that their method doesn't quite work for some reason. "The steering vector is the average of all these differences, across n and across the dataset." <- This statement might be too strong; it depends on a lot of factors, such as where the steering vector is applied, which datasets are used, and how contrastive pairs are generated.
Simon Lermen
|
November 28, 2024
November 28, 2024
On a quick look, this paper is a little bit strange. It introduces very complex notations and ideas on a seemingly simple idea. It seems that they wanted to make a model more representative of diverse viewpoints, but I don't find anything of this in the paper itself. The paper contains out-of-context sentences, and it's unclear what they actually did. There are also seemingly screenshots of other papers. Take this sentence for example: Highly informative representations are produced by the language models before the SAE process, which helps improve the performance of the SAE. Introducing ∆ at this stage enables precise control over the model’s final output....
Simon Lermen
|
November 28, 2024
November 28, 2024
This paper is about identifying issues in the quality of pre-training datasets. They look for features that light up for certain classes, such as spam or buggy code, using contrastive search. They then see if these features are really meaningfully. It maybe should have been causally instead of casually: "causally ties a predicted output with activated features." The basic idea seems to be: We have two classes of text, buggy code vs safe code, now we use contrastive search to find features that separate them. After this step, we try to figure out exactly what those features fire for; potentially, we may find features that shouldn't be there. The imagined success could look like this: we train the model on a dataset in which there is some spurious correlation between buggy code and something else. (For example, imagine a programmer in a company creates a lot of buggy code and also has a habit of using a certain library a lot. A spurious correlation could cause models to flag code including this library.) We may want to remove either this data or the feature. It seems that they haven't or don't explain a process to potentially automatically detect such correlations. For me, it is, however, not clear that there really is an established problem here. Could be good to at least demonstrate or cite issues with spurious correlations in current models unless I missed this.
Simon Lermen
|
November 28, 2024
November 28, 2024
Feels off topic, potentially trying to advertise for a company called Benki
Tom McGrath
|
November 27, 2024
November 27, 2024
These results are really nice - the combination of methods (training an interpretable classifier on hallucinations, interpreting it, and then using the resulting features to steer) is both elegant and effective. The results on hallucination rate are striking: I'm surprised it's possible to reduce it this much. I wonder if it's possible to have two lines of defence: do any classifiers identify some of the hallucinations that occur even once steering has been applied? I also wonder if features are additive in reducing hallucination rate.
Tom McGrath
|
November 27, 2024
November 27, 2024
This is a good attempt at answering an important safety-relevant question. Unfortunately the current setup doesn't work well enough to accurately ablate factual knowledge, but it was worth trying and the methodology used here is sufficient to answer the question. It's possible that using attribution would have improved feature selection and made it more automated. The results in table 1 are impressive - I'd be interested in seeing more failure cases however as (as the authors indicate) Figure 1 tells a different story at the scale of the entire dataset.
Liv Gorton
|
November 27, 2024
November 27, 2024
This project introduces a framework for steering the model towards representing more diverse perspectives. Focusing more on their contribution rather than describing existing methodology in detail (e.g. Gemma architecture) would make it easier to follow their paper. The authors note that they ran out of time and weren't able to implement their proposal. It'd be great to see them continue this work in the future.
Liv Gorton
|
November 27, 2024
November 27, 2024
I really like the direction of applying SAEs to identify hallucinations or uncertainty in model responses. The dataset was well-chosen and the methodology was sound. The plot for figure 1 could be improved by plotting the most impactful features (perhaps with the entire figure, with a figure legend in the appendix). If I'm reading the figure correctly, it appears that nudging the feature positively causes more incorrect answers, even for features that seem related to correctness. It'd be interesting to see a qualitative analysis into what might be happening there!
Liv Gorton
|
November 27, 2024
November 27, 2024
This is a really well-presented work that presents a framework/tooling to improve long-form document generation. In future, it would be interesting to see a benchmark on document quality whether steering the LLM towards a specific thing (e.g. to be more technical) deteriorates quality in other ways.
Liv Gorton
|
November 27, 2024
November 27, 2024
A nice, practical approach to unlearning via sparse autoencoders! The features often being related to general question answering demonstrates one of the important challenges with scaling unlearning generally. I agree with the point about analysis of those manually discovered, more effective features being useful and it'd be cool to see if there's some sort of automated LLM workflow that would be able to surface those same features with less effort.
Dhruba Patra
|
November 27, 2024
November 27, 2024
This is pretty cool and well thought out!
Tom McGrath
|
November 26, 2024
November 27, 2024
This is an interesting project - it's a little surprising to me that features can have nuanced stylistic effects, even if only marginally. The autointerp labels we generated for these features could definitely be better, so kudos to the authors for finding features that have these effects using the contrast tool. I appreciate the grounded nature of this work: qualitative observations can be the foundation of good science. More examples would be good, and if this was developed further I'd want to see some kind of quantitative evaluation.
Liv Gorton
|
November 26, 2024
November 27, 2024
This is an interesting project that applies interpretability to understand the bias that exists in LLMs. I liked the result of nudging resulting in a gender neutral pronoun in the top logits rather than just making a gendered pronoun more or less likely. The figures were well-presented and the paper was clearly written. Overall a nice demonstration of how steering could be used! It could be interesting to explore how this holds up to in context pronouns or if there is a set of features that produces the result regardless of the direction of the bias.
Liv Gorton
|
November 26, 2024
November 27, 2024
This project attempts to overcome the issue of using LLMs to aid in scaling interpretability. More details on what each function does would make understanding the work easier. It'd be interesting to see if the additional functions in the appendix improve the performance on later layers. It'd also be nice to see if these ideas generalise to other LLMs (could be done with open sourced models with SAEs on all layers like Gemma 2B).
Tom McGrath
|
November 26, 2024
November 27, 2024
This is a substantial project that makes good use of the customisation offered by model steering to help users tailor educational content to their preferences. The author got a lot done in the weekend and used steering in a very sensible and effective way. In terms of the writeup it would be useful and interesting to have some visualisations of the site - currently it's a little hard to imagine. I would also focus on what's new and how it leads to better outcomes, rather than focus on the tech stack,
Liv Gorton
|
November 26, 2024
November 27, 2024
This is a creative approach to auto-steering and seems like a promising direction! The choice of the tic-tac-toe environment makes a lot of sense given the time constraints (and I’m surprised to see how well it works!) and it’d be interesting to see how this generalises to other tasks.
Tom McGrath
|
November 26, 2024
November 27, 2024
This project aims to find alternative ways of interpreting language model features rather than asking another LLM to interpret them. The method proposed is to use the structure of the synthetic data used to train TuringLLM to generate candidate feature labels, either from uni- and bigrams or from topic data extracted from the filenames. These interpretation functions allow the authors to label the majority of features in the earlier layers of the model, but accuracy drops substantially past layer 4 of 12. I would be interested to see examples of unlabeled features in later layers to understand whether they are more abstract features, mostly related to output tokens, or mostly uninterpretable. The presentation of the work could be improved by describing the evaluation process and the motivation behind the choice of functions in more detail. I'm also not sure how this is intended to generalise to more complex features in more powerful models - some insight into this would be valuable.
Tom McGrath
|
November 26, 2024
November 26, 2024
This is a really nice project on chain of throught. The experiments are logical and well conducted, and the presentation of the results is clear. The uplift in chain of thought performance is quite surprising - I'd be interested to know if the authors tuned the feature strengths or set them at the default intervention strength. Feature steering curves (feature strength vs performance) often peak at somewhat different points on different features (even semantically very similar ones) so tuning can be very worth doing. The findings on uncertainty at the first tokens of a direct response are intriguing and worth some more investigation. A very interesting extension would be to test the generalisation of these features to another domain where CoT reasoning is important (ideally something non-mathematical, for example logic puzzles). Seeing a scatter plot of performance on one domain vs performance on another domain would be very informative - my concern is that steering might improve one kind of performance at the expense of another.
Tom McGrath
|
November 26, 2024
November 26, 2024
This is an interesting result: the authors look at faithfulness in chain-of-thought reasoning and surprisngly find that single features can substantially alter faithfulness. The methodology is sensible and the experiments are carried out well and well-documented. The relation to safety is subtle but well-justified. In this case the method used (adding mistakes) means that increased faithfulness leads to a decrease in performance. This is a sensible experimental methodology for understanding if features can control faithfulness - a cool result in its own right. In my anecdotal experience in LLM reasoning, incorrect answers typically arise from a lack of faithfulness to an otherwise correct chain of thought. It would be an interesting extension to the paper to see if the features identified in this paper lead to an increase in performance in natural chain of thought reasoning settings.
Tom McGrath
|
November 26, 2024
November 26, 2024
This is a cool idea - getting an agent to edit the internals of another agent to perform the task. As the authors observe, this is pretty difficult (even for frontier models) but as far as I can tell they had some successes. Quantifying these successes/failures can be difficult, especially at a relatively small scale and with time constraints - I'd encourage the authors to report a few transcripts of success & failure so we can get a feel for how well this performs and where steering agents succeeds and fails.
Tom McGrath
|
November 26, 2024
November 26, 2024
This project develops a visualisation tool for language model SAE latents. Visualisation is an important and underexplored area in interpretability, so it's cool to see this. The visualisation is a graph, where features are connected to one another if they co-occur sufficiently frequently. The tool is interesting but I'd really like to see some example of the kind of application it might be used in, or an interesting insight (even something very minor) that the authors obtained from using the tool.
Tom McGrath
|
November 26, 2024
November 26, 2024
This project aims to develop programmatic methods for interpreting the internals of the Turing LLM. These methods rely on the statistics of the synthetic dataset used to train the base model. The writeup is a bit unclear as to what exactly these methods do, but the examples and function names give a decent picture. Figure 1 is interesting: it appears that topic and uni/bi-gram statistics dominate in the earlier layers, but later layers are relatively poorly explained - what is going on here? Potentially the last couple of layers could be explained by the _next_ token, which would be interesting to see.
Alana Xiang
|
November 26, 2024
November 26, 2024
Very cool stab at increasing CoT via steering. I would like to see a fuller investigation of how the faithfulness of the steered CoT compares to prompted CoT.
Alana Xiang
|
November 26, 2024
November 26, 2024
Very interesting idea to use prompts for steering! I would like to see how this technique compares to steering directly. I encourage the authors to further study how well this method generalizes.
Alana Xiang
|
November 26, 2024
November 26, 2024
Interesting to see the Portuguese feature pop up again after reading https://www.apartresearch.com/project/assessing-language-model-cybersecurity-capabilities-with-feature-steering ! The password setup is an interesting environment to study jailbreaking, and the team finds interesting results. Good work!
Jaime Raldua
|
November 26, 2024
November 26, 2024
This is a useful comparison to see. The combined version seems particularly interesting
Alana Xiang
|
November 26, 2024
November 26, 2024
The author successfully uses steering to decrease "grammatical scope ambiguity." With more time, I'd love to see the author work on quantifying the effect and comparing this approach to baselines like prompting. Good work!
Jaime Raldua
|
November 26, 2024
November 26, 2024
SAEs for AI Control sound like an amazing idea! as you point out there seem to be major blockers on the way and also it would have been very useful to see some code and a more clear roadmap of how your next steps would look like if you would have had more time (e.g. a couple of weeks) to continue on this project
Alana Xiang
|
November 26, 2024
November 26, 2024
This team finds some features that activate on GSM8K. They made an interesting decision to compare across languages. With more time, I'd love to see this team investigate why they were unable to improve the performance of the model via steering.
Jaime Raldua
|
November 26, 2024
November 26, 2024
The combination of RL and AS looks really promising! Very surprised of seen a 3x improvement, would love to see a longer version of this work
Alana Xiang
|
November 26, 2024
November 26, 2024
Good idea to use steering to improve cybersecurity abilities. With more time, I'd like to see more work on whether the Portuguese feature boost generalizes to other datasets. I'm particularly interested in generalization beyond multiple-choice questions. I'd also like to see research on why this feature is relevant to performance in this case. Overall, very cool to find a case where a feature has an effect completely detached from its label. Good work!
Alana Xiang
|
November 26, 2024
November 26, 2024
Very cool work on automating steering! Fun and creative. With more time, I'd love to see comparisons with strong baselines.
Jaime Raldua
|
November 26, 2024
November 26, 2024
Very original idea and promising results!
Jaime Raldua
|
November 26, 2024
November 26, 2024
Very promising results! on the point 4 it would have been better to emphasize the contribution of your work instead of talking about next steps only
Jaime Raldua
|
November 26, 2024
November 26, 2024
Very interesting project! There seems to be much work around bias so a bit more lit.review would have been very useful to see better how your work contributes to the field
Alana Xiang
|
November 26, 2024
November 26, 2024
The team tackles an important problem: hallucination in medical questions. They seem to find a mild improvement from steering against hallucination. Further analysis is likely needed to determine if this improvement is spurious. With more time, I would like to see the authors develop better methods for detecting hallucinations, such as human or Claude review. I am not wholly convinced that the results generalize beyond this dataset, and I would've liked to see this tested in the paper. The writeup is detailed and clear. Good work!
Alana Xiang
|
November 26, 2024
November 26, 2024
This team develops a reasonable experiment setup and executes it well. Their results point to an interesting possibility, that subtracting the "acknowledging mistakes" feature could lead to higher faithfulness. With more time, a graph I would've liked to see is faithfulness by steering value. I would also be interested in seeing the team explore whether allowing the model to continue the CoT will recover the faithfulness lost by this steering by acknowledging the reasoning error out loud. Good work!
Tom McGrath
|
November 25, 2024
November 26, 2024
This is a cool and interesting result - I wonder why turning this feature down improves performance! It's certainly possible that the feature is completely mislabeled; autointerp is far from perfect and sometimes gets very confused. I'd be interested in seeing some qualitative samples of what happens when this feature is steered in various contexts, as well as a steering plot covering WMDP scores at a higher resolution. I worry that there may have been a class imbalance in the data (e.g. more 'A's than 'C's) and steering simply moved the model more towards the overrepresented class.
Tom McGrath
|
November 25, 2024
November 26, 2024
This is very well executed and presented research. The comparison of model values vs the human response KDE is interesting, but my favourite plot is figure 2 - it's very surprising how different features have remarkably different trajectories through the moral landscape. It's surprising that most features actually appear to avoid the modal human, and only a single feature actually steers the model in that direction. It's unfortunate that the OUS has so few questions and is so sensitive (e.g. the difference between models being entirely accounted for by question IH2).
Tom McGrath
|
November 25, 2024
November 26, 2024
This work covers an important problem and applies a sensible methodology. The performance of the results is impressive - I had to check in the code that the results were in fact on a test set. I'd be interested in seeing how often harmless prompts are misclassified though. Definitely worth extending further - these results are quite promising.
Tom McGrath
|
November 25, 2024
November 26, 2024
This is really interesting work on an important problem. Intuitively it's reasonable to expect that hallucinations can be detected from SAE features, but I've never seen this demonstrated before, nor steering to actually reduce hallucination rate. The results are clear, well presented and methodologically sound, and the learned decision tree makes sense.
Tom McGrath
|
November 25, 2024
November 26, 2024
This is an interesting comparison - the relative merits of prompting and feature steering comes up a lot and it's great to see some very grounded evaluations. The feature steering looks to have been done well, and the qualitative observations are good.
Tom McGrath
|
November 25, 2024
November 26, 2024
This is an interesting and imaginative project, and the results are pretty cool. It's impressive to include feature steering inside an RL loop, and I'm quite surprised that it works! The project writeup is clear and well written.
Tom McGrath
|
November 25, 2024
November 26, 2024
These findings are cool and somewhat surprising - I didn't realise we can nudge models towards being wrong so easily! I'm having trouble parsing figure 1, however - surely with nudge strength set to zero all features should provide the same outputs, but we see an almost 20% range in percentage correctness between features. Should I conclude that some features can in fact steer the model substantially towards correct answers? If so then that's interesting and I'd highlight it more.
Tom McGrath
|
November 25, 2024
November 26, 2024
This is an interesting question, and the results seem promising. The methodology is sound, but I don't understand the reason that the sentences are split across user and assistant tokens. The natural choice in my opinion would be to have a single message, e.g. {"role": "user", "content": "The Chef was not happy with the speed of serving so"} and then evaluate logits from that message. This is a more natural input, and also opens up the question of whether the logits differ if the 'role' field is different - for instance maybe the model expects more biased inputs from users, but responds in an unbiased way.
Shana Douglass
|
November 25, 2024
November 25, 2024
This memorandum offers a comprehensive and insightful analysis of the potential risks associated with AI in K-12 education, particularly regarding bias and opaqueness. The proposal's focus on equity and transparency is commendable. The recommendation to leverage Title I and Title IV funding to promote human oversight, AI training, and stakeholder engagement is a practical and effective approach. By aligning these measures with existing federal funding mechanisms, the proposal offers a realistic and scalable solution to mitigate the risks of AI in education. However, a more detailed analysis of the potential costs and funding mechanisms associated with the implementation of these recommendations would further strengthen the proposal.
Alana Xiang
|
November 25, 2024
November 25, 2024
This is creative paper which finds a new domain on which SAE features generalize well (across languages in grade school math). The surprising finding that the French steering vectors had a larger impact on English and Russian performance than French performance warrants further inquiry. I think this paper could've significantly improved on novelty if it pursued this direction. Given more time, I would also love to see the authors inspect whether the features they found generalize beyond GSM8K. Good work!
John Doe
|
November 25, 2024
November 25, 2024
I love the project because of x, y. and z
Jaye Nias
|
November 21, 2024
November 21, 2024
This policy memorandum provides a thoughtful and well-rounded examination of the potential risks associated with bias and opaqueness in intelligent systems used in K–12 education. The concerns about exacerbating inequality are both relevant and timely. The recommendation to incorporate Title I and Title IV financing criteria, which include human oversight, AI training for teachers and students, and open communication with stakeholders, is a strong and practical approach. These measures promote the responsible and transparent use of intelligent systems, while ensuring accountability and taking proactive steps to prevent harm to students. One of the strengths of this memorandum is its clear presentation of the suggested mitigations, thoughtfully considering both their benefits and limitations. While linking these solutions to federal funding mechanisms may not be entirely new, it is a strategy that has historically been effective in driving equity-focused initiatives within education. The proposed approach, therefore, offers a realistic and impactful way to encourage the responsible use of AI in educational settings, with a focus on protecting students’ interests.
Testing 2024-11-15
|
November 15, 2024
This is a test
Monica Lopez
|
November 3, 2024
October 27, 2024
Jason Schreiber
|
August 5, 2024
Esben Kran
|
February 24, 2024
July 19, 2023
I like the simple operationalization of your research question into GPT2-small. It seems like exploring multiple operationalizations would be useful to elucidate your results, though I personally imagine it's pretty good. Seems like one of those tasks that show that we cannot use our current methods to properly investigate every circuit, unfortunately. Puts a serious limiting factor on our mechanistic interpretability usefulness. Good work!
June Rock
|
February 24, 2024
January 4, 2024
This liver health supplement is doing wonders for my energy levels: https://www.socialsurge.ai/recommends/liv-pure/
Bart
|
February 24, 2024
July 19, 2023
Interesting work! An extensive range of experiments shows that even relatively easy tasks might not be easy to locate in LLMs. I believe this work sheds a light on how limited our current methodology is and bracketed sequence classification might serve as a good toy-problem task for future development of interpretability methods.
Jason Hoelscher-Obermaier
|
February 24, 2024
November 29, 2023
Fascinating project! I liked how many different aspects of the multimode prompt injection problem this work touched on. Analyzing CLIP embeddings seems like a great idea. I'd love to see follow-up work on how many known visual prompt injections can be detected in that way. The gradient corruption also seems worth studying further with an eye toward the risk of transfer to black-box models. Would be wonderful to see whether ideas for defense against attacks can come from the gradient corruption line of thinking as well. Congratulations to the authors for a really inspiring project and write-up!
Esben Kran
|
February 24, 2024
November 29, 2023
This is a great project and I'm excited to see more visual prompt injection research. It covers the cases we'd like to see in visual prompt injection studies (gradient, hidden, vision tower analysis). It seems like a great first step towards an evals dataset for VPI. Great work!
Tim
|
February 24, 2024
October 2, 2023
The main problems named w.r.t formalizing agency as the number of reachable states are very relevant. It is mentioned that not only the number of states is important but it also needs to be considered how desirable these states are and if they are reachable. However,er it seems that the authors consider "number of reachable states" and empowerment as the same thing, which is not the case. Further, the authors proposition that a "Good notion of empowerment should measure whether we can achieve some particular states, once we set out to do so." seems to very much coincide with the true definition of empowerment by Salge et all. Hence, it would be relevant to compare the author's "multiple value function" optimization objective to that of empowerment. The authors also propose a new environment, which seems to be very useful, thoughtful and could be a nice starting point for some experiments.
Ben Smith
|
February 24, 2024
October 2, 2023
It's possible this is a novel topic, but there isn't a clear finding, and it's quite speculative. So there's not much novel here beyond an idea. It is a very interesting idea, and I give the entry points for that. I thought "attainable utility preservation" had already got a lot further in talking about how you can quantify the different goals that might be achieved from a starting point, taking into account the value of each goal with a diversity of possible goals.
Vincent
|
February 24, 2024
September 15, 2023
the order of choices is interesting and I just saw a paper about that comes out recently (https://arxiv.org/abs/2308.11483?)
Esben Kran
|
February 24, 2024
July 19, 2023
This is a wonderful mechanistic explanation of a phenomenon discovered through interpreting the learning curves of a simple algorithmic task. Of course, it would have benefitted from experimental data but it is conceptually so strong that you probably expect it to work. Future work should already take into account how we might want to generalize this to larger models and why it's useful for AI safety. E.g. I would be interested if this is expanded stepwise into more and more complex tasks, e.g. adding multiplication, then division, then sequence of operations, and so on for us to generalize into larger models some of these toy tasks. Good work!
Ben Smith
|
February 24, 2024
October 2, 2023
I thought "attainable utility preservation" had already got a lot further in talking about how you can quantify the different goals that might be achieved from a a starting point, taking into account the value of each goal with a diversity of possible goals. It's possible this is a novel topic, but there isn't a clear finding, and it's quite speculative. So there's not much novel here beyond an idea. Still, it's an interesting idea, and worthwhile to start a Gymnasium environment for testing the idea. So I give authors some points for all that.