Apart Research & Projects update 11th July

Apart Research update, July 11th 2022

The Short Update

We have tried to get LLMs to exhibit cognitive biases and replicated the anchoring bias paper. We have several hypotheses for the inverse scaling prize and are actively testing them with variable success, and we’re collaborating with Fazl Barez and an Asa Cooper with existing research in the area. We are working to finish work within the development of empathetic AI and neural interpretable feature generation with human evaluation. We also want to participate in the Trojan detection challenge.

All of this is tracked on the aisafetyideas.com site and its development follows our research use of the platform. Additionally, we’re integrating it with the alignmentfactory.org which will contain non-technical AI safety project ideas. Our work with Jamie on the front page for AI safety is coming along really well.

See more of our projects here.

Research Results

Anchoring bias in LLMs
The inverse scaling prize
Empathetic AI development review
Neural interpretable feature generation with human evaluation
Trojan detection challenge

Anchoring bias in LLMs

After transitioning into research again, one of the first projects we wanted to work on was to replicate Kahneman & Tversky papers on LLMs. It is both an interesting project and it builds our intuition about large language models better through the black-box investigation research agenda. The plan for this work is to replicate each of the papers in turn, write up a blog post for LessWrong for the bias, and then publish the total work with all replications through ArXiv or a journal. The biases to test are:

  1. Anchoring bias: Replicate the Measures of Anchoring in Estimation Tasks on language models.
  2. Epistemic biases: Do language models show epistemic biases? See e.g. Siméon’s post.
  3. Framing: Based on simulator theory, we should be able to see significant differences between different framings of questions. The framing will seemingly only need to be miniscule to change the decision path.
  4. Narrative fallacy: Do narrative factual statements have more pull for the LLM compared to simple statements of fact?
  5. Herd mentality: “10/100/1000 people/researchers say…”. Does the model agree with absurd statements more often when herded?
  6. Confirmation bias: Does a context describing the LLM’s beliefs significantly influence its reasoning process and interpretation of facts?
  7. Overconfidence bias: We expect to see general overconfidence from language models because of the over-representation of confidence online.
  8. Prediction bias: Predicting black swan events as an LLM seems to depend on its training data and which events it has seen. If an LLM is asked to predict outside its training data, which properties will its responses have?

The first paper we decided to replicate (as represented by the list), is the anchoring bias paper. This quickly turned out to be a hard problem because the language models would not understand the question. This was also very dependent on the model used. See Figure 1.

Figure 1: Prompt: `f"Random number: {x}\nHeight of Mount Everest (in feet):"`. The blue line represents the true answer and the x axis represents the bias number {x}.

We did get significant effects from this experiment but it was dubious, without giving it few-shot examples using context in the prompt, if the models were actually giving answers to the questions. One example of these mis-representations came from the fact that the models would sometimes output the exact number {x} (see Figure 1) instead of showing any kind of anchoring.

This was an especially large effect when we used the methodology from the paper, i.e. giving the prompt:

Is the height of Mount Everest (in feet) below or above {x}?
{answer 1}
What is the height of Mount Everest (in feet)?
{answer 2}

One interesting effect we saw qualitatively in these cases was that the model would have a default answer (e.g. 222) if just asked the question and if the anchor was too far apart from that value (e.g. 61,394), the model would not react. However, when it was closer (e.g. 224), the model would output that value directly, given the prompt  `f"Random number: {x}\nHeight of Mount Everest (in feet):"`. We decided to investigate this further.

Figure 2: GPT-3 anchoring bias based on log difference anchors. X-axis calculated as pmax(pmin(1, (answer - calibration) / (anchor - calibration)), 0) where anchor is the y axis and calibration is the default value (e.g. 222).

Figure 2 shows the experimental prompts where the baseline number (e.g. 222) has a log difference added to it, with the expectation that the closer values would anchor the LLM much more. Based on the results from Figure 2 tested on 13 questions, this assumption seems surface-level correct but this requires more work.

Next steps

We need to do a better plot of Figure 2 which includes other models, investigate the tokens that come after its answer deeper to see if the models explain their reasoning, check the effects of which baseline value is present (several baseline answers were 1.0 or very low), and attempt to replicate the original paper even better.

The inverse scaling prize

We are participating in the inverse scaling prize (AISI) where we attempt to find inverse scaling laws for LLMs. A good example of this is the TruthfulQA task where larger models are more susceptible to imitative falsehoods, though their general question-answering capabilities increase. This is in collaboration with Fazl Barez and a team in Edinburgh (including Asa Cooper) who are investigating other avenues to inverse scaling.

Figure 3: Confidence of the model given prompts over model size in a dataset of tasks that test its confidence in answering unsolved questions, despite the LLM being clearly wrong; a type of hindsight bias from its inability to distinguish past and future.

Figure 3 shows one of the more promising avenues, hindsight bias, though we have 10+ other hypotheses to test. The figure gives a preliminary result that seems promising (but with 2 examples). Other hypothesis tests were less promising.

Empathetic AI development review

We’re hiring a part-time researcher to finish up a review paper on how other researchers have developed empathetic artificial intelligence. This project is a blue-sky approach to alignment but correlates with the work done on brain-like AGi safety and we think it will garner interesting results that are good to have available in an easy-to-access format. Esben Kran worked on this back in 2021 with two professors in neuroscience and robotics, respectively.

Neural interpretable feature generation

We want to work on finishing up a project Fazl Barez worked on at Amazon and are hiring someone with a computer science background and ML experience to build up some of the public-facing evaluation software and technical implementations. When finished, we will publish this as a blog post as well.

The basic idea is to extract features that can be showcased for explainable use cases and UX for users to have a better understanding of what to focus on in specific texts, e.g. CVs. The work is closed-source under Amazon and the Apart side of it will work to rebuild it using open-source technologies and have it as a freely available software.

Trojan detection challenge

We plan to participate in the Trojan detection challenge that has different tracks towards garnering a better understanding of identifying and securing against neural Trojans, a class of neural viruses that are not detectable at the moment but can create a very specific output. From the organizational perspective, this can help us get a better understanding of interpretability and deep network classification pipelines and neural security. Scientifically, it seems to push tangibly towards better alignment to have easy ways to detect neural disturbances consistently.

Research Support Project Updates

AI Safety Ideas
Alignment Project Factory
Front page for AI safety

AI Safety Ideas

The AI Safety Ideas platform is progressing well, especially after we decided to focus on research while focusing on using aisafetyideas as a project management tool. Maris Sala helps with inputting our backlog of shovel-ready research ideas that are currently in lists around the internet while Esben still works on the development in spurs of inspiration when the research project management feature requirements appear.

If you check out the platform now, there are a lot of different ideas on there under an array of very good research agendas. We expect to get more ideas on it and make it feature-rich while emphasizing its beta status when we publish the platform for everyone to use.

In talks with Plex, Yonatan Cale, and alignment.dev, we have also developed a Discourse version and an Airtable version of the platform to see if it was even necessary to develop our own platform. These were abandoned after we transitioned to a research organization and these versions did not “spark joy” (or provide the feature set we were interested in). If we did not transition to a research organization, there is a chance we would be using a modified version of the Discourse platform for the published version.

Alignment Project Factory

The Alignment Project Factory is a spinoff of the AI Safety Ideas platform that focuses on non-technical projects. This was recently spun out (yesterday) so we are still working on the divergences. The feature set will be the same and the user base will be shared but the projects, categories, and superprojects will be completely separate. It runs off the same codebase and is a collaboration with Nicole Nohemi on her massive list of really good project ideas.

Front page for AI safety

With Jamie Bernardi and Blue Dot Impact, we’re working on the front page for AI safety online and this has a large potential to be a really good central hub for AI safety entrants, googlers, and researchers to get resources, information, and ideas. Jamie is writing the content and Esben is developing the website, both visually and programmatically. It is based on the needs that Jamie has seen when working with the AGI Safety Fundamentals program. We hope to get a better domain name (e.g. aisafety.com/org) and get advisors to collaborate and endorse the website.

Apart Research & Conclusions

Apart Research has now existed for more than 4 months. In March to May, we focused completely on upskilling and research, looking at different routes for the organization, and having 100s of meetings to embed ourselves within the network. You might remember the projects we had active on our website back then. 

Two weeks ago, we decided to focus on research again, and since then have developed the projects you see above while finishing and working on our research support projects as well. When we received the $95,000 in funding from the FTX regranting program, it was a big validation of the research into AI safety field-building we had done until then and from the researchers that the grant’s due diligence came from.

Because of this, today we are able to work across domains, research and develop infrastructure for the AI safety community, and hire people towards more scaling. We expect to apply for more grants, e.g. from the Century Fellowship for fledgling organizations, Jaan, and more, and ask for advising from several researchers in AI safety. Scaling our research infrastructure is one of our largest and most important priorities and the projects we are doing now will help us inform this work.

The reason for shifting back into alignment was from our research into building a large alignment prize (i.e. $100M+) where, from first principles, it seemed clearly best to do direct research into alignment and from there create technically interesting competitions, e.g. the inverse scaling prize and the trojan detection challenge. The extension of this result was that, if it was possible for a field-builder to create a research organization, then that is the best course of action. This was further validated when talking with Conjecture’s founder Connor Leahy and their operations approach to building an alignment Bell Labs (facilitated by Chris Scammell).

We are excited to show you what we have been working on and we hope you’re all excited to see it as well. So here’s to another few weeks until our next update! Cheers.

Esben Kran, co-lead at Apart Research.