Forum | Apart Research

Omniscient Narrative Agent

December 13, 2024

October 3, 2024

The Concordia Contest: Advancing the Cooperative Intelligence of Language Model Agents

. Placed

Seemingly Human: Dark Patterns in ChatGPT

February 24, 2024

February 12, 2024

MASec

. Placed

Model Cards for AI Algorithm Governance

February 24, 2024

January 7, 2024

Governance

. Placed

Detecting Implicit Gaming through Retrospective Evaluation Sets

February 24, 2024

November 27, 2023

Evaluations

. Placed

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

September 12, 2024

October 2, 2023

Multi-agent

. Placed

Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions

February 24, 2024

September 25, 2023

Agency

. Placed

In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops

February 24, 2024

September 24, 2023

Agency

. Placed

Evaluating Myopia in Large Language Models

February 24, 2024

September 10, 2023

Agency

. Placed

Data Taxation

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Relating induction heads in Transformers to temporal context model in human free recall

February 24, 2024

July 17, 2023

Interpretability

. Placed

Exploring the Robustness of Model-Graded Evaluations of Language Models

February 24, 2024

July 2, 2023

Safety Benchmarks

. Placed

Solving the CNN Mech Int Challenge

February 24, 2024

May 10, 2023

Interpretability 2.0

. Placed

Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight Techniques

February 24, 2024

February 16, 2023

ScaleOversight

. Placed

We Discovered An Neuron

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

Discovering Latent Knowledge in Language Models Without Supervision - extensions and testing

February 24, 2024

December 19, 2022

AI Testing

. Placed

Investigating Neuron Behaviour via Dataset Example Pruning and Local Search

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Agreeableness vs. Truthfulness

February 24, 2024

October 18, 2022

Language Model Hackathon

. Placed

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

December 3, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Promoting School-Level Accountability for the Responsible Deployment of AI and Related Systems in K-12 Education: Mitigating Bias and Increasing Transparency

November 27, 2024

67240c8fb0416a4520d2b4b6

Howard University AI Safety Summit & Policy Hackathon

. Placed

Robust Machine Unlearning for Dangerous Capabilities

November 7, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Diamonds are Not All You Need

November 10, 2024

66792e7b43f57dc7a262ec11

Agent Security Hackathon

. Placed

DarkForest - Defending the Authentic and Humane Web

September 5, 2024

66792de23b5e6f1a6eb18e3f

Hackathon for Technical AI Safety Startups

. Placed

Speculative Consequences of A.I. Misuse

November 10, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

AI Alignment Knowledge Graph

November 10, 2024

6690e5a6af314ac3b68e3d51

Research Augmentation Hackathon: Supercharging AI Alignment

. Placed

Sandbag Detection through Model Degradation

July 8, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Unsupervised Recovery of Hidden Markov Models from Transformers with Evolutionary Algorithms

June 14, 2024

661eead4df76057e22a47ca8

Computational Mechanics Hackathon!

. Placed

rAInboltBench : Benchmarking user location inference through single images

May 31, 2024

65b750b6007bebd5884ddbbf

AI Security Evaluation Hackathon: Measuring AI Capability

. Placed

Beyond Refusal: Scrubbing Hazards from Open-Source Models

May 8, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Very Cooperative Agent

November 10, 2024

October 6, 2024

The Concordia Contest: Advancing the Cooperative Intelligence of Language Model Agents

. Placed

Fishing for the answer: Mapping the flow of information in LLM agent groups using lessons from fish schools

February 24, 2024

February 12, 2024

MASec

. Placed

Obsolescent Souls

February 24, 2024

January 7, 2024

Governance

. Placed

Visual Prompt Injection Detection

February 24, 2024

November 27, 2023

Evaluations

. Placed

Jailbreaking the Overseer

February 24, 2024

October 1, 2023

Multi-agent

. Placed

Discovering Agency Features as Latent Space Directions in LLMs via SVD

February 24, 2024

September 25, 2023

Agency

. Placed

Agency as Shanon information. Unveiling limitations and common misconceptions

February 24, 2024

September 24, 2023

Agency

. Placed

Against Agency

February 24, 2024

September 21, 2023

Agency

. Placed

The AI governance gaps in developing countries

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Who cares about brackets?

February 24, 2024

July 17, 2023

Interpretability

. Placed

From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety

February 24, 2024

July 4, 2023

Safety Benchmarks

. Placed

Dropout Incentivizes Privileged Bases

February 24, 2024

May 10, 2023

Interpretability 2.0

. Placed

Player Of Games

February 24, 2024

February 16, 2023

ScaleOversight

. Placed

Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

Investigating Training Dynamics via Token Loss Trajectories

February 24, 2024

December 19, 2022

AI Testing

. Placed

Backup Transformer Heads are Robust to Ablation Distribution

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

AI: My Partner in Crime

February 24, 2024

October 18, 2022

Language Model Hackathon

. Placed

Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities

December 3, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

SafeBites

October 29, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Dynamic Risk Assessment in Autonomous Agents Using Ontologies and AI

November 10, 2024

66792e7b43f57dc7a262ec11

Agent Security Hackathon

. Placed

Simulation Operators: The Next Level of the Annotation Business

September 5, 2024

66792de23b5e6f1a6eb18e3f

Hackathon for Technical AI Safety Startups

. Placed

CoPirate

November 10, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

Grant Application Simulator

November 10, 2024

6690e5a6af314ac3b68e3d51

Research Augmentation Hackathon: Supercharging AI Alignment

. Placed

Detecting and Controlling Deceptive Representation in LLMs with Representational Engineering

August 29, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

RNNs represent belief state geometry in hidden state

June 14, 2024

661eead4df76057e22a47ca8

Computational Mechanics Hackathon!

. Placed

Cybersecurity Persistence Benchmark

May 31, 2024

65b750b6007bebd5884ddbbf

AI Security Evaluation Hackathon: Measuring AI Capability

. Placed

Jekyll and HAIde: The Better an LLM is at Identifying Misinformation, the More Effective it is at Worsening It.

May 8, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Iterated contract negotiation

February 24, 2024

February 11, 2024

MASec

. Placed

2030 - The CEO Dilemna

February 24, 2024

January 8, 2024

Governance

. Placed

Cross-Lingual Generalizability of the SADDER Benchmark

February 24, 2024

November 27, 2023

Evaluations

. Placed

LLMs With Knowledge of Jailbreaks Will Use Them

February 24, 2024

October 2, 2023

Multi-agent

. Placed

Uncertainty about value naturally leads to empowerment

February 24, 2024

September 26, 2023

Agency

. Placed

Comparing truthful reporting, intent alignment, agency preservation and value identification

February 24, 2024

September 24, 2023

Agency

. Placed

Building brakes for a speeding car: A global coordination proposal for AI safety

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Embedding and Transformer Synthesis

February 24, 2024

July 16, 2023

Interpretability

. Placed

MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark

February 24, 2024

July 2, 2023

Safety Benchmarks

. Placed

OthelloScope

February 24, 2024

May 10, 2023

Interpretability 2.0

. Placed

Reverse Word Wizards: Pitting Language Models Against the Art of Reversal

February 24, 2024

February 16, 2023

ScaleOversight

. Placed

Automated Identification of Potential Feature Neurons

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

Counting Letters, Chaining Premises & Solving Equations: Exploring Inverse Scaling Problems with GPT-3

February 24, 2024

December 19, 2022

AI Testing

. Placed

Model editing hazards at the example of ROME

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

All Fish are Trees

February 24, 2024

October 18, 2022

Language Model Hackathon

. Placed

Utilitarian Decision-Making in Models - Evaluation and Steering

December 3, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Sue-Per GPT: Legal RAG Assistant

November 7, 2024

AI Policy Hackathon at Johns Hopkins University

. Placed

Cop N' Shop

November 10, 2024

66792e7b43f57dc7a262ec11

Agent Security Hackathon

. Placed

AI Safety Collective - Crowdsourcing Solutions for Critical AI Safety Challenges

September 5, 2024

66792de23b5e6f1a6eb18e3f

Hackathon for Technical AI Safety Startups

. Placed

LLM Research Collaboration Recommender

November 10, 2024

6690e5a6af314ac3b68e3d51

Research Augmentation Hackathon: Supercharging AI Alignment

. Placed

Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach

July 8, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Investigating the Effect of Model Capacity Constraints on Belief State Representations

June 14, 2024

661eead4df76057e22a47ca8

Computational Mechanics Hackathon!

. Placed

Handcrafting a Network to Predict Next Token Probabilities for the Random-Random-XOR Process

June 14, 2024

661eead4df76057e22a47ca8

Computational Mechanics Hackathon!

. Placed

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

May 31, 2024

65b750b6007bebd5884ddbbf

AI Security Evaluation Hackathon: Measuring AI Capability

. Placed

Artificial Advocates: Biasing Democratic Feedback using AI

May 8, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Towards High-Quality Model-Written Evaluations

February 24, 2024

November 27, 2023

Evaluations

. Placed

Second-order Jailbreaks

February 24, 2024

October 2, 2023

Multi-agent

. Placed

ILLUSION OF CONTROL

February 24, 2024

September 25, 2023

Agency

. Placed

Agency, value and empowerment.

February 24, 2024

September 24, 2023

Agency

. Placed

Premortem AI

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Interpreting Planning in Transformers

February 24, 2024

July 17, 2023

Interpretability

. Placed

Exploitation of LLM’s to Elicit Misaligned Outputs

February 24, 2024

July 2, 2023

Safety Benchmarks

. Placed

Improving TransformerLens Head Detector

February 24, 2024

May 10, 2023

Interpretability 2.0

. Placed

Soft Prompts are a Convex Set

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

Trojan detection and implementation on transformers

February 24, 2024

December 19, 2022

AI Testing

. Placed

Probing Conceptual Knowledge on Solved Games

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Reducing hindsight neglect with "Let's think step by step"

February 24, 2024

October 18, 2022

Language Model Hackathon

. Placed

Steering Swiftly to Safety with Sparse Autoencoders

December 3, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Understanding Incentives To Build Uninterruptible Agentic AI Systems

October 29, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

OCAP Agents

November 10, 2024

66792e7b43f57dc7a262ec11

Agent Security Hackathon

. Placed

Identity System for AIs

September 5, 2024

66792de23b5e6f1a6eb18e3f

Hackathon for Technical AI Safety Startups

. Placed

Phish Tycoon: phishing using voice cloning

November 10, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

PurePrompt - An easy tool for prompt robustness and eval augmentation

November 10, 2024

6690e5a6af314ac3b68e3d51

Research Augmentation Hackathon: Supercharging AI Alignment

. Placed

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

July 8, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Benchmarking Dark Patterns in LLMs

May 31, 2024

65b750b6007bebd5884ddbbf

AI Security Evaluation Hackathon: Measuring AI Capability

. Placed

Gradient Descent Over Interpolated Activation Patches for Circuit Discovery

February 24, 2024

January 22, 2024

ARENA HACK

. Placed

AttentionData

February 24, 2024

January 22, 2024

ARENA HACK

. Placed

Trust and Power in the Age of AI

February 24, 2024

January 8, 2024

Governance

. Placed

Example Documentation of Implementation Guidance for the EU AI Act: a draft proposal to address challenges raised by business and civil society actors

February 24, 2024

January 8, 2024

Governance

. Placed

Boxing AIs - The power of checklists

February 24, 2024

January 8, 2024

Governance

. Placed

AI Safeguard: Navigating Compliance and Risk in the Era of the EU AI Act

February 24, 2024

January 8, 2024

Governance

. Placed

The EU AI Act: Caution against a potential "Ultron"

February 24, 2024

January 7, 2024

Governance

. Placed

AI Safety risks: An Infographic Analyis

February 24, 2024

January 7, 2024

Governance

. Placed

Multifaceted Benchmarking

February 24, 2024

November 27, 2023

Evaluations

. Placed

The Firemaker

February 24, 2024

October 2, 2023

Multi-agent

. Placed

Risk assessment through a small-scale simulation of a chemical laboratory.

February 24, 2024

October 2, 2023

Multi-agent

. Placed

Missing Social Instincts in LLMs

February 24, 2024

October 2, 2023

Multi-agent

. Placed

Jailbreaking is Incentivized in LLM-LLM Interactions

February 24, 2024

October 2, 2023

Multi-agent

. Placed

LLM Collectives in Multi-Round Interactions: Truth or Deception?

February 24, 2024

October 2, 2023

Multi-agent

. Placed

Exploring Failures: Assessing Large Language Model in General Sum Games with Imperfect Information Against Human Norms

February 24, 2024

October 2, 2023

Multi-agent

. Placed

Exploring multi-agent interactions in the dollar auction

February 24, 2024

October 2, 2023

Multi-agent

. Placed

Do many interacting LLMs perform well in the N-Player Prisoner’s Dilemma Game?

February 24, 2024

October 2, 2023

Multi-agent

. Placed

Emergent Deception from Semi-Cooperative Negotiations

February 24, 2024

October 2, 2023

Multi-agent

. Placed

Cooperative AI is a Double Edged Sword

February 24, 2024

October 2, 2023

Multi-agent

. Placed

Balancing Objectives: Ethical Dilemmas and AI's Temptation for Immediate Gains in Team Environments

February 24, 2024

October 2, 2023

Multi-agent

. Placed

Can collusion between advanced AI Agents remain perfectly undetectable?

February 24, 2024

October 2, 2023

Multi-agent

. Placed

The artificial wolves of Millers Hollow

February 24, 2024

October 1, 2023

Multi-agent

. Placed

LLM agent topic of conversation can be manipulated by external LLM agent

February 24, 2024

October 1, 2023

Multi-agent

. Placed

Escalation and stubbornness caused by hallucination

February 24, 2024

October 1, 2023

Multi-agent

. Placed

Can Malicious Agents Corrupt the System?

February 24, 2024

October 1, 2023

Multi-agent

. Placed

AI Defect in Low Payoff Multi-Agent Scenarios

February 24, 2024

October 1, 2023

Multi-agent

. Placed

In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops

February 24, 2024

September 24, 2023

Multi-agent

. Placed

Goal Misgeneralization

February 24, 2024

September 7, 2023

Interpretability

. Placed

SADDER - Situational Awareness Dataset for Detecting Extreme Risks

February 24, 2024

August 21, 2023

Evals

. Placed

Impact of “fear of shutoff” on chatbot advice regarding illegal behavior

February 24, 2024

August 21, 2023

Evals

. Placed

GPT-4 May Accelerate Finding and Exploiting Novel Security Vulnerabilities

February 24, 2024

August 21, 2023

Evals

. Placed

Can Large Language Models Solve Security Challenges?

February 24, 2024

August 21, 2023

Evals

. Placed

Alignment and capability of GPT4 in small languages

February 24, 2024

August 21, 2023

Evals

. Placed

Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text

February 24, 2024

August 20, 2023

Evals

. Placed

Preliminary measures of faithfulness in least-to-most prompting

February 24, 2024

August 20, 2023

Evals

. Placed

Residual Stream Verification via California Housing Prices Experiment

February 24, 2024

July 25, 2023

Interpretability

. Placed

Problem 9.60 - Dimensionaliy reduction

February 24, 2024

July 25, 2023

Interpretability

. Placed

Whose Morals Should AI Have

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Why AI Governance in Australia Could Slow Down Capabilities

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Will AI Always Undermine Democracy?

February 27, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Values-aligned AI through the Lens of Lessig’s Modalities of Regulation

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

What might limit the exponential growth of AI?

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Where will AI fit into the democratic system?

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Where does AI fit into Democratic System ?

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

US Military Involvement in AGI Development: Risks and Opportunities for AI Safety

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Towards AI Regulation (?)

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

The Risks of Future AI systems

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

The Marble Puzzle

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

STA

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Scale up AI safety

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Slowing down AI through memetics

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Redirecting Resources Spent on AI Development

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Preventing Artificial Democracy - a framework to assess risks and benefits

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Public Opinion and Its Importance to AI Policy

February 24, 2024

July 21, 2023

EAGx LatAm Epoch AI Hackathon

. Placed

OpenAI, two steps ahead of the future’s oracle

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Policy Recommendations to Incentivize Alignment Research

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Navigating the GPT-6 Deployment Minefield: Obstacles to Delaying Deployment

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Mapping AI applications onto the political process

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Measuring Gender Bias in Text-to-Image Models using Object Detection

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Increased democratic responsiveness: Using AI as a support-tool for Citizen Assemblies?

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Hybrid Blockchain Networks for Transparency in Artificial Inteligence

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

FGA Ratings Syste

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

GPT-6 Needs ARC Evals

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Evaluating the effectiveness of an AI ranking system market

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Directional Infringement: AI Risk Classification

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Die or Survive in AI era: Guidance on Education

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

DemocracyGPT

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Conceptualization of the National Artificial Intelligence Regulatory Authority(NAIRA)

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Catastrophic AGI Risk Prediction

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Code of the Heart

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Categories of AI Risks

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Categorizing the Risks of AI: A guide for policy makers

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Can the pin factory defeat the paperclip factory?

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Can we open-source a collective decision-making protocol?

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Beyond the Binary: Balancing VR Governance and Opt-Out Autonomy

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

AI Sanity

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

AI Governance

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

A Digest of AI Risk Categories

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

A radical and counterintuitive approach to AI governance

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

A Blueprint for GPT-6

February 24, 2024

July 21, 2023

AI Governance Hackathon

. Placed

Toward a Working Deep Dream for LLM's

February 24, 2024

July 17, 2023

Interpretability

. Placed

One is 1- Analyzing Activations of Numerical Words vs Digits

February 24, 2024

July 17, 2023

Interpretability

. Placed

Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model

February 24, 2024

July 17, 2023

Interpretability

. Placed

Multimodal Similarity Detection in Transformer Models

February 24, 2024

July 17, 2023

Interpretability

. Placed

Experiments in Superposition

February 24, 2024

July 17, 2023

Interpretability

. Placed

DPO vs PPO comparative analysis

February 24, 2024

July 17, 2023

Interpretability

. Placed

Towards Interpretability of 5 digit addition

February 24, 2024

July 16, 2023

Interpretability

. Placed

Factual recall rarely happens in attention layer

February 24, 2024

July 16, 2023

Interpretability

. Placed

Manipulative Expression Recognition (MER) and LLM Manipulativeness Benchmark

February 24, 2024

July 3, 2023

Safety Benchmarks

. Placed

AI & Cyberdefense

February 24, 2024

July 3, 2023

Safety Benchmarks

. Placed

Identifying undesirable conduct when interacting with individuals with psychiatric conditions

February 24, 2024

July 2, 2023

Safety Benchmarks

. Placed

Why Might Negative Name Mover Heads Exist?

February 24, 2024

June 11, 2023

ARENA Interpretability

. Placed

Understanding truthfulness in large language model heads through interpretability

February 24, 2024

June 11, 2023

ARENA Interpretability

. Placed

Understanding How Othello GPT Identifies Valid Moves from its Internal World Model

February 24, 2024

June 11, 2023

ARENA Interpretability

. Placed

Swap Graphs with Attribution Patching

February 24, 2024

June 11, 2023

ARENA Interpretability

. Placed

modiff

February 24, 2024

June 11, 2023

ARENA Interpretability

. Placed

ACDC++: Fast automated circuit discovery using attribution patching

February 24, 2024

June 11, 2023

ARENA Interpretability

. Placed

AI Policy Pre-Evaluation Prediction Markets

February 24, 2024

June 4, 2023

Democratic Input

. Placed

Towards Formally Describing Program Traces from Chains of Language Model Calls with Causal Influence Diagrams: A Sketch

February 24, 2024

May 30, 2023

Verification

. Placed

It Ain't Much but it's ONNX Work

February 24, 2024

May 29, 2023

Verification

. Placed

Fuzzing Large Language Models

February 24, 2024

May 29, 2023

Verification

. Placed

Othello Mechint playground

February 24, 2024

May 10, 2023

Interpretability 2.0

. Placed

Exploring OthelloGPT

February 24, 2024

May 10, 2023

Interpretability 2.0

. Placed

Detecting Phase Transitions

February 24, 2024

May 10, 2023

Interpretability 2.0

. Placed

AutoAdminsteredAntidotes: Circuit detection in a poisoned model for MNIST classification

February 24, 2024

May 10, 2023

Interpretability 2.0

. Placed

Algorithmic Explanation: A method for measuring interpretations of neural networks

February 24, 2024

May 10, 2023

Interpretability 2.0

. Placed

AI Impact Assessments

February 24, 2024

May 10, 2023

AI Governance Hackathon

. Placed

AI and Democracy: Balancing Risks and Opportunities to Maintain Meaningful Human Control

February 24, 2024

May 10, 2023

AI Governance Hackathon

. Placed

Simon's Time-Off Newsletter

February 24, 2024

March 7, 2023

. Placed

Risk Defense Initiative

February 24, 2024

March 7, 2023

. Placed

New AI organization brainstorm

February 24, 2024

March 7, 2023

. Placed

Diversity in AI safety

February 24, 2024

March 7, 2023

. Placed

Critique of OpenAI's alignment plan

February 24, 2024

March 7, 2023

. Placed

Catalogue of AI safety

February 24, 2024

March 7, 2023

. Placed

ChatGPT Alignment Talent Search

February 24, 2024

March 7, 2023

. Placed

Authority bias to ChatGPT

February 24, 2024

March 7, 2023

. Placed

Analysis of upcoming AGI companies

February 24, 2024

March 7, 2023

. Placed

AI Safety Talent Pool Identification

February 24, 2024

March 7, 2023

. Placed

AI Safety Subproblems for Software Engineering Researchers

February 24, 2024

March 7, 2023

. Placed

AI Safety unionization for bottom-up governance

February 24, 2024

March 7, 2023

. Placed

Sustainable Fashion Brand Language Learning Model 1

February 24, 2024

February 16, 2023

ScaleOversight

. Placed

Physics Guided Deep Learning Interpretation

February 24, 2024

February 16, 2023

ScaleOversight

. Placed

Can you keep a secret?

February 24, 2024

February 16, 2023

ScaleOversight

. Placed

Automated Model Oversight Using CoTP

February 24, 2024

February 16, 2023

ScaleOversight

. Placed

Trafo Mech Int on the web!

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

TraCR-Supported Mechanistic Interpretability

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

The Start of Investigating a 1-Layer SoLU Model

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

One Attention Head Is All You Need for Sorting Fixed-Length Lists

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

Iterative summarization interpretability

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

Investigating Agent Behavior In different RL methods

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

In search of linguistic concepts: investigating BERT's context vectors

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

Interactive Layerscope

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

Distillation by duplication: The importance of layer selection

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

$B$ Confident Bro: Discovering Latent Knowledge In Language Models Without Supervision

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

Attention Phrenology: A spatial classification of attention heads

February 24, 2024

January 25, 2023

Mechanistic Interpretability Hackathon

. Placed

This Is Fine(-tuning): A benchmark testing LLMs robustness against bad fine-tuning data

February 24, 2024

December 19, 2022

AI Testing

. Placed

Model Hubris: On the Presumptuousness of Large Language Models

February 24, 2024

December 19, 2022

AI Testing

. Placed

LLM benchmarking through specifically-aligned feedback

February 24, 2024

December 19, 2022

AI Testing

. Placed

Formal Verification for Paren-balance checking

February 24, 2024

December 19, 2022

AI Testing

. Placed

Evaluating Critical Level Of Perturbations Required To Achieve Certain Fail Rate

February 24, 2024

December 19, 2022

AI Testing

. Placed

Demonstrating LLM agentic capabilities

February 24, 2024

December 19, 2022

AI Testing

. Placed

War is 15% conflic, 15% DragonMagazine

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Visualizing the effect prompt design has on text-davinci-002 mode collapse and social biases

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Trying to make GPT2 dream

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Top-Down Interpretability Through Eigenspectra

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Sparsity Lens

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Regularly Oversimplifying Neural Networks

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Optimising image patches to change RL-agent behaviour

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Observing and Validating Induction heads in SOLU-8l-old

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Natural language descriptions for natural language directions

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Neurons and Attention Heads that Look for Sentence Structure in GPT2

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Mechanisms of Causal Reasoning

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Interpretability at a glance

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Interpreting Catastrophic Failure Modes in OpenAI’s Whisper

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

How to find the minimum of a list - Transformer Edition

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Finding unusual neuron sets by activation vector distance

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Caught Red-Bandit

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

An Informal Investigation of Indirect Object Identification in Mistral GPT2-Small Battlestar

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

An Intuitive Logic for Understanding Autoregressive Language Models

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Alignment Jam : Gradient-based Interpretability of Quantum-inspired neural networks

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Algorithmic bit-wise boolean task on a transformer

February 24, 2024

November 15, 2022

Interpretability Hackathon

. Placed

Wording influences truthfulness

February 27, 2024

October 18, 2022

Language Model Hackathon

. Placed

Simulating an Alien

February 24, 2024

October 18, 2022

Language Model Hackathon

. Placed

Reasoning with Chain of Thought

February 24, 2024

October 18, 2022

Language Model Hackathon

. Placed

The first THC products

January 21, 2025

. Placed

AI Risk Management Assurance Network (AIRMAN)

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

Securing AGI Deployment and Mitigating Safety Risks

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

CoTEP: A Multi-Modal Chain of Thought Evaluation Platform for the Next Generation of SOTA AI Models

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

Cite2Root

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

VaultX - AI-Driven Middleware for Real-Time PII Detection and Data Security

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

Prompt+question Shield

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

Safe.ai: AI Agent Risk Assessment Platform

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

Safe.ai: AI Agent Risk Assessment Platform

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

.ALign File

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

LLM-prompt-optimiser based SAAS platform for evaluations

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

Navigating the AGI Revolution: Retraining and Redefining Human Purpose

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

Towards an Agent Marketplace for Alignment Research (AMAR)

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

HITL For High Risk AI Domains

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

AI Safety Evaluation – Benchmarking Framework

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

RestriktAI: Enhancing Safety and Control for Autonomous AI Agents

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

Neural Seal

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

Enhancing human intelligence with neurofeedback

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

Building Bridges for AI Safety: Proposal for a Collaborative Platform for Alumni and Researchers

January 19, 2025

67657515db3dae3e3314dae8

AI Safety Entrepreneurship Hackathon

. Placed

The best THC products

January 5, 2025

. Placed

The main plumbing tie all the habits!

January 5, 2025

. Placed

The wealthiest plumbing proprietorship constantly!

January 3, 2025

. Placed

The most talented plumbing band constantly!

January 2, 2025

. Placed

The wealthiest plumbing band till doomsday!

January 1, 2025

. Placed

Modernizing DC’s Emergency Communications

November 26, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Bias Mitigation in LLM by Steering Features

November 25, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

SAGE: Safe, Adaptive Generation Engine for Long Form Document Generation in Collaborative, High Stakes Domains

December 6, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Faithful or Factual? Tuning Mistake Acknowledgment in LLMs

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Analyzing Dataset Bias with SAEs

November 25, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Improving Llama-3-8B-Instruct Hallucination Robustness in Medical Q&A Using Feature Steering

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Unveiling Latent Beliefs Using Sparse Autoencoders

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Can we steer a model’s behavior with just one prompt? investigating SAE-driven auto-steering

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Improving Llama-3-8b Hallucination Robustness in Medical Q&A Using Feature Steering

November 24, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Sparse Autoencoders and Gemma 2-2B: Pioneering Demographic-Sensitive Language Modeling for Opinion QA

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Assessing Language Model Cybersecurity Capabilities with Feature Steering

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Math Speaks All Languages: Enhancing LLM Problem-Solving Across Multilingual Contexts

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Edufire - Personalized Education Platform Using LLM Steering

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Investigate arithmetic features in Multi-lingual LLMs

November 25, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Tentative proposal for AI control with weak supervisors trough Mechanistic Inspection

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Clear Thought and Clear Speech: Reducing Grammatical Scope Ambiguity

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

BBLLM

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Investigating Feature Effects on Manipulation Susceptibility

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Let LLM Agents Perform LLM Surgery

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Feature Tuning versus Prompting for Ambiguous Questions

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Auto Prompt Injection

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Feature based unlearning

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Recovering Goodfire's SAE feature vectors from their API

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

Encouraging Chain-of-Thought Reasoning

November 24, 2024

6710eab8447f62cdea3a653c

Reprogramming AI Models Hackathon

. Placed

User Transparency Within AI

November 20, 2024

67240c8fb0416a4520d2b4b6

Howard University AI Safety Summit & Policy Hackathon

. Placed

Community-First: A Rights-Based Framework for AI Governance in India's Welfare Systems

November 20, 2024

67240c8fb0416a4520d2b4b6

Howard University AI Safety Summit & Policy Hackathon

. Placed

National Data Privacy and Governance Act

November 20, 2024

67240c8fb0416a4520d2b4b6

Howard University AI Safety Summit & Policy Hackathon

. Placed

Implementing a Human-centered AI Assessment Framework (HAAF) for Equitable AI Development

November 20, 2024

67240c8fb0416a4520d2b4b6

Howard University AI Safety Summit & Policy Hackathon

. Placed

A Critical Review of "Chips for Peace": Lessons from "Atoms for Peace"

November 20, 2024

67240c8fb0416a4520d2b4b6

Howard University AI Safety Summit & Policy Hackathon

. Placed

AI Monitoring as a Rapid and Scalable Policy Solution: Weekly Global Bulletins on AI Developments

November 20, 2024

67240c8fb0416a4520d2b4b6

Howard University AI Safety Summit & Policy Hackathon

. Placed

Grandfather Paradox in AI – Bias Mitigation & Ethical AI1

November 20, 2024

67240c8fb0416a4520d2b4b6

Howard University AI Safety Summit & Policy Hackathon

. Placed

Glia for Healthcare Organisations

December 6, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

A Fundamental Rethinking to AI Evaluations: Establishing a Constitution-Based Framework

November 20, 2024

67240c8fb0416a4520d2b4b6

Howard University AI Safety Summit & Policy Hackathon

. Placed

applai

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Digital Rebellion: Analyzing misaligned AI agent cooperation for virtual labor strikes

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Policy Analysis: AI and Sustainability: Climate Impact Monitoring

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

AI Parliament

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Next-Gen AI-Enhanced Epidemic Intelligence

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Glia

December 6, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

AI ADVISORY COUNCIL FOR SUSTAINABLE ECONOMIC GROWTH AND ETHICAL INNOVATION IN THE DOMINICAN REPUBLIC (CANIA)

October 29, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

AI and Public Health: TSA Pre Health Check

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Hero Journey: Personalized Health Interventions for the Incarcerated

October 29, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Mapping Intent: Documenting Policy Adherence with Ontology Extraction

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

EcoNavix

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Towards a Unified Framework for Cybersecurity and AI Safety: Recommendations for Secure Development of Large Language Models

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Enviro - A Comprehensive Environmental Solution Using Policy and Technology

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Enhancing Human Verification Systems to Address AI Agent Circumvention and Attributability Concerns

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Reprocessing Nuclear Waste From Small Modular Reactors (SMRs)

October 29, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Politicians on AI Safety

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Policy Framework for Sustainable AI: Repurposing Waste Heat from Data Centers in the USA

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Predictive Analytics & Imagery for Environmental Monitoring

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Proposal for U.S.-China Technical Cooperation on AI Safety

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Proposal for a Provisional FDA Designation Targeting Biomedical Products Evaluated with Novel Methodologies

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Infectious Disease Outbreak Prediction and Dashboard

October 27, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Pan, your SMART Sustainability Expert

October 29, 2024

670822f88b8fdf04a35a4b76

AI Policy Hackathon at Johns Hopkins University

. Placed

Cross-model surveillance for emails handling

October 7, 2024

66792e7b43f57dc7a262ec11

Agent Security Hackathon

. Placed

Inference-Time Agent Security

October 6, 2024

66792e7b43f57dc7a262ec11

Agent Security Hackathon

. Placed

Intent Inspector - Protecting Against Prompt Injections for Agent Tool Misuse

October 6, 2024

66792e7b43f57dc7a262ec11

Agent Security Hackathon

. Placed

AI Honeypot

October 6, 2024

66792e7b43f57dc7a262ec11

Agent Security Hackathon

. Placed

AI Agent Capabilities Evolution

October 6, 2024

66792e7b43f57dc7a262ec11

Agent Security Hackathon

. Placed

An Autonomous Agent for Model Attribution

October 6, 2024

66792e7b43f57dc7a262ec11

Agent Security Hackathon

. Placed

Using ARC-AGI puzzles as CAPTCHa task

October 6, 2024

66792e7b43f57dc7a262ec11

Agent Security Hackathon

. Placed

LLM Agent Security: Jailbreaking Vulnerabilities and Mitigation Strategies

October 6, 2024

66792e7b43f57dc7a262ec11

Agent Security Hackathon

. Placed

Interpreting a toy model for finding the maximum element in a list

September 17, 2024

66e1cdf44e17beca5dc0c050

ARENA 4.0 Interpretability Hackathon

. Placed

Finding Circular Features in Gemma 2 2B

September 17, 2024

66e1cdf44e17beca5dc0c050

ARENA 4.0 Interpretability Hackathon

. Placed

nnsight transparent debugging

September 17, 2024

66e1cdf44e17beca5dc0c050

ARENA 4.0 Interpretability Hackathon

. Placed

minTranscoders

September 17, 2024

66e1cdf44e17beca5dc0c050

ARENA 4.0 Interpretability Hackathon

. Placed

Latent Space Clustering and Summarization

September 17, 2024

66e1cdf44e17beca5dc0c050

ARENA 4.0 Interpretability Hackathon

. Placed

Devising Effective Bechmarks

September 1, 2024

66792de23b5e6f1a6eb18e3f

Hackathon for Technical AI Safety Startups

. Placed

WELMA: Open-world environments for Language Model agents

September 1, 2024

66792de23b5e6f1a6eb18e3f

Hackathon for Technical AI Safety Startups

. Placed

CAMARA: A Comprehensive & Adaptive Multi-Agent framework for Red-Teaming and Adversarial Defense

September 1, 2024

66792de23b5e6f1a6eb18e3f

Hackathon for Technical AI Safety Startups

. Placed

Amplified Wise Simulations for Safe Training and Deployment

September 4, 2024

66792de23b5e6f1a6eb18e3f

Hackathon for Technical AI Safety Startups

. Placed

ÆLIGN: Aligned Agent-based Workflows via Collaboration & Safety Protocols

September 1, 2024

66792de23b5e6f1a6eb18e3f

Hackathon for Technical AI Safety Startups

. Placed

GuardianAI

September 3, 2024

66792de23b5e6f1a6eb18e3f

Hackathon for Technical AI Safety Startups

. Placed

Jailbreaking general purpose robots

September 1, 2024

66792de23b5e6f1a6eb18e3f

Hackathon for Technical AI Safety Startups

. Placed

Demonstrating LLM Code Injection Via Compromised Agent Tool

August 27, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

Misinformational AI-Generated Academic Papers

August 26, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

GrandSlam usecases not technology

August 25, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

AI Agents for Personalized Interaction and Behavioral Analysis

August 25, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

RedFluence

August 25, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

BBC News Impersonator

August 25, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

Unsolved AI Safety Concepts Explorer

August 25, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

AI Research Paper Processor

August 25, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

Sleeper Agents Detector

August 25, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

AdGPT

October 5, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

General Pervasiveness

August 27, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

VerifyStream

August 25, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

Web App for Interacting with Refusal-Ablated Language Model Agents

August 25, 2024

66a7c53acd7d1c97a3b3dad0

AI capabilities and risks demo-jam: Creating visceral interactive demonstrations

. Placed

Alignment Research Critiquer

July 29, 2024

6690e5a6af314ac3b68e3d51

Research Augmentation Hackathon: Supercharging AI Alignment

. Placed

Alignment Research Critiquer

July 28, 2024

6690e5a6af314ac3b68e3d51

Research Augmentation Hackathon: Supercharging AI Alignment

. Placed

Data Massager

July 28, 2024

6690e5a6af314ac3b68e3d51

Research Augmentation Hackathon: Supercharging AI Alignment

. Placed

AI Alignment Toolkit Research Assistant

July 28, 2024

6690e5a6af314ac3b68e3d51

Research Augmentation Hackathon: Supercharging AI Alignment

. Placed

Reflections on using LLMs to read a paper

July 28, 2024

6690e5a6af314ac3b68e3d51

Research Augmentation Hackathon: Supercharging AI Alignment

. Placed

Academic Weapon

July 28, 2024

6690e5a6af314ac3b68e3d51

Research Augmentation Hackathon: Supercharging AI Alignment

. Placed

Can Language Models Sandbag Manipulation?

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Deceptive behavior does not seem to be reducible to a single vector

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Deceptive behavior does not seem to be reducible to a single vector

June 30, 2024

65b750b6007bebd5884ddbbf

AI Security Evaluation Hackathon: Measuring AI Capability

. Placed

Werewolf Benchmark

July 1, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Detecting Lies of (C)omission

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

July 2, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Detecting Deception with AI Tics 😉

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Eliciting maximally distressing questions for deceptive LLMs

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Evaluating Steering Methods for Deceptive Behavior Control in LLMs

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

An Exploration of Current Theory of Mind Evals

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Sandbagging LLMs using Activation Steering

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Towards a Benchmark for Self-Correction on Model-Attributed Misinformation

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Boosting Language Model Honesty with Truthful Suffixes

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Detection of potentially deceptive attitudes using expression style analysis

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

From Sycophancy (not) to Sandbagging

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Gradient-Based Deceptive Trigger Discovery

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Evaluating and inducing steganography in LLMs

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Developing a deception dataset

June 30, 2024

660d65646a619f5cf53b1f56

Deception Detection Hackathon: Preventing AI deception

. Placed

Looking forward to posterity: what past information is transferred to the future?

June 3, 2024

661eead4df76057e22a47ca8

Computational Mechanics Hackathon!

. Placed

Looking forward to posterity: what past information is transferred to the future?

June 3, 2024

661eead4df76057e22a47ca8

Computational Mechanics Hackathon!

. Placed

Belief State Representations in Transformer Models on Nonergodic Data

June 3, 2024

661eead4df76057e22a47ca8

Computational Mechanics Hackathon!

. Placed

Steering Model’s Belief States

June 2, 2024

661eead4df76057e22a47ca8

Computational Mechanics Hackathon!

. Placed

Exploring Hierarchical Structure Representation in Transformer Models through Computational Mechanics

June 2, 2024

661eead4df76057e22a47ca8

Computational Mechanics Hackathon!

. Placed

LLM Benchmarking with Single-Agent Stochastic Dynamic Simulations

May 26, 2024

65b750b6007bebd5884ddbbf

AI Security Evaluation Hackathon: Measuring AI Capability

. Placed

Benchmark for emergent capabilities in high-risk scenarios

May 26, 2024

65b750b6007bebd5884ddbbf

AI Security Evaluation Hackathon: Measuring AI Capability

. Placed

WashBench – A Benchmark for Assessing Softening of Harmful Content in LLM-generated Text Summaries

May 26, 2024

65b750b6007bebd5884ddbbf

AI Security Evaluation Hackathon: Measuring AI Capability

. Placed

Evaluating the ability of LLMs to follow rules

May 26, 2024

65b750b6007bebd5884ddbbf

AI Security Evaluation Hackathon: Measuring AI Capability

. Placed

Black box detection of Sleeper Agents

May 26, 2024

65b750b6007bebd5884ddbbf

AI Security Evaluation Hackathon: Measuring AI Capability

. Placed

Manifold Recovery as a Benchmark for Text Embedding Models

May 26, 2024

65b750b6007bebd5884ddbbf

AI Security Evaluation Hackathon: Measuring AI Capability

. Placed

Detecting Anthropomorphic Tendencies in Language Models via Conversational Probing

May 28, 2024

65b750b6007bebd5884ddbbf

AI Security Evaluation Hackathon: Measuring AI Capability

. Placed

THE ROLE OF AI IN COMBATING POLITICAL DEEPFAKES IN AFRICAN DEMOCRACIES

May 6, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

LEGISLaiTOR: A tool for jailbreaking the legislative process

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Silent Curriculum

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Building more democratic institutions with collaboratively constructed debate moderation tools

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Political Bias Vulnerabilities in LLMs

May 26, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

AI Misinformation and Threats to Democratic Rights

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

GPT4 Is Righter Than GPT3.5 Replicating Findings on Political Bias in LLMs for non-Western Democracies

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Assessing Algorithmic Bias in Large Language Models' Predictions of Public Opinion Across Demographics

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

AI misinformation threatens the Wisdom of the crowd

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Trustworthy or knave? – scoring politicians with AI in real-time

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Multilingual Bias in Large Language Models: Assessing Political Skew Across Languages

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

AI in the Newsroom: Analyzing the Increase in ChatGPT-Favored Words in News Articles

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

WMDP-Defense: Weapons of Mass Disruption

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Democracy and AI: Ensuring Election Efficiency in Nigeria and Africa

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Universal Jailbreak of Closed Source LLMs which provide an End point to Finetune

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

AI Politician

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

A Framework for Centralizing forces in AI

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa.

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Investigating detection of election-influencing Sleeper Agents using probes

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

No place is safe - Automated investigation of private communities

May 5, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

USE OF AI IN POLITICAL CAMPAIGNS: GAP ASSESSMENT AND RECOMMENDATIONS

May 4, 2024

65b750920b4aeb478958fb32

AI and Democracy Hackathon: Demonstrating the Risks

. Placed

Forum for the Apart Hackathons

Omniscient Narrative Agent

Seemingly Human: Dark Patterns in ChatGPT

Model Cards for AI Algorithm Governance

Detecting Implicit Gaming through Retrospective Evaluation Sets

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions

In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops

Evaluating Myopia in Large Language Models

Data Taxation

Relating induction heads in Transformers to temporal context model in human free recall

Exploring the Robustness of Model-Graded Evaluations of Language Models

Solving the CNN Mech Int Challenge

Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight Techniques

We Discovered An Neuron

Discovering Latent Knowledge in Language Models Without Supervision - extensions and testing

Investigating Neuron Behaviour via Dataset Example Pruning and Local Search

Agreeableness vs. Truthfulness

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

Promoting School-Level Accountability for the Responsible Deployment of AI and Related Systems in K-12 Education: Mitigating Bias and Increasing Transparency

Robust Machine Unlearning for Dangerous Capabilities

Diamonds are Not All You Need

DarkForest - Defending the Authentic and Humane Web

Speculative Consequences of A.I. Misuse

AI Alignment Knowledge Graph

Sandbag Detection through Model Degradation

Unsupervised Recovery of Hidden Markov Models from Transformers with Evolutionary Algorithms

rAInboltBench : Benchmarking user location inference through single images

Beyond Refusal: Scrubbing Hazards from Open-Source Models

Very Cooperative Agent

Fishing for the answer: Mapping the flow of information in LLM agent groups using lessons from fish schools

Obsolescent Souls

Visual Prompt Injection Detection

Jailbreaking the Overseer

Discovering Agency Features as Latent Space Directions in LLMs via SVD

Agency as Shanon information. Unveiling limitations and common misconceptions

Against Agency

The AI governance gaps in developing countries

Who cares about brackets?

From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety

Dropout Incentivizes Privileged Bases

Player Of Games

Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small

Investigating Training Dynamics via Token Loss Trajectories

Backup Transformer Heads are Robust to Ablation Distribution

AI: My Partner in Crime

Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities

SafeBites

Dynamic Risk Assessment in Autonomous Agents Using Ontologies and AI

Simulation Operators: The Next Level of the Annotation Business

CoPirate

Grant Application Simulator

Detecting and Controlling Deceptive Representation in LLMs with Representational Engineering

RNNs represent belief state geometry in hidden state

Cybersecurity Persistence Benchmark

Jekyll and HAIde: The Better an LLM is at Identifying Misinformation, the More Effective it is at Worsening It.

Iterated contract negotiation

2030 - The CEO Dilemna

Cross-Lingual Generalizability of the SADDER Benchmark

LLMs With Knowledge of Jailbreaks Will Use Them

Uncertainty about value naturally leads to empowerment

Comparing truthful reporting, intent alignment, agency preservation and value identification

Building brakes for a speeding car: A global coordination proposal for AI safety

Embedding and Transformer Synthesis

MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark

OthelloScope

Reverse Word Wizards: Pitting Language Models Against the Art of Reversal

Automated Identification of Potential Feature Neurons

Counting Letters, Chaining Premises & Solving Equations: Exploring Inverse Scaling Problems with GPT-3

Model editing hazards at the example of ROME

All Fish are Trees

Utilitarian Decision-Making in Models - Evaluation and Steering

Sue-Per GPT: Legal RAG Assistant

Cop N' Shop

AI Safety Collective - Crowdsourcing Solutions for Critical AI Safety Challenges

LLM Research Collaboration Recommender

Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach

Investigating the Effect of Model Capacity Constraints on Belief State Representations

Handcrafting a Network to Predict Next Token Probabilities for the Random-Random-XOR Process

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions