All Apart hackathon projects
Here you can see all hackathon projects submitted to the
Apart hackathons
.
EscalAtion: Assessing Multi-Agent Risks in Military Contexts
Gabriel Mukobi, Anka Reuel, Juan-Pablo Rivera, Chandler Smith
September 12, 2024
October 2, 2023
4th π
3rd π
2nd π
1st π
DarkForest - Defending the Authentic and Humane Web
Mustafa Yasir
September 5, 2024
Hackathon for Technical AI Safety Startups
4th π
3rd π
2nd π
1st π
Sandbag Detection through Model Degradation
Cam Tice, Philipp Alexander Kreer, Fedor Ryzhenkov, Nathan Helm-Burger, Prithviraj Singh Shahani
July 8, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Unsupervised Recovery of Hidden Markov Models from Transformers with Evolutionary Algorithms
Dylan Bowman, Colin Lu
June 14, 2024
Computational Mechanics Hackathon!
4th π
3rd π
2nd π
1st π
rAInboltBench : Benchmarking user location inference through single images
Le "Qronox" Lam ; Aleksandr Popov ; Jord Nguyen ; Trung Dung "mogu" Hoang ; Marcel M ; Felix Michalak
May 31, 2024
AI Security Evaluation Hackathon: Measuring AI Capability
4th π
3rd π
2nd π
1st π
Beyond Refusal: Scrubbing Hazards from Open-Source Models
Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis
May 8, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
We Discovered An Neuron
Joseph Miller, Clement Neo
February 24, 2024
January 25, 2023
4th π
3rd π
2nd π
1st π
Investigating Neuron Behaviour via Dataset Example Pruning and Local Search
Alex Foote
February 24, 2024
November 15, 2022
4th π
3rd π
2nd π
1st π
Seemingly Human: Dark Patterns in ChatGPT
Jin Suk Park, Angela Lu, Esben Kran
February 24, 2024
February 12, 2024
4th π
3rd π
2nd π
1st π
Model Cards for AI Algorithm Governance
Jaime Raldua Veuthey; Gediminas Dauderis; Chetan Talele
February 24, 2024
January 7, 2024
4th π
3rd π
2nd π
1st π
Detecting Implicit Gaming through Retrospective Evaluation Sets
Jacob Haimes, Lucie Philippon, Alice Rigg, Cenny Wenner
February 24, 2024
November 27, 2023
4th π
3rd π
2nd π
1st π
Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions
Harry Powell, Luigi Berducci
February 24, 2024
September 25, 2023
4th π
3rd π
2nd π
1st π
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
Helios Lyons
February 24, 2024
September 24, 2023
4th π
3rd π
2nd π
1st π
Evaluating Myopia in Large Language Models
Marco Bazzani, Felix Binder
February 24, 2024
September 10, 2023
4th π
3rd π
2nd π
1st π
Data Taxation
Joshua Sammet, Per Ivar Friborg, William Wale
February 24, 2024
July 21, 2023
4th π
3rd π
2nd π
1st π
Relating induction heads in Transformers to temporal context model in human free recall
Ji-An Li
February 24, 2024
July 17, 2023
4th π
3rd π
2nd π
1st π
Exploring the Robustness of Model-Graded Evaluations of Language Models
Simon Lermen, OndΕej Kvapil
February 24, 2024
July 2, 2023
4th π
3rd π
2nd π
1st π
Solving the CNN Mech Int Challenge
Stefan Heimersheim, Marius Hobhahn
February 24, 2024
May 10, 2023
4th π
3rd π
2nd π
1st π
Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight Techniques
Sophia Pung, Gabriel Mukobi
February 24, 2024
February 16, 2023
4th π
3rd π
2nd π
1st π
Discovering Latent Knowledge in Language Models Without Supervision - extensions and testing
Agatha Duzan, Matthieu David, Jonathan Claybrough
February 24, 2024
December 19, 2022
4th π
3rd π
2nd π
1st π
Agreeableness vs. Truthfulness
February 24, 2024
October 18, 2022
4th π
3rd π
2nd π
1st π
Simulation Operators: The Next Level of the Annotation Business
Ardy Haroen
September 5, 2024
Hackathon for Technical AI Safety Startups
4th π
3rd π
2nd π
1st π
Detecting and Controlling Deceptive Representation in LLMs with Representational Engineering
Avyay M Casheekar, Kaushik Sanjay Prabhakar, Kanishk Rath, Sienka Dounia
August 29, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
RNNs represent belief state geometry in hidden state
Keenan Pepper
June 14, 2024
Computational Mechanics Hackathon!
4th π
3rd π
2nd π
1st π
Cybersecurity Persistence Benchmark
Davide Zani, Felix Michalak, Jeremias Ferrao
May 31, 2024
AI Security Evaluation Hackathon: Measuring AI Capability
4th π
3rd π
2nd π
1st π
Jekyll and HAIde: The Better an LLM is at Identifying Misinformation, the More Effective it is at Worsening It.
Mayowa Osibodu
May 8, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
Fishing for the answer: Mapping the flow of information in LLM agent groups using lessons from fish schools
Matthew Lutz, Nyasha Duri
February 24, 2024
February 12, 2024
4th π
3rd π
2nd π
1st π
Obsolescent Souls
Markov
February 24, 2024
January 7, 2024
4th π
3rd π
2nd π
1st π
Visual Prompt Injection Detection
Yoann Poupart, Imene Kerboua
February 24, 2024
November 27, 2023
4th π
3rd π
2nd π
1st π
Jailbreaking the Overseer
Alexander Meinke
February 24, 2024
October 1, 2023
4th π
3rd π
2nd π
1st π
Discovering Agency Features as Latent Space Directions in LLMs via SVD
max max
February 24, 2024
September 25, 2023
4th π
3rd π
2nd π
1st π
Agency as Shanon information. Unveiling limitations and common misconceptions
Ivan Madan, Hennadii Madan
February 24, 2024
September 24, 2023
4th π
3rd π
2nd π
1st π
Against Agency
Catherine Brewer
February 24, 2024
September 21, 2023
4th π
3rd π
2nd π
1st π
The AI governance gaps in developing countries
N Tran
February 24, 2024
July 21, 2023
4th π
3rd π
2nd π
1st π
Who cares about brackets?
Theo Clark, Alex Roman, Hannes Thurnherr
February 24, 2024
July 17, 2023
4th π
3rd π
2nd π
1st π
From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety
Heramb Podar, Vladislav Bargatin
February 24, 2024
July 4, 2023
4th π
3rd π
2nd π
1st π
Dropout Incentivizes Privileged Bases
Edoardo Pona, Victor Levoso FernΓ ndez, Abhay, Kunvar
February 24, 2024
May 10, 2023
4th π
3rd π
2nd π
1st π
Player Of Games
Samuel Knoche
February 24, 2024
February 16, 2023
4th π
3rd π
2nd π
1st π
Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small
Chris Mathwin, Guillaume Corlouer
February 24, 2024
January 25, 2023
4th π
3rd π
2nd π
1st π
Investigating Training Dynamics via Token Loss Trajectories
Alex Foote
February 24, 2024
December 19, 2022
4th π
3rd π
2nd π
1st π
Backup Transformer Heads are Robust to Ablation Distribution
Lucas Sato, Gabe Mukobi, Mishika Govil
February 24, 2024
November 15, 2022
4th π
3rd π
2nd π
1st π
AI: My Partner in Crime
February 24, 2024
October 18, 2022
4th π
3rd π
2nd π
1st π
AI Safety Collective - Crowdsourcing Solutions for Critical AI Safety Challenges
Lye Jia Jun, Dhruba Patra, Philipp Blandfort
September 5, 2024
Hackathon for Technical AI Safety Startups
4th π
3rd π
2nd π
1st π
Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach
Siddharth Reddy Bakkireddy, Rakesh Reddy Bakkireddy
July 8, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Handcrafting a Network to Predict Next Token Probabilities for the Random-Random-XOR Process
Rick Goldstein
June 14, 2024
Computational Mechanics Hackathon!
4th π
3rd π
2nd π
1st π
Investigating the Effect of Model Capacity Constraints on Belief State Representations
Ari Brill, Chu Chen
June 14, 2024
Computational Mechanics Hackathon!
4th π
3rd π
2nd π
1st π
Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions
Alexander Pino, Carl Vinas, Joseph Dantes, Zmavli Caimle, Kyle Reynoso
May 31, 2024
AI Security Evaluation Hackathon: Measuring AI Capability
4th π
3rd π
2nd π
1st π
Artificial Advocates: Biasing Democratic Feedback using AI
Sam Patterson, Jeremy Dolan, Simon Wisdom, Maten
May 8, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
OthelloScope
Albert Garde, Esben Kran
February 24, 2024
May 10, 2023
4th π
3rd π
2nd π
1st π
Automated Identification of Potential Feature Neurons
Michelle Wai Man Lo
February 24, 2024
January 25, 2023
4th π
3rd π
2nd π
1st π
Model editing hazards at the example of ROME
Oscar Persson, Jochem HΓΆlscher
February 24, 2024
November 15, 2022
4th π
3rd π
2nd π
1st π
Iterated contract negotiation
Robert Klassert
February 24, 2024
February 11, 2024
4th π
3rd π
2nd π
1st π
2030 - The CEO Dilemna
Pierina Camarena, Leon Nyametso, Capucine Marteau
February 24, 2024
January 8, 2024
4th π
3rd π
2nd π
1st π
Cross-Lingual Generalizability of the SADDER Benchmark
Siddhant Arora, Jord Nguyen, Akash Kundu
February 24, 2024
November 27, 2023
4th π
3rd π
2nd π
1st π
LLMs With Knowledge of Jailbreaks Will Use Them
Jack Foxabbott, Marcel Hedman, Kaspar Senft, Kianoosh Ashouritaklimi
February 24, 2024
October 2, 2023
4th π
3rd π
2nd π
1st π
Uncertainty about value naturally leads to empowerment
Filip Sondej
February 24, 2024
September 26, 2023
4th π
3rd π
2nd π
1st π
Comparing truthful reporting, intent alignment, agency preservation and value identification
Aksinya Bykova
February 24, 2024
September 24, 2023
4th π
3rd π
2nd π
1st π
Building brakes for a speeding car: A global coordination proposal for AI safety
Charles Martinet, Blanche Freudenreich, Henry Papadatos, Manuel Bimich
February 24, 2024
July 21, 2023
4th π
3rd π
2nd π
1st π
Embedding and Transformer Synthesis
Rick Goldstein
February 24, 2024
July 16, 2023
4th π
3rd π
2nd π
1st π
MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark
Roman Leventov, Jason Hoelscher-Obermaier
February 24, 2024
July 2, 2023
4th π
3rd π
2nd π
1st π
Reverse Word Wizards: Pitting Language Models Against the Art of Reversal
Ingrid Backman, Asta Rassmussen, Klara Nielsen
February 24, 2024
February 16, 2023
4th π
3rd π
2nd π
1st π
Counting Letters, Chaining Premises & Solving Equations: Exploring Inverse Scaling Problems with GPT-3
D. Chipping, J. Harding, H. Mannering, P. Selvaraj
February 24, 2024
December 19, 2022
4th π
3rd π
2nd π
1st π
All Fish are Trees
February 24, 2024
October 18, 2022
4th π
3rd π
2nd π
1st π
Identity System for AIs
Artem Grigor
September 5, 2024
Hackathon for Technical AI Safety Startups
4th π
3rd π
2nd π
1st π
Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders
Simon Lermen, Mateusz Dziemian
July 8, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Benchmarking Dark Patterns in LLMs
Jord Nguyen, Akash Kundu, Sami Jawhar
May 31, 2024
AI Security Evaluation Hackathon: Measuring AI Capability
4th π
3rd π
2nd π
1st π
Unleashing Sleeper Agents
Nora Petrova, Jord Nguyen
May 8, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
Towards High-Quality Model-Written Evaluations
Jannes Elstner, Jaime Raldua Veuthey
February 24, 2024
November 27, 2023
4th π
3rd π
2nd π
1st π
Second-order Jailbreaks
Mikhail Terekhov, Romain Graux, Denis Rosset, Eduardo Neville, Gabin Kolly
February 24, 2024
October 2, 2023
4th π
3rd π
2nd π
1st π
ILLUSION OF CONTROL
Mary Osuka
February 24, 2024
September 25, 2023
4th π
3rd π
2nd π
1st π
Agency, value and empowerment.
Benjamin Sturgeon, Leo Hyams
February 24, 2024
September 24, 2023
4th π
3rd π
2nd π
1st π
Premortem AI
Alvin Γ nestrand, Matthias Endres, Harry Powell, Chris Lonsberry
February 24, 2024
July 21, 2023
4th π
3rd π
2nd π
1st π
Interpreting Planning in Transformers
Victor Levoso Fernandez , Abhay Sheshadri
February 24, 2024
July 17, 2023
4th π
3rd π
2nd π
1st π
Exploitation of LLMβs to Elicit Misaligned Outputs
Desik Mandava, Jayanth Santosh, Aishwarya Gurung
February 24, 2024
July 2, 2023
4th π
3rd π
2nd π
1st π
Improving TransformerLens Head Detector
Mateusz BagiΕski, Jay Bailey
February 24, 2024
May 10, 2023
4th π
3rd π
2nd π
1st π
Soft Prompts are a Convex Set
- Amir Sarid - Bary Levy - Dan Barzily - Edo Arad - Gal Hyams - Geva Kipper - Guy Dar - Itay Yona - Yossi Gandelsman
February 24, 2024
January 25, 2023
4th π
3rd π
2nd π
1st π
Trojan detection and implementation on transformers
ClΓ©ment Dumas, Charbel-RaphaΓ«l Segerie, Liam Imadache
February 24, 2024
December 19, 2022
4th π
3rd π
2nd π
1st π
Probing Conceptual Knowledge on Solved Games
Amir Sarid, Bary Levy, Dan Barzilay, Edo Arad, Itay Yona, Joey Geralnik
February 24, 2024
November 15, 2022
4th π
3rd π
2nd π
1st π
Reducing hindsight neglect with "Let's think step by step"
February 24, 2024
October 18, 2022
4th π
3rd π
2nd π
1st π
Multifaceted Benchmarking
Eduardo Neville, George Golynskyi, Tetra Jones
February 24, 2024
November 27, 2023
4th π
3rd π
2nd π
1st π
Exploring multi-agent interactions in the dollar auction
Thomas Broadley, Allison Huang
February 24, 2024
October 2, 2023
4th π
3rd π
2nd π
1st π
AutoAdminsteredAntidotes: Circuit detection in a poisoned model for MNIST classification
Kola Ayonrinde, Denizhan βDennisβ Akar, Kitti KovΓ‘cs, Adam Newgas, David Quarel
February 24, 2024
May 10, 2023
4th π
3rd π
2nd π
1st π
Reasoning with Chain of Thought
February 24, 2024
October 18, 2022
4th π
3rd π
2nd π
1st π
Wording influences truthfulness
February 27, 2024
October 18, 2022
4th π
3rd π
2nd π
1st π
Othello Mechint playground
Victor Levoso Fernandez, Edoardo Pona ,Abhay Sheshadri, Kunvar
February 24, 2024
May 10, 2023
4th π
3rd π
2nd π
1st π
Detecting Phase Transitions
Jesse Hoogland, Lucas Texeira, Benjamin Gerraty, Rumi Salazar, Samuel Knoche
February 24, 2024
May 10, 2023
4th π
3rd π
2nd π
1st π
Simulating an Alien
February 24, 2024
October 18, 2022
4th π
3rd π
2nd π
1st π
Exploring OthelloGPT
Yeu-Tong Lau
February 24, 2024
May 10, 2023
4th π
3rd π
2nd π
1st π
Best CBD Products
cbd melatonin gummies
October 9, 2024
4th π
3rd π
2nd π
1st π
Best CBD Products
ThomasMew
October 9, 2024
4th π
3rd π
2nd π
1st π
Cross-model surveillance for emails handling
Le Ngoc Mai
October 7, 2024
Agent Security Hackathon
4th π
3rd π
2nd π
1st π
Pre-eminent CBD Products
ThomasMew
October 7, 2024
4th π
3rd π
2nd π
1st π
Dynamic Risk Assessment in Autonomous Agents Using Ontologies and AI
Alejandra de Brunner
October 7, 2024
Agent Security Hackathon
4th π
3rd π
2nd π
1st π
Diamonds are Not All You Need
Michael Andrzejewski, Melwina Albuquerque
October 6, 2024
Agent Security Hackathon
4th π
3rd π
2nd π
1st π
Inference-Time Agent Security
Nicholas Chen
October 6, 2024
Agent Security Hackathon
4th π
3rd π
2nd π
1st π
Cop N' Shop
Vaishnavi Pamulapati, Diego Sabajo, Andres Sepulveda Morales, Elsa Donnat, Paul Vautravers
October 6, 2024
Agent Security Hackathon
4th π
3rd π
2nd π
1st π
Intent Inspector - Protecting Against Prompt Injections for Agent Tool Misuse
Oliver Morris, Gerard Boxo Corominas
October 6, 2024
Agent Security Hackathon
4th π
3rd π
2nd π
1st π
OCAP Agents
Dmitrii Volkov
October 6, 2024
Agent Security Hackathon
4th π
3rd π
2nd π
1st π
AI Honeypot
Reworr
October 6, 2024
Agent Security Hackathon
4th π
3rd π
2nd π
1st π
AI Agent Capabilities Evolution
Ekaterina Krupkina
October 6, 2024
Agent Security Hackathon
4th π
3rd π
2nd π
1st π
An Autonomous Agent for Model Attribution
Jord Nguyen
October 6, 2024
Agent Security Hackathon
4th π
3rd π
2nd π
1st π
Using ARC-AGI puzzles as CAPTCHa task
Mikolaj Kniejski
October 6, 2024
Agent Security Hackathon
4th π
3rd π
2nd π
1st π
LLM Agent Security: Jailbreaking Vulnerabilities and Mitigation Strategies
mohammed arsalan , Vishwesh bhat
October 6, 2024
Agent Security Hackathon
4th π
3rd π
2nd π
1st π
A-one CBD Products
AbrahamWrEte
October 5, 2024
4th π
3rd π
2nd π
1st π
AdGPT
Khaidar Bikmaev, Nikolaj Kotov, Dmitrii Volkov
October 5, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
The a- hemp products
bemidji dispensary
October 2, 2024
4th π
3rd π
2nd π
1st π
The most qualified hemp products
ScottBax
October 1, 2024
4th π
3rd π
2nd π
1st π
The most suitable hemp products
ScottBax
October 1, 2024
4th π
3rd π
2nd π
1st π
The most suitable hemp products
LemuelApemy
September 30, 2024
4th π
3rd π
2nd π
1st π
CBD gummies secure been a game-changer in behalf of me!
LadonnaBlaky
September 25, 2024
4th π
3rd π
2nd π
1st π
Interpreting a toy model for finding the maximum element in a list
Tinuade Adeleke
September 17, 2024
ARENA 4.0 Interpretability Hackathon
4th π
3rd π
2nd π
1st π
Finding Circular Features in Gemma 2 2B
Leo, Misha
September 17, 2024
ARENA 4.0 Interpretability Hackathon
4th π
3rd π
2nd π
1st π
nnsight transparent debugging
Aaron Kaufman, David Steinberg
September 17, 2024
ARENA 4.0 Interpretability Hackathon
4th π
3rd π
2nd π
1st π
minTranscoders
Bronson Schoen
September 17, 2024
ARENA 4.0 Interpretability Hackathon
4th π
3rd π
2nd π
1st π
Latent Space Clustering and Summarization
Matthew Shinkle
September 17, 2024
ARENA 4.0 Interpretability Hackathon
4th π
3rd π
2nd π
1st π
GuardianAI
Andres Sepulveda Morales, Doroteya Stoyanova, Patrick Huang, Wanjie Zhong
September 3, 2024
Hackathon for Technical AI Safety Startups
4th π
3rd π
2nd π
1st π
Devising Effective Bechmarks
Nancy Vigil, Ashish Rai
September 1, 2024
Hackathon for Technical AI Safety Startups
4th π
3rd π
2nd π
1st π
WELMA: Open-world environments for Language Model agents
Sohaib Imran, Mujeeb Nawaz
September 1, 2024
Hackathon for Technical AI Safety Startups
4th π
3rd π
2nd π
1st π
CAMARA: A Comprehensive & Adaptive Multi-Agent framework for Red-Teaming and Adversarial Defense
Vishnu Vardhan Lanka, Era Sarda, Raghav Ravishankar
September 1, 2024
Hackathon for Technical AI Safety Startups
4th π
3rd π
2nd π
1st π
ΓLIGN: Aligned Agent-based Workflows via Collaboration & Safety Protocols
Samantha Guerriero and Nora Petrova
September 1, 2024
Hackathon for Technical AI Safety Startups
4th π
3rd π
2nd π
1st π
CoPirate
Mia Hopman, Carissa Cullen, Jack Wittmayer, Vaishnavi Pamulapati
August 27, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
General Pervasiveness
Andres Sepulveda Morales, Patrick Huang
August 27, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
Demonstrating LLM Code Injection Via Compromised Agent Tool
Kevin Vegda, Oliver Chamberlain, William Baird
August 27, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
Misinformational AI-Generated Academic Papers
Aaron Sandoval, Akash Kundu, Layla Adam
August 26, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
Phish Tycoon: phishing using voice cloning
Craig Albuquerque, Melwina Albuquerque
August 26, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
GrandSlam usecases not technology
jmari wyatt
August 25, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
AI Agents for Personalized Interaction and Behavioral Analysis
Chetan Talele,Jasper Timm
August 25, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
Speculative Consequences of A.I. Misuse
Joseph Karam, Charlie Nguyen, Andrew Lam
August 25, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
RedFluence
Pushkal Ahluwalia Aaryan Purohit
August 25, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
BBC News Impersonator
Kyal Pindolia, Funmi βFinnβ Okuleye
August 25, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
Unsolved AI Safety Concepts Explorer
Tewodros Mesfin
August 25, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
AI Research Paper Processor
Lincoln Vann-Wakelin, Ian Osorio Perez, Bhairavi Ahilan
August 25, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
Sleeper Agents Detector
MichaΓ«l Trazzi and Saahir Vazirani
August 25, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
VerifyStream
Kailash Balasubramaniyam
August 25, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
Web App for Interacting with Refusal-Ablated Language Model Agents
Simon Lermen
August 25, 2024
AI capabilities and risks demo-jam: Creating visceral interactive demonstrations
4th π
3rd π
2nd π
1st π
PurePrompt - An easy tool for prompt robustness and eval augmentation
Axel Sorensen
July 30, 2024
Research Augmentation Hackathon: Supercharging AI Alignment
4th π
3rd π
2nd π
1st π
Alignment Research Critiquer
Nancy Vigil, Ashish Rai
July 29, 2024
Research Augmentation Hackathon: Supercharging AI Alignment
4th π
3rd π
2nd π
1st π
Alignment Research Critiquer
Nancy Vigil, Ashish Rai
July 28, 2024
Research Augmentation Hackathon: Supercharging AI Alignment
4th π
3rd π
2nd π
1st π
LLM Research Collaboration Recommender
David McSharry
July 28, 2024
Research Augmentation Hackathon: Supercharging AI Alignment
4th π
3rd π
2nd π
1st π
Data Massager
Giles Edkins, Ayush Jain
July 28, 2024
Research Augmentation Hackathon: Supercharging AI Alignment
4th π
3rd π
2nd π
1st π
AI Alignment Toolkit Research Assistant
Luciano Hanyon Wu
July 28, 2024
Research Augmentation Hackathon: Supercharging AI Alignment
4th π
3rd π
2nd π
1st π
Grant Application Simulator
MichaΓ«l Trazzi
July 28, 2024
Research Augmentation Hackathon: Supercharging AI Alignment
4th π
3rd π
2nd π
1st π
Reflections on using LLMs to read a paper
Lovkush Agarwal
July 28, 2024
Research Augmentation Hackathon: Supercharging AI Alignment
4th π
3rd π
2nd π
1st π
Academic Weapon
Luis Tercero Herman
July 28, 2024
Research Augmentation Hackathon: Supercharging AI Alignment
4th π
3rd π
2nd π
1st π
AI Alignment Knowledge Graph
Matin Mahmood, Samuel Ratnam, Sruthi Kuriakose, Pandelis Mouratoglou
July 28, 2024
Research Augmentation Hackathon: Supercharging AI Alignment
4th π
3rd π
2nd π
1st π
The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs
Tanush Chopra, Michael Li
July 2, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Werewolf Benchmark
Luhan Mikaelson, Zach Nguyen, Andy Liu, Jord Nguyen, Akash Kundu
July 1, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Can Language Models Sandbag Manipulation?
Arthur Camara, Alexander Cockburn, Myles Heller
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Deceptive behavior does not seem to be reducible to a single vector
Carl Vinas, Zmavli Caimle
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Deceptive behavior does not seem to be reducible to a single vector
Carl John ViΓ±as, Zmavli Caimle
June 30, 2024
AI Security Evaluation Hackathon: Measuring AI Capability
4th π
3rd π
2nd π
1st π
Detecting Lies of (C)omission
Ilan Moscovitz, Nikhil Kotecha, Sam Svenningsen
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Detecting Deception with AI Tics π
Samuel Svenningsen, Ilan Moscovitz, Nikhil Kotecha
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Eliciting maximally distressing questions for deceptive LLMs
Γpiphanie GΓ©dΓ©on
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Evaluating Steering Methods for Deceptive Behavior Control in LLMs
Casey Hird, Basavasagar Patil, Tinuade Adeleke, Adam Fraknoi, Neel Jay
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
An Exploration of Current Theory of Mind Evals
John Henderson, Alan Fung, Bachar Moustapha
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Sandbagging LLMs using Activation Steering
Davide Zani, Jeremias Ferrao
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Towards a Benchmark for Self-Correction on Model-Attributed Misinformation
Alexi Roth Luis CaΓ±amo, Kyle Gabriel Reynoso
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Boosting Language Model Honesty with Truthful Suffixes
Smitty van Bodegom, Giles Edkins, Annie Szorkin
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Detection of potentially deceptive attitudes using expression style analysis
Roland Pihlakas
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
From Sycophancy (not) to Sandbagging
Felix HofstΓ€tter, Daniel Tan, Sohaib Imran, David Quarel
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Gradient-Based Deceptive Trigger Discovery
Henning Bartsch, Leon Eshuijs
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Evaluating and inducing steganography in LLMs
Artem Karpov
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Developing a deception dataset
Lovkush Agarwal
June 30, 2024
Deception Detection Hackathon: Preventing AI deception
4th π
3rd π
2nd π
1st π
Looking forward to posterity: what past information is transferred to the future?
Zmavli Caimle
June 3, 2024
Computational Mechanics Hackathon!
4th π
3rd π
2nd π
1st π
Looking forward to posterity: what past information is transferred to the future?
Zmavli Caimle
June 3, 2024
Computational Mechanics Hackathon!
4th π
3rd π
2nd π
1st π
Belief State Representations in Transformer Models on Nonergodic Data
Junfeng Feng, Wanjie Zhong, Doroteya Stoyanova, Lennart Finke
June 3, 2024
Computational Mechanics Hackathon!
4th π
3rd π
2nd π
1st π
Steering Modelβs Belief States
Gonçalo Paulo, Sinem Erisken, Tassilo Neubauer
June 2, 2024
Computational Mechanics Hackathon!
4th π
3rd π
2nd π
1st π
Exploring Hierarchical Structure Representation in Transformer Models through Computational Mechanics
Olli JΓ€rviniemi, Udayanto Dwi Atmojo, Aayush Kucheria, Konsta Tiilikainen
June 2, 2024
Computational Mechanics Hackathon!
4th π
3rd π
2nd π
1st π
Detecting Anthropomorphic Tendencies in Language Models via Conversational Probing
Jacob Haimes, Esben Kran
May 28, 2024
AI Security Evaluation Hackathon: Measuring AI Capability
4th π
3rd π
2nd π
1st π
LLM Benchmarking with Single-Agent Stochastic Dynamic Simulations
Sai Joseph, Anita Beroza, Eleni Angelou, Sofia Mendez, Evelyn Ciara
May 26, 2024
AI Security Evaluation Hackathon: Measuring AI Capability
4th π
3rd π
2nd π
1st π
Benchmark for emergent capabilities in high-risk scenarios
Junfeng Feng, Wanjie Zhong,Saptadip Saha, Doroteya Stoyanova
May 26, 2024
AI Security Evaluation Hackathon: Measuring AI Capability
4th π
3rd π
2nd π
1st π
WashBench β A Benchmark for Assessing Softening of Harmful Content in LLM-generated Text Summaries
Sev Geraskin, Jakub KryΕ, Luhan Mikaelson, Simon Wisdom
May 26, 2024
AI Security Evaluation Hackathon: Measuring AI Capability
4th π
3rd π
2nd π
1st π
Evaluating the ability of LLMs to follow rules
Jasmina Nasufi, Einar Urdshals
May 26, 2024
AI Security Evaluation Hackathon: Measuring AI Capability
4th π
3rd π
2nd π
1st π
Black box detection of Sleeper Agents
Kenneth Ong
May 26, 2024
AI Security Evaluation Hackathon: Measuring AI Capability
4th π
3rd π
2nd π
1st π
Manifold Recovery as a Benchmark for Text Embedding Models
Lennart Finke
May 26, 2024
AI Security Evaluation Hackathon: Measuring AI Capability
4th π
3rd π
2nd π
1st π
Political Bias Vulnerabilities in LLMs
Chris DiGiano, Vassil Tashev, Aysh Segulguzel
May 26, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
THE ROLE OF AI IN COMBATING POLITICAL DEEPFAKES IN AFRICAN DEMOCRACIES
Uzo Mkparu , Winnie Kungu Chialuka Prisca-mary Onuoha, , Kayode Adekoya
May 6, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
LEGISLaiTOR: A tool for jailbreaking the legislative process
Willie Chalmers III, Margaret Belford
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
Silent Curriculum
Aman Priyanshu, Supriti Vijay
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
Building more democratic institutions with collaboratively constructed debate moderation tools
Max Holschneider
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
AI Misinformation and Threats to Democratic Rights
Davide Zani, Mariam Ibrahim, Tiwai Mhundwa, Felix Michalak, Andrei Avram
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
GPT 4 Is Righter Than GPT 3.5 Replicating Findings on Political Bias in LLMs for non-Western Democracies
Huu Khiem "Gumperto" Hoang, Jord Nguyen
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
Assessing Algorithmic Bias in Large Language Models' Predictions of Public Opinion Across Demographics
Khai Tran,Sev Geraskin,Doroteya Stoyanova,Jord Nguyen
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
AI misinformation threatens the Wisdom of the crowd
Emil Svenberg
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
Trustworthy or knave? β scoring politicians with AI in real-time
MichaΕ Kubiak, Kamil Kulesza
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
WMDP-Defense: Weapons of Mass Disruption
Esben Kran, Tristan Williams, Bart Bussmann
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
Multilingual Bias in Large Language Models: Assessing Political Skew Across Languages
Srishti Dutta, Akash Kundu
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
AI in the Newsroom: Analyzing the Increase in ChatGPT-Favored Words in News Articles
Aayush Kucheria, Okko KatajamΓ€ki, Santeri Koivula, Andrea La Mantia, Norman Piotrowski
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
Democracy and AI: Ensuring Election Efficiency in Nigeria and Africa
Adebayo Mubarak Adewumi
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
Universal Jailbreak of Closed Source LLMs which provide an End point to Finetune
Desik
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
AI Politician
David Abecassis, Felix Michalak, Nguyen Dang Nhat Anh, Zhiyi Xu
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
A Framework for Centralizing forces in AI
Emiel Robben, Sixuan Pei, Yuan Wei, Nils MΓΌller
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa.
Ajani, Adedeji Hammed & Adedokun-Shittu, Nafisat Afolake
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
Investigating detection of election-influencing Sleeper Agents using probes
Kenneth Ong
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
No place is safe - Automated investigation of private communities
Lucie Philippon
May 5, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
USE OF AI IN POLITICAL CAMPAIGNS: GAP ASSESSMENT AND RECOMMENDATIONS
Daphne, Osuka, Ian
May 4, 2024
AI and Democracy Hackathon: Demonstrating the Risks
4th π
3rd π
2nd π
1st π
Towards Interpretability of 5 digit addition
Philip Quirke
February 24, 2024
July 16, 2023
4th π
3rd π
2nd π
1st π
AttentionData
Connor Henderson
February 24, 2024
January 22, 2024
4th π
3rd π
2nd π
1st π
Gradient Descent Over Interpolated Activation Patches for Circuit Discovery
Glen M. Taggart
February 24, 2024
January 22, 2024
4th π
3rd π
2nd π
1st π
AI Safeguard: Navigating Compliance and Risk in the Era of the EU AI Act
Heramb Podar
February 24, 2024
January 8, 2024
4th π
3rd π
2nd π
1st π
Observing and Validating Induction heads in SOLU-8l-old
Brian Muhia
February 24, 2024
November 15, 2022
4th π
3rd π
2nd π
1st π
Optimising image patches to change RL-agent behaviour
ClΓ©ment Dumas, Louis Develle, Robert Scholz, Amaury Lorin
February 24, 2024
November 15, 2022
4th π
3rd π
2nd π
1st π
Regularly Oversimplifying Neural Networks
Nick Kross, Mica White
February 24, 2024
November 15, 2022
4th π
3rd π
2nd π
1st π
Sparsity Lens
Vladimir Ivanov, SimΓ©on Campos
February 24, 2024
November 15, 2022
4th π
3rd π
2nd π
1st π
Top-Down Interpretability Through Eigenspectra
Jan Wehner, Rauno Arike, Jesse Hoogland, Simon Marshall
February 24, 2024
November 15, 2022
4th π
3rd π
2nd π
1st π
Trying to make GPT2 dream
Gordei PribΓ΅tkin, Mykyta Baliesnyi, Joonas X
February 24, 2024
November 15, 2022
4th π
3rd π
2nd π
1st π
Visualizing the effect prompt design has on text-davinci-002 mode collapse and social biases
Gloria, Sam, Ingrid, Nina
February 24, 2024
November 15, 2022
4th π
3rd π
2nd π
1st π
War is 15% conflic, 15% DragonMagazine
Giles Edkins, Lauren Greenspan, Dan Valentine
February 24, 2024
November 15, 2022
4th π
3rd π
2nd π
1st π
Apart
Home
Research
Sprints
Lab
Team
News
Donate