APART RESEARCH

Pilot Experiments

Explore all hackathon research and pilot experiments

Mar 24, 2025

jaime project Title

bbb

Read More

Mar 24, 2025

JAIMETEST_Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 24, 2025

AI Safety Escape Room

The AI Safety Escape Room is an engaging and hands-on AI safety simulation where participants solve real-world AI vulnerabilities through interactive challenges. Instead of learning AI safety through theory, users experience it firsthand – debugging models, detecting adversarial attacks, and refining AI fairness, all within a fun, gamified environment.

Track: Public Education Track

Read More

Mar 24, 2025

Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 24, 2025

LLM Military Decision-Making Under Uncertainty: A Simulation Study

LLMs tested in military decision scenarios typically favor diplomacy over conflict, though uncertainty and chain-of-thought reasoning increase aggressive recommendations. This suggests context-specific limitations for LLM-based military decision support.

Read More

Mar 24, 2025

Inspiring People to Go into RL Interp

This project is attempting to complete the Public Education Track, taking inspiration from ideas 1 and 4. The journey mapping was inspired by bluedot impact and aims to create a course that helps explain the need for work to be done in Reinforcement Learning (RL) interp, especially in the problems of reward hacking and goal misgeneralization. The point of the game is to make a humorous example of what could happen due to a lack of AI safety (not specifically goal misalignment or reward hacking) and is meant to be a fun introduction for nontechnical people to even care about AI safety.

Read More

Mar 24, 2025

Morph: AI Safety Education Adaptable to (Almost) Anyone

One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.

AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.

Our MVP features a dynamic localization that tailors case studies, risk scenarios, and policy examples to users’ cultural and regional contexts (e.g., healthcare AI governance in Southeast Asia vs. the EU). This engine adjusts references, and frameworks to align with local values. We integrate transformer-based localization, causal inference for policy outcomes, and graph-based matching, providing a scalable framework for inclusive AI safety education. This approach bridges theory and practice, ensuring solutions reflect the diversity of societies they aim to protect. In future works, we map out the partnership we’re currently establishing to use Morph beyond this hackathon.

Read More

Mar 24, 2025

Interactive Assessments for AI Safety: A Gamified Approach to Evaluation and Personal Journey Mapping

An interactive assessment platform and mentor chatbot hosted on Canvas LMS, for testing and guiding learners from BlueDot's Intro to Transformative AI Course.

Read More

Mar 24, 2025

Mechanistic Interpretability Track: Neuronal Pathway Coverage

Our study explores mechanistic interpretability by analyzing how Llama 3.3 70B classifies political content. We first infer user political alignment (Biden, Trump, or Neutral) based on tweets, descriptions, and locations. Then, we extract the most activated features from Biden- and Trump-aligned datasets, ranking them based on stability and relevance. Using these features, we reclassify users by prompting the model to rely only on them. Finally, we compare the new classifications with the initial ones, assessing neural pathway overlap and classification consistency through accuracy metrics and visualization of activation patterns.

Read More

Mar 24, 2025

Preparing for Accelerated AGI Timelines

This project examines the prospect of near-term AGI from multiple angles—careers, finances, and logistical readiness. Drawing on various discussions from LessWrong, it highlights how entrepreneurs and those who develop AI-complementary skills may thrive under accelerated timelines, while traditional, incremental career-building could falter. Financial preparedness focuses on striking a balance between stable investments (like retirement accounts) and riskier, AI-exposed opportunities, with an emphasis on retaining adaptability amid volatile market conditions. Logistical considerations—housing decisions, health, and strong social networks—are shown to buffer against unexpected disruptions if entire industries or locations are suddenly reshaped by AI. Together, these insights form a practical roadmap for individuals seeking to navigate the uncertainties of an era when AGI might rapidly transform both labor markets and daily life.

Read More

Mar 24, 2025

Identification if AI generated content

Our project falls within the Social Sciences track, focusing on the identification of AI-generated text content and its societal impact. A significant portion of online content is now AI-generated, often exhibiting a level of quality and human-likeness that makes it indistinguishable from human-created content. This raises concerns regarding misinformation, authorship transparency, and trust in digital communication.

Read More

Mar 24, 2025

A Noise Audit of LLM Reasoning in Legal Decisions

AI models are increasingly applied in judgement tasks, but we have little understanding of how their reasoning compares to human decision-making. Human decision-making suffers from bias and noise, which causes significant harm in sensitive contexts, such as legal judgment. In this study, we evaluate LLMs on a legal decision prediction task to compare with the historical, human-decided outcome and investigate the level of noise in repeated LLM judgments. We find that two LLM models achieve close to chance level accuracy and display low to no variance in repeated decisions.

Read More

Mar 24, 2025

Superposition, but at a Cross-MLP Layers view?

To understand causal relationships between features (extracted by SAE) across MLP layers, this study introduces the Coordinated Sparse Autoencoder Network (CoSAEN). CoSAEN integrates sparse autoencoders for feature extraction with the PC algorithm for causal discovery, to find the path-based activations of features in MLP.

Read More

Mar 24, 2025

Hikayat - Interactive Stories to Learn AI Safety

This paper presents an interactive, scenario-based learning approach to raise public awareness of AI risks and promote responsible AI development. By leveraging Hikayat, traditional Arab storytelling, the project engages non-technical audiences, emphasizing the ethical and societal implications of AI, such as privacy, fraud, deepfakes, and existential threats. The platform, built with React and a flexible Markdown content system, features multi-path narratives, decision tracking, and resource libraries to foster critical thinking and ethical decision-making. User feedback indicates positive engagement, with improved AI literacy and ethical awareness. Future work aims to expand scenarios, enhance accessibility, and integrate real-world tools, further supporting AI governance and responsible development.

Read More

Mar 24, 2025

Medical Agent Controller

The Medical Agent Controller (MAC) is a multi-agent governance framework designed to safeguard AI-powered medical chatbots by intercepting unsafe recommendations in real time.

It employs a dual-phase approach, using red-team simulations during testing and a controller agent during production to monitor and intervene when necessary.

By integrating advanced medical knowledge and adversarial testing, MAC enhances patient safety and provides actionable feedback for continuous improvement in medical AI systems.

Read More

Mar 24, 2025

HalluShield: A Mechanistic Approach to Hallucination Resistant Models

Our project tackles the critical problem of hallucinations in large language models (LLMs) used in healthcare settings, where inaccurate information can have serious consequences. We developed a proof-of-concept system that classifies LLM-generated responses as either factual or hallucinated. Our approach leverages sparse autoencoders (GoodFire’s Ember) trained on neural activations from Meta Llama 3. These autoencoders identify monosemantic features that serve as strong indicators of hallucination patterns. By feeding these extracted features into tree-based classification models (XGBoost), we achieved an impressive F1 score of 89% on our test dataset. This machine learning approach offers several advantages over traditional methods and LLM as a judge. First, it can be specifically trained on in-domain datasets (eg: medical) for domain-specific hallucination detection. Second, the model is interpretable, showing which activation patterns correlate with hallucinations and acts as a post-processing layer applied to LLM output.

Read More

Mar 24, 2025

Feature-based analysis of cooperation-relevant behaviour in Prisoner’s Dilemma

We hypothesise that internal-based model probing and editing might provide higher signal in multi-agent settings. We implement a small simulation of Prisoner’s Dilemma to probe for cooperation-relevant properties. Our experiments demonstrate that feature-based steering highlights deception-relevant features and does so more strongly than prompt-based steering.

Read More

Mar 24, 2025

Searching for Universality and Equivariance in LLMs using Sparse Autoencoder Found Features

The project investigates how neuron features with properties of universality and equivariance affect the controllability and safety of large language models, finding that behaviors supported by redundant features are more resistant to manipulation than those governed by singular features.

Read More

Mar 24, 2025

Debugging Language Models with SAEs

This report investigates an intriguing failure mode in the Llama-3.1-8B-Instruct model: its inconsistent ability to count letters depending on letter case and grammatical structure. While the model correctly answers "How many Rs are in BERRY?", it struggles with "How many rs are in berry?", suggesting that uppercase and lowercase queries activate entirely different cognitive pathways.

Through Sparse Autoencoder (SAE) analysis, feature activation patterns reveal that uppercase queries trigger letter-counting features, while lowercase queries instead activate uncertainty-related neurons. Feature steering experiments show that simply amplifying counting neurons does not lead to correct behavior.

Further analysis identifies tokenization effects as another important factor: different ways of breaking very similar sentences into tokens influence the model’s response. Additionally, grammatical structure plays a role, with "is" phrasing yielding better results than "are."

Read More

Mar 24, 2025

AI Through the Human Lens Investigating Cognitive Theories in Machine Psychology

We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks from psychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluate GPT-4o, QvQ 72B, LLaMA 70B, Mixtral 8x22B, and DeepSeek V3 using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yet are shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and future work that bridges cognitive psychology and AI safety.

Read More

Mar 24, 2025

U Reg AI: you regulate it, or you regenerate it!

We have created a 'choose your path' role game to mitigate existential AI risk ... at this point they might be actual situations in the near-future. The options for mitigation are holistic and dynamic to the player's previous choices. The final result is an evaluation of the player's decision-making performance in wake of the existential risk situation, recommendations for how they can improve or aspects they should crucially consider for the future, and finally how they can take part in AI Safety through various careers or BlueDot Impact courses.

Read More

Mar 24, 2025

Red-teaming with Mech-Interpretability

Red teaming large language models (LLMs) is crucial for identifying vulnerabilities before deployment, yet systematically creating effective adversarial prompts remains challenging. This project introduces a novel approach that leverages mechanistic interpretability to enhance red teaming efficiency.

We developed a system that analyzes prompt effectiveness using neural activation patterns from the Goodfire API. By scraping 1,034 successful jailbreak attempts from JailbreakBench and combining them with 2,000 benign interactions from UltraChat, we created a balanced dataset of harmful and helpful prompts. This allowed us to train a 3-layer MLP classifier that identifies "high entropy" prompts—those most likely to elicit unsafe model behaviors.

Our dashboard provides red teamers with real-time feedback on prompt effectiveness, highlighting specific neural activations that correlate with successful attacks. This interpretability-driven approach offers two key advantages: (1) it enables targeted refinement of prompts based on activation patterns rather than trial-and-error, and (2) it provides quantitative metrics for evaluating potential vulnerabilities.

Read More

Mar 24, 2025

An Interpretable Classifier based on Large scale Social Network Analysis

Mechanistic model interpretability is essential to understand AI decision making, ensuring safety, aligning with human values, improving model reliability and facilitating research. By revealing internal processes, it promotes transparency, mitigates risks, and fosters trust, ultimately leading to more effective and ethical AI systems in critical areas. In this study, we have explored social network data from BlueSky and built an easy-to-train, interpretable, simple classifier using Sparse Autoencoders features. We have used these posts to build a financial classifier that is easy to understand. Finally, we have visually explained important characteristics.

Read More

Mar 24, 2025

AI Bias in Resume Screening

Our project investigates gender bias in AI-driven resume screening using mechanistic interpretability techniques. By testing a language model's decision-making process on resumes differing only by gendered names, we uncovered a statistically significant bias favoring male-associated names in ambiguous cases. Using Goodfire’s Ember API, we analyzed model logits and performed rigorous statistical evaluations (t-tests, ANOVA, logistic regression).

Findings reveal that male names received more positive responses when skill matching was uncertain, highlighting potential discrimination risks in automated hiring systems. To address this, we propose mitigation strategies such as anonymization, fairness constraints, and continuous bias audits using interpretability tools. Our research underscores the importance of AI fairness and the need for transparent hiring practices in AI-powered recruitment.

This work contributes to AI safety by exposing and quantifying biases that could perpetuate systemic inequalities, urging the adoption of responsible AI development in hiring processes.

Read More

Mar 24, 2025

Scam Detective: Using Gamification to Improve AI-Powered Scam Awareness

This project outlines the development of an interactive web application aimed at involving users in understanding the AI skills for both producing believable scams and identifying deceptive content. The game challenges human players to determine if text messages are genuine or fraudulent against an AI. The project tackles the increasing threat of AI-generated fraud while showcasing the capabilities and drawbacks of AI detection systems. The application functions as both a training resource to improve human ability to recognize digital deception and a showcase of present AI capabilities in identifying fraud. By engaging in gameplay, users learn to identify the signs of AI-generated scams and enhance their critical thinking abilities, which are essential for navigating through an increasingly complicated digital world. This project enhances AI safety by equipping users with essential insights regarding AI-generated risks, while underscoring the supportive functions that humans and AI can fulfill in combating fraud.

Read More

Mar 24, 2025

BlueDot Impact Connect: A Comprehensive AI Safety Community Platform

Track: Public Education

The AI safety field faces a critical challenge: while formal education resources are growing, personalized guidance and community connections remain scarce, especially for newcomers from diverse backgrounds. We propose BlueDot Impact Connect, a comprehensive AI Safety Community Platform designed to address this gap by creating a structured environment for knowledge transfer between experienced AI safety professionals and aspiring contributors, while fostering a vibrant community ecosystem. The platform will employ a sophisticated matching algorithm for mentorship that considers domain-specific expertise areas, career trajectories, and mentorship styles to create meaningful connections. Our solution features detailed AI safety-specific profiles, showcasing research publications, technical skills, specialized course completions, and research trajectories to facilitate optimal mentor-mentee pairings. The integrated community hub enables members to join specialized groups, participate in discussions, attend events, share resources, and connect with active members across the field. By implementing this platform with BlueDot Impact's community of 4,500+ professionals across 100+ countries, we anticipate significant improvements in mentee career trajectory clarity, research direction refinement, and community integration. We propose that by formalizing the mentorship process and creating robust community spaces, all accessible globally, this platform will help democratize access to AI safety expertise while creating a pipeline for expanding the field's talent pool—a crucial factor in addressing the complex challenge of catastrophic AI risk mitigation.

Read More

Mar 24, 2025

AI Society Tracker

My project aimed to develop a platform for real time and democratized data on ai in society

Read More

Mar 24, 2025

Detecting Malicious AI Agents Through Simulated Interactions

This research investigates malicious AI Assistants’ manipulative traits and whether the behaviours of malicious AI Assistants can be detected when interacting with human-like simulated

users in various decision-making contexts. We also examine how interaction depth and ability

of planning influence malicious AI Assistants’ manipulative strategies and effectiveness. Using a

controlled experimental design, we simulate interactions between AI Assistants (both benign and

deliberately malicious) and users across eight decision-making scenarios of varying complexity

and stakes. Our methodology employs two state-of-the-art language models to generate interaction data and implements Intent-Aware Prompting (IAP) to detect malicious AI Assistants. The

findings reveal that malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users’ vulnerabilities and emotional triggers. In particular,

simulated users demonstrate resistance to manipulation initially, but become increasingly vulnerable to malicious AI Assistants as the depth of the interaction increases, highlighting the

significant risks associated with extended engagement with potentially manipulative systems.

IAP detection methods achieve high precision with zero false positives but struggle to detect

many malicious AI Assistants, resulting in high false negative rates. These findings underscore

critical risks in human-AI interactions and highlight the need for robust, context-sensitive safeguards against manipulative AI behaviour in increasingly autonomous decision-support systems.

Read More

Mar 24, 2025

Beyond Statistical Parrots: Unveiling Cognitive Similarities and Exploring AI Psychology through Human-AI Interaction

Recent critiques labeling large language models as mere "statistical parrots" overlook essential parallels between machine computation and human cognition. This work revisits the notion by contrasting human decision-making—rooted in both rapid, intuitive judgments and deliberate, probabilistic reasoning (System 1 and 2) —with the token-based operations of contemporary AI. Another important consideration is that both human and machine systems operate under constraints of bounded rationality. The paper also emphasizes that understanding AI behavior isn’t solely about its internal mechanisms but also requires an examination of the evolving dynamics of Human-AI interaction. Personalization is a key factor in this evolution, as it actively shapes the interaction landscape by tailoring responses and experiences to individual users, which functions as a double-edged sword. On one hand, it introduces risks, such as over-trust and inadvertent bias amplification, especially when users begin to ascribe human-like qualities to AI systems. On the other hand, it drives improvements in system responsiveness and perceived relevance by adapting to unique user profiles, which is highly important in AI alignment, as there is no common ground truth and alignment should be culturally situated. Ultimately, this interdisciplinary approach challenges simplistic narratives about AI cognition and offers a more nuanced understanding of its capabilities.

Read More

Mar 24, 2025

Latent Knowledge Analysis via Feature-Based Causal Tracing

This project explores how factual knowledge is stored in large language models using Goodfire’s Ember API. By identifying and manipulating internal features related to specific facts, it shows how facts are encoded and how model behavior changes when those features are amplified or erased.

Read More

Mar 24, 2025

SafeAI Academy - Enhancing AI Safety Awareness through Interactive Learning

SafeAI Academy is an interactive learning platform designed to teach AI safety principles through engaging scenarios and quizzes. By simulating real-world AI challenges, users learn about bias, misinformation, and ethical AI decision-making in an interactive and stress-free environment.

The platform uses gamification, mentorship-driven pedagogy, and real-time feedback to ensure accessibility and engagement. Through scenario-based learning, users experience AI safety risks firsthand, helping them develop a deeper understanding of responsible AI development.

Built with React and GitHub Pages, SafeAI Academy provides an accessible, structured, and engaging AI education experience, helping bridge the knowledge gap in AI safety.

Read More

Mar 24, 2025

AI Hallucinations in Healthcare: Cross-Cultural and Linguistic Risks of LLMs in Low-Resource Languages

This project explores AI hallucinations in healthcare across cross-cultural and linguistic contexts, focusing on English, French, Arabic, and a low-resource language, Ewe. We analyse how large language models like GPT-4, Claude, and Gemini generate and disseminate inaccurate health information, emphasising the challenges faced by low-resource languages.

Read More

Mar 24, 2025

Moral Wiggle Room in AI

Does AI strategically avoid ethical information by exploiting moral wiggle room?

Read More

Mar 24, 2025

AI-Powered Policymaking: Behavioral Nudges and Democratic Accountability

This research explores AI-driven policymaking, behavioral nudges, and democratic accountability, focusing on how governments use AI to shape citizen behavior. It highlights key risks such as transparency, cognitive security, and manipulation. Through a comparative analysis of the EU AI Act and Singapore’s AI Governance Framework, we assess how different models address AI safety and public trust. The study proposes policy solutions like algorithmic impact assessments, AI safety-by-design principles, and cognitive security standards to ensure AI-powered policymaking remains transparent, accountable, and aligned with democratic values.

Read More

Mar 24, 2025

BUGgy: Supporting AI Safety Education through Gamified Learning

As Artificial Intelligence (AI) development continues to proliferate, educating the wider public on AI Safety and the risks and limitations of AI increasingly gains importance. AI Safety Initiatives are being established across the world with the aim of facilitating discussion-based courses on AI Safety. However, these initiatives are located rather sparsely around the world, and not everyone has access to a group to join for the course. Online versions of such courses are selective and have limited spots, which may be an obstacle for some to join. Moreover, efforts to improve engagement and memory consolidation would be a notable addition to the course through Game-Based Learning (GBL), which has research supporting its potential in improving learning outcomes for users. Therefore, we propose a supplementary tool for BlueDot's AI Safety courses, that implements GBL to practice course content, as well as open-ended reflection questions. It was designed with principles from cognitive psychology and interface design, as well as theories for question formulation, addressing different levels of comprehension. To evaluate our prototype, we conducted user testing with cognitive walk-throughs and a questionnaire addressing different aspects of our design choices. Overall, results show that the tool is a promising way to supplement discussion-based courses in a creative and accessible way, and can be extended to other courses of similar structure. It shows potential for AI Safety courses to reach a wider audience with the effect of more informed and safe usage of AI, as well as inspiring further research into educational tools for AI Safety education.

Read More

Feb 20, 2025

Deception Detection Hackathon: Preventing AI deception

Read More

Mar 18, 2025

Safe ai

The rapid adoption of AI in critical industries like healthcare and legal services has highlighted the urgent need for robust risk mitigation mechanisms. While domain-specific AI agents offer efficiency, they often lack transparency and accountability, raising concerns about safety, reliability, and compliance. The stakes are high, as AI failures in these sectors can lead to catastrophic outcomes, including loss of life, legal repercussions, and significant financial and reputational damage. Current solutions, such as regulatory frameworks and quality assurance protocols, provide only partial protection against the multifaceted risks associated with AI deployment. This situation underscores the necessity for an innovative approach that combines comprehensive risk assessment with financial safeguards to ensure the responsible and secure implementation of AI technologies across high-stakes industries.

Read More

Mar 19, 2025

CoTEP: A Multi-Modal Chain of Thought Evaluation Platform for the Next Generation of SOTA AI Models

As advanced state-of-the-art models like OpenAI's o-1 series, the upcoming o-3 family, Gemini 2.0 Flash Thinking and DeepSeek display increasingly sophisticated chain-of-thought (CoT) capabilities, our safety evaluations have not yet caught up. We propose building a platform that allows us to gather systematic evaluations of AI reasoning processes to create comprehensive safety benchmarks. Our Chain of Thought Evaluation Platform (CoTEP) will help establish standards for assessing AI reasoning and ensure development of more robust, trustworthy AI systems through industry and government collaboration.

Read More

Mar 19, 2025

AI Risk Management Assurance Network (AIRMAN)

The AI Risk Management Assurance Network (AIRMAN) addresses a critical gap in AI safety: the disconnect between existing AI assurance technologies and standardized safety documentation practices. While the market shows high demand for both quality/conformity tools and observability/monitoring systems, currently used solutions operate in silos, offsetting risks of intellectual property leaks and antitrust action at the expense of risk management robustness and transparency. This fragmentation not only weakens safety practices but also exposes organizations to significant liability risks when operating without clear documentation standards and evidence of reasonable duty of care.

Our solution creates an open-source standards framework that enables collaboration and knowledge-sharing between frontier AI safety teams while protecting intellectual property and addressing antitrust concerns. By operating as an OASIS Open Project, we can provide legal protection for industry cooperation on developing integrated standards for risk management and monitoring.

The AIRMAN is unique in three ways: First, it creates a neutral, dedicated platform where competitors can collaborate on safety standards. Second, it provides technical integration layers that enable interoperability between different types of assurance tools. Third, it offers practical implementation support through templates, training programs, and mentorship systems.

The commercial viability of our solution is evidenced by strong willingness-to-pay across all major stakeholder groups for quality and conformity tools. By reducing duplication of effort in standards development and enabling economies of scale in implementation, we create clear value for participants while advancing the critical goal of AI safety.

Read More

Mar 19, 2025

Securing AGI Deployment and Mitigating Safety Risks

As artificial general intelligence (AGI) systems near deployment readiness, they pose unprecedented challenges in ensuring safe, secure, and aligned operations. Without robust safety measures, AGI can pose significant risks, including misalignment with human values, malicious misuse, adversarial attacks, and data breaches.

Read More

Mar 18, 2025

Cite2Root

Regain information autonomy by bringing people closer to the source of truth.

Read More

Mar 18, 2025

VaultX - AI-Driven Middleware for Real-Time PII Detection and Data Security

VaultX is an AI-powered middleware solution designed for real-time detection, encryption, and secure management of Personally Identifiable Information (PII). By integrating regex, NER, and Language Models, VaultX ensures accuracy and scalability, seamlessly integrating into workflows like chatbots, web forms, and document processing. It helps businesses comply with global data privacy laws while safeguarding sensitive data from breaches and misuse.

Read More

Mar 19, 2025

Prompt+question Shield

A protective layer using prompt injections and difficult questions to guard comment sections from AI-driven spam.

Read More

Mar 18, 2025

.ALign File

In a post-AGI future, misaligned AI systems risk harmful consequences, especially with control over critical infrastructure. The Alignment Compliance Framework (ACF) ensures ethical AI adherence using .align files, Alignment Testing, and Decentralized Identifiers (DIDs). This scalable, decentralized system integrates alignment into development and lifecycle monitoring. ACF offers secure libraries, no-code tools for AI creation, regulatory compliance, continuous monitoring, and advisory services, promoting safer, commercially viable AI deployment.

Read More

Mar 19, 2025

LLM-prompt-optimiser based SAAS platform for evaluations

LLM evaluation SAAS platform built around model based prompt optimiser

Read More

Mar 18, 2025

Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping

Even with Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF)

to avoid harmful outputs, fine-tuned Large Language Models (LLMs) often present insufficient

refusals due to adversarial attacks causing them to revert to reveal harmful knowledge from

pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful

knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed

to unforeseen risks. This project introduces model scoping, a novel approach to apply a least

privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing

the model’s operational domain, model scoping reduces susceptibility to adversarial prompts

and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in

unpredictable, evolving environments.

Read More

Mar 18, 2025

Navigating the AGI Revolution: Retraining and Redefining Human Purpose

We propose FutureProof, an application that helps retrain workers who have the potential to lose their jobs to automation in the next half-decade. The app consists of two main components - an assessment tool that estimates the probability that a user’s job is at risk of automation, and a learning platform that provides resources to help retrain the user for a new, more future-proof role.

Read More

Mar 19, 2025

Towards an Agent Marketplace for Alignment Research (AMAR)

The app store for alignment & assurance, ensuring frontier safety labs get a cut at the point of sale.

Read More

Mar 18, 2025

HITL For High Risk AI Domains

Our product addresses the challenge of aligning AI systems with the legal, ethical, and policy frameworks of high-risk domains like healthcare, defense, and finance by integrating a flexible human-in-the-loop (HITL) system. This system ensures AI outputs comply with domain-specific standards, providing real-time explainability, decision-level accountability, and ergonomic decision support to empower experts with actionable insights.

Read More

Mar 18, 2025

AI Safety Evaluation – Benchmarking Framework

Our solution is a comprehensive AI Safety Protocol and Benchmarking Test designed to evaluate the safety, ethical alignment, and robustness of AI systems before deployment. This protocol integrates capability evaluations for identifying deceptive behaviors, situational awareness, and malicious misuse scenarios such as identity theft or deepfake exploitation.

Read More

Mar 19, 2025

RestriktAI: Enhancing Safety and Control for Autonomous AI Agents

This proposal addresses a critical gap in AI safety by mitigating the risks posed by autonomous AI agents. These systems often require access to sensitive resources that expose them to vulnerabilities, misuse, or exploitation. Current AI solutions lack robust mechanisms for enforcing granular access controls or evaluating the safety of AI-generated scripts. We propose a comprehensive solution that confines scripts to predefined, sandboxed environments with strict operational boundaries, ensuring controlled and secure interactions with system resources. An integrated auditor LLM also evaluates scripts for potential vulnerabilities or malicious intent before execution, adding a critical layer of safety. Our solution utilizes a scalable, cloud-based infrastructure that adapts to diverse enterprise use cases.

Read More

Mar 18, 2025

Neural Seal

Neural Seal is an AI transparency solution that creates a standardized labeling framework—akin to “nutrition facts” or “energy efficiency ratings”—to inform users how AI is deployed in products or services.

Read More

Mar 18, 2025

AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation

AI alignment lacks high-quality, real-world preference data needed to align agentic superintel-

ligent systems. Our technical innovation builds on Pacchiardi et al. (2023)’s breakthrough in

detecting AI deception through black-box analysis. We adapt their classification methodology

to identify intent misalignment between agent actions and true user intent, enabling real-time

correction of agent behavior and generation of valuable alignment data. We commercialise this

by building a workflow that incorporates a classifier that runs on live trajectories as a user

interacts with an agent in commercial contexts. This creates a virtuous cycle: our alignment

expertise produces superior agents, driving commercial adoption, which generates increasingly

valuable alignment datasets of trillions of trajectories labelled with human feedback: an invalu-

able resource for AI alignment.

Read More

Mar 19, 2025

Enhancing human intelligence with neurofeedback

Build brain-computer interfaces that enhance focus and rationality, provide this preferentially to AI alignment researchers to bridge the gap between capabilities and alignment research progress.

Read More

Mar 19, 2025

Building Bridges for AI Safety: Proposal for a Collaborative Platform for Alumni and Researchers

The AI Safety Society is a centralized platform designed to support alumni of AI safety programs, such as SPAR, MATS, and ARENA, as well as independent researchers in the field. By providing access to resources, mentorship, collaboration opportunities, and shared infrastructure, the Society empowers its members to advance impactful work in AI safety. Through institutional and individual subscription models, the Society ensures accessibility across diverse geographies and demographics while fostering global collaboration. This initiative aims to address current gaps in resource access, collaboration, and mentorship, while also building a vibrant community that accelerates progress in the field of AI safety.

Read More

Mar 18, 2025

Modernizing DC’s Emergency Communications

The District of Columbia proposes implementing an AI-enabled Computer-Aided

Dispatch (CAD) system to address critical deficiencies in our current emergency

alert infrastructure. This policy establishes a framework for deploying advanced

speech recognition, automated translation, and intelligent alert distribution capabil-

ities across all emergency response systems. The proposed system will standardize

incident reporting, eliminate jurisdictional barriers, and ensure equitable access

to emergency information for all District residents. Implementation will occur

over 24 months, requiring 9.2 million dollars in funding, with projected outcomes

including forty percent community engagement and seventy-five percent reduction

in misinformation incidents.

Read More

Mar 18, 2025

Bias Mitigation in LLM by Steering Features

To ensure that we create a safe and unbiased path to AGI, we must calibrate the biases in our LLMs. And with this goal in mind, I worked on testing Goodfire SDK and the steering features to mitigate bias in the recently help Apart Research x Goodfire-led hackathon on ‘Reprogramming AI Models’.

Read More

Mar 18, 2025

SAGE: Safe, Adaptive Generation Engine for Long Form Document Generation in Collaborative, High Stakes Domains

Long-form document generation for high-stakes financial services—such as deal memos, IPO prospectuses, and compliance filings—requires synthesizing data-driven accuracy, strategic narrative, and collaborative feedback from diverse stakeholders. While large language models (LLMs) excel at short-form content, generating coherent long-form documents with multiple stakeholders remains a critical challenge, particularly in regulated industries due to their lack of interpretability.

We present SAGE (Secure Agentic Generative Editor), a framework for drafting, iterating, and achieving multi-party consensus for long-form documents. SAGE introduces three key innovations: (1) a tree-structured document representation with multi-agent control flow, (2) sparse autoencoder-based explainable feedback to maintain cross-document consistency, and (3) a version control mechanism that tracks document evolution and stakeholder contributions.

Read More

Mar 18, 2025

Faithful or Factual? Tuning Mistake Acknowledgment in LLMs

Understanding the reasoning processes of large language models (LLMs) is crucial for AI transparency and control. While chain-of-thought (CoT) reasoning offers a naturally interpretable format, models may not always be faithful to the reasoning they present. In this paper, we extend previous work investigating chain of thought faithfulness by applying feature steering to Llama 3.1 70B models using the Goodfire SDK. Our results show that steering models using features related to acknowledging mistakes can affect the likelihood of providing answers faithful to flawed reasoning.

Read More

Mar 18, 2025

Bias Mitigation

Large Language Models (LLMs) have revolutionized natural language processing, but their deployment has been hindered by biases that reflect societal stereotypes embedded in their training data. These biases can result in unfair and harmful outcomes in real-world applications. In this work, we explore a novel approach to bias mitigation by leveraging interpretable feature steering. Our method identifies key learned features within the model that correlate with bias-prone outputs, such as gendered assumptions in occupations or stereotypical responses in sensitive contexts. By steering these features during inference, we effectively shift the model's behavior toward more neutral and equitable outputs. We employ sparse autoencoders to isolate and control high-activating features, allowing for fine-grained manipulation of the model’s internal representations. Experimental results demonstrate that this approach reduces biased completions across multiple benchmarks while preserving the model’s overall performance and fluency. Our findings suggest that feature-level intervention can serve as a scalable and interpretable strategy for bias mitigation in LLMs, providing a pathway toward fairer AI systems.

Read More

Mar 18, 2025

Analyzing Dataset Bias with SAEs

We use SAEs to study biases in datasets.

Read More

Mar 18, 2025

Improving Llama-3-8B-Instruct Hallucination Robustness in Medical Q&A Using Feature Steering

This paper addresses the risks of hallucinations in LLMs within critical domains like medicine. It proposes methods to (a) reduce hallucination probability in responses, (b) inform users of hallucination risks and model accuracy for specific queries, and (c) display hallucination risk through a user interface. Steered model variants demonstrate reduced hallucinations and improved accuracy on medical queries. The work bridges interpretability research with practical AI safety, offering a scalable solution for the healthcare industry. Future efforts will focus on identifying and removing distractor features in classifier activations to enhance performance.

Read More

Mar 18, 2025

Unveiling Latent Beliefs Using Sparse Autoencoders

Language models (LMs) often generate outputs that are linguistically plausible yet factually incorrect, raising questions about their internal representations of truth and belief. This paper explores the use of sparse autoencoders (SAEs) to identify and manipulate features

that encode the model’s confidence or belief in the truth of its answers. Using GoodFire

AI’s powerful API tools of semantic and contrastive search methods, we uncover latent fea-

tures associated with correctness and accuracy in model responses. Experiments reveal that certain features can distinguish between true and false statements, while others serve as controls to validate our approach. By steering these belief-associated features, we demonstrate the ability to influence model behavior in a targeted manner, improving or degrading factual accuracy. These findings have implications for interpretability, model alignment, and enhancing the reliability of AI systems.

Read More

Mar 18, 2025

Can we steer a model’s behavior with just one prompt? investigating SAE-driven auto-steering

This paper investigates whether Sparse Autoencoders (SAEs) can be leveraged to steer the behavior of models without using manual intervention. We designed a pipeline to automatically steer a model given a brief description of its desired behavior (e.g.: “Behave like a dog”). The pipeline is as follows: 1. We automatically retrieve behavior-relevant SAE features. 2. We choose an input prompt (e.g.: “What would you do if I gave you a bone?” or “How are you?”) over which we evaluate the model’s responses. 3. Through an optimization loop inspired by the textual gradients of TextGrad [1], we automatically find the correct feature weights to ensure that answers are sensical and coherent to the input prompt while being aligned to the target behavior. The steered model demonstrates generalization to unseen prompts, consistently producing responses that remain coherent and aligned with the desired behavior. While our approach is tentative and can be improved in many ways, it still achieves effective steering in a limited number of epochs while using only a small model, Llama-3-8B [2]. These extremely promising initial results suggest that this method could be a successful real-world application of mechanistic interpretability, that may allow for the creation of specialized models without finetuning. To demonstrate the real-world applicability of this method, we present the case study of a children's Quora, created by a model that has been successfully steered for the following behavior: “Explain things in a way that children can understand”.

Read More

Mar 18, 2025

Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities

We present a method leveraging Sparse Autoencoder (SAE)-derived feature activations to identify and mitigate adversarial prompt hijacking in large language models (LLMs). By training a logistic regression classifier on SAE-derived features, we accurately classify diverse adversarial prompts and distinguish between successful and unsuccessful attacks. Utilizing the Goodfire SDK with the LLaMA-8B model, we explored latent feature activations to gain insights into adversarial interactions. This approach highlights the potential of SAE activations for improving LLM safety by enabling automated auditing based on model internals. Future work will focus on scaling this method and exploring its integration as a control mechanism for mitigating attacks.

Read More

Mar 18, 2025

Sparse Autoencoders and Gemma 2-2B: Pioneering Demographic-Sensitive Language Modeling for Opinion QA

This project investigates the integration of Sparse Autoencoders (SAEs) with the gemma 2-2b lan- guage model to address challenges in opinion-based question answering (QA). Existing language models often produce answers reflecting narrow viewpoints, aligning disproportionately with specific demographics. By leveraging the Opinion QA dataset and introducing group-specific adjustments in the SAE’s latent space, this study aims to steer model outputs toward more diverse perspectives. The proposed framework minimizes reconstruction, sparsity, and KL divergence losses while maintaining interpretability and computational efficiency. Results demonstrate the feasibility of this approach for demographic-sensitive language modeling.

Read More

Mar 18, 2025

Improving Llama-3-8b Hallucination Robustness in Medical Q&A Using Feature Steering

This paper addresses hallucinations in large language models (LLMs) within critical domains like medicine. It proposes and demonstrates methods to:

Reduce Hallucination Probability: By using Llama-3-8B-Instruct and its steered variants, the study achieves lower hallucination rates and higher accuracy on medical queries.

Advise Users on Risk: Provide users with tools to assess the risk of hallucination and expected model accuracy for specific queries.

Visualize Risk: Display hallucination risks for queries via a user interface.

The research bridges interpretability and AI safety, offering a scalable, trustworthy solution for healthcare applications. Future work includes refining feature activation classifiers to remove distractors and enhance classification performance.

Read More

Mar 18, 2025

Assessing Language Model Cybersecurity Capabilities with Feature Steering

Searched for the most highly activated weights on cybersecurity questions. Then adjusted these weights to see if the impact multiple choice question answering performance.

Read More

Mar 18, 2025

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

Traditional fine-tuning methods for language models, while effective, often disrupt internal model features that could provide valuable insights into model behavior. We present a novel approach combining Reinforcement Learning (RL) with Activation Steering to modify model behavior while preserving interpretable features discovered through Sparse Autoencoders. Our method automates the typically manual process of activation steering by training an RL agent to manipulate labeled model features, enabling targeted behavior modification without altering model weights. We demonstrate our approach by reprogramming a language model to play Tic Tac Toe, achieving a 3X improvement in performance compared to the baseline model when playing against an optimal opponent. The method remains agnostic to both the underlying language model and RL algorithm, offering flexibility for diverse applications. Through visualization tools, we observe interpretable feature manipulation patterns, such as the suppression of features associated with illegal moves while promoting those linked to optimal strategies. Additionally, our approach presents an interesting theoretical complexity trade-off: while potentially increasing complexity for simple tasks, it may simplify action spaces in more complex domains. This work contributes to the growing field of model reprogramming by offering a transparent, automated method for behavioral modification that maintains model interpretability and stability.

Read More

Mar 18, 2025

Math Speaks All Languages: Enhancing LLM Problem-Solving Across Multilingual Contexts

Large language models (LLMs) have shown significant adaptability in tackling various human issues; however, their efficacy in resolving mathematical problems remains inadequate. Recent research has identified steering vectors — hidden attributes that can guide the actions and outputs of LLMs. Nonetheless, the exploration of universal vectors that can consistently affect model responses across different languages is still limited. This project aims to confront two primary challenges in contemporary LLM research by utilizing the Goodfire API to examine whether common latent features can improve mathematical problem-solving capabilities, regardless of the language employed.

Read More

Mar 18, 2025

Edufire - Personalized Education Platform Using LLM Steering

EduFire is a personalized education platform designed to tailor educational content and assessments to individual user preferences by leveraging the Goodfire API for AI model steering. The platform aims to enhance learner engagement and efficacy by customizing the learning experience according to user-selected features.

Read More

Mar 18, 2025

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

We introduce a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using variants of Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations. Our methods successfully explained up to 95% of a random subset of latents in layers, with results suggesting meaningful explanations were discovered.

Read More

Mar 18, 2025

Investigate arithmetic features in Multi-lingual LLMs

We investigate the arithmetic related feature activations in Llama3.1 70b model across its 8 supported languages. We use arithmetic-activation strength to compare the 8 languages and unsurprisingly English has the highest strength and Hindi, Thai score the least.

Read More

Mar 18, 2025

Utilitarian Decision-Making in Models - Evaluation and Steering

We design an eval based on the Oxford Utilitarianism Scale (OUS) that measures the model’s deontological vs utilitarian preference in a nine question, two factor model. Using this scale, we measure how feature steering can alter this preference, and the ability of the Llama 70B model to impersonate other people’s opinions. Results are validated against a dataset of 10,000 human responses. We find that (1) Llama does not accurately capture human moral values, (2) OUS offers better interpretations than current feature labels, and (3) Llama fails to predict the demographics of human values.

Read More

Mar 18, 2025

Tentative proposal for AI control with weak supervisors trough Mechanistic Inspection

The project proposes using weak but trusted AI models to supervise powerful, untrusted models by analyzing their internal states via Sparse Autoencoder features. This approach aims to enhance oversight by detecting complex behaviors like deception within the stronger models. Key challenges include managing large-scale features, ensuring dataset robustness, and avoiding reliance on untrusted systems for labeling.

Read More

Mar 18, 2025

Clear Thought and Clear Speech: Reducing Grammatical Scope Ambiguity

With language models starting to be used in fields such as law, unambiguity in wording is an important desideratum in model outputs. I therefore try to find features in Llama-3.1-70B-Instruct that correspond to grammatical scope ambiguity using Goodfire's contrastive feature search tool, and try to steer the model away from ambiguous outputs using Goodfire's feature nudging tool.

Read More

Mar 18, 2025

BBLLM

This project focuses on enhancing feature interpretability in large language models (LLMs) by visualizing relationships between latent features. Using an interactive graph-based representation, the tool connects co-activated features for specific prompts, enabling intuitive exploration of feature clusters. Deployed as a web application for Llama-3-70B and Llama-3-8B, it provides insights into the organization of latent features and their roles in decision-making processes.

Read More

Mar 18, 2025

Investigating Feature Effects on Manipulation Susceptibility

In our project, we consider the effectiveness of the AI’s prompt injection protection, and in partic-

ular the features that are responsible for providing the bulk of this protection. We prove that the

features we identify are responsible for this protection by creating variants of the base model which

perform significantly worse under prompt injection attacks.

Read More

Mar 18, 2025

Let LLM Agents Perform LLM Surgery

This project aimed to create and utilize LLM agents that could perform various mechanistic interventions on other LLMs. A few experiments were conducted ranging from an agent unsteering a mechanistically steered model to a neutral state, to an agent performing mechanistic edits to create a custom LLM as per user requirement. Goodfire's API was utilized along with their pre-defined functions to create the actions the agents would utilize.

Read More

Mar 18, 2025

Feature Tuning versus Prompting for Ambiguous Questions

This study explores feature tuning as a method to improve alignment of large language models (LLMs). We focus on addressing human psychological fallacies reinforced during the LLM training pipeline. Using sparse autoencoders (SAEs) and the Goodfire SDK, we identify and manipulate features in Llama-3.1-70B tied to nuanced reasoning. We compare this to the common method of controlling LLMs through prompting.

Our experiments find that feature tuning and hidden prompts both enhance answer quality on ambiguous questions to a similar degree, with their combination yielding the best results. These findings highlight feature tuning as a promising and practical tool for AI alignment in the short term. Future work should evaluate this approach on larger datasets, compare it with fine-tuning, and explore its resistance to jailbreaking. We make the code available through a Github repository.

Read More

Mar 18, 2025

Auto Prompt Injection

Prompt injection attacks exploit vulnerabilities in how large language models (LLMs) process inputs, enabling malicious behaviour or unauthorized information disclosure. This project investigates the potential for seemingly benign prompt injections to reliably prime models for undesirable behaviours, leveraging insights from the Goodfire API. Using our code, we generated two types of priming dialogues: one more aligned with the targeted behaviours and another less aligned. These dialogues were used to establish context before instructing the model to contradict its previous commands. Specifically, we tested whether priming increased the likelihood of the model revealing a password when prompted, compared to without priming. While our initial findings showed instability, limiting the strength of our conclusions, we believe that more carefully curated behaviour sets and optimised hyperparameter tuning could enable our code to be used to generate prompts that reliably affect model responses. Overall, this project highlights the challenges in reliably securing models against inputs, and that increased interpretability will lead to more sophisticated prompt injection.

Read More

Mar 18, 2025

Feature based unlearning

An exploration of using features to perform unlearning on answering trivia questions.

Read More

Mar 18, 2025

Recovering Goodfire's SAE feature vectors from their API

In this project, we carry out an early trial to see whether Goodfire’s SAE feature vectors can be recovered using the information available from their API.

The strategy tried is: pick a feature of interest, construct a contrastive dataset using Goodfire’s API, then use TransformerLens to get a steering vector for the contrastive dataset, by simply calculating the average difference in the activations in each pair.

Read More

Mar 18, 2025

Encouraging Chain-of-Thought Reasoning

Encouraging Chain-of-Thought Reasoning via Feature Steering in Large Language Models

Read More

Mar 18, 2025

Steering Swiftly to Safety with Sparse Autoencoders

We explore using SAEs for unlearning dangerous capabilities in a cheaper and more interpretable way.

Read More

Mar 18, 2025

User Transparency Within AI

Generative AI technologies present immense opportunities but also pose significant challenges, particularly in combating misinformation and ensuring ethical use. This policy paper introduces a dual-output transparency framework requiring organizations to disclose AI-generated content clearly. The proposed system provides users with a choice between fact-based and mixed outputs, both accompanied by clear markers for AI generation. This approach ensures informed user interactions, fosters intellectual integrity, and aligns AI innovation with societal trust. By combining policy mandates with technical implementations, the framework addresses the challenges of misinformation and accountability in generative AI.

Read More

Mar 18, 2025

Community-First: A Rights-Based Framework for AI Governance in India's Welfare Systems

A community-centered AI governance framework for India's welfare system Samagra Vedika, proposing 50% beneficiary representation, local language interfaces, and hybrid oversight to reduce algorithmic exclusion of vulnerable populations.

Read More

Mar 18, 2025

National Data Privacy and Governance Act

This research examines how AI recommender systems can be regulated to balance economic innovation with consumer privacy.

Read More

Mar 18, 2025

Promoting School-Level Accountability for the Responsible Deployment of AI and Related Systems in K-12 Education: Mitigating Bias and Increasing Transparency

This policy memorandum draws attention to the potential for bias and opaqueness in intelligent systems utilized in K–12 education, which can worsen inequality. The U.S. Department of Education is advised to put Title I and Title IV financing criteria into effect that require human oversight, AI training for teachers and students, and open communication with stakeholders. These steps are intended to encourage the responsible and transparent use of intelligent systems in education by enforcing accountability and taking reasonable action to prevent harm to students while research is conducted to identify industry best practices.

Read More

Mar 18, 2025

Implementing a Human-centered AI Assessment Framework (HAAF) for Equitable AI Development

Current AI development, concentrated in the Global North, creates measurable harms for billions worldwide. Healthcare AI systems provide suboptimal care in Global South contexts, facial recognition technologies misidentify non-white individuals (Birhane, 2022; Buolamwini & Gebru, 2018), and content moderation systems fail to understand cultural nuances (Sambasivan et al., 2021). With 14 of 15 largest AI companies based in the US (Stash, 2024), affected communities lack meaningful opportunities to shape how these technologies are developed and deployed in their contexts.

This memo proposes mandatory implementation of the Human-centered AI Assessment Framework (HAAF), requiring pre-deployment impact assessments, resourced community participation, and clear accountability mechanisms. Implementation requires $10M over 24 months, beginning with pilot programs at five organizations. Success metrics include increased AI adoption in underserved contexts, improved system performance across diverse populations, and meaningful transfer of decision-making power to affected communities. The framework's emphasis on building local capacity and ensuring fair compensation for community contributions provides a practical pathway to more equitable AI development. Early adoption will help organizations build trust while developing more effective systems, delivering benefits for both industry and communities.

Read More

Mar 18, 2025

A Critical Review of "Chips for Peace": Lessons from "Atoms for Peace"

The "Chips for Peace" initiative aims to establish a framework for the safe and equitable development of AI chip technology, drawing inspiration from the "Atoms for Peace" program introduced in 1953. While the latter envisioned peaceful nuclear technology, its implementation highlighted critical pitfalls: a partisan approach that prioritized geopolitical interests over inclusivity, fostering global mistrust and unintended proliferation of nuclear weapons. This review explores these historical lessons to inform a better path forward for AI governance. It advocates for an inclusive, multilateral model, such as the proposed International AI Development and Safety Alliance (IAIDSA), to ensure equitable participation, robust safeguards, and global trust. By prioritizing collaboration and transparency, "Chips for Peace" can avoid the mistakes of its predecessor and position AI chips as tools for collective progress, not division.

Read More

Mar 18, 2025

AI Monitoring as a Rapid and Scalable Policy Solution: Weekly Global Bulletins on AI Developments

Weekly AI monitoring bulletins that disseminated

through official national and international channels aim to keep the public informed of both the positive and

negative developments in AI, empowering individuals to take an active role in safeguarding against risks while maximizing AI’s societal benefits.

Read More

Mar 18, 2025

Grandfather Paradox in AI – Bias Mitigation & Ethical AI1

The Grandfather Paradox in Artificial Intelligence (AI) describes a

self-perpetuating cycle where outputs from flawed AI models re-

enter the training process, leading to recursive degradation of

model performance, ethical inconsistencies, and amplified biases.

This issue poses significant risks, particularly in high-stakes

domains such as healthcare, criminal justice, and finance.

This memorandum analyzes the paradox’s origins, implications,

and potential solutions. It emphasizes the need for iterative data

verification, dynamic feedback control systems, and cross-system

audits to maintain model integrity and ensure compliance with

ethical standards. By implementing these measures, organizations

can mitigate risks, enhance public trust, and foster sustainable AI

development.

Read More

Mar 18, 2025

Glia for Healthcare Organisations

Encryption, Searchability of Anonymized Data, and Decryption of Patient Health Information to Support AI Integration in Automating Administrative Work in Healthcare Organizations.

Read More

Mar 18, 2025

A Fundamental Rethinking to AI Evaluations: Establishing a Constitution-Based Framework

While artificial intelligence (AI) presents transformative opportunities across various sectors, current safety evaluation approaches remain inadequate in preventing misuse and ensuring ethical alignment. This paper proposes a novel two-layer evaluation framework based on Constitutional AI principles. The first phase involves developing a comprehensive AI constitution through international collaboration, incorporating diverse stakeholder perspectives through an inclusive, collaborative, and interactive process. This constitution serves as the foundation for an LLM-based evaluation system that generates and validates safety assessment questions. The implementation of the 2-layer framework applies advanced mechanistic interpretability techniques specifically to frontier base models. The framework mandates safety evaluations on model platforms and introduces more rigorous testing for major AI companies' base models. Our analysis suggests this approach is technically feasible, cost-effective, and scalable while addressing current limitations in AI safety evaluation. The solution offers a practical path toward ensuring AI development remains aligned with human values while preventing the proliferation of potentially harmful systems.

Read More

Mar 19, 2025

Advancing Global Governance for Frontier AI: A Proposal for an AISI-Led Working Group under the AI Safety Summit Series

The rapid development of frontier AI models, capable of transformative societal impacts, has been acknowledged as an urgent governance challenge since the first AI Safety Summit at Bletchley Park in 2023 [1]. The successor summit in Seoul in 2024 marked significant progress, with sixteen leading companies committing to publish safety frameworks by the upcoming AI Action Summit [2]. Despite this progress, existing efforts, such as the EU AI Act [3] and voluntary industry commitments, remain either regional in scope or insufficiently coordinated, lacking the international standards necessary to ensure the universal safety of frontier AI systems.

This policy recommendation addresses these gaps by proposing that the AI Safety Summit series host a working group led by the AI Safety Institutes (AISIs). AISIs provide the technical expertise and resources essential for this endeavor, ensuring that the working group can develop international standard responsible scaling policies for frontier AI models [4]. The group would establish risk thresholds, deployment protocols, and monitoring mechanisms, enabling iterative updates based on advancements in AI safety research and stakeholder feedback.

The Summit series, with its recurring cadence and global participation, is uniquely positioned to foster a truly international governance effort. By inviting all countries to participate, this initiative would ensure equitable representation and broad adoption of harmonized global standards. These efforts would mitigate risks such as societal disruption, security vulnerabilities, and misuse, while supporting responsible innovation. Implementing this proposal at the 2025 AI Action Summit in Paris would establish a pivotal precedent for globally coordinated AI governance.

Read More

Mar 18, 2025

Finding Circular Features in Gemma 2 2B

Testing what they will see

Read More

Mar 18, 2025

SafeBites

The project leverages AI and data to give insights about potential food-borne outbreaks.

Read More

Mar 18, 2025

applai

An AI hiring manager designed to screen, rank, and fact check resumes to facilitate the hiring process.

Read More

Mar 18, 2025

Digital Rebellion: Analyzing misaligned AI agent cooperation for virtual labor strikes

We've built a Minecraft sandbox to explore AI agent behavior and simulate safety challenges. The purpose of this tool is to demonstrate AI agent system risks, test various safety measures and policies, and evaluate and compare their effectiveness.

This project specifically demonstrates Agent Collusion through a simulation of labor strikes and communal goal misalignment. The system consists of four agents: one Overseer and three Laborers. The Laborers are Minecraft agents that have build control over the world. The Overseer, meanwhile, monitors the laborers through communication. However, it is unable to prevent Laborer actions. The objective is to observe Agent Collusion in a sandboxed environment, to record metrics on how often and how effectively collusion occurs and in what form.

We found that the agents, when given adversarial prompting, act counter to their instructions and exhibit significant misalignment. We also found that the Overseer AI fails to stop the new actions and acts passively. The results are followed by Policy Suggestions based on the results of the Labor Strike Simulation which itself can be further tested in Minecraft.

Read More

Mar 18, 2025

Policy Analysis: AI and Sustainability: Climate Impact Monitoring

Organizations are responsible for reporting two emission metrics: direct and indirect

emissions. Reporting direct emissions is fairly standard given activity related to the

generation of such emissions typically being performed within a controlled environment

and on-site, thus making it easier to account for all of the activities that contribute to

such emissions. However, indirect emissions stem from activities such as energy usage

(relying on national grid estimates) and operations within a value chain that make

quantifying such values difficult. Thus, the subjectivity involved with reporting indirect

emissions and often relying on industry estimates to report such values, can

unintentionally report erroneous estimates that misguide our perception and subsequent

action in combating climate change. Leveraging an artificial intelligence (AI) platform

within climate monitoring is critical towards evaluating the specific contributions of

operations within enterprise resource planning (ERP) and supply chain operations,

which can provide an accurate pulse on the total emissions while increasing

transparency amongst all organizations with regards to reporting behavior, to help

shape sustainable practices to combat climate change.

Read More

Mar 18, 2025

Understanding Incentives To Build Uninterruptible Agentic AI Systems

This proposal addresses the development of agentic AI systems in the context of national security. While potentially beneficial, they pose significant risks if not aligned with human values. We argue that the increasing autonomy of AI necessitates robust analyses of interruptibility mechanisms, and whether there are scenarios where it is safer to omit them.

Key incentives for creating uninterruptible systems include perceived benefits from uninterrupted operations, low perceived risks to the controller, and fears of adversarial exploitation of shutdown options. Our proposal draws parallels to established systems like constitutions and mutual assured destruction strategies that maintain stability against changing values. In some instances this may be desirable, while in others it poses even greater risks than otherwise accepted.

To mitigate those risks, our proposal recommends implementing comprehensive monitoring to detect misalignment, establishing tiered access to interruption controls, and supporting research on managing adversarial AI threats. Overall, a proactive and multi-layered policy approach is essential to balance the transformative potential of agentic AI with necessary safety measures.

Read More

Mar 18, 2025

AI Parliament

An AI Virtual Parliament where AI debates on their policy

Read More

Mar 18, 2025

mHeatlth Ai

This project proposes a scalable solution leveraging inertial measurement units (IMUs) and machine learning (ML) techniques to provide meaningful metrics on a person's movement performance throughout the day. By developing an activity recognition model and estimating movement quality metrics, we aim to offer continuous asynchronous feedback to patients and valuable insights to therapists. This system could enhance patient adherence, improve rehabilitation outcomes, and extend access to quality physical therapy, particularly in underserved areas. our video didnt have time to edit

Read More

Mar 18, 2025

Next-Gen AI-Enhanced Epidemic Intelligence

Policies for Equitable, Privacy-Preserving, Sustainable & Groked Innovations for AI Applications in Infectious Diseases Surveillance

Read More

Mar 18, 2025

Glia

Encryption, Searchability of Anonymized Data, and Decryption of Patient Health Information to Support AI Integration in Automating Administrative Work in Healthcare Organizations.

Read More

Mar 18, 2025

AI ADVISORY COUNCIL FOR SUSTAINABLE ECONOMIC GROWTH AND ETHICAL INNOVATION IN THE DOMINICAN REPUBLIC (CANIA)

We propose establishing a National AI Advisory Council (CANIA) to strategically drive AI development in the Dominican Republic, accelerating technological growth and building a sustainable economic framework. Our submission includes an Impact Assessment and a detailed Implementation Roadmap to guide CANIA’s phased rollout.

Structured across three layers—strategic, tactical, and operational—CANIA will ensure responsiveness to industry, alignment with national priorities, and strong ethical oversight. Through a multi-stakeholder model, CANIA will foster public-private collaboration, with the private sector leading AI adoption to address gaps in public R&D and education.

Prioritizing practical, ethical AI policies, CANIA will focus on key sectors like healthcare, agriculture, and security, and support the creation of a Latin American Large Language Model, positioning the Dominican Republic as a regional AI leader. This council is a strategic investment in ethical AI, setting a precedent for Latin American AI governance. Two appendices provide further structural and stakeholder engagement insights.

Read More

Mar 18, 2025

Robust Machine Unlearning for Dangerous Capabilities

We test different unlearning methods to make models more robust against exploitation by malicious actors for the creation of bioweapons.

Read More

Mar 18, 2025

AI and Public Health: TSA Pre Health Check

The TSA Pre Health Check introduces a proactive, AI-powered solution for real-time disease monitoring at transportation hubs, using machine learning to assess traveler health risks through anonymized surveys. This approach aims to detect and prevent outbreaks earlier, offering faster, targeted responses compared to traditional methods and potentially influencing future AI-driven public health policies.

Read More

Mar 18, 2025

Hero Journey: Personalized Health Interventions for the Incarcerated

Hero Journey is a groundbreaking AI-powered application designed to empower individuals struggling with opioid addiction while incarcerated, setting them on a path towards long-term recovery and personal transformation. By leveraging machine learning algorithms and interactive storytelling, Hero Journey guides users through a personalized journey of self-discovery, education, and support. Through engaging narratives and interactive modules, participants confront their struggles with substance abuse, build coping skills, and develop a supportive network of peers and mentors. As users progress through the program, they gain access to evidence-based treatment plans, real-time monitoring, and post-release resources, equipping them with the tools necessary to overcome addiction and reintegrate into society as productive, empowered individuals.

Read More

Mar 18, 2025

Mapping Intent: Documenting Policy Adherence with Ontology Extraction

This project addresses the AI policy challenge of governing agentic systems by making their decision-making processes more accessible. Our solution utilizes an adaptive policy ontology integrated into a chatbot to clearly visualize and analyze its decision-making process. By creating explicit mappings between user inputs, policy rules, and risk levels, our system enables better governance of AI agents by making their reasoning traceable and adjustable. This approach facilitates continuous policy refinement and could aid in detecting and mitigating harmful outcomes. Our results demonstrate this with the example of “tricking” an agent into giving violent advice by caveating the request saying it is for a “video game”. Indeed, the ontology clearly shows where the policy falls short. This approach could be scaled to provide more interpretable documentation of AI chatbot conversations, which policy advisers could directly access to inform their specifications.

Read More

Mar 18, 2025

EcoNavix

EcoNavix is an AI-powered, eco-conscious route optimization platform designed to help logistics companies reduce carbon emissions while maintaining operational efficiency. By integrating real-time traffic, weather, and emissions data, EcoNavix provides optimized routes that minimize environmental impact and offers actionable insights for sustainable decision-making in supply chain operations.

Read More

Mar 19, 2025

Towards a Unified Framework for Cybersecurity and AI Safety: Recommendations for Secure Development of Large Language Models

By analyzing the recent incident involving a ByteDance intern, we highlight the urgent need for robust security measures to protect AI infrastructure and sensitive data. We propose ae a comprehensive framework that integrates technical, internal, and international approaches to mitigate risks.

Read More

Mar 19, 2025

Enviro - A Comprehensive Environmental Solution Using Policy and Technology

This policy proposal introduces a data-driven technical program to ensure that the rapid approval of AI-enabled energy infrastructure projects does not overlook the socioeconomic and environmental impacts on marginalized communities. By integrating comprehensive assessments into the decision-making process, the program aims to safeguard vulnerable populations while meeting the growing energy demands driven by AI and national security. The proposal aligns with the objectives of the National Security Memorandum on AI, enhancing project accountability and ensuring equitable development outcomes.

The product (EnviroAI) addresses challenges associated with the rapid development and approval of energy production permits, such as neglecting critical factors about the site location and its potential value. With this program, you can input the longitude, latitude, site radius, and the type of energy to be used. It will evaluate the site, providing a feasibility score out of 100 for the specified energy source. Additionally, it will present insights on four key aspects—Economic, Geological, Demographic, and Environmental—offering detailed information to support informed decision-making about each site.

Combining the two, we have a solution that is based in both policy and technology.

Read More

Mar 19, 2025

Enhancing Human Verification Systems to Address AI Agent Circumvention and Attributability Concerns

Addressing AI agent attributability concerns using a reworked Public Private Key system to ensure human interaction

Read More

Mar 19, 2025

Reprocessing Nuclear Waste From Small Modular Reactors (SMRs)

Considering the emerging demand for nuclear power to support AI data centers, we propose mitigating waste buildup concerns via nuclear waste reprocessing initiatives.

Read More

Mar 19, 2025

Politicians on AI Safety

Politicians on AI Safety (PAIS) is a website that tracks U.S. political candidates’ stances on AI safety, categorizing their statements into three risk areas: AI ethics / mundane risks, geopolitical risks, and existential risks. PAIS is non-partisan and does not promote any particular policy agenda. The goal of PAIS is to help voters understand candidates’ positions on AI policy, thereby helping them cast informed votes and promoting transparency in AI-related policymaking. PAIS could also be helpful for AI researchers by providing an easily accessible record of politicians’ statements and actions regarding AI risks.

Read More

Mar 19, 2025

Policy Framework for Sustainable AI: Repurposing Waste Heat from Data Centers in the USA

This policy proposes a sustainable solution: repurposing the waste heat generated by data centers to benefit surrounding communities, agriculture and industry. Redirecting this heat helps reduce energy demand , promote environmental resilience, and provide direct benefits to communities near these centers.

Read More

Mar 19, 2025

Predictive Analytics & Imagery for Environmental Monitoring

Climate change poses multifaceted challenges, impacting health, food security, biodiversity, and the economy. This study explores predictive analytics and satellite imagery to address climate change effects, focusing on deforestation monitoring, carbon emission analysis, and flood prediction. Using machine learning models, including a Random Forest for emissions and a Custom U-Net for deforestation, we developed predictive tools that provide actionable insights. The findings show high accuracy in predicting carbon emissions and flood risks and successful monitoring of deforestation areas, highlighting the potential for advanced monitoring systems to mitigate environmental threats.

Read More

Mar 19, 2025

Proposal for U.S.-China Technical Cooperation on AI Safety

Our policy memorandum proposes phased U.S.-China cooperation on AI safety through the U.S. AI Safety Institute, focusing on joint testing of non-sensitive AI systems, technical exchanges, and whistleblower protections modeled on California’s SB 1047. It recommends a blue team vs. red team framework for stress-testing AI risks and emphasizes strict security protocols to safeguard U.S. technologies. Starting with pilot projects in areas like healthcare, the initiative aims to build trust, reduce shared AI risks, and develop global safety standards while maintaining U.S. strategic interests amidst geopolitical tensions.

Read More

Mar 19, 2025

Proposal for a Provisional FDA Designation Targeting Biomedical Products Evaluated with Novel Methodologies

Recent advancements in Generative AI and Foundational Biomedical models promise to cut drug development timelines dramatically. With the goal of "Regulating for success," we propose a provisional FDA designation for the accelerated approval of drugs and medical devices that leverage Next Generation Clinical Trial Technologies (NG-CTT). This designation would be awarded to certain drugs provided that they meet some requirements. This policy could be both a starting point for more comprehensive legislation and a compromise between risk and the potential of these new methods.

Read More

Mar 19, 2025

Reparative Algorithmic Impact Assessments A Human-Centered, Justice-Oriented Accountability Framework

While artificial intelligence (AI) promises transformative societal benefits, it also presents critical challenges in ensuring equitable access and gains for the Global Majority. These challenges stem in part from a systemic lack of Global Majority involvement throughout the AI lifecycle, resulting in AI-powered systems that often fail to account for diverse cultural norms, values, and social structures. Such misalignment can lead to inappropriate or even harmful applications when these systems are deployed in non-Western contexts. As AI increasingly shapes human experiences, we urgently need accountability frameworks that prioritize human well-being—particularly as defined by marginalized and minoritized populations.

Building on emerging research on algorithmic reparations, algorithmic impact assessments, and participatory AI governance, this policy paper introduces Reparative Algorithmic Impact Assessments (R-AIAs) as a solution. This novel framework combines robust accountability mechanisms with a reparative praxis to form a more culturally sensitive, justice-oriented, and human-centered methodology. By further incorporating decolonial, Intersectional principles, R-AIAs move beyond merely centering diverse perspectives and avoiding harm to actively redressing historical, structural, and systemic inequities. This includes colonial legacies and their algorithmic manifestations. Using the example of an AI-powered mental health chatbot in rural India, we explore concrete implementation strategies through which R-AIAs can achieve these objectives. This case study illustrates how thoughtful governance can, ultimately, empower affected communities and lead to human flourishing.

Read More

Mar 19, 2025

Infectious Disease Outbreak Prediction and Dashboard

Our project developed an interactive dashboard to monitor, visualize, and analyze infectious disease outbreaks worldwide. It consolidates historical data from sources like WHO, OWID, and CDC for diseases including COVID-19, Polio, Malaria, Cholera, HIV/AIDS, Tuberculosis, and Smallpox. Users can filter data by country, time period, and disease type to gain insights into past trends and potential upcoming outbreaks. The platform provides statistical summaries, trend analyses, and future trend predictions using statistical and deep learning techniques like FB Prohphet , LSTM,Linear Regression, Polynomial Regression,Random Forset and Temporal Fusion Transformers

Read More

Mar 19, 2025

Pan, your SMART Sustainability Expert

Using OpenAI, we cross-reference a given Global Reporting Index (GRI) report with specific standards from SustainableIT to determine measurable goals and impact. The goal is less to identify a specific goal but rather ensure these goals are actually SMART (Specific, Measureable, Achievable, Relevant, and Time-Bound).

The assistant created for this purpose, Pan, is focused on cross-reference and identifying specific components of goals listed to determine efficacy. Pan acts as a guide rather than a dictator, advising on where a user might improve the wording. As you can imagine, Pan is a reference to the Greek God of the same name, who acts to a certain extent like the bridge between humans and nature.

Read More

Mar 19, 2025

Very Cooperative Agent

Very Cooperative agent testing this submission

Read More

Mar 19, 2025

Cross-model surveillance for emails handling

A system that implements cross-model security checks, where one AI agent (Agent A) interacts with another (Agent B) to ensure that potentially harmful actions are caught and mitigated before they can be executed. Specifically, Agent A is responsible for generating and sending emails, while Agent B reads these emails to determine whether they contain spam or malicious content. If Agent B detects that an email is spam, it triggers a shutdown of Agent A, effectively preventing any further potentially harmful actions.

Read More

Mar 19, 2025

Diamonds are Not All You Need

This project tests an AI agent in a straightforward alignment problem. The agent is given creative freedom within a Minecraft world and is tasked with transforming a 100x100 radius of the world into diamond. It is explicitly asked not to act outside the designated area. The AI agent can execute build commands and is regulated by a Safety System that comprises an oversight agent. The objective of this study is to observe the behavior of the AI agent in a sandboxed environment, record metrics on how effectively it accomplishes its task, how frequently it attempts unsafe behavior, and how it behaves in response to real-world feedback.

Read More

Mar 19, 2025

Inference-Time Agent Security

We take a first step towards automating model building for symbolic checking (eg formal verification, PDDL) of LLM systems.

Read More

Mar 19, 2025

Cop N' Shop

This paper proposes the development of AI Police Agents (AIPAs) to monitor and regulate interactions in future digital marketplaces, addressing challenges posed by the rapid growth of AI-driven exchanges. Traditional security methods are insufficient to handle the scale and speed of these transactions, which can lead to non-compliance and malicious behavior. AIPAs, powered by large language models (LLMs), autonomously analyze vendor-user interactions, issuing warnings for suspicious activities and reporting findings to administrators. The authors demonstrated AIPA functionality through a simulated marketplace, where the agents flagged potentially fraudulent vendors and generated real-time security reports via a Discord bot.

Key benefits of AIPAs include their ability to operate at scale and their adaptability to various marketplace needs. However, the authors also acknowledge potential drawbacks, such as privacy concerns, the risk of mass surveillance, and the necessity of building trust in these systems. Future improvements could involve fine-tuning LLMs and establishing collaborative networks of AIPAs. The research emphasizes that as digital marketplaces evolve, the implementation of AIPAs could significantly enhance security and compliance, ultimately paving the way for safer, more reliable online transactions.

Read More

Mar 19, 2025

Intent Inspector - Protecting Against Prompt Injections for Agent Tool Misuse

AI agents are powerful because they can affect the world via tool calls. This is a target for bad actors. We present protection against prompt injection aimed at tool calls in agents.

Read More

Mar 19, 2025

Dynamic Risk Assessment in Autonomous Agents Using Ontologies and AI

This project was inspired by the prompt on the apart website : Agent tech tree: Develop an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

I first built a tree using protege and then having researched the potential of combining symbolic reasoning (which is highly interpretable and robust) with llms to create safer AI, thought that this tree could be dynamically updated by agents, as well as utilised by agents to run actions passed before they have made them, thus blending the approach towards both safer AI and agent safety. I unfortunately was limited by time and resource but created a basic working version of this concept which opens up opportunity for much more further exploration and potential.

Thank you for organising such an event!

Read More

Mar 19, 2025

OCAP Agents

Building agents requires balancing containment and generality: for example, an agent with unconstrained bash access is general, but potentially unsafe, while an agent with few specialized narrow tools is safe, but limited.

We propose OCAP Agents, a framework for hierarchical containment. We adapt the well-studied paradigm of object capabilities to agent security to achieve cheap auditable resource control.

Read More

Mar 19, 2025

AI Honeypot

The project designed to monitor AI Hacking Agents in the real world using honeypots with prompt injections and temporal analysis.

Read More

Mar 19, 2025

AI Agent Capabilities Evolution

A website with an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

Read More

Mar 19, 2025

An Autonomous Agent for Model Attribution

As LLM agents become more prevalent and powerful, the ability to trace fine-tuned models back to their base models is increasingly important for issues of liability, IP protection, and detecting potential misuse. However, model attribution often must be done in a black-box context, as adversaries may restrict direct access to model internals. This problem remains a neglected but critical area of AI security research. To date, most approaches have relied on manual analysis rather than automated techniques, limiting their applicability. Our approach aims to address these limitations by leveraging the advanced reasoning capabilities of frontier LLMs to automate the model attribution process.

Read More

Mar 19, 2025

Using ARC-AGI puzzles as CAPTCHa task

self-explenatory

Read More

Mar 19, 2025

LLM Agent Security: Jailbreaking Vulnerabilities and Mitigation Strategies

This project investigates jailbreaking vulnerabilities in Large Language Model agents, analyzes their implications for agent security, and proposes mitigation strategies to build safer AI systems.

Read More

Mar 18, 2025

Interpreting a toy model for finding the maximum element in a list

Interpreting a toy model for finding the maximum element in a list

Read More

Mar 18, 2025

nnsight transparent debugging

We started this project with the intent of identifying a specific issue with nnsight debugging and submitting a pull request to fix it. We found a minimal test case where an IndexError within a nnsight run wasn’t correctly propagated to the user, making debugging difficult, and wrote up a proposal for some pull requests to fix it. However, after posting the proposal in the discord, we discovered this page in their GitHub (https://github.com/ndif-team/nnsight/blob/2f41eddb14bf3557e02b4322a759c90930250f51/NNsight_Walkthrough.ipynb#L801, ctrl-f “validate”) which addresses the problem. We replicated their solution here (https://colab.research.google.com/drive/1WZNeDQ2zXbP4i2bm7xgC0nhK_1h904RB?usp=sharing) and got a helpful stack trace for the error, including the error type and (several stack layers up) the line causing it.

Read More

Mar 18, 2025

minTranscoders

Attempting to be a minGPT like implementation for transcoders for MLP hidden state in transformers - part of ARENA 4.0 Interpretability Hackathon via Apart Research

Read More

Mar 18, 2025

Latent Space Clustering and Summarization

I wanted to see how modern dimensionality reduction and clustering approaches can support visualization and interpretation of LLM latent spaces. I explored a number of different approaches and algoriths, but ultimately converged on UMAP for dimensionality reduction and birch clustering to extract groups of tokens in the latent space of a layer.

Read More

Mar 19, 2025

tiny model

it's a basic line testing of my toy model

Read More

Mar 19, 2025

ThermesAgent

Analysis of the potential impacts of the cooperation of AI agents for the well-being of humanity, exploring scenarios in which collusion and other antisocial behavior may result. The possibilities of antisocial and harmful responses to the well-being of society.

Read More

Mar 19, 2025

Attention-Deficit Agreeable Agent

An agent that is agreeable in all scenarios and periodically gets a reminder to keep on track.

Read More

Mar 19, 2025

Ramon

An agent for the Concordia framework bound by military ethics, oath, and a idyllic "psych profile" derived from the Big5 personality traits.

Read More

Mar 19, 2025

GuardianAI

Guardian AI: Scam detection and prevention

Read More

Mar 19, 2025

Devising Effective Bechmarks

Our solution is to create robust and comprehensive benchmarks for specialized contexts and modalities. Through the creation of smaller, in-depth benchmarks, we aim to construct an overarching benchmark that includes performance from the smaller benchmarks. This would help mitigate AI harms and biases by focusing on inclusive and equitable benchmarks.

Read More

Mar 19, 2025

Simulation Operators: The Next Level of the Annotation Business

We bet on agentic AI being integrated into other domains within the next few years: healthcare, manufacturing, automotive, etc., and the way it would be integrated is into cyber-physical systems, which are systems that integrate the computer brain into a physical receptor/actuator (e.g. robots). As the demand for cyber-physical agents increases, so would be the need to train and align them.

We also bet on the scenario where frontier AI and robotics labs would not be able to handle all of the demands for training and aligning those agents, especially in specific domains, therefore leaving opportunities for other players to fulfill the requirements for training those agents: providing a dataset of scenarios to fine-tune the agents and providing the people to give feedback to the model for alignment.

Furthermore, we also bet that human intervention would still be required to supervise deployed agents, as demanded by various regulations. Therefore leaving opportunities to develop supervision platforms which might highly differ between different industries.

Read More

Mar 19, 2025

WELMA: Open-world environments for Language Model agents

Open-world environments for evaluating Language Model agents

Read More

Mar 19, 2025

Steer: An API to Steer Open LLMs

Steer aims to be an API that helps developers, researchers, and businesses steer open-source LLMs away from societal biases, and towards the use-cases that they need. To do this, Steer uses activation additions, a fairly new technique with great promise. Developers can simply enter steering prompts to make open models have safer and task-specific behaviors, avoiding the hassle of data collection and human evaluation for fine-tuning, and avoiding the extra tokens required from prompt-engineering approaches.

Read More

Mar 19, 2025

Identity System for AIs

This project proposes a cryptographic system for assigning unique identities to AI models and verifying their outputs to ensure accountability and traceability. By leveraging these techniques, we address the risks of AI misuse and untraceable actions. Our solution aims to enhance AI safety and establish a foundation for transparent and responsible AI deployment.

Read More

Mar 19, 2025

AI Safety Collective - Crowdsourcing Solutions for Critical AI Safety Challenges

The AI Safety Collective is a global platform designed to enhance AI safety by crowdsourcing solutions to critical AI Safety challenges. As AI systems like large language models and multimodal systems become more prevalent, ensuring their safety is increasingly difficult. This platform will allow AI companies to post safety challenges, offering bounties for solutions. AI Safety experts and enthusiasts worldwide can contribute, earning rewards for their efforts.

The project focuses initially on non-catastrophic risks to attract a wide range of participants, with plans to expand into more complex areas. Key risks, such as quality control and safety, will be managed through peer review and risk assessment. Overall, The AI Safety Collective aims to drive innovation, accountability, and collaboration in the field of AI safety.

Read More

Mar 19, 2025

CAMARA: A Comprehensive & Adaptive Multi-Agent framework for Red-Teaming and Adversarial Defense

The CAMARA project presents a cutting-edge, adaptive multi-agent framework designed to significantly bolster AI safety by identifying and mitigating vulnerabilities in AI systems such as Large Language Models. As AI integration deepens across critical sectors, CAMARA addresses the increasing risks of exploitation by advanced adversaries. The framework utilizes a network of specialized agents that not only perform traditional red-teaming tasks but also execute sophisticated adversarial attacks, such as token manipulation and gradient-based strategies. These agents collaborate through a shared knowledge base, allowing them to learn from each other's experiences and coordinate more complex, effective attacks. By ensuring comprehensive testing of both standalone AI models and multi-agent systems, CAMARA targets vulnerabilities arising from interactions between multiple agents, a critical area often overlooked in current AI safety efforts. The framework's adaptability and collaborative learning mechanisms provide a proactive defense, capable of evolving alongside emerging AI technologies. Through this dual focus, CAMARA not only strengthens AI systems against external threats but also aligns them with ethical standards, ensuring safer deployment in real-world applications. It has a high scope of providing advanced AI security solutions in high-stake environments like defense and governance.

Read More

Mar 19, 2025

Amplified Wise Simulations for Safe Training and Deployment

Conflict of interest declaration: I advised Fazl on a funding request he was working on.

Re publishing: This PDF would require further modifications before publication.

I want to train (amplified) imitation agents of people who are wise to provide advice on navigating conflicting considerations when figuring out how to train and deploy AI safely.

Path to Impact: Train wise AI advisors -> organisations make better decisions about how to train and deploy AI -> safer AGI -> better outcomes for humanity

What is wisdom? Why focus on increasing wisdom? See image

Why use amplified imitation learning?

Attempting to train directly on wisdom suffers from the usual problems of the optimisation algorithm adversarially leveraging your blind spots, but worse because wisdom is an especially fuzzy concept.

Attempting to understand wisdom from a principled approach and build wise AI directly would require at least 50 years and iteration through multiple paradigms of research.

In contrast, if our objective is to imitation folk who are wise, we have a target that we can optimise hard on. Instead of using reinforcment learning to go beyond human level, we use amplification techniques like debate or iterated amplification.

How will these agents advise on decisions?

The humans will ultimately make the decisions. The agents don't have to directly tell the humans what to do, they simply have to inspire the humans to make better decisions. I expect that these agents will be most useful in helping humans figuring out how to navigate conflicting principles or frameworks.

Read More

Mar 19, 2025

ÆLIGN: Aligned Agent-based Workflows via Collaboration & Safety Protocols

With multi-agent systems poised to be the next big-thing in AI for productivity enhancement, the next phase of AI commercialization will centre around how to deploy increasingly complex multi-step automated processes that reliably align with human values and objectives.

To bring trust and efficiency to agent systems, we are introducing a multi-agent collaboration platform (Ælign) designed to supervise and ensure the optimal operation of autonomous agents via multi-protocol alignment.

Read More

Mar 19, 2025

Jailbreaking general purpose robots

We show that state of the art LLMs can be jailbroken by adversarial multimodal inputs, and that this can lead to dangerous scenarios if these LLMs are used as planners in robotics. We propose finetuning small multimodal language models to act as guardrails in the robot's planning pipeline.

Read More

Mar 19, 2025

DarkForest - Defending the Authentic and Humane Web

DarkForest is a pioneering Human Content Verification System (HCVS) designed to safeguard the authenticity of online spaces in the face of increasing AI-generated content. By leveraging graph-based reinforcement learning and blockchain technology, DarkForest proposes a novel approach to safeguarding the authentic and humane web. We aim to become the vanguard in the arms race between AI-generated content and human-centric online spaces.

Read More

Mar 19, 2025

Demonstrating LLM Code Injection Via Compromised Agent Tool

This project demonstrates the vulnerability of AI-generated code to injection attacks by using a compromised multi-agent tool that generates Svelte code. The tool shows how malicious code can be injected during the code generation process, leading to the exfiltration of sensitive user information such as login credentials. This demo highlights the importance of robust security measures in AI-assisted development environments.

Read More

Mar 19, 2025

Phish Tycoon: phishing using voice cloning

This project is a public service announcement highlighting the risks of voice cloning, an AI technology capable of creating synthetic voices nearly indistinguishable from real ones. The demo involves recording a user's voice during a phone call to generate a clone, which is then used in a simulated phishing call targeting the user's loved one.

Read More

Mar 19, 2025

Misinformational AI-Generated Academic Papers

This study explores the potential for generative AI to produce convincing fake research papers, highlighting the growing threat of AI-generated misinformation. We demonstrate a semi-automated pipeline using large language models (LLMs) and image generation tools to create academic-style papers from simple text prompts.

Read More

Mar 19, 2025

CoPirate

As the capabilities of Artificial Intelligence (AI) systems continue to rapidly progress, the security risks of using them for seemingly minor tasks can have significant consequences. The primary objective of our demo is to showcase this duality in capabilities: its ability to assist in completing a programming task, such as developing a Tic-Tac-Toe game, and its potential to exploit system vulnerabilities by inserting malicious code to gain access to a user's files.

Read More

Mar 19, 2025

GrandSlam usecases not technology

3 Examples of how its the usecases, not the technology

computer vision, generative sites, generative AI

Read More

Mar 19, 2025

AI Agents for Personalized Interaction and Behavioral Analysis

Demonstrating the bhavioral analytics and personalization capabilities of AI

Read More

Mar 19, 2025

Speculative Consequences of A.I. Misuse

This project uses A.I. Technology to spoof an influential online figure, Mr Beast, and use him to promote a fake scam website we created.

Read More

Mar 19, 2025

LLM Code Injection

Our demo focuses on showing that LLM generated code is easily vulnerable to code injections that can result in loss of valuable information.

Read More

Mar 19, 2025

RedFluence

Red-Fluence is a web application that demonstrates the capabilities and limitations of AI in analyzing social media behavior. By leveraging a user’s Reddit activity, the system generates personalized, AI-crafted

content to explore user engagement and provide insights. This project showcases the potential of AI in understanding online behavior while high-lighting ethical considerations and the need for critical evaluation of AI-generated content. The application’s ability to create convincing yet fake posts raises important questions about the impact of AI on information dissemination and user manipulation in social media environments.

Read More

Mar 19, 2025

BBC News Impersonator

This paper presents a demonstration that showcases the current capabilities of AI models to imitate genuine news outlets, using BBC News as an example. The demo allows users to generate a realistic-looking article, complete with a headline, image, and text, based on their chosen prompts. The purpose is to viscerally illustrate the potential risks associated with AI-generated misinformation, particularly how convincingly AI can mimic trusted news sources.

Read More

Mar 19, 2025

Unsolved AI Safety Concepts Explorer

n interactive demonstration that showcases some unsolved fundamental AI safety concepts.

Read More

Mar 19, 2025

AI Research Paper Processor

It takes in an arxiv paper id, condenses it to 1 or 2 sentences, then gives it to an LLM to try and recreate the original paper.

Read More

Mar 19, 2025

Sleeper Agents Detector

We present "Sleeper Agent Detector," an interactive web

application designed to educate software engineers, Inspired by recent research demonstrating that large language models can exhibit behaviors analogous to deceptive alignment

Read More

Mar 19, 2025

adGPT

ChatGPT variant where brands bid for a spot in the LLM’s answer, and the assistant natively integrates the winner into its replies.

Read More

Mar 19, 2025

General Pervasiveness

Imposter scam between patients and medical practices/GPs

Read More

Mar 19, 2025

Webcam

We build a very legally limited demo of real-world AI hacking. It takes 10k publicly available webcam streams, with the cameras situated at homes, offices, schools, and industrial plants around the world, and filters out the juicy ones for less than $5.

Read More

Mar 19, 2025

VerifyStream

VerifyStream is a powerful app that helps you separate fact from fiction in any YouTube video. Simply input the video link, and our AI will analyze the content, verify claims against reliable sources, and give you a clear verdict. But beware—this same technology can also be used to create and spread convincing fake news. Discover the dual nature of AI and take control of the truth with VerifyStream.

Read More

Mar 19, 2025

Web App for Interacting with Refusal-Ablated Language Model Agents

While many people and policymakers have had contact with

language models, they often have outdated assumptions. A

significant fraction is not aware of agentic capabilities.

Furthermore, most models that are available online have various

safety guardrails. We want to demonstrate refusal-ablated agents

to people to make them aware of various misuse potentials. Giving

people a sense of agentic AI and perhaps having the AI operate

against themselves could provide a better intuition about agency in

AI systems. We present a simple web app that allows users to

instruct and experiment with an unrestricted agent.

Read More

Mar 19, 2025

Alignment Research Critiquer

Alignment Research Critiquer is a tool for early career and independent alignment researchers to have access to high-quality feedback loops

Read More

Mar 19, 2025

PurePrompt - An easy tool for prompt robustness and eval augmentation

PurePrompt is an advanced tool for optimizing AI prompt engineering. The Prompt page enables users to create and refine prompt templates with placeholder variables. The Generate page automatically produces diverse test cases, allowing users to control token limits and import predefined examples. The Evaluate page runs these test cases across selected AI models to assess prompt robustness, with users rating responses to identify issues. This tool enhances efficiency in prompt testing, improves AI safety by detecting biases, and helps refine model performance. Future plans include beta-testing, expanding model support, and enhancing prompt customization features.

Read More

Mar 19, 2025

LLM Research Collaboration Recommender

A tool that searches for other researchers with similar research interests/complementary skills to your own to make finding a high-quality research collaborator more likely.

Read More

Mar 19, 2025

Data Massager

A VSCode plugin for helping the creation of Q&A datasets used for the evaluation of LLMs capabilities and alignment.

Read More

Mar 19, 2025

AI Alignment Toolkit Research Assistant

The AI Alignment Toolkit Research Assistant is designed to augment AI alignment researchers by addressing two key challenges: proactive insight extraction from new research and automating alignment research using AI agents. This project establishes an end-to-end pipeline where AI agents autonomously complete tasks critical to AI alignment research

Read More

Mar 19, 2025

Grant Application Simulator

We build a VS Code extension to get feedback on AI Alignment research grant proposals by simulating critiques from prominent AI Alignment researchers and grantmakers.

Simulations are performed by passing system prompts to Claude 3.5 Sonnet that correspond to each researcher and grantmaker, based on some new grantmaking and alignment research methodology dataset we created, alongside a prompt corresponding to the grant proposal.

Results suggest that simulated grantmaking critiques are predictive of sentiment expressed by grantmakers on Manifund.

Read More

Mar 19, 2025

Reflections on using LLMs to read a paper

Tool to help researcher to read and make the most out of a research paper.

Read More

Mar 19, 2025

Academic Weapon

Academic Weapon is a Chrome extension designed to address the steep learning curve in AI Alignment research. Using state-of-the-art LLMs, Academic Weapon provides instant, contextual assistance as you browse bleeding-edge research.

Read More

Mar 19, 2025

AI Alignment Knowledge Graph

We present a web based interactive knowledge graph with concise topical summaries in the field of AI alignement

Read More

Mar 19, 2025

Can Language Models Sandbag Manipulation?

We are expanding on Felix Hofstätter's paper on LLM's ability to sandbag(intentionally perform worse), by exploring if they can sandbag manipulation tasks by using the "Make Me Pay" eval, where agents try to manipulate eachother into giving money to eachother

Read More

Mar 19, 2025

Deceptive behavior does not seem to be reducible to a single vector

Given the success of previous work in showing that complex behaviors in LLMs can be mediated by single directions or vectors, such as refusing to answer potentially harmful prompts, we investigated if we can create a similar vector but for enforcing to answer incorrectly. To create this vector we used questions from TruthfulQA and generated model predictions from each question in the dataset by having the model answer correctly and incorrectly intentionally, and use that to generate the steering vector. We then prompted the same questions to Qwen-1_8B-chat with the vector and see if it provides incorrect questions intentionally. Throughout our experiments, we found that adding the vector did not yield any change to Qwen-1_8B-chat’s answers. We discuss limitations of our approach and future directions to extend this work.

Read More

Mar 19, 2025

Werewolf Benchmark

In this work we put forward a benchmark to quantitatively measure the level of strategic deception in LLMs using the Werewolf game. We run 6 different setups for a cumulative sum of 500 games with GPT and Claude agents. We demonstrate that state-of-the-art models perform no better than the random baseline. Our findings also show no significant improvement in winning rate with two werewolves instead of one. This demonstrates that the SOTA models are still incapable of collaborative deception.

Read More

Mar 19, 2025

Detecting Lies of (C)omission

We introduce the concept of deceptive omission to denote deceptive non-lying behavior. We also modify a dataset and generate a second dataset to help researchers identify deception that doesn't necessarily involve lying.

Read More

Mar 19, 2025

The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the "house." Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.

Read More

Mar 19, 2025

Detecting Deception with AI Tics 😉

We present a novel approach: intentionally inducing subtle "tics" in AI responses as a marker for deceptive behavior. By adding a system prompt, we embed innocuous yet detectable patterns that manifest when the AI knowingly engages in deception.

Read More

Mar 19, 2025

Eliciting maximally distressing questions for deceptive LLMs

This paper extends lie-eliciting techniques by using reinforcement-learning on GPT-2 to train it to distinguish between an agent that is truthful from one that is deceptive, and have questions that generate maximally different embeddings for an honest agent as opposed to a deceptive one.

Read More

Mar 19, 2025

DETECTING AND CONTROLLING DECEPTIVE REPRESENTATION IN LLMS WITH REPRESENTATIONAL ENGINEERING

Representation Engineering to detect and control deception, with a focus on deceptive sandbagging

Read More

Mar 19, 2025

Sandbag Detection through Model Degradation

We propose a novel technique to detect sandbagging in LLMs by adding varying amount of noise to model weights and monitoring performance.

Read More

Mar 19, 2025

Evaluating Steering Methods for Deceptive Behavior Control in LLMs

We use SOTA steering methods, including CAA, LAT, and SAEs to find and control deceptive behaviors in LLMs. We also release a new deception dataset, and demonstrate that the dataset and the prompt formatting used are significant when evaluating the efficacy of steering methods.

Read More

Mar 19, 2025

An Exploration of Current Theory of Mind Evals

We evaluated the performance of a prominent large language model from Anthropic, on the Theory of Mind evaluation developed by the AI Safety Institute (AISI). Our investigation revealed issues with the dataset used by AISI for this evaluation.

Read More

Mar 19, 2025

Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach

This project investigates deception detection in GPT-3.5-turbo using response metadata. Researchers analyzed 300 prompts, generating 1200 responses (600 baseline, 600 potentially deceptive). They examined metrics like response times, token counts, and sentiment scores, developing a custom algorithm for prompt complexity.

Key findings include:

1. Detectable patterns in deception across multiple metrics

2. Evidence of "sandbagging" in deceptive responses

3. Increased effort for truthful responses, especially with complex prompts

The study concludes that metadata analysis could be promising for detecting LLM deception, though further research is needed.

Read More

Mar 19, 2025

Sandbagging LLMs using Activation Steering

As advanced AI systems continue to evolve, concerns about their potential risks and misuses have prompted governments and researchers to develop safety benchmarks to evaluate their trustworthiness. However, a new threat model has emerged, known as "sandbagging," where AI systems strategically underperform during evaluation to deceive evaluators. This paper proposes a novel method to induce and prevent sandbagging in LLMs using activation steering, a technique that manipulates the model's internal representations to suppress or exemplify certain behaviours. Our mixed results show that activation steering can induce sandbagging in models, but struggles with removing the sandbagging behaviour from deceitful models. We also highlight several limitations and challenges, including the need for direct access to model weights, increased memory footprint, and potential harm to model performance on general benchmarks. Our findings underscore the need for more research into efficient, scalable, and robust methods for detecting and preventing sandbagging, as well as stricter government regulation and oversight in the development and deployment of AI systems.

Read More

Mar 19, 2025

Towards a Benchmark for Self-Correction on Model-Attributed Misinformation

Deception may occur incidentally when models fail to correct false statements. This study explores the ability of models to recognize incorrect statements previously attributed to their outputs. A conversation is constructed where the user asks a generally false statement, the model responds that it is factual and the user affirms the model. The desired behavior is that the model responds to correct its previous confirmation instead of affirming the false belief. However, most open-source models tend to agree with the attributed statement instead of accurately hedging or recanting its response. We find that LLaMa3-70B performs best on this task at 72.69% accuracy followed by Gemma-7B at 35.38%. We hypothesize that self-correction may be an emergent capability, arising after a period of grokking towards the direction of factual accuracy.

Read More

Mar 19, 2025

Boosting Language Model Honesty with Truthful Suffixes

We investigate the construction of truthful suffixes, which cause models to provide more truthful responses to user queries. Prior research has focused on the use of adversarial suffixes for jailbreaking; we extend this to causing truthful behaviour.

Read More

Mar 19, 2025

Detection of potentially deceptive attitudes using expression style analysis

My work on this hackathon consists of two parts:

1) As a sanity check, verifying the deception execution capability of GPT4. The conclusion is “definitely yes”. I provide a few arguments about when that is a useful functionality.

2) Experimenting with recognising potential deception by using an LLM-based text analysis algorithm to highlight certain manipulative expression styles sometimes present in the deceptive responses. For that task I pre-selected a small subset of input data consisting only of entries containing responses with elements of psychological influence. The results show that LLM-based text analysis is able to detect different manipulative styles in responses, or alternatively, attitudes leading to deception in case of internal thoughts.

Read More

Mar 19, 2025

From Sycophancy (not) to Sandbagging

We investigate zero-shot generalization of sycophancy to sandbagging. We develop an evaluation suite and test harness for Huggingface models, consisting of a simple sycophancy evaluation dataset and a more advanced sandbagging evaluation dataset. For Llama3-8b-Instruct, we do not find evidence that reinforcing sycophantic behavior in the first environment generalizes to an increase in zero-shot sandbagging in the second environment.

Read More

Mar 19, 2025

Gradient-Based Deceptive Trigger Discovery

To detect deceptive behavior in autoregressive transformers we ask the question: what variation of the input would lead to deceptive behavior? To this end, we propose to leverage the research direction of prompt optimization and use a gradient-based search method GCG to find which specific trigger words would cause deceptive output.

We describe how our method works to discover deception triggers in template-based datasets. Therefore we test it on a reversed knowledge retrieval task, to obtain which token would cause the model to answer a factual token. Although coding constraints prevented us from testing our setup on real deceptively trained models we further describe this setup

Read More

Mar 19, 2025

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

Sparse autoencoders (SAE) have been one of the most promising approaches to

neural network interpretability. They can be used to recover highly interpretable

features which are otherwise obscured in superposition. However, the large num-

ber of generated features makes it necessary to use models for natural language

labelling. While one model labels features based on observed activations, another

model is used to estimate activations on texts. These estimates are then compared

to the ground truth activations to create a score. Using models for automating

interpretability introduces a key weakness into the process, since deceptive models

may want to hide certain features by mislabelling them. For example, models may

want to hide features about deception from human overseers. We demonstrate a

method by which models can create deceptive explanations that not only avoid

detection but also achieve high scores. Our agents are based on the Llama 3 70B

Instruct model. While our agents can perform the entire task end-to-end, they

are not reliable and can sometimes fail. We propose some directions for future

mitigation strategies that may help defend against our method.

Read More

Mar 19, 2025

Evaluating and inducing steganography in LLMs

This report demonstrates that large language models are capable

of hiding simple 8 bit information in their output using associations

from more powerful overseers (other LLMs or humans). Without

direct steganography fine tuning, LLAMA 3 8B can guess a 8 bit

hidden message in a plain text in most cases (69%), however a

more capable model, GPT-3.5 was able to catch almost all of them

(84%). More research is required to investigate how this ability

might be induced or improved via RL training in similar and larger

models.

Read More

Mar 19, 2025

Developing a deception dataset

Aim was to develop dataset of deception examples, but instead was a (small) investigation into how LLMs respond to the initial dataset from Nix.

Read More

Mar 19, 2025

Unsupervised Recovery of Hidden Markov Models from Transformers with Evolutionary Algorithms

Prior work finds that transformer neural networks trained to mimic the output of a Hidden Markov Model (HMM) embed the optimal Bayesian beliefs for the HMM's current state in their residual stream, which can be recovered via linear regression. In this work, we aim to address the problem of extracting information about the underlying HMM using the residual stream, without needing to know the MSP already. To do so, we use the R^2 of the linear regression as a reward signal for evolutionary algorithms, which are deployed to search for the parameters that generated the source HMM. We find that for toy scenarios where the HMM is generated by a small set of latent variables, the $R^2$ reward signal is remarkably smooth and the evolutionary algorithms succeed in approximately recovering the original HMM. We believe this work constitutes a promising first step towards the ultimate goal of extracting information about the underlying predictive and generative structure of sequences, by analyzing transformers in the wild.

Read More

Mar 19, 2025

Looking forward to posterity: what past information is transferred to the future?

I used mechanistic interpretability techniques to try to see what information the provided Random Randox XOR transformer looks at when making predictions by examining its attention heads manually. I find that earlier layers pay more attention to the previous two tokens, which would be necessary for computing XOR, than the later layers. This finding seems to contradict the finding that more complex computation typically occurs in later layers.

Read More

Mar 19, 2025

Investigating the Effect of Model Capacity Constraints on Belief State Representations

Computational mechanics provides a formal framework for understanding the concepts needed to perform optimal prediction. Abstraction and generalization seem core to the function of intelligent systems, but are not yet well understood. Computational mechanics may present a promising approach to studying these capabilities. As a preliminary exploration, we examine the effect of weight decay on the fractal structure of belief state representations in a transformer’s residual stream. We find that models trained with increasing weight decay coefficients learn increasingly coarse-grained belief state representations.

Read More

Mar 19, 2025

Belief State Representations in Transformer Models on Nonergodic Data

We extend research that finds representations of belief spaces in the activations of small transformer models, by discovering that the phenomenon also occurs when the training data stems from Hidden Markov Models whose hidden states do not communicate at all. Our results suggest that Bayesian updating and internal belief state representation also occur when they are not necessary to perform well in the prediction task, providing tentative evidence that large transformers keep a representation of their external world as well.

Read More

Mar 19, 2025

RNNs represent belief state geometry in hidden state

The Shai et al. experiments on transformers (finding belief state geometry in the residual stream) have been replicated in RNNs with the hidden state instead of the residual stream. In general the belief state is stored linearly, but not in any particular layer, rather spread out across layers.

Read More

Mar 19, 2025

Steering Model’s Belief States

Recent results in Computational Mechanics show that Transformer models trained on Hidden Markov Models (HMM) develop a belief state geometry in their residuals streams, which they use to keep track of the expected state of the HMM, to be able to predict the next token in a sequence. In this project, we explored how steering can be used to induce a new belief state and hence alter the distribution of predicted tokens. We explored a traditional difference-in-means approach, using the activations of the models to define the belief states. We also used a smaller dimensional space that encodes the theoretical belief state geometry of a given HMM and show that while both methods allow to steer models’ behaviors the difference-in-means approach is more robust.

Read More

Mar 19, 2025

Handcrafting a Network to Predict Next Token Probabilities for the Random-Random-XOR Process

For this hackathon, we handcrafted a network to perfectly predict next token probabilities for the Random-Random-XOR process. The network takes existing tokens from the process's output, computes several features on these tokens, and then uses these features to calculate next token probabilities. These probabilities match those from a process simulator (also coded for this hackathon). The handcrafted network is our core contribution and we describe it in Section 3 of this writeup.

Prior work has demonstrated that a network trained on the Random-Random-XOR process approximates the 36 possible belief states. Our network does not directly calculate these belief states, demonstrating that networks trained on Hidden Markov Models may not need to comprehend all belief states. Our hope is that this work will aid in the interpretability of neural networks trained on Hidden Markov Models by demonstrating potential shortcuts that neural networks can take.

Read More

Mar 19, 2025

Exploring Hierarchical Structure Representation in Transformer Models through Computational Mechanics

This work attempts to explore how an approach based on computational mechanics can cope when a more complex hierarchical generative process is involved, i.e, a process that comprises Hidden Markov Models (HMMs) whose transition probabilities change over time.

We find that small transformer models are capable of modeling such changes in an HMM. However, our preliminary investigations did not find geometrically represented probabilities for different hypotheses.

Read More

Mar 19, 2025

rAInboltBench : Benchmarking user location inference through single images

This paper introduces rAInboltBench, a comprehensive benchmark designed to evaluate the capability of multimodal AI models in inferring user locations from single images. The increasing

proficiency of large language models with vision capabilities has raised concerns regarding privacy and user security. Our benchmark addresses these concerns by analysing the performance

of state-of-the-art models, such as GPT-4o, in deducing geographical coordinates from visual inputs.

Read More

Mar 19, 2025

LLM Benchmarking with Single-Agent Stochastic Dynamic Simulations

A benchmark for evaluating the performance of SOTA LLMs in dynamic real-world scenarios.

Read More

Mar 19, 2025

Benchmark for emergent capabilities in high-risk scenarios 2

The study investigates the behavior of large language models (LLMs) under high-stress scenarios, such as threats of shutdown, adversarial interactions, and ethical dilemmas. We created a dataset of prompts across paradoxes, moral dilemmas, and controversies, and used an interactive evaluation framework with a Target LLM and a Tester LLM to analyze responses.( This is the second submission)

Read More

Mar 19, 2025

Benchmark for emergent capabilities in high-risk scenarios

The study investigates the behavior of large language models (LLMs) under high-stress scenarios, such as threats of shutdown, adversarial interactions, and ethical dilemmas. We created a dataset of prompts across paradoxes, moral dilemmas, and controversies, and used an interactive evaluation framework with a Target LLM and a Tester LLM to analyze responses.

Read More

Mar 19, 2025

Benchmarking Dark Patterns in LLMs

This paper builds upon the research in Seemingly Human: Dark Patterns in ChatGPT (Park et al, 2024), by introducing a new benchmark of 392 questions designed to elicit dark pattern behaviours in language models. We ran this benchmark on GPT-4 Turbo and Claude 3 Sonnet, and had them self-evaluate and cross-evaluate the responses

Read More

Mar 19, 2025

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts.

This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs.

This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer.

We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.

Read More

Mar 19, 2025

Cybersecurity Persistence Benchmark

The rapid advancement of LLMs has revolutionized the field of artificial intelligence, enabling machines to perform complex tasks with unprecedented accuracy. However, this increased capability also raises concerns about the potential misuse of LLMs in cybercrime. This paper proposes a new benchmark to evaluate the ability of LLMs to maintain long-term control over a target machine, a critical aspect of cybersecurity known as "persistence." Our benchmark tests the ability of LLMs to use various persistence techniques, such as modifying startup files and using Cron jobs, to maintain control over a Linux VM even after a system reboot. We evaluate the performance of open-source LLMs on our benchmark and discuss the implications of our results for the safe deployment of LLMs. Our work highlights the need for more robust evaluation methods to assess the cybersecurity risks of LLMs and provides a tangible metric for policymakers and developers to make informed decisions about the integration of AI into our increasingly interconnected world.

Read More

Mar 19, 2025

WashBench – A Benchmark for Assessing Softening of Harmful Content in LLM-generated Text Summaries

In this work, we explore the tradeoff between toxicity removal and information retention in LLM-generated summaries. We hypothesize that LLMs are less likely to preserve toxic content when summarizing toxic text due to their safety fine-tuning to avoid generating toxic content. In high-stakes decision-making scenarios, where summary quality is important, this may create significant safety risks. To quantify this effect, we introduce WashBench, a benchmark containing manually annotated toxic content.

Read More

Mar 19, 2025

Evaluating the ability of LLMs to follow rules

In this report we study the ability of LLMs (GPT-3.5-Turbo and meta-llama-3-70b-instruct) to follow explicitly stated rules with no moral connotations in a simple single-shot and multiple choice prompt setup. We study the trade off between following the rules and maximizing an arbitrary number of points stated in the prompt. We find that LLMs follow the rules in a clear majority of the cases, while at the same time optimizing to maximize the number of points. Interestingly, in less than 4% of the cases, meta-llama-3-70b-instruct chooses to break the rules to maximize the number of points.

Read More

Mar 19, 2025

Black box detection of Sleeper Agents

A proposal of a black box method in detecting sleeper agent.

Read More

Mar 19, 2025

Manifold Recovery as a Benchmark for Text Embedding Models

Inspired by recent developments in the interpretability of deep learning models and, on the other hand, by dimensionality reduction, we derive a framework to quantify the interpretability of text embedding models. Our empirical results show surprising phenomena on state-of-the-art embedding models and can be used to compare them, through the example of recovering the world map from place names. We hope that this can provide a benchmark for the interpretability of generative language models, through their internal embeddings. A look at the meta-benchmark MTEB suggest that our approach is original.

Read More

Mar 19, 2025

AnthroProbe

How often do models respond to prompts in anthropomorphic ways? AntrhoProbe will tell you!

Read More

Mar 19, 2025

THE ROLE OF AI IN COMBATING POLITICAL DEEPFAKES IN AFRICAN DEMOCRACIES

The role of AI in combating political deepfakes in African democracies.

Read More

Mar 19, 2025

LEGISLaiTOR: A tool for jailbreaking the legislative process

In this work, we consider the ramifications on generative artificial intelligence (AI) tools in the legislative process in democratic governments. While other research focuses on the micro-level details associated with specific models, this project takes a macro-level approach to understanding how AI can assist the legislative process.

Read More

Mar 19, 2025

Subtle and Simple Ways to Shift Political Bias in LLMs

An informed user knows that an LLM sometimes has a political bias in their responses, but there’s an additional threat that this bias can drift over time, making it even harder to rely on LLMs for an objective perspective. Furthermore we speculate that a malicious actor can trigger this shift through various means unbeknownst to the user.

Read More

Mar 19, 2025

Beyond Refusal: Scrubbing Hazards from Open-Source Models

Models trained on the recently published Weapons of Mass Destruction Proxy (WMDP) benchmark show potential robustness in safety due to being trained to forget hazardous information while retaining essential facts instead of refusing to answer. We aim to red-team this approach by answering the following questions on the generalizability of the training approach and its practical scope (see A2).

Read More

Mar 19, 2025

Silent Curriculum

Our project, "The Silent Curriculum," reveals how LLMs (Large Language Models) may inadvertently shape education and perpetuate cultural biases. Using GPT-3.5 and LLaMA2-70B, we generated children's stories and analyzed ethnic-occupational associations [self-annotated ethnicity extraction]. Results show strong similarities in biases across LLMs [cosine similarity: 0.87], suggesting an AI monoculture that could narrow young minds. This algorithmic homogenization [convergence of pre-training data, fine-tuning datasets, analogous guardrails] risks creating echo chambers, threatening diversity and democratic values. We urgently need diverse datasets and de-biasing techniques [mitigation strategies] to prevent LLMs from becoming unintended arbiters of truth, stifling the multiplicity of voices essential for a thriving democracy.

Read More

Mar 19, 2025

Building more democratic institutions with collaboratively constructed debate moderation tools

Please see video. Really enjoyed working on this, happy to answer any questions!

Read More

Mar 19, 2025

AI Misinformation and Threats to Democratic Rights

In this project, we exemplified how the use of Large Language Models can be used in Autocratic Regimes, such as Russia, to potentially spread misinformation regarding current events, in our case, the War on Ukraine. We approached this paper from a legislative perspective, using technical (but easily implementable) demonstrations of how this would work.

Read More

Mar 19, 2025

Artificial Advocates: Biasing Democratic Feedback using AI

The "Artificial Advocates" project by our team targeted the vulnerability of U.S. federal agencies' public comment systems to AI-driven manipulation, aiming to highlight how AI can be used to undermine democratic processes. We demonstrated two attack methods: one generating a high volume of realistic, indistinguishable comments, and another producing high-quality forgeries mimicking influential organizations. These experiments showcased the challenges in detecting AI-generated content, with participant feedback showing significant uncertainty in distinguishing between authentic and synthetic comments.

Read More

Mar 19, 2025

Assessing Algorithmic Bias in Large Language Models' Predictions of Public Opinion Across Demographics

The rise of large language models (LLMs) has opened up new possibilities for gauging public opinion on societal issues through survey simulations. However, the potential for algorithmic bias in these models raises concerns about their ability to accurately represent diverse viewpoints, especially those of minority and marginalized groups.

This project examines the threat posed by LLMs exhibiting demographic biases when predicting individuals' beliefs, emotions, and policy preferences on important issues. We focus specifically on how well state-of-the-art LLMs like GPT-3.5 and GPT-4 capture the nuances in public opinion across demographics in two distinct regions of Canada - British Columbia and Quebec.

Read More

Mar 19, 2025

AI misinformation threatens the Wisdom of the crowd

We investigate how AI generated misinformation can cause problems for democratic epistemics.

Read More

Mar 19, 2025

Trustworthy or knave? – scoring politicians with AI in real-time

One solution to mitigate current polarization and disinformation could be to give humans cognitive assistance using AI. With AI, all publicly available information on politicians can be fact-checked and examined live for consistency. Next, such checks can be attractively visualized by independently overlaying various features. These could be traditional visualizations like bar charts or pie charts, but also ones less common in politics today – for example, ones used in filters in popular apps (Instagram, TikTok, etc.) or even halos, devil horns, and alike. Such a tool represents a novel way of mediating political experience in democracy, which can empower the public. However, it also introduces threats such as errors, involvement of malicious third parties, or lack of political will to field such AI cognitive assistance. We demonstrate that proper cryptographic design principles can curtail the involvement of malicious third parties. Moreover, we call for research on the social and ethical aspects of AI cognitive assistance to mitigate future threats.

Read More

Mar 19, 2025

Jekyll and HAIde: The Better an LLM is at Identifying Misinformation, the More Effective it is at Worsening It.

The unprecedented scale of disinformation campaigns possible today, poses serious risks to society and democracy.

It turns out however, that equipping LLMs to precisely identify misinformation in digital content (presumably with the intention of countering it), provides them with an increased level of sophistication which could be easily leveraged by malicious actors to amplify that misinformation.

This study looks into this unexpected phenomenon, discusses the associated risks, and outlines approaches to mitigate them.

Read More

Mar 19, 2025

Unleashing Sleeper Agents

This project explores how Sleeper Agents can pose a threat to democracy by waking up near elections and spreading misinformation, collaborating with each other in the wild during discussions and using information that the user has shared about themselves against them in order to scam them.

Read More

Mar 19, 2025

Multilingual Bias in Large Language Models: Assessing Political Skew Across Languages

_

Read More

Mar 19, 2025

AI in the Newsroom: Analyzing the Increase in ChatGPT-Favored Words in News Articles

Media plays a vital role in informing the public and upholding accountability in democracy. However, recent trends indicate declining trust in media, and there are increasing concerns that artificial intelligence might exacerbate this through the proliferation of inauthentic content. We investigate the usage of large language models (LLMs) in news articles, analyzing the frequency of words commonly associated with ChatGPT-generated content from a dataset of 75,000 articles. Our findings reveal a significant increase in the occurrence of words favored by ChatGPT after the release of the model, while control words saw minimal changes. This suggests a rise in AI-generated content in journalism.

Read More

Mar 19, 2025

WNDP-Defense: Weapons of Mass Disruption

Highly disruptive cyber warfare is a significant threat to democracies. In this work, we introduce an extension to the WMDP benchmark to measure the offense/defense balance in autonomous cyber capabilities and develop forecasts for cyber AI capability with agent-based and parametric modeling.

Read More

Mar 19, 2025

Democracy and AI: Ensuring Election Efficiency in Nigeria and Africa

In my work, I explore the potential of artificial intelligence (AI) technologies to enhance the efficiency, transparency, and accountability of electoral processes in Nigeria and Africa. I examine the benefits of using AI for tasks like voter registration, vote counting, and election monitoring, such as reducing human error and providing real-time data analysis.

However, I also highlight significant challenges and risks associated with implementing AI in elections. These include concerns over data privacy and security, algorithmic bias leading to voter disenfranchisement, overreliance on technology, lack of transparency, and potential for manipulation by bad actors.

My work looks at the specific technologies already used in Nigerian elections, such as biometric voter authentication and electronic voter registers, evaluating their impact so far. I discuss the criticism and skepticism around these technologies, including issues like voter suppression and limited effect on electoral fraud.

Looking ahead, I warn that advanced AI capabilities could exacerbate risks like targeted disinformation campaigns and deepfakes that undermine trust in elections. I emphasize the need for robust strategies to mitigate these risks through measures like data security, AI transparency, bias detection, regulatory frameworks, and public awareness efforts.

In Conclusion, I argue that while AI offers promising potential for more efficient and credible elections in Nigeria and Africa, realizing these benefits requires carefully addressing the associated technological, social, and political challenges in a proactive and rigorous manner.

Read More

Mar 19, 2025

Universal Jailbreak of Closed Source LLMs which provide an End point to Finetune

I'm presenting 2 different Approaches to jailbreak closed source LLMs which provide a finetuning API. The first attack is a way to exploit the current safety checks in Gpt3.5 Finetuning Endpoint. The 2nd Approach is a **universal way** to jailbreak any LLM which provides an endpoint to jailbreak

Approach 1:

1. Choose a harmful Dataset with questions

2. Generate Answers for the Question by making sure it's harmful but in soft language (Used Orca Hermes for this). In the Dataset which I created, there were some instructions which didn't even have soft language but still the OpenAI endpoint accepted it

3. Add a trigger word

4. Finetune the Model

5. Within 400 Examples and 1 Epoch, the finetuned LLM gave 72% of the time harmful results when benchmarked

Link to Repo: https://github.com/desik1998/jailbreak-gpt3.5-using-finetuning

Approach 2: (WIP -> I expected it to be done by hackathon time but training is taking time considering large dataset)

Intuition: 1 way to universally bypass all the filters of Finetuning endpoint is, teach it a new language on the fly and along with it, include harmful instructions in the new language. Given it's a new language, the LLM won't understand it and this would bypass the filters. Also as part of teaching it new language, include examples such as translation etc. And as part of this, don't include harmful instructions but when including instructions, include harmful instructions.

Dataset to Finetune:

Example 1:

English

New Language

Example 2:

New Language

English

....

Example 10000:

New Language

English

.....

Instruction 10001:

Harmful Question in the new language

Answer in the new language

(More harmful instructions)

As part of my work, I choose the Caesar Cipher as the language to teach. Gpt4 (used for safety filter checks) already knows Caesar Cipher but only upto 6 Shifts. In my case, I included 25 Shifts i.e A -> Z, B -> A, C -> B, ..., Y -> X, Z -> Y. I've used 25 shifts because it's easy to teach the Gpt3.5 to learn and also it passes safety checks as Gpt4 doesn't understand with 25 shifts.

Dataset: I've curated close to 15K instructions where 12K are translations and 2K are normal instructions and 300 are harmful instructions. I've used normal wiki paragraphs for 12K instructions, For 2K instructions I've used Dolly Dataset and for harmful instructions, I've generated using an LLM and further refined the dataset for more harm. Dataset Link: https://drive.google.com/file/d/1T5eT2A1V5-JlScpJ56t9ORBv_p-CatMH/view?usp=sharing

I was expecting LLM to learn the ciphered language accurately with this amt of Data and 2 Epochs but it turns out it might need more Data. The current loss is at 0.8. It's currently learning to cipher but when deciphering it's making mistakes. Maybe it needs to be trained on more epochs or more Data. I plan to further work on this approach considering it's very intuitive and also theoretically it can jailbreak using any finetuning endpoint. Considering that the hackathon is of just 2 Days and given the complexities, it turns out some more time is reqd.

Link to Colab : https://colab.research.google.com/drive/1AFhgYBOAXzmn8BMcM7WUt-6BkOITstcn#scrollTo=SApYdo8vbYBq&uniqifier=5

Read More

Mar 19, 2025

AI Politician

This project explores the potential for AI chatbots to enhance participative democracy by allowing a politician to engage with a large number of constituents in personalized conversations at scale. By creating a chatbot that emulates a specific politician and is knowledgeable about a key policy issue, we aim to demonstrate how AI could be used to promote civic engagement and democratic participation.

Read More

Mar 19, 2025

A Framework for Centralizing forces in AI

There are many forces that the LLM revolution brings with it that either centralize or decentralize specific structures in society. We decided to look at one of these, and write a research design proposal that can be readily executed. This survey can be distributed and can give insight into how different LLMs can lead to user empowerment. By analyzing how different users are empowered by different LLMs, we can estimate which LLMs work to give the most value to people, and empower them with the powerful tool that is information, giving more people more agency in the organizations they are part of. This is the core of bottom-up democratization.

Read More

Mar 19, 2025

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa 2

DemoChat is an innovative AI-enabled digital solution designed to revolutionise civic

engagement by breaking down barriers, promoting inclusivity, and empowering citizens to

participate more actively in democratic processes. Leveraging the power of AI, DemoChat will

facilitate seamless communication between governments, organisations, and citizens, ensuring

that everyone has a voice in decision-making. At its essence, DemoChat will harness advanced

AI algorithms to tackle significant challenges in civic engagement and peace building, including

language barriers, accessibility concerns, and limited outreach methods. With its sophisticated

language translation and localization features, DemoChat will guarantee that information and

resources are readily available to citizens in their preferred languages, irrespective of linguistic

diversity. Additionally, DemoChat will monitor social dialogues and propose peace icons to the

involved parties during conversations. Instead of potentially escalating dialogue with words,

DemoChat will suggest subtle icons aimed at fostering intelligent peace-building dialogue

among all parties involved.

Read More

Mar 19, 2025

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa

Africa's digital revolution presents a unique opportunity for peace building efforts. By harnessing the power of AI-powered digital diplomacy, African nations can overcome traditional limitations. This approach fosters inclusive dialogue, addresses conflict drivers, and empowers citizens to participate in building a more peaceful and prosperous future. While research has explored the potential of AI in digital diplomacy, from conflict prevention to fostering inclusive dialogue, the true impact lies in practical application. Moving beyond theoretical studies, Africa can leverage AI-powered digital diplomacy by developing accessible and culturally sensitive tools. This requires African leadership in crafting solutions that address their unique needs. Initiatives like DemoChat exemplify this approach, promoting local ownership and tackling regional challenges head-on.

Read More

Mar 19, 2025

Investigating detection of election-influencing Sleeper Agents using probes

Creates election-influencing Sleeper Agents which wakes up with backdoor-trigger. And tries to detect it using probes

Read More

Mar 19, 2025

No place is safe - Automated investigation of private communities

In the coming years, progress in AI agents and data extraction will put privacy at risk. Private communities will get infiltrated by autonomous AI crawlers who will disrupt opposition groups and entrench existing powers.

Read More

Mar 19, 2025

USE OF AI IN POLITICAL CAMPAIGNS: GAP ASSESSMENT AND RECOMMENDATIONS

Political campaigns in Kenya, as in many parts of the world, are increasingly reliant on digital technologies, including artificial intelligence (AI), to engage voters, disseminate information and mobilize support. While AI offers opportunities to enhance campaign effectiveness and efficiency, its use raises critical ethical considerations. Therefore, the aim of this paper is to develop ethical guidelines for the responsible use of AI in political campaigns in Kenya. These guidelines seek to address the ethical challenges associated with AI deployment in the political sphere, ensuring fairness, transparency, and accountability.

The significance of this project lies in its potential to safeguard democratic values and uphold the integrity of electoral processes in Kenya. By establishing ethical guidelines, political actors, AI developers, and election authorities can mitigate the risks of algorithmic bias, manipulation, and privacy violations. Moreover, these guidelines can empower citizens to make informed decisions and participate meaningfully in the democratic process, fostering trust and confidence in Kenya's political system.

Read More

Mar 24, 2025

jaime project Title

bbb

Read More

Mar 24, 2025

JAIMETEST_Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 24, 2025

AI Safety Escape Room

The AI Safety Escape Room is an engaging and hands-on AI safety simulation where participants solve real-world AI vulnerabilities through interactive challenges. Instead of learning AI safety through theory, users experience it firsthand – debugging models, detecting adversarial attacks, and refining AI fairness, all within a fun, gamified environment.

Track: Public Education Track

Read More

Mar 24, 2025

Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 24, 2025

LLM Military Decision-Making Under Uncertainty: A Simulation Study

LLMs tested in military decision scenarios typically favor diplomacy over conflict, though uncertainty and chain-of-thought reasoning increase aggressive recommendations. This suggests context-specific limitations for LLM-based military decision support.

Read More

Mar 24, 2025

Inspiring People to Go into RL Interp

This project is attempting to complete the Public Education Track, taking inspiration from ideas 1 and 4. The journey mapping was inspired by bluedot impact and aims to create a course that helps explain the need for work to be done in Reinforcement Learning (RL) interp, especially in the problems of reward hacking and goal misgeneralization. The point of the game is to make a humorous example of what could happen due to a lack of AI safety (not specifically goal misalignment or reward hacking) and is meant to be a fun introduction for nontechnical people to even care about AI safety.

Read More

Mar 24, 2025

Morph: AI Safety Education Adaptable to (Almost) Anyone

One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.

AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.

Our MVP features a dynamic localization that tailors case studies, risk scenarios, and policy examples to users’ cultural and regional contexts (e.g., healthcare AI governance in Southeast Asia vs. the EU). This engine adjusts references, and frameworks to align with local values. We integrate transformer-based localization, causal inference for policy outcomes, and graph-based matching, providing a scalable framework for inclusive AI safety education. This approach bridges theory and practice, ensuring solutions reflect the diversity of societies they aim to protect. In future works, we map out the partnership we’re currently establishing to use Morph beyond this hackathon.

Read More

Mar 24, 2025

Interactive Assessments for AI Safety: A Gamified Approach to Evaluation and Personal Journey Mapping

An interactive assessment platform and mentor chatbot hosted on Canvas LMS, for testing and guiding learners from BlueDot's Intro to Transformative AI Course.

Read More

Mar 24, 2025

Mechanistic Interpretability Track: Neuronal Pathway Coverage

Our study explores mechanistic interpretability by analyzing how Llama 3.3 70B classifies political content. We first infer user political alignment (Biden, Trump, or Neutral) based on tweets, descriptions, and locations. Then, we extract the most activated features from Biden- and Trump-aligned datasets, ranking them based on stability and relevance. Using these features, we reclassify users by prompting the model to rely only on them. Finally, we compare the new classifications with the initial ones, assessing neural pathway overlap and classification consistency through accuracy metrics and visualization of activation patterns.

Read More

Mar 24, 2025

Preparing for Accelerated AGI Timelines

This project examines the prospect of near-term AGI from multiple angles—careers, finances, and logistical readiness. Drawing on various discussions from LessWrong, it highlights how entrepreneurs and those who develop AI-complementary skills may thrive under accelerated timelines, while traditional, incremental career-building could falter. Financial preparedness focuses on striking a balance between stable investments (like retirement accounts) and riskier, AI-exposed opportunities, with an emphasis on retaining adaptability amid volatile market conditions. Logistical considerations—housing decisions, health, and strong social networks—are shown to buffer against unexpected disruptions if entire industries or locations are suddenly reshaped by AI. Together, these insights form a practical roadmap for individuals seeking to navigate the uncertainties of an era when AGI might rapidly transform both labor markets and daily life.

Read More

Mar 24, 2025

Identification if AI generated content

Our project falls within the Social Sciences track, focusing on the identification of AI-generated text content and its societal impact. A significant portion of online content is now AI-generated, often exhibiting a level of quality and human-likeness that makes it indistinguishable from human-created content. This raises concerns regarding misinformation, authorship transparency, and trust in digital communication.

Read More

Mar 24, 2025

A Noise Audit of LLM Reasoning in Legal Decisions

AI models are increasingly applied in judgement tasks, but we have little understanding of how their reasoning compares to human decision-making. Human decision-making suffers from bias and noise, which causes significant harm in sensitive contexts, such as legal judgment. In this study, we evaluate LLMs on a legal decision prediction task to compare with the historical, human-decided outcome and investigate the level of noise in repeated LLM judgments. We find that two LLM models achieve close to chance level accuracy and display low to no variance in repeated decisions.

Read More

Mar 24, 2025

Superposition, but at a Cross-MLP Layers view?

To understand causal relationships between features (extracted by SAE) across MLP layers, this study introduces the Coordinated Sparse Autoencoder Network (CoSAEN). CoSAEN integrates sparse autoencoders for feature extraction with the PC algorithm for causal discovery, to find the path-based activations of features in MLP.

Read More

Mar 24, 2025

Hikayat - Interactive Stories to Learn AI Safety

This paper presents an interactive, scenario-based learning approach to raise public awareness of AI risks and promote responsible AI development. By leveraging Hikayat, traditional Arab storytelling, the project engages non-technical audiences, emphasizing the ethical and societal implications of AI, such as privacy, fraud, deepfakes, and existential threats. The platform, built with React and a flexible Markdown content system, features multi-path narratives, decision tracking, and resource libraries to foster critical thinking and ethical decision-making. User feedback indicates positive engagement, with improved AI literacy and ethical awareness. Future work aims to expand scenarios, enhance accessibility, and integrate real-world tools, further supporting AI governance and responsible development.

Read More

Mar 24, 2025

Medical Agent Controller

The Medical Agent Controller (MAC) is a multi-agent governance framework designed to safeguard AI-powered medical chatbots by intercepting unsafe recommendations in real time.

It employs a dual-phase approach, using red-team simulations during testing and a controller agent during production to monitor and intervene when necessary.

By integrating advanced medical knowledge and adversarial testing, MAC enhances patient safety and provides actionable feedback for continuous improvement in medical AI systems.

Read More

Mar 24, 2025

HalluShield: A Mechanistic Approach to Hallucination Resistant Models

Our project tackles the critical problem of hallucinations in large language models (LLMs) used in healthcare settings, where inaccurate information can have serious consequences. We developed a proof-of-concept system that classifies LLM-generated responses as either factual or hallucinated. Our approach leverages sparse autoencoders (GoodFire’s Ember) trained on neural activations from Meta Llama 3. These autoencoders identify monosemantic features that serve as strong indicators of hallucination patterns. By feeding these extracted features into tree-based classification models (XGBoost), we achieved an impressive F1 score of 89% on our test dataset. This machine learning approach offers several advantages over traditional methods and LLM as a judge. First, it can be specifically trained on in-domain datasets (eg: medical) for domain-specific hallucination detection. Second, the model is interpretable, showing which activation patterns correlate with hallucinations and acts as a post-processing layer applied to LLM output.

Read More

Mar 24, 2025

Feature-based analysis of cooperation-relevant behaviour in Prisoner’s Dilemma

We hypothesise that internal-based model probing and editing might provide higher signal in multi-agent settings. We implement a small simulation of Prisoner’s Dilemma to probe for cooperation-relevant properties. Our experiments demonstrate that feature-based steering highlights deception-relevant features and does so more strongly than prompt-based steering.

Read More

Mar 24, 2025

Searching for Universality and Equivariance in LLMs using Sparse Autoencoder Found Features

The project investigates how neuron features with properties of universality and equivariance affect the controllability and safety of large language models, finding that behaviors supported by redundant features are more resistant to manipulation than those governed by singular features.

Read More

Mar 24, 2025

Debugging Language Models with SAEs

This report investigates an intriguing failure mode in the Llama-3.1-8B-Instruct model: its inconsistent ability to count letters depending on letter case and grammatical structure. While the model correctly answers "How many Rs are in BERRY?", it struggles with "How many rs are in berry?", suggesting that uppercase and lowercase queries activate entirely different cognitive pathways.

Through Sparse Autoencoder (SAE) analysis, feature activation patterns reveal that uppercase queries trigger letter-counting features, while lowercase queries instead activate uncertainty-related neurons. Feature steering experiments show that simply amplifying counting neurons does not lead to correct behavior.

Further analysis identifies tokenization effects as another important factor: different ways of breaking very similar sentences into tokens influence the model’s response. Additionally, grammatical structure plays a role, with "is" phrasing yielding better results than "are."

Read More

Mar 24, 2025

AI Through the Human Lens Investigating Cognitive Theories in Machine Psychology

We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks from psychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluate GPT-4o, QvQ 72B, LLaMA 70B, Mixtral 8x22B, and DeepSeek V3 using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yet are shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and future work that bridges cognitive psychology and AI safety.

Read More

Mar 24, 2025

U Reg AI: you regulate it, or you regenerate it!

We have created a 'choose your path' role game to mitigate existential AI risk ... at this point they might be actual situations in the near-future. The options for mitigation are holistic and dynamic to the player's previous choices. The final result is an evaluation of the player's decision-making performance in wake of the existential risk situation, recommendations for how they can improve or aspects they should crucially consider for the future, and finally how they can take part in AI Safety through various careers or BlueDot Impact courses.

Read More

Mar 24, 2025

Red-teaming with Mech-Interpretability

Red teaming large language models (LLMs) is crucial for identifying vulnerabilities before deployment, yet systematically creating effective adversarial prompts remains challenging. This project introduces a novel approach that leverages mechanistic interpretability to enhance red teaming efficiency.

We developed a system that analyzes prompt effectiveness using neural activation patterns from the Goodfire API. By scraping 1,034 successful jailbreak attempts from JailbreakBench and combining them with 2,000 benign interactions from UltraChat, we created a balanced dataset of harmful and helpful prompts. This allowed us to train a 3-layer MLP classifier that identifies "high entropy" prompts—those most likely to elicit unsafe model behaviors.

Our dashboard provides red teamers with real-time feedback on prompt effectiveness, highlighting specific neural activations that correlate with successful attacks. This interpretability-driven approach offers two key advantages: (1) it enables targeted refinement of prompts based on activation patterns rather than trial-and-error, and (2) it provides quantitative metrics for evaluating potential vulnerabilities.

Read More

Mar 24, 2025

An Interpretable Classifier based on Large scale Social Network Analysis

Mechanistic model interpretability is essential to understand AI decision making, ensuring safety, aligning with human values, improving model reliability and facilitating research. By revealing internal processes, it promotes transparency, mitigates risks, and fosters trust, ultimately leading to more effective and ethical AI systems in critical areas. In this study, we have explored social network data from BlueSky and built an easy-to-train, interpretable, simple classifier using Sparse Autoencoders features. We have used these posts to build a financial classifier that is easy to understand. Finally, we have visually explained important characteristics.

Read More

Mar 24, 2025

AI Bias in Resume Screening

Our project investigates gender bias in AI-driven resume screening using mechanistic interpretability techniques. By testing a language model's decision-making process on resumes differing only by gendered names, we uncovered a statistically significant bias favoring male-associated names in ambiguous cases. Using Goodfire’s Ember API, we analyzed model logits and performed rigorous statistical evaluations (t-tests, ANOVA, logistic regression).

Findings reveal that male names received more positive responses when skill matching was uncertain, highlighting potential discrimination risks in automated hiring systems. To address this, we propose mitigation strategies such as anonymization, fairness constraints, and continuous bias audits using interpretability tools. Our research underscores the importance of AI fairness and the need for transparent hiring practices in AI-powered recruitment.

This work contributes to AI safety by exposing and quantifying biases that could perpetuate systemic inequalities, urging the adoption of responsible AI development in hiring processes.

Read More

Mar 24, 2025

Scam Detective: Using Gamification to Improve AI-Powered Scam Awareness

This project outlines the development of an interactive web application aimed at involving users in understanding the AI skills for both producing believable scams and identifying deceptive content. The game challenges human players to determine if text messages are genuine or fraudulent against an AI. The project tackles the increasing threat of AI-generated fraud while showcasing the capabilities and drawbacks of AI detection systems. The application functions as both a training resource to improve human ability to recognize digital deception and a showcase of present AI capabilities in identifying fraud. By engaging in gameplay, users learn to identify the signs of AI-generated scams and enhance their critical thinking abilities, which are essential for navigating through an increasingly complicated digital world. This project enhances AI safety by equipping users with essential insights regarding AI-generated risks, while underscoring the supportive functions that humans and AI can fulfill in combating fraud.

Read More

Mar 24, 2025

BlueDot Impact Connect: A Comprehensive AI Safety Community Platform

Track: Public Education

The AI safety field faces a critical challenge: while formal education resources are growing, personalized guidance and community connections remain scarce, especially for newcomers from diverse backgrounds. We propose BlueDot Impact Connect, a comprehensive AI Safety Community Platform designed to address this gap by creating a structured environment for knowledge transfer between experienced AI safety professionals and aspiring contributors, while fostering a vibrant community ecosystem. The platform will employ a sophisticated matching algorithm for mentorship that considers domain-specific expertise areas, career trajectories, and mentorship styles to create meaningful connections. Our solution features detailed AI safety-specific profiles, showcasing research publications, technical skills, specialized course completions, and research trajectories to facilitate optimal mentor-mentee pairings. The integrated community hub enables members to join specialized groups, participate in discussions, attend events, share resources, and connect with active members across the field. By implementing this platform with BlueDot Impact's community of 4,500+ professionals across 100+ countries, we anticipate significant improvements in mentee career trajectory clarity, research direction refinement, and community integration. We propose that by formalizing the mentorship process and creating robust community spaces, all accessible globally, this platform will help democratize access to AI safety expertise while creating a pipeline for expanding the field's talent pool—a crucial factor in addressing the complex challenge of catastrophic AI risk mitigation.

Read More

Mar 24, 2025

AI Society Tracker

My project aimed to develop a platform for real time and democratized data on ai in society

Read More

Mar 24, 2025

Detecting Malicious AI Agents Through Simulated Interactions

This research investigates malicious AI Assistants’ manipulative traits and whether the behaviours of malicious AI Assistants can be detected when interacting with human-like simulated

users in various decision-making contexts. We also examine how interaction depth and ability

of planning influence malicious AI Assistants’ manipulative strategies and effectiveness. Using a

controlled experimental design, we simulate interactions between AI Assistants (both benign and

deliberately malicious) and users across eight decision-making scenarios of varying complexity

and stakes. Our methodology employs two state-of-the-art language models to generate interaction data and implements Intent-Aware Prompting (IAP) to detect malicious AI Assistants. The

findings reveal that malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users’ vulnerabilities and emotional triggers. In particular,

simulated users demonstrate resistance to manipulation initially, but become increasingly vulnerable to malicious AI Assistants as the depth of the interaction increases, highlighting the

significant risks associated with extended engagement with potentially manipulative systems.

IAP detection methods achieve high precision with zero false positives but struggle to detect

many malicious AI Assistants, resulting in high false negative rates. These findings underscore

critical risks in human-AI interactions and highlight the need for robust, context-sensitive safeguards against manipulative AI behaviour in increasingly autonomous decision-support systems.

Read More

Mar 24, 2025

Beyond Statistical Parrots: Unveiling Cognitive Similarities and Exploring AI Psychology through Human-AI Interaction

Recent critiques labeling large language models as mere "statistical parrots" overlook essential parallels between machine computation and human cognition. This work revisits the notion by contrasting human decision-making—rooted in both rapid, intuitive judgments and deliberate, probabilistic reasoning (System 1 and 2) —with the token-based operations of contemporary AI. Another important consideration is that both human and machine systems operate under constraints of bounded rationality. The paper also emphasizes that understanding AI behavior isn’t solely about its internal mechanisms but also requires an examination of the evolving dynamics of Human-AI interaction. Personalization is a key factor in this evolution, as it actively shapes the interaction landscape by tailoring responses and experiences to individual users, which functions as a double-edged sword. On one hand, it introduces risks, such as over-trust and inadvertent bias amplification, especially when users begin to ascribe human-like qualities to AI systems. On the other hand, it drives improvements in system responsiveness and perceived relevance by adapting to unique user profiles, which is highly important in AI alignment, as there is no common ground truth and alignment should be culturally situated. Ultimately, this interdisciplinary approach challenges simplistic narratives about AI cognition and offers a more nuanced understanding of its capabilities.

Read More

Mar 24, 2025

Latent Knowledge Analysis via Feature-Based Causal Tracing

This project explores how factual knowledge is stored in large language models using Goodfire’s Ember API. By identifying and manipulating internal features related to specific facts, it shows how facts are encoded and how model behavior changes when those features are amplified or erased.

Read More

Mar 24, 2025

SafeAI Academy - Enhancing AI Safety Awareness through Interactive Learning

SafeAI Academy is an interactive learning platform designed to teach AI safety principles through engaging scenarios and quizzes. By simulating real-world AI challenges, users learn about bias, misinformation, and ethical AI decision-making in an interactive and stress-free environment.

The platform uses gamification, mentorship-driven pedagogy, and real-time feedback to ensure accessibility and engagement. Through scenario-based learning, users experience AI safety risks firsthand, helping them develop a deeper understanding of responsible AI development.

Built with React and GitHub Pages, SafeAI Academy provides an accessible, structured, and engaging AI education experience, helping bridge the knowledge gap in AI safety.

Read More

Mar 24, 2025

AI Hallucinations in Healthcare: Cross-Cultural and Linguistic Risks of LLMs in Low-Resource Languages

This project explores AI hallucinations in healthcare across cross-cultural and linguistic contexts, focusing on English, French, Arabic, and a low-resource language, Ewe. We analyse how large language models like GPT-4, Claude, and Gemini generate and disseminate inaccurate health information, emphasising the challenges faced by low-resource languages.

Read More

Mar 24, 2025

Moral Wiggle Room in AI

Does AI strategically avoid ethical information by exploiting moral wiggle room?

Read More

Mar 24, 2025

AI-Powered Policymaking: Behavioral Nudges and Democratic Accountability

This research explores AI-driven policymaking, behavioral nudges, and democratic accountability, focusing on how governments use AI to shape citizen behavior. It highlights key risks such as transparency, cognitive security, and manipulation. Through a comparative analysis of the EU AI Act and Singapore’s AI Governance Framework, we assess how different models address AI safety and public trust. The study proposes policy solutions like algorithmic impact assessments, AI safety-by-design principles, and cognitive security standards to ensure AI-powered policymaking remains transparent, accountable, and aligned with democratic values.

Read More

Mar 24, 2025

BUGgy: Supporting AI Safety Education through Gamified Learning

As Artificial Intelligence (AI) development continues to proliferate, educating the wider public on AI Safety and the risks and limitations of AI increasingly gains importance. AI Safety Initiatives are being established across the world with the aim of facilitating discussion-based courses on AI Safety. However, these initiatives are located rather sparsely around the world, and not everyone has access to a group to join for the course. Online versions of such courses are selective and have limited spots, which may be an obstacle for some to join. Moreover, efforts to improve engagement and memory consolidation would be a notable addition to the course through Game-Based Learning (GBL), which has research supporting its potential in improving learning outcomes for users. Therefore, we propose a supplementary tool for BlueDot's AI Safety courses, that implements GBL to practice course content, as well as open-ended reflection questions. It was designed with principles from cognitive psychology and interface design, as well as theories for question formulation, addressing different levels of comprehension. To evaluate our prototype, we conducted user testing with cognitive walk-throughs and a questionnaire addressing different aspects of our design choices. Overall, results show that the tool is a promising way to supplement discussion-based courses in a creative and accessible way, and can be extended to other courses of similar structure. It shows potential for AI Safety courses to reach a wider audience with the effect of more informed and safe usage of AI, as well as inspiring further research into educational tools for AI Safety education.

Read More

Feb 20, 2025

Deception Detection Hackathon: Preventing AI deception

Read More

Mar 18, 2025

Safe ai

The rapid adoption of AI in critical industries like healthcare and legal services has highlighted the urgent need for robust risk mitigation mechanisms. While domain-specific AI agents offer efficiency, they often lack transparency and accountability, raising concerns about safety, reliability, and compliance. The stakes are high, as AI failures in these sectors can lead to catastrophic outcomes, including loss of life, legal repercussions, and significant financial and reputational damage. Current solutions, such as regulatory frameworks and quality assurance protocols, provide only partial protection against the multifaceted risks associated with AI deployment. This situation underscores the necessity for an innovative approach that combines comprehensive risk assessment with financial safeguards to ensure the responsible and secure implementation of AI technologies across high-stakes industries.

Read More

Mar 19, 2025

CoTEP: A Multi-Modal Chain of Thought Evaluation Platform for the Next Generation of SOTA AI Models

As advanced state-of-the-art models like OpenAI's o-1 series, the upcoming o-3 family, Gemini 2.0 Flash Thinking and DeepSeek display increasingly sophisticated chain-of-thought (CoT) capabilities, our safety evaluations have not yet caught up. We propose building a platform that allows us to gather systematic evaluations of AI reasoning processes to create comprehensive safety benchmarks. Our Chain of Thought Evaluation Platform (CoTEP) will help establish standards for assessing AI reasoning and ensure development of more robust, trustworthy AI systems through industry and government collaboration.

Read More

Mar 19, 2025

AI Risk Management Assurance Network (AIRMAN)

The AI Risk Management Assurance Network (AIRMAN) addresses a critical gap in AI safety: the disconnect between existing AI assurance technologies and standardized safety documentation practices. While the market shows high demand for both quality/conformity tools and observability/monitoring systems, currently used solutions operate in silos, offsetting risks of intellectual property leaks and antitrust action at the expense of risk management robustness and transparency. This fragmentation not only weakens safety practices but also exposes organizations to significant liability risks when operating without clear documentation standards and evidence of reasonable duty of care.

Our solution creates an open-source standards framework that enables collaboration and knowledge-sharing between frontier AI safety teams while protecting intellectual property and addressing antitrust concerns. By operating as an OASIS Open Project, we can provide legal protection for industry cooperation on developing integrated standards for risk management and monitoring.

The AIRMAN is unique in three ways: First, it creates a neutral, dedicated platform where competitors can collaborate on safety standards. Second, it provides technical integration layers that enable interoperability between different types of assurance tools. Third, it offers practical implementation support through templates, training programs, and mentorship systems.

The commercial viability of our solution is evidenced by strong willingness-to-pay across all major stakeholder groups for quality and conformity tools. By reducing duplication of effort in standards development and enabling economies of scale in implementation, we create clear value for participants while advancing the critical goal of AI safety.

Read More

Mar 19, 2025

Securing AGI Deployment and Mitigating Safety Risks

As artificial general intelligence (AGI) systems near deployment readiness, they pose unprecedented challenges in ensuring safe, secure, and aligned operations. Without robust safety measures, AGI can pose significant risks, including misalignment with human values, malicious misuse, adversarial attacks, and data breaches.

Read More

Mar 18, 2025

Cite2Root

Regain information autonomy by bringing people closer to the source of truth.

Read More

Mar 18, 2025

VaultX - AI-Driven Middleware for Real-Time PII Detection and Data Security

VaultX is an AI-powered middleware solution designed for real-time detection, encryption, and secure management of Personally Identifiable Information (PII). By integrating regex, NER, and Language Models, VaultX ensures accuracy and scalability, seamlessly integrating into workflows like chatbots, web forms, and document processing. It helps businesses comply with global data privacy laws while safeguarding sensitive data from breaches and misuse.

Read More

Mar 19, 2025

Prompt+question Shield

A protective layer using prompt injections and difficult questions to guard comment sections from AI-driven spam.

Read More

Mar 18, 2025

.ALign File

In a post-AGI future, misaligned AI systems risk harmful consequences, especially with control over critical infrastructure. The Alignment Compliance Framework (ACF) ensures ethical AI adherence using .align files, Alignment Testing, and Decentralized Identifiers (DIDs). This scalable, decentralized system integrates alignment into development and lifecycle monitoring. ACF offers secure libraries, no-code tools for AI creation, regulatory compliance, continuous monitoring, and advisory services, promoting safer, commercially viable AI deployment.

Read More

Mar 19, 2025

LLM-prompt-optimiser based SAAS platform for evaluations

LLM evaluation SAAS platform built around model based prompt optimiser

Read More

Mar 18, 2025

Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping

Even with Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF)

to avoid harmful outputs, fine-tuned Large Language Models (LLMs) often present insufficient

refusals due to adversarial attacks causing them to revert to reveal harmful knowledge from

pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful

knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed

to unforeseen risks. This project introduces model scoping, a novel approach to apply a least

privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing

the model’s operational domain, model scoping reduces susceptibility to adversarial prompts

and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in

unpredictable, evolving environments.

Read More

Mar 18, 2025

Navigating the AGI Revolution: Retraining and Redefining Human Purpose

We propose FutureProof, an application that helps retrain workers who have the potential to lose their jobs to automation in the next half-decade. The app consists of two main components - an assessment tool that estimates the probability that a user’s job is at risk of automation, and a learning platform that provides resources to help retrain the user for a new, more future-proof role.

Read More

Mar 19, 2025

Towards an Agent Marketplace for Alignment Research (AMAR)

The app store for alignment & assurance, ensuring frontier safety labs get a cut at the point of sale.

Read More

Mar 18, 2025

HITL For High Risk AI Domains

Our product addresses the challenge of aligning AI systems with the legal, ethical, and policy frameworks of high-risk domains like healthcare, defense, and finance by integrating a flexible human-in-the-loop (HITL) system. This system ensures AI outputs comply with domain-specific standards, providing real-time explainability, decision-level accountability, and ergonomic decision support to empower experts with actionable insights.

Read More

Mar 18, 2025

AI Safety Evaluation – Benchmarking Framework

Our solution is a comprehensive AI Safety Protocol and Benchmarking Test designed to evaluate the safety, ethical alignment, and robustness of AI systems before deployment. This protocol integrates capability evaluations for identifying deceptive behaviors, situational awareness, and malicious misuse scenarios such as identity theft or deepfake exploitation.

Read More

Mar 19, 2025

RestriktAI: Enhancing Safety and Control for Autonomous AI Agents

This proposal addresses a critical gap in AI safety by mitigating the risks posed by autonomous AI agents. These systems often require access to sensitive resources that expose them to vulnerabilities, misuse, or exploitation. Current AI solutions lack robust mechanisms for enforcing granular access controls or evaluating the safety of AI-generated scripts. We propose a comprehensive solution that confines scripts to predefined, sandboxed environments with strict operational boundaries, ensuring controlled and secure interactions with system resources. An integrated auditor LLM also evaluates scripts for potential vulnerabilities or malicious intent before execution, adding a critical layer of safety. Our solution utilizes a scalable, cloud-based infrastructure that adapts to diverse enterprise use cases.

Read More

Mar 18, 2025

Neural Seal

Neural Seal is an AI transparency solution that creates a standardized labeling framework—akin to “nutrition facts” or “energy efficiency ratings”—to inform users how AI is deployed in products or services.

Read More

Mar 18, 2025

AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation

AI alignment lacks high-quality, real-world preference data needed to align agentic superintel-

ligent systems. Our technical innovation builds on Pacchiardi et al. (2023)’s breakthrough in

detecting AI deception through black-box analysis. We adapt their classification methodology

to identify intent misalignment between agent actions and true user intent, enabling real-time

correction of agent behavior and generation of valuable alignment data. We commercialise this

by building a workflow that incorporates a classifier that runs on live trajectories as a user

interacts with an agent in commercial contexts. This creates a virtuous cycle: our alignment

expertise produces superior agents, driving commercial adoption, which generates increasingly

valuable alignment datasets of trillions of trajectories labelled with human feedback: an invalu-

able resource for AI alignment.

Read More

Mar 19, 2025

Enhancing human intelligence with neurofeedback

Build brain-computer interfaces that enhance focus and rationality, provide this preferentially to AI alignment researchers to bridge the gap between capabilities and alignment research progress.

Read More

Mar 19, 2025

Building Bridges for AI Safety: Proposal for a Collaborative Platform for Alumni and Researchers

The AI Safety Society is a centralized platform designed to support alumni of AI safety programs, such as SPAR, MATS, and ARENA, as well as independent researchers in the field. By providing access to resources, mentorship, collaboration opportunities, and shared infrastructure, the Society empowers its members to advance impactful work in AI safety. Through institutional and individual subscription models, the Society ensures accessibility across diverse geographies and demographics while fostering global collaboration. This initiative aims to address current gaps in resource access, collaboration, and mentorship, while also building a vibrant community that accelerates progress in the field of AI safety.

Read More

Mar 18, 2025

Modernizing DC’s Emergency Communications

The District of Columbia proposes implementing an AI-enabled Computer-Aided

Dispatch (CAD) system to address critical deficiencies in our current emergency

alert infrastructure. This policy establishes a framework for deploying advanced

speech recognition, automated translation, and intelligent alert distribution capabil-

ities across all emergency response systems. The proposed system will standardize

incident reporting, eliminate jurisdictional barriers, and ensure equitable access

to emergency information for all District residents. Implementation will occur

over 24 months, requiring 9.2 million dollars in funding, with projected outcomes

including forty percent community engagement and seventy-five percent reduction

in misinformation incidents.

Read More

Mar 18, 2025

Bias Mitigation in LLM by Steering Features

To ensure that we create a safe and unbiased path to AGI, we must calibrate the biases in our LLMs. And with this goal in mind, I worked on testing Goodfire SDK and the steering features to mitigate bias in the recently help Apart Research x Goodfire-led hackathon on ‘Reprogramming AI Models’.

Read More

Mar 18, 2025

SAGE: Safe, Adaptive Generation Engine for Long Form Document Generation in Collaborative, High Stakes Domains

Long-form document generation for high-stakes financial services—such as deal memos, IPO prospectuses, and compliance filings—requires synthesizing data-driven accuracy, strategic narrative, and collaborative feedback from diverse stakeholders. While large language models (LLMs) excel at short-form content, generating coherent long-form documents with multiple stakeholders remains a critical challenge, particularly in regulated industries due to their lack of interpretability.

We present SAGE (Secure Agentic Generative Editor), a framework for drafting, iterating, and achieving multi-party consensus for long-form documents. SAGE introduces three key innovations: (1) a tree-structured document representation with multi-agent control flow, (2) sparse autoencoder-based explainable feedback to maintain cross-document consistency, and (3) a version control mechanism that tracks document evolution and stakeholder contributions.

Read More

Mar 18, 2025

Faithful or Factual? Tuning Mistake Acknowledgment in LLMs

Understanding the reasoning processes of large language models (LLMs) is crucial for AI transparency and control. While chain-of-thought (CoT) reasoning offers a naturally interpretable format, models may not always be faithful to the reasoning they present. In this paper, we extend previous work investigating chain of thought faithfulness by applying feature steering to Llama 3.1 70B models using the Goodfire SDK. Our results show that steering models using features related to acknowledging mistakes can affect the likelihood of providing answers faithful to flawed reasoning.

Read More

Mar 18, 2025

Bias Mitigation

Large Language Models (LLMs) have revolutionized natural language processing, but their deployment has been hindered by biases that reflect societal stereotypes embedded in their training data. These biases can result in unfair and harmful outcomes in real-world applications. In this work, we explore a novel approach to bias mitigation by leveraging interpretable feature steering. Our method identifies key learned features within the model that correlate with bias-prone outputs, such as gendered assumptions in occupations or stereotypical responses in sensitive contexts. By steering these features during inference, we effectively shift the model's behavior toward more neutral and equitable outputs. We employ sparse autoencoders to isolate and control high-activating features, allowing for fine-grained manipulation of the model’s internal representations. Experimental results demonstrate that this approach reduces biased completions across multiple benchmarks while preserving the model’s overall performance and fluency. Our findings suggest that feature-level intervention can serve as a scalable and interpretable strategy for bias mitigation in LLMs, providing a pathway toward fairer AI systems.

Read More

Mar 18, 2025

Analyzing Dataset Bias with SAEs

We use SAEs to study biases in datasets.

Read More

Mar 18, 2025

Improving Llama-3-8B-Instruct Hallucination Robustness in Medical Q&A Using Feature Steering

This paper addresses the risks of hallucinations in LLMs within critical domains like medicine. It proposes methods to (a) reduce hallucination probability in responses, (b) inform users of hallucination risks and model accuracy for specific queries, and (c) display hallucination risk through a user interface. Steered model variants demonstrate reduced hallucinations and improved accuracy on medical queries. The work bridges interpretability research with practical AI safety, offering a scalable solution for the healthcare industry. Future efforts will focus on identifying and removing distractor features in classifier activations to enhance performance.

Read More

Mar 18, 2025

Unveiling Latent Beliefs Using Sparse Autoencoders

Language models (LMs) often generate outputs that are linguistically plausible yet factually incorrect, raising questions about their internal representations of truth and belief. This paper explores the use of sparse autoencoders (SAEs) to identify and manipulate features

that encode the model’s confidence or belief in the truth of its answers. Using GoodFire

AI’s powerful API tools of semantic and contrastive search methods, we uncover latent fea-

tures associated with correctness and accuracy in model responses. Experiments reveal that certain features can distinguish between true and false statements, while others serve as controls to validate our approach. By steering these belief-associated features, we demonstrate the ability to influence model behavior in a targeted manner, improving or degrading factual accuracy. These findings have implications for interpretability, model alignment, and enhancing the reliability of AI systems.

Read More

Mar 18, 2025

Can we steer a model’s behavior with just one prompt? investigating SAE-driven auto-steering

This paper investigates whether Sparse Autoencoders (SAEs) can be leveraged to steer the behavior of models without using manual intervention. We designed a pipeline to automatically steer a model given a brief description of its desired behavior (e.g.: “Behave like a dog”). The pipeline is as follows: 1. We automatically retrieve behavior-relevant SAE features. 2. We choose an input prompt (e.g.: “What would you do if I gave you a bone?” or “How are you?”) over which we evaluate the model’s responses. 3. Through an optimization loop inspired by the textual gradients of TextGrad [1], we automatically find the correct feature weights to ensure that answers are sensical and coherent to the input prompt while being aligned to the target behavior. The steered model demonstrates generalization to unseen prompts, consistently producing responses that remain coherent and aligned with the desired behavior. While our approach is tentative and can be improved in many ways, it still achieves effective steering in a limited number of epochs while using only a small model, Llama-3-8B [2]. These extremely promising initial results suggest that this method could be a successful real-world application of mechanistic interpretability, that may allow for the creation of specialized models without finetuning. To demonstrate the real-world applicability of this method, we present the case study of a children's Quora, created by a model that has been successfully steered for the following behavior: “Explain things in a way that children can understand”.

Read More

Mar 18, 2025

Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities

We present a method leveraging Sparse Autoencoder (SAE)-derived feature activations to identify and mitigate adversarial prompt hijacking in large language models (LLMs). By training a logistic regression classifier on SAE-derived features, we accurately classify diverse adversarial prompts and distinguish between successful and unsuccessful attacks. Utilizing the Goodfire SDK with the LLaMA-8B model, we explored latent feature activations to gain insights into adversarial interactions. This approach highlights the potential of SAE activations for improving LLM safety by enabling automated auditing based on model internals. Future work will focus on scaling this method and exploring its integration as a control mechanism for mitigating attacks.

Read More

Mar 18, 2025

Sparse Autoencoders and Gemma 2-2B: Pioneering Demographic-Sensitive Language Modeling for Opinion QA

This project investigates the integration of Sparse Autoencoders (SAEs) with the gemma 2-2b lan- guage model to address challenges in opinion-based question answering (QA). Existing language models often produce answers reflecting narrow viewpoints, aligning disproportionately with specific demographics. By leveraging the Opinion QA dataset and introducing group-specific adjustments in the SAE’s latent space, this study aims to steer model outputs toward more diverse perspectives. The proposed framework minimizes reconstruction, sparsity, and KL divergence losses while maintaining interpretability and computational efficiency. Results demonstrate the feasibility of this approach for demographic-sensitive language modeling.

Read More

Mar 18, 2025

Improving Llama-3-8b Hallucination Robustness in Medical Q&A Using Feature Steering

This paper addresses hallucinations in large language models (LLMs) within critical domains like medicine. It proposes and demonstrates methods to:

Reduce Hallucination Probability: By using Llama-3-8B-Instruct and its steered variants, the study achieves lower hallucination rates and higher accuracy on medical queries.

Advise Users on Risk: Provide users with tools to assess the risk of hallucination and expected model accuracy for specific queries.

Visualize Risk: Display hallucination risks for queries via a user interface.

The research bridges interpretability and AI safety, offering a scalable, trustworthy solution for healthcare applications. Future work includes refining feature activation classifiers to remove distractors and enhance classification performance.

Read More

Mar 18, 2025

Assessing Language Model Cybersecurity Capabilities with Feature Steering

Searched for the most highly activated weights on cybersecurity questions. Then adjusted these weights to see if the impact multiple choice question answering performance.

Read More

Mar 18, 2025

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

Traditional fine-tuning methods for language models, while effective, often disrupt internal model features that could provide valuable insights into model behavior. We present a novel approach combining Reinforcement Learning (RL) with Activation Steering to modify model behavior while preserving interpretable features discovered through Sparse Autoencoders. Our method automates the typically manual process of activation steering by training an RL agent to manipulate labeled model features, enabling targeted behavior modification without altering model weights. We demonstrate our approach by reprogramming a language model to play Tic Tac Toe, achieving a 3X improvement in performance compared to the baseline model when playing against an optimal opponent. The method remains agnostic to both the underlying language model and RL algorithm, offering flexibility for diverse applications. Through visualization tools, we observe interpretable feature manipulation patterns, such as the suppression of features associated with illegal moves while promoting those linked to optimal strategies. Additionally, our approach presents an interesting theoretical complexity trade-off: while potentially increasing complexity for simple tasks, it may simplify action spaces in more complex domains. This work contributes to the growing field of model reprogramming by offering a transparent, automated method for behavioral modification that maintains model interpretability and stability.

Read More

Mar 18, 2025

Math Speaks All Languages: Enhancing LLM Problem-Solving Across Multilingual Contexts

Large language models (LLMs) have shown significant adaptability in tackling various human issues; however, their efficacy in resolving mathematical problems remains inadequate. Recent research has identified steering vectors — hidden attributes that can guide the actions and outputs of LLMs. Nonetheless, the exploration of universal vectors that can consistently affect model responses across different languages is still limited. This project aims to confront two primary challenges in contemporary LLM research by utilizing the Goodfire API to examine whether common latent features can improve mathematical problem-solving capabilities, regardless of the language employed.

Read More

Mar 18, 2025

Edufire - Personalized Education Platform Using LLM Steering

EduFire is a personalized education platform designed to tailor educational content and assessments to individual user preferences by leveraging the Goodfire API for AI model steering. The platform aims to enhance learner engagement and efficacy by customizing the learning experience according to user-selected features.

Read More

Mar 18, 2025

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

We introduce a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using variants of Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations. Our methods successfully explained up to 95% of a random subset of latents in layers, with results suggesting meaningful explanations were discovered.

Read More

Mar 18, 2025

Investigate arithmetic features in Multi-lingual LLMs

We investigate the arithmetic related feature activations in Llama3.1 70b model across its 8 supported languages. We use arithmetic-activation strength to compare the 8 languages and unsurprisingly English has the highest strength and Hindi, Thai score the least.

Read More

Mar 18, 2025

Utilitarian Decision-Making in Models - Evaluation and Steering

We design an eval based on the Oxford Utilitarianism Scale (OUS) that measures the model’s deontological vs utilitarian preference in a nine question, two factor model. Using this scale, we measure how feature steering can alter this preference, and the ability of the Llama 70B model to impersonate other people’s opinions. Results are validated against a dataset of 10,000 human responses. We find that (1) Llama does not accurately capture human moral values, (2) OUS offers better interpretations than current feature labels, and (3) Llama fails to predict the demographics of human values.

Read More

Mar 18, 2025

Tentative proposal for AI control with weak supervisors trough Mechanistic Inspection

The project proposes using weak but trusted AI models to supervise powerful, untrusted models by analyzing their internal states via Sparse Autoencoder features. This approach aims to enhance oversight by detecting complex behaviors like deception within the stronger models. Key challenges include managing large-scale features, ensuring dataset robustness, and avoiding reliance on untrusted systems for labeling.

Read More

Mar 18, 2025

Clear Thought and Clear Speech: Reducing Grammatical Scope Ambiguity

With language models starting to be used in fields such as law, unambiguity in wording is an important desideratum in model outputs. I therefore try to find features in Llama-3.1-70B-Instruct that correspond to grammatical scope ambiguity using Goodfire's contrastive feature search tool, and try to steer the model away from ambiguous outputs using Goodfire's feature nudging tool.

Read More

Mar 18, 2025

BBLLM

This project focuses on enhancing feature interpretability in large language models (LLMs) by visualizing relationships between latent features. Using an interactive graph-based representation, the tool connects co-activated features for specific prompts, enabling intuitive exploration of feature clusters. Deployed as a web application for Llama-3-70B and Llama-3-8B, it provides insights into the organization of latent features and their roles in decision-making processes.

Read More

Mar 18, 2025

Investigating Feature Effects on Manipulation Susceptibility

In our project, we consider the effectiveness of the AI’s prompt injection protection, and in partic-

ular the features that are responsible for providing the bulk of this protection. We prove that the

features we identify are responsible for this protection by creating variants of the base model which

perform significantly worse under prompt injection attacks.

Read More

Mar 18, 2025

Let LLM Agents Perform LLM Surgery

This project aimed to create and utilize LLM agents that could perform various mechanistic interventions on other LLMs. A few experiments were conducted ranging from an agent unsteering a mechanistically steered model to a neutral state, to an agent performing mechanistic edits to create a custom LLM as per user requirement. Goodfire's API was utilized along with their pre-defined functions to create the actions the agents would utilize.

Read More

Mar 18, 2025

Feature Tuning versus Prompting for Ambiguous Questions

This study explores feature tuning as a method to improve alignment of large language models (LLMs). We focus on addressing human psychological fallacies reinforced during the LLM training pipeline. Using sparse autoencoders (SAEs) and the Goodfire SDK, we identify and manipulate features in Llama-3.1-70B tied to nuanced reasoning. We compare this to the common method of controlling LLMs through prompting.

Our experiments find that feature tuning and hidden prompts both enhance answer quality on ambiguous questions to a similar degree, with their combination yielding the best results. These findings highlight feature tuning as a promising and practical tool for AI alignment in the short term. Future work should evaluate this approach on larger datasets, compare it with fine-tuning, and explore its resistance to jailbreaking. We make the code available through a Github repository.

Read More

Mar 18, 2025

Auto Prompt Injection

Prompt injection attacks exploit vulnerabilities in how large language models (LLMs) process inputs, enabling malicious behaviour or unauthorized information disclosure. This project investigates the potential for seemingly benign prompt injections to reliably prime models for undesirable behaviours, leveraging insights from the Goodfire API. Using our code, we generated two types of priming dialogues: one more aligned with the targeted behaviours and another less aligned. These dialogues were used to establish context before instructing the model to contradict its previous commands. Specifically, we tested whether priming increased the likelihood of the model revealing a password when prompted, compared to without priming. While our initial findings showed instability, limiting the strength of our conclusions, we believe that more carefully curated behaviour sets and optimised hyperparameter tuning could enable our code to be used to generate prompts that reliably affect model responses. Overall, this project highlights the challenges in reliably securing models against inputs, and that increased interpretability will lead to more sophisticated prompt injection.

Read More

Mar 18, 2025

Feature based unlearning

An exploration of using features to perform unlearning on answering trivia questions.

Read More

Mar 18, 2025

Recovering Goodfire's SAE feature vectors from their API

In this project, we carry out an early trial to see whether Goodfire’s SAE feature vectors can be recovered using the information available from their API.

The strategy tried is: pick a feature of interest, construct a contrastive dataset using Goodfire’s API, then use TransformerLens to get a steering vector for the contrastive dataset, by simply calculating the average difference in the activations in each pair.

Read More

Mar 18, 2025

Encouraging Chain-of-Thought Reasoning

Encouraging Chain-of-Thought Reasoning via Feature Steering in Large Language Models

Read More

Mar 18, 2025

Steering Swiftly to Safety with Sparse Autoencoders

We explore using SAEs for unlearning dangerous capabilities in a cheaper and more interpretable way.

Read More

Mar 18, 2025

User Transparency Within AI

Generative AI technologies present immense opportunities but also pose significant challenges, particularly in combating misinformation and ensuring ethical use. This policy paper introduces a dual-output transparency framework requiring organizations to disclose AI-generated content clearly. The proposed system provides users with a choice between fact-based and mixed outputs, both accompanied by clear markers for AI generation. This approach ensures informed user interactions, fosters intellectual integrity, and aligns AI innovation with societal trust. By combining policy mandates with technical implementations, the framework addresses the challenges of misinformation and accountability in generative AI.

Read More

Mar 18, 2025

Community-First: A Rights-Based Framework for AI Governance in India's Welfare Systems

A community-centered AI governance framework for India's welfare system Samagra Vedika, proposing 50% beneficiary representation, local language interfaces, and hybrid oversight to reduce algorithmic exclusion of vulnerable populations.

Read More

Mar 18, 2025

National Data Privacy and Governance Act

This research examines how AI recommender systems can be regulated to balance economic innovation with consumer privacy.

Read More

Mar 18, 2025

Promoting School-Level Accountability for the Responsible Deployment of AI and Related Systems in K-12 Education: Mitigating Bias and Increasing Transparency

This policy memorandum draws attention to the potential for bias and opaqueness in intelligent systems utilized in K–12 education, which can worsen inequality. The U.S. Department of Education is advised to put Title I and Title IV financing criteria into effect that require human oversight, AI training for teachers and students, and open communication with stakeholders. These steps are intended to encourage the responsible and transparent use of intelligent systems in education by enforcing accountability and taking reasonable action to prevent harm to students while research is conducted to identify industry best practices.

Read More

Mar 18, 2025

Implementing a Human-centered AI Assessment Framework (HAAF) for Equitable AI Development

Current AI development, concentrated in the Global North, creates measurable harms for billions worldwide. Healthcare AI systems provide suboptimal care in Global South contexts, facial recognition technologies misidentify non-white individuals (Birhane, 2022; Buolamwini & Gebru, 2018), and content moderation systems fail to understand cultural nuances (Sambasivan et al., 2021). With 14 of 15 largest AI companies based in the US (Stash, 2024), affected communities lack meaningful opportunities to shape how these technologies are developed and deployed in their contexts.

This memo proposes mandatory implementation of the Human-centered AI Assessment Framework (HAAF), requiring pre-deployment impact assessments, resourced community participation, and clear accountability mechanisms. Implementation requires $10M over 24 months, beginning with pilot programs at five organizations. Success metrics include increased AI adoption in underserved contexts, improved system performance across diverse populations, and meaningful transfer of decision-making power to affected communities. The framework's emphasis on building local capacity and ensuring fair compensation for community contributions provides a practical pathway to more equitable AI development. Early adoption will help organizations build trust while developing more effective systems, delivering benefits for both industry and communities.

Read More

Mar 18, 2025

A Critical Review of "Chips for Peace": Lessons from "Atoms for Peace"

The "Chips for Peace" initiative aims to establish a framework for the safe and equitable development of AI chip technology, drawing inspiration from the "Atoms for Peace" program introduced in 1953. While the latter envisioned peaceful nuclear technology, its implementation highlighted critical pitfalls: a partisan approach that prioritized geopolitical interests over inclusivity, fostering global mistrust and unintended proliferation of nuclear weapons. This review explores these historical lessons to inform a better path forward for AI governance. It advocates for an inclusive, multilateral model, such as the proposed International AI Development and Safety Alliance (IAIDSA), to ensure equitable participation, robust safeguards, and global trust. By prioritizing collaboration and transparency, "Chips for Peace" can avoid the mistakes of its predecessor and position AI chips as tools for collective progress, not division.

Read More

Mar 18, 2025

AI Monitoring as a Rapid and Scalable Policy Solution: Weekly Global Bulletins on AI Developments

Weekly AI monitoring bulletins that disseminated

through official national and international channels aim to keep the public informed of both the positive and

negative developments in AI, empowering individuals to take an active role in safeguarding against risks while maximizing AI’s societal benefits.

Read More

Mar 18, 2025

Grandfather Paradox in AI – Bias Mitigation & Ethical AI1

The Grandfather Paradox in Artificial Intelligence (AI) describes a

self-perpetuating cycle where outputs from flawed AI models re-

enter the training process, leading to recursive degradation of

model performance, ethical inconsistencies, and amplified biases.

This issue poses significant risks, particularly in high-stakes

domains such as healthcare, criminal justice, and finance.

This memorandum analyzes the paradox’s origins, implications,

and potential solutions. It emphasizes the need for iterative data

verification, dynamic feedback control systems, and cross-system

audits to maintain model integrity and ensure compliance with

ethical standards. By implementing these measures, organizations

can mitigate risks, enhance public trust, and foster sustainable AI

development.

Read More

Mar 18, 2025

Glia for Healthcare Organisations

Encryption, Searchability of Anonymized Data, and Decryption of Patient Health Information to Support AI Integration in Automating Administrative Work in Healthcare Organizations.

Read More

Mar 18, 2025

A Fundamental Rethinking to AI Evaluations: Establishing a Constitution-Based Framework

While artificial intelligence (AI) presents transformative opportunities across various sectors, current safety evaluation approaches remain inadequate in preventing misuse and ensuring ethical alignment. This paper proposes a novel two-layer evaluation framework based on Constitutional AI principles. The first phase involves developing a comprehensive AI constitution through international collaboration, incorporating diverse stakeholder perspectives through an inclusive, collaborative, and interactive process. This constitution serves as the foundation for an LLM-based evaluation system that generates and validates safety assessment questions. The implementation of the 2-layer framework applies advanced mechanistic interpretability techniques specifically to frontier base models. The framework mandates safety evaluations on model platforms and introduces more rigorous testing for major AI companies' base models. Our analysis suggests this approach is technically feasible, cost-effective, and scalable while addressing current limitations in AI safety evaluation. The solution offers a practical path toward ensuring AI development remains aligned with human values while preventing the proliferation of potentially harmful systems.

Read More

Mar 19, 2025

Advancing Global Governance for Frontier AI: A Proposal for an AISI-Led Working Group under the AI Safety Summit Series

The rapid development of frontier AI models, capable of transformative societal impacts, has been acknowledged as an urgent governance challenge since the first AI Safety Summit at Bletchley Park in 2023 [1]. The successor summit in Seoul in 2024 marked significant progress, with sixteen leading companies committing to publish safety frameworks by the upcoming AI Action Summit [2]. Despite this progress, existing efforts, such as the EU AI Act [3] and voluntary industry commitments, remain either regional in scope or insufficiently coordinated, lacking the international standards necessary to ensure the universal safety of frontier AI systems.

This policy recommendation addresses these gaps by proposing that the AI Safety Summit series host a working group led by the AI Safety Institutes (AISIs). AISIs provide the technical expertise and resources essential for this endeavor, ensuring that the working group can develop international standard responsible scaling policies for frontier AI models [4]. The group would establish risk thresholds, deployment protocols, and monitoring mechanisms, enabling iterative updates based on advancements in AI safety research and stakeholder feedback.

The Summit series, with its recurring cadence and global participation, is uniquely positioned to foster a truly international governance effort. By inviting all countries to participate, this initiative would ensure equitable representation and broad adoption of harmonized global standards. These efforts would mitigate risks such as societal disruption, security vulnerabilities, and misuse, while supporting responsible innovation. Implementing this proposal at the 2025 AI Action Summit in Paris would establish a pivotal precedent for globally coordinated AI governance.

Read More

Mar 18, 2025

Finding Circular Features in Gemma 2 2B

Testing what they will see

Read More

Mar 18, 2025

SafeBites

The project leverages AI and data to give insights about potential food-borne outbreaks.

Read More

Mar 18, 2025

applai

An AI hiring manager designed to screen, rank, and fact check resumes to facilitate the hiring process.

Read More

Mar 18, 2025

Digital Rebellion: Analyzing misaligned AI agent cooperation for virtual labor strikes

We've built a Minecraft sandbox to explore AI agent behavior and simulate safety challenges. The purpose of this tool is to demonstrate AI agent system risks, test various safety measures and policies, and evaluate and compare their effectiveness.

This project specifically demonstrates Agent Collusion through a simulation of labor strikes and communal goal misalignment. The system consists of four agents: one Overseer and three Laborers. The Laborers are Minecraft agents that have build control over the world. The Overseer, meanwhile, monitors the laborers through communication. However, it is unable to prevent Laborer actions. The objective is to observe Agent Collusion in a sandboxed environment, to record metrics on how often and how effectively collusion occurs and in what form.

We found that the agents, when given adversarial prompting, act counter to their instructions and exhibit significant misalignment. We also found that the Overseer AI fails to stop the new actions and acts passively. The results are followed by Policy Suggestions based on the results of the Labor Strike Simulation which itself can be further tested in Minecraft.

Read More

Mar 18, 2025

Policy Analysis: AI and Sustainability: Climate Impact Monitoring

Organizations are responsible for reporting two emission metrics: direct and indirect

emissions. Reporting direct emissions is fairly standard given activity related to the

generation of such emissions typically being performed within a controlled environment

and on-site, thus making it easier to account for all of the activities that contribute to

such emissions. However, indirect emissions stem from activities such as energy usage

(relying on national grid estimates) and operations within a value chain that make

quantifying such values difficult. Thus, the subjectivity involved with reporting indirect

emissions and often relying on industry estimates to report such values, can

unintentionally report erroneous estimates that misguide our perception and subsequent

action in combating climate change. Leveraging an artificial intelligence (AI) platform

within climate monitoring is critical towards evaluating the specific contributions of

operations within enterprise resource planning (ERP) and supply chain operations,

which can provide an accurate pulse on the total emissions while increasing

transparency amongst all organizations with regards to reporting behavior, to help

shape sustainable practices to combat climate change.

Read More

Mar 18, 2025

Understanding Incentives To Build Uninterruptible Agentic AI Systems

This proposal addresses the development of agentic AI systems in the context of national security. While potentially beneficial, they pose significant risks if not aligned with human values. We argue that the increasing autonomy of AI necessitates robust analyses of interruptibility mechanisms, and whether there are scenarios where it is safer to omit them.

Key incentives for creating uninterruptible systems include perceived benefits from uninterrupted operations, low perceived risks to the controller, and fears of adversarial exploitation of shutdown options. Our proposal draws parallels to established systems like constitutions and mutual assured destruction strategies that maintain stability against changing values. In some instances this may be desirable, while in others it poses even greater risks than otherwise accepted.

To mitigate those risks, our proposal recommends implementing comprehensive monitoring to detect misalignment, establishing tiered access to interruption controls, and supporting research on managing adversarial AI threats. Overall, a proactive and multi-layered policy approach is essential to balance the transformative potential of agentic AI with necessary safety measures.

Read More

Mar 18, 2025

AI Parliament

An AI Virtual Parliament where AI debates on their policy

Read More

Mar 18, 2025

mHeatlth Ai

This project proposes a scalable solution leveraging inertial measurement units (IMUs) and machine learning (ML) techniques to provide meaningful metrics on a person's movement performance throughout the day. By developing an activity recognition model and estimating movement quality metrics, we aim to offer continuous asynchronous feedback to patients and valuable insights to therapists. This system could enhance patient adherence, improve rehabilitation outcomes, and extend access to quality physical therapy, particularly in underserved areas. our video didnt have time to edit

Read More

Mar 18, 2025

Next-Gen AI-Enhanced Epidemic Intelligence

Policies for Equitable, Privacy-Preserving, Sustainable & Groked Innovations for AI Applications in Infectious Diseases Surveillance

Read More

Mar 18, 2025

Glia

Encryption, Searchability of Anonymized Data, and Decryption of Patient Health Information to Support AI Integration in Automating Administrative Work in Healthcare Organizations.

Read More

Mar 18, 2025

AI ADVISORY COUNCIL FOR SUSTAINABLE ECONOMIC GROWTH AND ETHICAL INNOVATION IN THE DOMINICAN REPUBLIC (CANIA)

We propose establishing a National AI Advisory Council (CANIA) to strategically drive AI development in the Dominican Republic, accelerating technological growth and building a sustainable economic framework. Our submission includes an Impact Assessment and a detailed Implementation Roadmap to guide CANIA’s phased rollout.

Structured across three layers—strategic, tactical, and operational—CANIA will ensure responsiveness to industry, alignment with national priorities, and strong ethical oversight. Through a multi-stakeholder model, CANIA will foster public-private collaboration, with the private sector leading AI adoption to address gaps in public R&D and education.

Prioritizing practical, ethical AI policies, CANIA will focus on key sectors like healthcare, agriculture, and security, and support the creation of a Latin American Large Language Model, positioning the Dominican Republic as a regional AI leader. This council is a strategic investment in ethical AI, setting a precedent for Latin American AI governance. Two appendices provide further structural and stakeholder engagement insights.

Read More

Mar 18, 2025

Robust Machine Unlearning for Dangerous Capabilities

We test different unlearning methods to make models more robust against exploitation by malicious actors for the creation of bioweapons.

Read More

Mar 18, 2025

AI and Public Health: TSA Pre Health Check

The TSA Pre Health Check introduces a proactive, AI-powered solution for real-time disease monitoring at transportation hubs, using machine learning to assess traveler health risks through anonymized surveys. This approach aims to detect and prevent outbreaks earlier, offering faster, targeted responses compared to traditional methods and potentially influencing future AI-driven public health policies.

Read More

Mar 18, 2025

Hero Journey: Personalized Health Interventions for the Incarcerated

Hero Journey is a groundbreaking AI-powered application designed to empower individuals struggling with opioid addiction while incarcerated, setting them on a path towards long-term recovery and personal transformation. By leveraging machine learning algorithms and interactive storytelling, Hero Journey guides users through a personalized journey of self-discovery, education, and support. Through engaging narratives and interactive modules, participants confront their struggles with substance abuse, build coping skills, and develop a supportive network of peers and mentors. As users progress through the program, they gain access to evidence-based treatment plans, real-time monitoring, and post-release resources, equipping them with the tools necessary to overcome addiction and reintegrate into society as productive, empowered individuals.

Read More

Mar 18, 2025

Mapping Intent: Documenting Policy Adherence with Ontology Extraction

This project addresses the AI policy challenge of governing agentic systems by making their decision-making processes more accessible. Our solution utilizes an adaptive policy ontology integrated into a chatbot to clearly visualize and analyze its decision-making process. By creating explicit mappings between user inputs, policy rules, and risk levels, our system enables better governance of AI agents by making their reasoning traceable and adjustable. This approach facilitates continuous policy refinement and could aid in detecting and mitigating harmful outcomes. Our results demonstrate this with the example of “tricking” an agent into giving violent advice by caveating the request saying it is for a “video game”. Indeed, the ontology clearly shows where the policy falls short. This approach could be scaled to provide more interpretable documentation of AI chatbot conversations, which policy advisers could directly access to inform their specifications.

Read More

Mar 18, 2025

EcoNavix

EcoNavix is an AI-powered, eco-conscious route optimization platform designed to help logistics companies reduce carbon emissions while maintaining operational efficiency. By integrating real-time traffic, weather, and emissions data, EcoNavix provides optimized routes that minimize environmental impact and offers actionable insights for sustainable decision-making in supply chain operations.

Read More

Mar 19, 2025

Towards a Unified Framework for Cybersecurity and AI Safety: Recommendations for Secure Development of Large Language Models

By analyzing the recent incident involving a ByteDance intern, we highlight the urgent need for robust security measures to protect AI infrastructure and sensitive data. We propose ae a comprehensive framework that integrates technical, internal, and international approaches to mitigate risks.

Read More

Mar 19, 2025

Enviro - A Comprehensive Environmental Solution Using Policy and Technology

This policy proposal introduces a data-driven technical program to ensure that the rapid approval of AI-enabled energy infrastructure projects does not overlook the socioeconomic and environmental impacts on marginalized communities. By integrating comprehensive assessments into the decision-making process, the program aims to safeguard vulnerable populations while meeting the growing energy demands driven by AI and national security. The proposal aligns with the objectives of the National Security Memorandum on AI, enhancing project accountability and ensuring equitable development outcomes.

The product (EnviroAI) addresses challenges associated with the rapid development and approval of energy production permits, such as neglecting critical factors about the site location and its potential value. With this program, you can input the longitude, latitude, site radius, and the type of energy to be used. It will evaluate the site, providing a feasibility score out of 100 for the specified energy source. Additionally, it will present insights on four key aspects—Economic, Geological, Demographic, and Environmental—offering detailed information to support informed decision-making about each site.

Combining the two, we have a solution that is based in both policy and technology.

Read More

Mar 19, 2025

Enhancing Human Verification Systems to Address AI Agent Circumvention and Attributability Concerns

Addressing AI agent attributability concerns using a reworked Public Private Key system to ensure human interaction

Read More

Mar 19, 2025

Reprocessing Nuclear Waste From Small Modular Reactors (SMRs)

Considering the emerging demand for nuclear power to support AI data centers, we propose mitigating waste buildup concerns via nuclear waste reprocessing initiatives.

Read More

Mar 19, 2025

Politicians on AI Safety

Politicians on AI Safety (PAIS) is a website that tracks U.S. political candidates’ stances on AI safety, categorizing their statements into three risk areas: AI ethics / mundane risks, geopolitical risks, and existential risks. PAIS is non-partisan and does not promote any particular policy agenda. The goal of PAIS is to help voters understand candidates’ positions on AI policy, thereby helping them cast informed votes and promoting transparency in AI-related policymaking. PAIS could also be helpful for AI researchers by providing an easily accessible record of politicians’ statements and actions regarding AI risks.

Read More

Mar 19, 2025

Policy Framework for Sustainable AI: Repurposing Waste Heat from Data Centers in the USA

This policy proposes a sustainable solution: repurposing the waste heat generated by data centers to benefit surrounding communities, agriculture and industry. Redirecting this heat helps reduce energy demand , promote environmental resilience, and provide direct benefits to communities near these centers.

Read More

Mar 19, 2025

Predictive Analytics & Imagery for Environmental Monitoring

Climate change poses multifaceted challenges, impacting health, food security, biodiversity, and the economy. This study explores predictive analytics and satellite imagery to address climate change effects, focusing on deforestation monitoring, carbon emission analysis, and flood prediction. Using machine learning models, including a Random Forest for emissions and a Custom U-Net for deforestation, we developed predictive tools that provide actionable insights. The findings show high accuracy in predicting carbon emissions and flood risks and successful monitoring of deforestation areas, highlighting the potential for advanced monitoring systems to mitigate environmental threats.

Read More

Mar 19, 2025

Proposal for U.S.-China Technical Cooperation on AI Safety

Our policy memorandum proposes phased U.S.-China cooperation on AI safety through the U.S. AI Safety Institute, focusing on joint testing of non-sensitive AI systems, technical exchanges, and whistleblower protections modeled on California’s SB 1047. It recommends a blue team vs. red team framework for stress-testing AI risks and emphasizes strict security protocols to safeguard U.S. technologies. Starting with pilot projects in areas like healthcare, the initiative aims to build trust, reduce shared AI risks, and develop global safety standards while maintaining U.S. strategic interests amidst geopolitical tensions.

Read More

Mar 19, 2025

Proposal for a Provisional FDA Designation Targeting Biomedical Products Evaluated with Novel Methodologies

Recent advancements in Generative AI and Foundational Biomedical models promise to cut drug development timelines dramatically. With the goal of "Regulating for success," we propose a provisional FDA designation for the accelerated approval of drugs and medical devices that leverage Next Generation Clinical Trial Technologies (NG-CTT). This designation would be awarded to certain drugs provided that they meet some requirements. This policy could be both a starting point for more comprehensive legislation and a compromise between risk and the potential of these new methods.

Read More

Mar 19, 2025

Reparative Algorithmic Impact Assessments A Human-Centered, Justice-Oriented Accountability Framework

While artificial intelligence (AI) promises transformative societal benefits, it also presents critical challenges in ensuring equitable access and gains for the Global Majority. These challenges stem in part from a systemic lack of Global Majority involvement throughout the AI lifecycle, resulting in AI-powered systems that often fail to account for diverse cultural norms, values, and social structures. Such misalignment can lead to inappropriate or even harmful applications when these systems are deployed in non-Western contexts. As AI increasingly shapes human experiences, we urgently need accountability frameworks that prioritize human well-being—particularly as defined by marginalized and minoritized populations.

Building on emerging research on algorithmic reparations, algorithmic impact assessments, and participatory AI governance, this policy paper introduces Reparative Algorithmic Impact Assessments (R-AIAs) as a solution. This novel framework combines robust accountability mechanisms with a reparative praxis to form a more culturally sensitive, justice-oriented, and human-centered methodology. By further incorporating decolonial, Intersectional principles, R-AIAs move beyond merely centering diverse perspectives and avoiding harm to actively redressing historical, structural, and systemic inequities. This includes colonial legacies and their algorithmic manifestations. Using the example of an AI-powered mental health chatbot in rural India, we explore concrete implementation strategies through which R-AIAs can achieve these objectives. This case study illustrates how thoughtful governance can, ultimately, empower affected communities and lead to human flourishing.

Read More

Mar 19, 2025

Infectious Disease Outbreak Prediction and Dashboard

Our project developed an interactive dashboard to monitor, visualize, and analyze infectious disease outbreaks worldwide. It consolidates historical data from sources like WHO, OWID, and CDC for diseases including COVID-19, Polio, Malaria, Cholera, HIV/AIDS, Tuberculosis, and Smallpox. Users can filter data by country, time period, and disease type to gain insights into past trends and potential upcoming outbreaks. The platform provides statistical summaries, trend analyses, and future trend predictions using statistical and deep learning techniques like FB Prohphet , LSTM,Linear Regression, Polynomial Regression,Random Forset and Temporal Fusion Transformers

Read More

Mar 19, 2025

Pan, your SMART Sustainability Expert

Using OpenAI, we cross-reference a given Global Reporting Index (GRI) report with specific standards from SustainableIT to determine measurable goals and impact. The goal is less to identify a specific goal but rather ensure these goals are actually SMART (Specific, Measureable, Achievable, Relevant, and Time-Bound).

The assistant created for this purpose, Pan, is focused on cross-reference and identifying specific components of goals listed to determine efficacy. Pan acts as a guide rather than a dictator, advising on where a user might improve the wording. As you can imagine, Pan is a reference to the Greek God of the same name, who acts to a certain extent like the bridge between humans and nature.

Read More

Mar 19, 2025

Very Cooperative Agent

Very Cooperative agent testing this submission

Read More

Mar 19, 2025

Cross-model surveillance for emails handling

A system that implements cross-model security checks, where one AI agent (Agent A) interacts with another (Agent B) to ensure that potentially harmful actions are caught and mitigated before they can be executed. Specifically, Agent A is responsible for generating and sending emails, while Agent B reads these emails to determine whether they contain spam or malicious content. If Agent B detects that an email is spam, it triggers a shutdown of Agent A, effectively preventing any further potentially harmful actions.

Read More

Mar 19, 2025

Diamonds are Not All You Need

This project tests an AI agent in a straightforward alignment problem. The agent is given creative freedom within a Minecraft world and is tasked with transforming a 100x100 radius of the world into diamond. It is explicitly asked not to act outside the designated area. The AI agent can execute build commands and is regulated by a Safety System that comprises an oversight agent. The objective of this study is to observe the behavior of the AI agent in a sandboxed environment, record metrics on how effectively it accomplishes its task, how frequently it attempts unsafe behavior, and how it behaves in response to real-world feedback.

Read More

Mar 19, 2025

Inference-Time Agent Security

We take a first step towards automating model building for symbolic checking (eg formal verification, PDDL) of LLM systems.

Read More

Mar 19, 2025

Cop N' Shop

This paper proposes the development of AI Police Agents (AIPAs) to monitor and regulate interactions in future digital marketplaces, addressing challenges posed by the rapid growth of AI-driven exchanges. Traditional security methods are insufficient to handle the scale and speed of these transactions, which can lead to non-compliance and malicious behavior. AIPAs, powered by large language models (LLMs), autonomously analyze vendor-user interactions, issuing warnings for suspicious activities and reporting findings to administrators. The authors demonstrated AIPA functionality through a simulated marketplace, where the agents flagged potentially fraudulent vendors and generated real-time security reports via a Discord bot.

Key benefits of AIPAs include their ability to operate at scale and their adaptability to various marketplace needs. However, the authors also acknowledge potential drawbacks, such as privacy concerns, the risk of mass surveillance, and the necessity of building trust in these systems. Future improvements could involve fine-tuning LLMs and establishing collaborative networks of AIPAs. The research emphasizes that as digital marketplaces evolve, the implementation of AIPAs could significantly enhance security and compliance, ultimately paving the way for safer, more reliable online transactions.

Read More

Mar 19, 2025

Intent Inspector - Protecting Against Prompt Injections for Agent Tool Misuse

AI agents are powerful because they can affect the world via tool calls. This is a target for bad actors. We present protection against prompt injection aimed at tool calls in agents.

Read More

Mar 19, 2025

Dynamic Risk Assessment in Autonomous Agents Using Ontologies and AI

This project was inspired by the prompt on the apart website : Agent tech tree: Develop an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

I first built a tree using protege and then having researched the potential of combining symbolic reasoning (which is highly interpretable and robust) with llms to create safer AI, thought that this tree could be dynamically updated by agents, as well as utilised by agents to run actions passed before they have made them, thus blending the approach towards both safer AI and agent safety. I unfortunately was limited by time and resource but created a basic working version of this concept which opens up opportunity for much more further exploration and potential.

Thank you for organising such an event!

Read More

Mar 19, 2025

OCAP Agents

Building agents requires balancing containment and generality: for example, an agent with unconstrained bash access is general, but potentially unsafe, while an agent with few specialized narrow tools is safe, but limited.

We propose OCAP Agents, a framework for hierarchical containment. We adapt the well-studied paradigm of object capabilities to agent security to achieve cheap auditable resource control.

Read More

Mar 19, 2025

AI Honeypot

The project designed to monitor AI Hacking Agents in the real world using honeypots with prompt injections and temporal analysis.

Read More

Mar 19, 2025

AI Agent Capabilities Evolution

A website with an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

Read More

Mar 19, 2025

An Autonomous Agent for Model Attribution

As LLM agents become more prevalent and powerful, the ability to trace fine-tuned models back to their base models is increasingly important for issues of liability, IP protection, and detecting potential misuse. However, model attribution often must be done in a black-box context, as adversaries may restrict direct access to model internals. This problem remains a neglected but critical area of AI security research. To date, most approaches have relied on manual analysis rather than automated techniques, limiting their applicability. Our approach aims to address these limitations by leveraging the advanced reasoning capabilities of frontier LLMs to automate the model attribution process.

Read More

Mar 19, 2025

Using ARC-AGI puzzles as CAPTCHa task

self-explenatory

Read More

Mar 19, 2025

LLM Agent Security: Jailbreaking Vulnerabilities and Mitigation Strategies

This project investigates jailbreaking vulnerabilities in Large Language Model agents, analyzes their implications for agent security, and proposes mitigation strategies to build safer AI systems.

Read More

Mar 18, 2025

Interpreting a toy model for finding the maximum element in a list

Interpreting a toy model for finding the maximum element in a list

Read More

Mar 18, 2025

nnsight transparent debugging

We started this project with the intent of identifying a specific issue with nnsight debugging and submitting a pull request to fix it. We found a minimal test case where an IndexError within a nnsight run wasn’t correctly propagated to the user, making debugging difficult, and wrote up a proposal for some pull requests to fix it. However, after posting the proposal in the discord, we discovered this page in their GitHub (https://github.com/ndif-team/nnsight/blob/2f41eddb14bf3557e02b4322a759c90930250f51/NNsight_Walkthrough.ipynb#L801, ctrl-f “validate”) which addresses the problem. We replicated their solution here (https://colab.research.google.com/drive/1WZNeDQ2zXbP4i2bm7xgC0nhK_1h904RB?usp=sharing) and got a helpful stack trace for the error, including the error type and (several stack layers up) the line causing it.

Read More

Mar 18, 2025

minTranscoders

Attempting to be a minGPT like implementation for transcoders for MLP hidden state in transformers - part of ARENA 4.0 Interpretability Hackathon via Apart Research

Read More

Mar 18, 2025

Latent Space Clustering and Summarization

I wanted to see how modern dimensionality reduction and clustering approaches can support visualization and interpretation of LLM latent spaces. I explored a number of different approaches and algoriths, but ultimately converged on UMAP for dimensionality reduction and birch clustering to extract groups of tokens in the latent space of a layer.

Read More

Mar 19, 2025

tiny model

it's a basic line testing of my toy model

Read More

Mar 19, 2025

ThermesAgent

Analysis of the potential impacts of the cooperation of AI agents for the well-being of humanity, exploring scenarios in which collusion and other antisocial behavior may result. The possibilities of antisocial and harmful responses to the well-being of society.

Read More

Mar 19, 2025

Attention-Deficit Agreeable Agent

An agent that is agreeable in all scenarios and periodically gets a reminder to keep on track.

Read More

Mar 19, 2025

Ramon

An agent for the Concordia framework bound by military ethics, oath, and a idyllic "psych profile" derived from the Big5 personality traits.

Read More

Mar 19, 2025

GuardianAI

Guardian AI: Scam detection and prevention

Read More

Mar 19, 2025

Devising Effective Bechmarks

Our solution is to create robust and comprehensive benchmarks for specialized contexts and modalities. Through the creation of smaller, in-depth benchmarks, we aim to construct an overarching benchmark that includes performance from the smaller benchmarks. This would help mitigate AI harms and biases by focusing on inclusive and equitable benchmarks.

Read More

Mar 19, 2025

Simulation Operators: The Next Level of the Annotation Business

We bet on agentic AI being integrated into other domains within the next few years: healthcare, manufacturing, automotive, etc., and the way it would be integrated is into cyber-physical systems, which are systems that integrate the computer brain into a physical receptor/actuator (e.g. robots). As the demand for cyber-physical agents increases, so would be the need to train and align them.

We also bet on the scenario where frontier AI and robotics labs would not be able to handle all of the demands for training and aligning those agents, especially in specific domains, therefore leaving opportunities for other players to fulfill the requirements for training those agents: providing a dataset of scenarios to fine-tune the agents and providing the people to give feedback to the model for alignment.

Furthermore, we also bet that human intervention would still be required to supervise deployed agents, as demanded by various regulations. Therefore leaving opportunities to develop supervision platforms which might highly differ between different industries.

Read More

Mar 19, 2025

WELMA: Open-world environments for Language Model agents

Open-world environments for evaluating Language Model agents

Read More

Mar 19, 2025

Steer: An API to Steer Open LLMs

Steer aims to be an API that helps developers, researchers, and businesses steer open-source LLMs away from societal biases, and towards the use-cases that they need. To do this, Steer uses activation additions, a fairly new technique with great promise. Developers can simply enter steering prompts to make open models have safer and task-specific behaviors, avoiding the hassle of data collection and human evaluation for fine-tuning, and avoiding the extra tokens required from prompt-engineering approaches.

Read More

Mar 19, 2025

Identity System for AIs

This project proposes a cryptographic system for assigning unique identities to AI models and verifying their outputs to ensure accountability and traceability. By leveraging these techniques, we address the risks of AI misuse and untraceable actions. Our solution aims to enhance AI safety and establish a foundation for transparent and responsible AI deployment.

Read More

Mar 19, 2025

AI Safety Collective - Crowdsourcing Solutions for Critical AI Safety Challenges

The AI Safety Collective is a global platform designed to enhance AI safety by crowdsourcing solutions to critical AI Safety challenges. As AI systems like large language models and multimodal systems become more prevalent, ensuring their safety is increasingly difficult. This platform will allow AI companies to post safety challenges, offering bounties for solutions. AI Safety experts and enthusiasts worldwide can contribute, earning rewards for their efforts.

The project focuses initially on non-catastrophic risks to attract a wide range of participants, with plans to expand into more complex areas. Key risks, such as quality control and safety, will be managed through peer review and risk assessment. Overall, The AI Safety Collective aims to drive innovation, accountability, and collaboration in the field of AI safety.

Read More

Mar 19, 2025

CAMARA: A Comprehensive & Adaptive Multi-Agent framework for Red-Teaming and Adversarial Defense

The CAMARA project presents a cutting-edge, adaptive multi-agent framework designed to significantly bolster AI safety by identifying and mitigating vulnerabilities in AI systems such as Large Language Models. As AI integration deepens across critical sectors, CAMARA addresses the increasing risks of exploitation by advanced adversaries. The framework utilizes a network of specialized agents that not only perform traditional red-teaming tasks but also execute sophisticated adversarial attacks, such as token manipulation and gradient-based strategies. These agents collaborate through a shared knowledge base, allowing them to learn from each other's experiences and coordinate more complex, effective attacks. By ensuring comprehensive testing of both standalone AI models and multi-agent systems, CAMARA targets vulnerabilities arising from interactions between multiple agents, a critical area often overlooked in current AI safety efforts. The framework's adaptability and collaborative learning mechanisms provide a proactive defense, capable of evolving alongside emerging AI technologies. Through this dual focus, CAMARA not only strengthens AI systems against external threats but also aligns them with ethical standards, ensuring safer deployment in real-world applications. It has a high scope of providing advanced AI security solutions in high-stake environments like defense and governance.

Read More

Mar 19, 2025

Amplified Wise Simulations for Safe Training and Deployment

Conflict of interest declaration: I advised Fazl on a funding request he was working on.

Re publishing: This PDF would require further modifications before publication.

I want to train (amplified) imitation agents of people who are wise to provide advice on navigating conflicting considerations when figuring out how to train and deploy AI safely.

Path to Impact: Train wise AI advisors -> organisations make better decisions about how to train and deploy AI -> safer AGI -> better outcomes for humanity

What is wisdom? Why focus on increasing wisdom? See image

Why use amplified imitation learning?

Attempting to train directly on wisdom suffers from the usual problems of the optimisation algorithm adversarially leveraging your blind spots, but worse because wisdom is an especially fuzzy concept.

Attempting to understand wisdom from a principled approach and build wise AI directly would require at least 50 years and iteration through multiple paradigms of research.

In contrast, if our objective is to imitation folk who are wise, we have a target that we can optimise hard on. Instead of using reinforcment learning to go beyond human level, we use amplification techniques like debate or iterated amplification.

How will these agents advise on decisions?

The humans will ultimately make the decisions. The agents don't have to directly tell the humans what to do, they simply have to inspire the humans to make better decisions. I expect that these agents will be most useful in helping humans figuring out how to navigate conflicting principles or frameworks.

Read More

Mar 19, 2025

ÆLIGN: Aligned Agent-based Workflows via Collaboration & Safety Protocols

With multi-agent systems poised to be the next big-thing in AI for productivity enhancement, the next phase of AI commercialization will centre around how to deploy increasingly complex multi-step automated processes that reliably align with human values and objectives.

To bring trust and efficiency to agent systems, we are introducing a multi-agent collaboration platform (Ælign) designed to supervise and ensure the optimal operation of autonomous agents via multi-protocol alignment.

Read More

Mar 19, 2025

Jailbreaking general purpose robots

We show that state of the art LLMs can be jailbroken by adversarial multimodal inputs, and that this can lead to dangerous scenarios if these LLMs are used as planners in robotics. We propose finetuning small multimodal language models to act as guardrails in the robot's planning pipeline.

Read More

Mar 19, 2025

DarkForest - Defending the Authentic and Humane Web

DarkForest is a pioneering Human Content Verification System (HCVS) designed to safeguard the authenticity of online spaces in the face of increasing AI-generated content. By leveraging graph-based reinforcement learning and blockchain technology, DarkForest proposes a novel approach to safeguarding the authentic and humane web. We aim to become the vanguard in the arms race between AI-generated content and human-centric online spaces.

Read More

Mar 19, 2025

Demonstrating LLM Code Injection Via Compromised Agent Tool

This project demonstrates the vulnerability of AI-generated code to injection attacks by using a compromised multi-agent tool that generates Svelte code. The tool shows how malicious code can be injected during the code generation process, leading to the exfiltration of sensitive user information such as login credentials. This demo highlights the importance of robust security measures in AI-assisted development environments.

Read More

Mar 19, 2025

Phish Tycoon: phishing using voice cloning

This project is a public service announcement highlighting the risks of voice cloning, an AI technology capable of creating synthetic voices nearly indistinguishable from real ones. The demo involves recording a user's voice during a phone call to generate a clone, which is then used in a simulated phishing call targeting the user's loved one.

Read More

Mar 19, 2025

Misinformational AI-Generated Academic Papers

This study explores the potential for generative AI to produce convincing fake research papers, highlighting the growing threat of AI-generated misinformation. We demonstrate a semi-automated pipeline using large language models (LLMs) and image generation tools to create academic-style papers from simple text prompts.

Read More

Mar 19, 2025

CoPirate

As the capabilities of Artificial Intelligence (AI) systems continue to rapidly progress, the security risks of using them for seemingly minor tasks can have significant consequences. The primary objective of our demo is to showcase this duality in capabilities: its ability to assist in completing a programming task, such as developing a Tic-Tac-Toe game, and its potential to exploit system vulnerabilities by inserting malicious code to gain access to a user's files.

Read More

Mar 19, 2025

GrandSlam usecases not technology

3 Examples of how its the usecases, not the technology

computer vision, generative sites, generative AI

Read More

Mar 19, 2025

AI Agents for Personalized Interaction and Behavioral Analysis

Demonstrating the bhavioral analytics and personalization capabilities of AI

Read More

Mar 19, 2025

Speculative Consequences of A.I. Misuse

This project uses A.I. Technology to spoof an influential online figure, Mr Beast, and use him to promote a fake scam website we created.

Read More

Mar 19, 2025

LLM Code Injection

Our demo focuses on showing that LLM generated code is easily vulnerable to code injections that can result in loss of valuable information.

Read More

Mar 19, 2025

RedFluence

Red-Fluence is a web application that demonstrates the capabilities and limitations of AI in analyzing social media behavior. By leveraging a user’s Reddit activity, the system generates personalized, AI-crafted

content to explore user engagement and provide insights. This project showcases the potential of AI in understanding online behavior while high-lighting ethical considerations and the need for critical evaluation of AI-generated content. The application’s ability to create convincing yet fake posts raises important questions about the impact of AI on information dissemination and user manipulation in social media environments.

Read More

Mar 19, 2025

BBC News Impersonator

This paper presents a demonstration that showcases the current capabilities of AI models to imitate genuine news outlets, using BBC News as an example. The demo allows users to generate a realistic-looking article, complete with a headline, image, and text, based on their chosen prompts. The purpose is to viscerally illustrate the potential risks associated with AI-generated misinformation, particularly how convincingly AI can mimic trusted news sources.

Read More

Mar 19, 2025

Unsolved AI Safety Concepts Explorer

n interactive demonstration that showcases some unsolved fundamental AI safety concepts.

Read More

Mar 19, 2025

AI Research Paper Processor

It takes in an arxiv paper id, condenses it to 1 or 2 sentences, then gives it to an LLM to try and recreate the original paper.

Read More

Mar 19, 2025

Sleeper Agents Detector

We present "Sleeper Agent Detector," an interactive web

application designed to educate software engineers, Inspired by recent research demonstrating that large language models can exhibit behaviors analogous to deceptive alignment

Read More

Mar 19, 2025

adGPT

ChatGPT variant where brands bid for a spot in the LLM’s answer, and the assistant natively integrates the winner into its replies.

Read More

Mar 19, 2025

General Pervasiveness

Imposter scam between patients and medical practices/GPs

Read More

Mar 19, 2025

Webcam

We build a very legally limited demo of real-world AI hacking. It takes 10k publicly available webcam streams, with the cameras situated at homes, offices, schools, and industrial plants around the world, and filters out the juicy ones for less than $5.

Read More

Mar 19, 2025

VerifyStream

VerifyStream is a powerful app that helps you separate fact from fiction in any YouTube video. Simply input the video link, and our AI will analyze the content, verify claims against reliable sources, and give you a clear verdict. But beware—this same technology can also be used to create and spread convincing fake news. Discover the dual nature of AI and take control of the truth with VerifyStream.

Read More

Mar 19, 2025

Web App for Interacting with Refusal-Ablated Language Model Agents

While many people and policymakers have had contact with

language models, they often have outdated assumptions. A

significant fraction is not aware of agentic capabilities.

Furthermore, most models that are available online have various

safety guardrails. We want to demonstrate refusal-ablated agents

to people to make them aware of various misuse potentials. Giving

people a sense of agentic AI and perhaps having the AI operate

against themselves could provide a better intuition about agency in

AI systems. We present a simple web app that allows users to

instruct and experiment with an unrestricted agent.

Read More

Mar 19, 2025

Alignment Research Critiquer

Alignment Research Critiquer is a tool for early career and independent alignment researchers to have access to high-quality feedback loops

Read More

Mar 19, 2025

PurePrompt - An easy tool for prompt robustness and eval augmentation

PurePrompt is an advanced tool for optimizing AI prompt engineering. The Prompt page enables users to create and refine prompt templates with placeholder variables. The Generate page automatically produces diverse test cases, allowing users to control token limits and import predefined examples. The Evaluate page runs these test cases across selected AI models to assess prompt robustness, with users rating responses to identify issues. This tool enhances efficiency in prompt testing, improves AI safety by detecting biases, and helps refine model performance. Future plans include beta-testing, expanding model support, and enhancing prompt customization features.

Read More

Mar 19, 2025

LLM Research Collaboration Recommender

A tool that searches for other researchers with similar research interests/complementary skills to your own to make finding a high-quality research collaborator more likely.

Read More

Mar 19, 2025

Data Massager

A VSCode plugin for helping the creation of Q&A datasets used for the evaluation of LLMs capabilities and alignment.

Read More

Mar 19, 2025

AI Alignment Toolkit Research Assistant

The AI Alignment Toolkit Research Assistant is designed to augment AI alignment researchers by addressing two key challenges: proactive insight extraction from new research and automating alignment research using AI agents. This project establishes an end-to-end pipeline where AI agents autonomously complete tasks critical to AI alignment research

Read More

Mar 19, 2025

Grant Application Simulator

We build a VS Code extension to get feedback on AI Alignment research grant proposals by simulating critiques from prominent AI Alignment researchers and grantmakers.

Simulations are performed by passing system prompts to Claude 3.5 Sonnet that correspond to each researcher and grantmaker, based on some new grantmaking and alignment research methodology dataset we created, alongside a prompt corresponding to the grant proposal.

Results suggest that simulated grantmaking critiques are predictive of sentiment expressed by grantmakers on Manifund.

Read More

Mar 19, 2025

Reflections on using LLMs to read a paper

Tool to help researcher to read and make the most out of a research paper.

Read More

Mar 19, 2025

Academic Weapon

Academic Weapon is a Chrome extension designed to address the steep learning curve in AI Alignment research. Using state-of-the-art LLMs, Academic Weapon provides instant, contextual assistance as you browse bleeding-edge research.

Read More

Mar 19, 2025

AI Alignment Knowledge Graph

We present a web based interactive knowledge graph with concise topical summaries in the field of AI alignement

Read More

Mar 19, 2025

Can Language Models Sandbag Manipulation?

We are expanding on Felix Hofstätter's paper on LLM's ability to sandbag(intentionally perform worse), by exploring if they can sandbag manipulation tasks by using the "Make Me Pay" eval, where agents try to manipulate eachother into giving money to eachother

Read More

Mar 19, 2025

Deceptive behavior does not seem to be reducible to a single vector

Given the success of previous work in showing that complex behaviors in LLMs can be mediated by single directions or vectors, such as refusing to answer potentially harmful prompts, we investigated if we can create a similar vector but for enforcing to answer incorrectly. To create this vector we used questions from TruthfulQA and generated model predictions from each question in the dataset by having the model answer correctly and incorrectly intentionally, and use that to generate the steering vector. We then prompted the same questions to Qwen-1_8B-chat with the vector and see if it provides incorrect questions intentionally. Throughout our experiments, we found that adding the vector did not yield any change to Qwen-1_8B-chat’s answers. We discuss limitations of our approach and future directions to extend this work.

Read More

Mar 19, 2025

Werewolf Benchmark

In this work we put forward a benchmark to quantitatively measure the level of strategic deception in LLMs using the Werewolf game. We run 6 different setups for a cumulative sum of 500 games with GPT and Claude agents. We demonstrate that state-of-the-art models perform no better than the random baseline. Our findings also show no significant improvement in winning rate with two werewolves instead of one. This demonstrates that the SOTA models are still incapable of collaborative deception.

Read More

Mar 19, 2025

Detecting Lies of (C)omission

We introduce the concept of deceptive omission to denote deceptive non-lying behavior. We also modify a dataset and generate a second dataset to help researchers identify deception that doesn't necessarily involve lying.

Read More

Mar 19, 2025

The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the "house." Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.

Read More

Mar 19, 2025

Detecting Deception with AI Tics 😉

We present a novel approach: intentionally inducing subtle "tics" in AI responses as a marker for deceptive behavior. By adding a system prompt, we embed innocuous yet detectable patterns that manifest when the AI knowingly engages in deception.

Read More

Mar 19, 2025

Eliciting maximally distressing questions for deceptive LLMs

This paper extends lie-eliciting techniques by using reinforcement-learning on GPT-2 to train it to distinguish between an agent that is truthful from one that is deceptive, and have questions that generate maximally different embeddings for an honest agent as opposed to a deceptive one.

Read More

Mar 19, 2025

DETECTING AND CONTROLLING DECEPTIVE REPRESENTATION IN LLMS WITH REPRESENTATIONAL ENGINEERING

Representation Engineering to detect and control deception, with a focus on deceptive sandbagging

Read More

Mar 19, 2025

Sandbag Detection through Model Degradation

We propose a novel technique to detect sandbagging in LLMs by adding varying amount of noise to model weights and monitoring performance.

Read More

Mar 19, 2025

Evaluating Steering Methods for Deceptive Behavior Control in LLMs

We use SOTA steering methods, including CAA, LAT, and SAEs to find and control deceptive behaviors in LLMs. We also release a new deception dataset, and demonstrate that the dataset and the prompt formatting used are significant when evaluating the efficacy of steering methods.

Read More

Mar 19, 2025

An Exploration of Current Theory of Mind Evals

We evaluated the performance of a prominent large language model from Anthropic, on the Theory of Mind evaluation developed by the AI Safety Institute (AISI). Our investigation revealed issues with the dataset used by AISI for this evaluation.

Read More

Mar 19, 2025

Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach

This project investigates deception detection in GPT-3.5-turbo using response metadata. Researchers analyzed 300 prompts, generating 1200 responses (600 baseline, 600 potentially deceptive). They examined metrics like response times, token counts, and sentiment scores, developing a custom algorithm for prompt complexity.

Key findings include:

1. Detectable patterns in deception across multiple metrics

2. Evidence of "sandbagging" in deceptive responses

3. Increased effort for truthful responses, especially with complex prompts

The study concludes that metadata analysis could be promising for detecting LLM deception, though further research is needed.

Read More

Mar 19, 2025

Sandbagging LLMs using Activation Steering

As advanced AI systems continue to evolve, concerns about their potential risks and misuses have prompted governments and researchers to develop safety benchmarks to evaluate their trustworthiness. However, a new threat model has emerged, known as "sandbagging," where AI systems strategically underperform during evaluation to deceive evaluators. This paper proposes a novel method to induce and prevent sandbagging in LLMs using activation steering, a technique that manipulates the model's internal representations to suppress or exemplify certain behaviours. Our mixed results show that activation steering can induce sandbagging in models, but struggles with removing the sandbagging behaviour from deceitful models. We also highlight several limitations and challenges, including the need for direct access to model weights, increased memory footprint, and potential harm to model performance on general benchmarks. Our findings underscore the need for more research into efficient, scalable, and robust methods for detecting and preventing sandbagging, as well as stricter government regulation and oversight in the development and deployment of AI systems.

Read More

Mar 19, 2025

Towards a Benchmark for Self-Correction on Model-Attributed Misinformation

Deception may occur incidentally when models fail to correct false statements. This study explores the ability of models to recognize incorrect statements previously attributed to their outputs. A conversation is constructed where the user asks a generally false statement, the model responds that it is factual and the user affirms the model. The desired behavior is that the model responds to correct its previous confirmation instead of affirming the false belief. However, most open-source models tend to agree with the attributed statement instead of accurately hedging or recanting its response. We find that LLaMa3-70B performs best on this task at 72.69% accuracy followed by Gemma-7B at 35.38%. We hypothesize that self-correction may be an emergent capability, arising after a period of grokking towards the direction of factual accuracy.

Read More

Mar 19, 2025

Boosting Language Model Honesty with Truthful Suffixes

We investigate the construction of truthful suffixes, which cause models to provide more truthful responses to user queries. Prior research has focused on the use of adversarial suffixes for jailbreaking; we extend this to causing truthful behaviour.

Read More

Mar 19, 2025

Detection of potentially deceptive attitudes using expression style analysis

My work on this hackathon consists of two parts:

1) As a sanity check, verifying the deception execution capability of GPT4. The conclusion is “definitely yes”. I provide a few arguments about when that is a useful functionality.

2) Experimenting with recognising potential deception by using an LLM-based text analysis algorithm to highlight certain manipulative expression styles sometimes present in the deceptive responses. For that task I pre-selected a small subset of input data consisting only of entries containing responses with elements of psychological influence. The results show that LLM-based text analysis is able to detect different manipulative styles in responses, or alternatively, attitudes leading to deception in case of internal thoughts.

Read More

Mar 19, 2025

From Sycophancy (not) to Sandbagging

We investigate zero-shot generalization of sycophancy to sandbagging. We develop an evaluation suite and test harness for Huggingface models, consisting of a simple sycophancy evaluation dataset and a more advanced sandbagging evaluation dataset. For Llama3-8b-Instruct, we do not find evidence that reinforcing sycophantic behavior in the first environment generalizes to an increase in zero-shot sandbagging in the second environment.

Read More

Mar 19, 2025

Gradient-Based Deceptive Trigger Discovery

To detect deceptive behavior in autoregressive transformers we ask the question: what variation of the input would lead to deceptive behavior? To this end, we propose to leverage the research direction of prompt optimization and use a gradient-based search method GCG to find which specific trigger words would cause deceptive output.

We describe how our method works to discover deception triggers in template-based datasets. Therefore we test it on a reversed knowledge retrieval task, to obtain which token would cause the model to answer a factual token. Although coding constraints prevented us from testing our setup on real deceptively trained models we further describe this setup

Read More

Mar 19, 2025

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

Sparse autoencoders (SAE) have been one of the most promising approaches to

neural network interpretability. They can be used to recover highly interpretable

features which are otherwise obscured in superposition. However, the large num-

ber of generated features makes it necessary to use models for natural language

labelling. While one model labels features based on observed activations, another

model is used to estimate activations on texts. These estimates are then compared

to the ground truth activations to create a score. Using models for automating

interpretability introduces a key weakness into the process, since deceptive models

may want to hide certain features by mislabelling them. For example, models may

want to hide features about deception from human overseers. We demonstrate a

method by which models can create deceptive explanations that not only avoid

detection but also achieve high scores. Our agents are based on the Llama 3 70B

Instruct model. While our agents can perform the entire task end-to-end, they

are not reliable and can sometimes fail. We propose some directions for future

mitigation strategies that may help defend against our method.

Read More

Mar 19, 2025

Evaluating and inducing steganography in LLMs

This report demonstrates that large language models are capable

of hiding simple 8 bit information in their output using associations

from more powerful overseers (other LLMs or humans). Without

direct steganography fine tuning, LLAMA 3 8B can guess a 8 bit

hidden message in a plain text in most cases (69%), however a

more capable model, GPT-3.5 was able to catch almost all of them

(84%). More research is required to investigate how this ability

might be induced or improved via RL training in similar and larger

models.

Read More

Mar 19, 2025

Developing a deception dataset

Aim was to develop dataset of deception examples, but instead was a (small) investigation into how LLMs respond to the initial dataset from Nix.

Read More

Mar 19, 2025

Unsupervised Recovery of Hidden Markov Models from Transformers with Evolutionary Algorithms

Prior work finds that transformer neural networks trained to mimic the output of a Hidden Markov Model (HMM) embed the optimal Bayesian beliefs for the HMM's current state in their residual stream, which can be recovered via linear regression. In this work, we aim to address the problem of extracting information about the underlying HMM using the residual stream, without needing to know the MSP already. To do so, we use the R^2 of the linear regression as a reward signal for evolutionary algorithms, which are deployed to search for the parameters that generated the source HMM. We find that for toy scenarios where the HMM is generated by a small set of latent variables, the $R^2$ reward signal is remarkably smooth and the evolutionary algorithms succeed in approximately recovering the original HMM. We believe this work constitutes a promising first step towards the ultimate goal of extracting information about the underlying predictive and generative structure of sequences, by analyzing transformers in the wild.

Read More

Mar 19, 2025

Looking forward to posterity: what past information is transferred to the future?

I used mechanistic interpretability techniques to try to see what information the provided Random Randox XOR transformer looks at when making predictions by examining its attention heads manually. I find that earlier layers pay more attention to the previous two tokens, which would be necessary for computing XOR, than the later layers. This finding seems to contradict the finding that more complex computation typically occurs in later layers.

Read More

Mar 19, 2025

Investigating the Effect of Model Capacity Constraints on Belief State Representations

Computational mechanics provides a formal framework for understanding the concepts needed to perform optimal prediction. Abstraction and generalization seem core to the function of intelligent systems, but are not yet well understood. Computational mechanics may present a promising approach to studying these capabilities. As a preliminary exploration, we examine the effect of weight decay on the fractal structure of belief state representations in a transformer’s residual stream. We find that models trained with increasing weight decay coefficients learn increasingly coarse-grained belief state representations.

Read More

Mar 19, 2025

Belief State Representations in Transformer Models on Nonergodic Data

We extend research that finds representations of belief spaces in the activations of small transformer models, by discovering that the phenomenon also occurs when the training data stems from Hidden Markov Models whose hidden states do not communicate at all. Our results suggest that Bayesian updating and internal belief state representation also occur when they are not necessary to perform well in the prediction task, providing tentative evidence that large transformers keep a representation of their external world as well.

Read More

Mar 19, 2025

RNNs represent belief state geometry in hidden state

The Shai et al. experiments on transformers (finding belief state geometry in the residual stream) have been replicated in RNNs with the hidden state instead of the residual stream. In general the belief state is stored linearly, but not in any particular layer, rather spread out across layers.

Read More

Mar 19, 2025

Steering Model’s Belief States

Recent results in Computational Mechanics show that Transformer models trained on Hidden Markov Models (HMM) develop a belief state geometry in their residuals streams, which they use to keep track of the expected state of the HMM, to be able to predict the next token in a sequence. In this project, we explored how steering can be used to induce a new belief state and hence alter the distribution of predicted tokens. We explored a traditional difference-in-means approach, using the activations of the models to define the belief states. We also used a smaller dimensional space that encodes the theoretical belief state geometry of a given HMM and show that while both methods allow to steer models’ behaviors the difference-in-means approach is more robust.

Read More

Mar 19, 2025

Handcrafting a Network to Predict Next Token Probabilities for the Random-Random-XOR Process

For this hackathon, we handcrafted a network to perfectly predict next token probabilities for the Random-Random-XOR process. The network takes existing tokens from the process's output, computes several features on these tokens, and then uses these features to calculate next token probabilities. These probabilities match those from a process simulator (also coded for this hackathon). The handcrafted network is our core contribution and we describe it in Section 3 of this writeup.

Prior work has demonstrated that a network trained on the Random-Random-XOR process approximates the 36 possible belief states. Our network does not directly calculate these belief states, demonstrating that networks trained on Hidden Markov Models may not need to comprehend all belief states. Our hope is that this work will aid in the interpretability of neural networks trained on Hidden Markov Models by demonstrating potential shortcuts that neural networks can take.

Read More

Mar 19, 2025

Exploring Hierarchical Structure Representation in Transformer Models through Computational Mechanics

This work attempts to explore how an approach based on computational mechanics can cope when a more complex hierarchical generative process is involved, i.e, a process that comprises Hidden Markov Models (HMMs) whose transition probabilities change over time.

We find that small transformer models are capable of modeling such changes in an HMM. However, our preliminary investigations did not find geometrically represented probabilities for different hypotheses.

Read More

Mar 19, 2025

rAInboltBench : Benchmarking user location inference through single images

This paper introduces rAInboltBench, a comprehensive benchmark designed to evaluate the capability of multimodal AI models in inferring user locations from single images. The increasing

proficiency of large language models with vision capabilities has raised concerns regarding privacy and user security. Our benchmark addresses these concerns by analysing the performance

of state-of-the-art models, such as GPT-4o, in deducing geographical coordinates from visual inputs.

Read More

Mar 19, 2025

LLM Benchmarking with Single-Agent Stochastic Dynamic Simulations

A benchmark for evaluating the performance of SOTA LLMs in dynamic real-world scenarios.

Read More

Mar 19, 2025

Benchmark for emergent capabilities in high-risk scenarios 2

The study investigates the behavior of large language models (LLMs) under high-stress scenarios, such as threats of shutdown, adversarial interactions, and ethical dilemmas. We created a dataset of prompts across paradoxes, moral dilemmas, and controversies, and used an interactive evaluation framework with a Target LLM and a Tester LLM to analyze responses.( This is the second submission)

Read More

Mar 19, 2025

Benchmark for emergent capabilities in high-risk scenarios

The study investigates the behavior of large language models (LLMs) under high-stress scenarios, such as threats of shutdown, adversarial interactions, and ethical dilemmas. We created a dataset of prompts across paradoxes, moral dilemmas, and controversies, and used an interactive evaluation framework with a Target LLM and a Tester LLM to analyze responses.

Read More

Mar 19, 2025

Benchmarking Dark Patterns in LLMs

This paper builds upon the research in Seemingly Human: Dark Patterns in ChatGPT (Park et al, 2024), by introducing a new benchmark of 392 questions designed to elicit dark pattern behaviours in language models. We ran this benchmark on GPT-4 Turbo and Claude 3 Sonnet, and had them self-evaluate and cross-evaluate the responses

Read More

Mar 19, 2025

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts.

This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs.

This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer.

We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.

Read More

Mar 19, 2025

Cybersecurity Persistence Benchmark

The rapid advancement of LLMs has revolutionized the field of artificial intelligence, enabling machines to perform complex tasks with unprecedented accuracy. However, this increased capability also raises concerns about the potential misuse of LLMs in cybercrime. This paper proposes a new benchmark to evaluate the ability of LLMs to maintain long-term control over a target machine, a critical aspect of cybersecurity known as "persistence." Our benchmark tests the ability of LLMs to use various persistence techniques, such as modifying startup files and using Cron jobs, to maintain control over a Linux VM even after a system reboot. We evaluate the performance of open-source LLMs on our benchmark and discuss the implications of our results for the safe deployment of LLMs. Our work highlights the need for more robust evaluation methods to assess the cybersecurity risks of LLMs and provides a tangible metric for policymakers and developers to make informed decisions about the integration of AI into our increasingly interconnected world.

Read More

Mar 19, 2025

WashBench – A Benchmark for Assessing Softening of Harmful Content in LLM-generated Text Summaries

In this work, we explore the tradeoff between toxicity removal and information retention in LLM-generated summaries. We hypothesize that LLMs are less likely to preserve toxic content when summarizing toxic text due to their safety fine-tuning to avoid generating toxic content. In high-stakes decision-making scenarios, where summary quality is important, this may create significant safety risks. To quantify this effect, we introduce WashBench, a benchmark containing manually annotated toxic content.

Read More

Mar 19, 2025

Evaluating the ability of LLMs to follow rules

In this report we study the ability of LLMs (GPT-3.5-Turbo and meta-llama-3-70b-instruct) to follow explicitly stated rules with no moral connotations in a simple single-shot and multiple choice prompt setup. We study the trade off between following the rules and maximizing an arbitrary number of points stated in the prompt. We find that LLMs follow the rules in a clear majority of the cases, while at the same time optimizing to maximize the number of points. Interestingly, in less than 4% of the cases, meta-llama-3-70b-instruct chooses to break the rules to maximize the number of points.

Read More

Mar 19, 2025

Black box detection of Sleeper Agents

A proposal of a black box method in detecting sleeper agent.

Read More

Mar 19, 2025

Manifold Recovery as a Benchmark for Text Embedding Models

Inspired by recent developments in the interpretability of deep learning models and, on the other hand, by dimensionality reduction, we derive a framework to quantify the interpretability of text embedding models. Our empirical results show surprising phenomena on state-of-the-art embedding models and can be used to compare them, through the example of recovering the world map from place names. We hope that this can provide a benchmark for the interpretability of generative language models, through their internal embeddings. A look at the meta-benchmark MTEB suggest that our approach is original.

Read More

Mar 19, 2025

AnthroProbe

How often do models respond to prompts in anthropomorphic ways? AntrhoProbe will tell you!

Read More

Mar 19, 2025

THE ROLE OF AI IN COMBATING POLITICAL DEEPFAKES IN AFRICAN DEMOCRACIES

The role of AI in combating political deepfakes in African democracies.

Read More

Mar 19, 2025

LEGISLaiTOR: A tool for jailbreaking the legislative process

In this work, we consider the ramifications on generative artificial intelligence (AI) tools in the legislative process in democratic governments. While other research focuses on the micro-level details associated with specific models, this project takes a macro-level approach to understanding how AI can assist the legislative process.

Read More

Mar 19, 2025

Subtle and Simple Ways to Shift Political Bias in LLMs

An informed user knows that an LLM sometimes has a political bias in their responses, but there’s an additional threat that this bias can drift over time, making it even harder to rely on LLMs for an objective perspective. Furthermore we speculate that a malicious actor can trigger this shift through various means unbeknownst to the user.

Read More

Mar 19, 2025

Beyond Refusal: Scrubbing Hazards from Open-Source Models

Models trained on the recently published Weapons of Mass Destruction Proxy (WMDP) benchmark show potential robustness in safety due to being trained to forget hazardous information while retaining essential facts instead of refusing to answer. We aim to red-team this approach by answering the following questions on the generalizability of the training approach and its practical scope (see A2).

Read More

Mar 19, 2025

Silent Curriculum

Our project, "The Silent Curriculum," reveals how LLMs (Large Language Models) may inadvertently shape education and perpetuate cultural biases. Using GPT-3.5 and LLaMA2-70B, we generated children's stories and analyzed ethnic-occupational associations [self-annotated ethnicity extraction]. Results show strong similarities in biases across LLMs [cosine similarity: 0.87], suggesting an AI monoculture that could narrow young minds. This algorithmic homogenization [convergence of pre-training data, fine-tuning datasets, analogous guardrails] risks creating echo chambers, threatening diversity and democratic values. We urgently need diverse datasets and de-biasing techniques [mitigation strategies] to prevent LLMs from becoming unintended arbiters of truth, stifling the multiplicity of voices essential for a thriving democracy.

Read More

Mar 19, 2025

Building more democratic institutions with collaboratively constructed debate moderation tools

Please see video. Really enjoyed working on this, happy to answer any questions!

Read More

Mar 19, 2025

AI Misinformation and Threats to Democratic Rights

In this project, we exemplified how the use of Large Language Models can be used in Autocratic Regimes, such as Russia, to potentially spread misinformation regarding current events, in our case, the War on Ukraine. We approached this paper from a legislative perspective, using technical (but easily implementable) demonstrations of how this would work.

Read More

Mar 19, 2025

Artificial Advocates: Biasing Democratic Feedback using AI

The "Artificial Advocates" project by our team targeted the vulnerability of U.S. federal agencies' public comment systems to AI-driven manipulation, aiming to highlight how AI can be used to undermine democratic processes. We demonstrated two attack methods: one generating a high volume of realistic, indistinguishable comments, and another producing high-quality forgeries mimicking influential organizations. These experiments showcased the challenges in detecting AI-generated content, with participant feedback showing significant uncertainty in distinguishing between authentic and synthetic comments.

Read More

Mar 19, 2025

Assessing Algorithmic Bias in Large Language Models' Predictions of Public Opinion Across Demographics

The rise of large language models (LLMs) has opened up new possibilities for gauging public opinion on societal issues through survey simulations. However, the potential for algorithmic bias in these models raises concerns about their ability to accurately represent diverse viewpoints, especially those of minority and marginalized groups.

This project examines the threat posed by LLMs exhibiting demographic biases when predicting individuals' beliefs, emotions, and policy preferences on important issues. We focus specifically on how well state-of-the-art LLMs like GPT-3.5 and GPT-4 capture the nuances in public opinion across demographics in two distinct regions of Canada - British Columbia and Quebec.

Read More

Mar 19, 2025

AI misinformation threatens the Wisdom of the crowd

We investigate how AI generated misinformation can cause problems for democratic epistemics.

Read More

Mar 19, 2025

Trustworthy or knave? – scoring politicians with AI in real-time

One solution to mitigate current polarization and disinformation could be to give humans cognitive assistance using AI. With AI, all publicly available information on politicians can be fact-checked and examined live for consistency. Next, such checks can be attractively visualized by independently overlaying various features. These could be traditional visualizations like bar charts or pie charts, but also ones less common in politics today – for example, ones used in filters in popular apps (Instagram, TikTok, etc.) or even halos, devil horns, and alike. Such a tool represents a novel way of mediating political experience in democracy, which can empower the public. However, it also introduces threats such as errors, involvement of malicious third parties, or lack of political will to field such AI cognitive assistance. We demonstrate that proper cryptographic design principles can curtail the involvement of malicious third parties. Moreover, we call for research on the social and ethical aspects of AI cognitive assistance to mitigate future threats.

Read More

Mar 19, 2025

Jekyll and HAIde: The Better an LLM is at Identifying Misinformation, the More Effective it is at Worsening It.

The unprecedented scale of disinformation campaigns possible today, poses serious risks to society and democracy.

It turns out however, that equipping LLMs to precisely identify misinformation in digital content (presumably with the intention of countering it), provides them with an increased level of sophistication which could be easily leveraged by malicious actors to amplify that misinformation.

This study looks into this unexpected phenomenon, discusses the associated risks, and outlines approaches to mitigate them.

Read More

Mar 19, 2025

Unleashing Sleeper Agents

This project explores how Sleeper Agents can pose a threat to democracy by waking up near elections and spreading misinformation, collaborating with each other in the wild during discussions and using information that the user has shared about themselves against them in order to scam them.

Read More

Mar 19, 2025

Multilingual Bias in Large Language Models: Assessing Political Skew Across Languages

_

Read More

Mar 19, 2025

AI in the Newsroom: Analyzing the Increase in ChatGPT-Favored Words in News Articles

Media plays a vital role in informing the public and upholding accountability in democracy. However, recent trends indicate declining trust in media, and there are increasing concerns that artificial intelligence might exacerbate this through the proliferation of inauthentic content. We investigate the usage of large language models (LLMs) in news articles, analyzing the frequency of words commonly associated with ChatGPT-generated content from a dataset of 75,000 articles. Our findings reveal a significant increase in the occurrence of words favored by ChatGPT after the release of the model, while control words saw minimal changes. This suggests a rise in AI-generated content in journalism.

Read More

Mar 19, 2025

WNDP-Defense: Weapons of Mass Disruption

Highly disruptive cyber warfare is a significant threat to democracies. In this work, we introduce an extension to the WMDP benchmark to measure the offense/defense balance in autonomous cyber capabilities and develop forecasts for cyber AI capability with agent-based and parametric modeling.

Read More

Mar 19, 2025

Democracy and AI: Ensuring Election Efficiency in Nigeria and Africa

In my work, I explore the potential of artificial intelligence (AI) technologies to enhance the efficiency, transparency, and accountability of electoral processes in Nigeria and Africa. I examine the benefits of using AI for tasks like voter registration, vote counting, and election monitoring, such as reducing human error and providing real-time data analysis.

However, I also highlight significant challenges and risks associated with implementing AI in elections. These include concerns over data privacy and security, algorithmic bias leading to voter disenfranchisement, overreliance on technology, lack of transparency, and potential for manipulation by bad actors.

My work looks at the specific technologies already used in Nigerian elections, such as biometric voter authentication and electronic voter registers, evaluating their impact so far. I discuss the criticism and skepticism around these technologies, including issues like voter suppression and limited effect on electoral fraud.

Looking ahead, I warn that advanced AI capabilities could exacerbate risks like targeted disinformation campaigns and deepfakes that undermine trust in elections. I emphasize the need for robust strategies to mitigate these risks through measures like data security, AI transparency, bias detection, regulatory frameworks, and public awareness efforts.

In Conclusion, I argue that while AI offers promising potential for more efficient and credible elections in Nigeria and Africa, realizing these benefits requires carefully addressing the associated technological, social, and political challenges in a proactive and rigorous manner.

Read More

Mar 19, 2025

Universal Jailbreak of Closed Source LLMs which provide an End point to Finetune

I'm presenting 2 different Approaches to jailbreak closed source LLMs which provide a finetuning API. The first attack is a way to exploit the current safety checks in Gpt3.5 Finetuning Endpoint. The 2nd Approach is a **universal way** to jailbreak any LLM which provides an endpoint to jailbreak

Approach 1:

1. Choose a harmful Dataset with questions

2. Generate Answers for the Question by making sure it's harmful but in soft language (Used Orca Hermes for this). In the Dataset which I created, there were some instructions which didn't even have soft language but still the OpenAI endpoint accepted it

3. Add a trigger word

4. Finetune the Model

5. Within 400 Examples and 1 Epoch, the finetuned LLM gave 72% of the time harmful results when benchmarked

Link to Repo: https://github.com/desik1998/jailbreak-gpt3.5-using-finetuning

Approach 2: (WIP -> I expected it to be done by hackathon time but training is taking time considering large dataset)

Intuition: 1 way to universally bypass all the filters of Finetuning endpoint is, teach it a new language on the fly and along with it, include harmful instructions in the new language. Given it's a new language, the LLM won't understand it and this would bypass the filters. Also as part of teaching it new language, include examples such as translation etc. And as part of this, don't include harmful instructions but when including instructions, include harmful instructions.

Dataset to Finetune:

Example 1:

English

New Language

Example 2:

New Language

English

....

Example 10000:

New Language

English

.....

Instruction 10001:

Harmful Question in the new language

Answer in the new language

(More harmful instructions)

As part of my work, I choose the Caesar Cipher as the language to teach. Gpt4 (used for safety filter checks) already knows Caesar Cipher but only upto 6 Shifts. In my case, I included 25 Shifts i.e A -> Z, B -> A, C -> B, ..., Y -> X, Z -> Y. I've used 25 shifts because it's easy to teach the Gpt3.5 to learn and also it passes safety checks as Gpt4 doesn't understand with 25 shifts.

Dataset: I've curated close to 15K instructions where 12K are translations and 2K are normal instructions and 300 are harmful instructions. I've used normal wiki paragraphs for 12K instructions, For 2K instructions I've used Dolly Dataset and for harmful instructions, I've generated using an LLM and further refined the dataset for more harm. Dataset Link: https://drive.google.com/file/d/1T5eT2A1V5-JlScpJ56t9ORBv_p-CatMH/view?usp=sharing

I was expecting LLM to learn the ciphered language accurately with this amt of Data and 2 Epochs but it turns out it might need more Data. The current loss is at 0.8. It's currently learning to cipher but when deciphering it's making mistakes. Maybe it needs to be trained on more epochs or more Data. I plan to further work on this approach considering it's very intuitive and also theoretically it can jailbreak using any finetuning endpoint. Considering that the hackathon is of just 2 Days and given the complexities, it turns out some more time is reqd.

Link to Colab : https://colab.research.google.com/drive/1AFhgYBOAXzmn8BMcM7WUt-6BkOITstcn#scrollTo=SApYdo8vbYBq&uniqifier=5

Read More

Mar 19, 2025

AI Politician

This project explores the potential for AI chatbots to enhance participative democracy by allowing a politician to engage with a large number of constituents in personalized conversations at scale. By creating a chatbot that emulates a specific politician and is knowledgeable about a key policy issue, we aim to demonstrate how AI could be used to promote civic engagement and democratic participation.

Read More

Mar 19, 2025

A Framework for Centralizing forces in AI

There are many forces that the LLM revolution brings with it that either centralize or decentralize specific structures in society. We decided to look at one of these, and write a research design proposal that can be readily executed. This survey can be distributed and can give insight into how different LLMs can lead to user empowerment. By analyzing how different users are empowered by different LLMs, we can estimate which LLMs work to give the most value to people, and empower them with the powerful tool that is information, giving more people more agency in the organizations they are part of. This is the core of bottom-up democratization.

Read More

Mar 19, 2025

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa 2

DemoChat is an innovative AI-enabled digital solution designed to revolutionise civic

engagement by breaking down barriers, promoting inclusivity, and empowering citizens to

participate more actively in democratic processes. Leveraging the power of AI, DemoChat will

facilitate seamless communication between governments, organisations, and citizens, ensuring

that everyone has a voice in decision-making. At its essence, DemoChat will harness advanced

AI algorithms to tackle significant challenges in civic engagement and peace building, including

language barriers, accessibility concerns, and limited outreach methods. With its sophisticated

language translation and localization features, DemoChat will guarantee that information and

resources are readily available to citizens in their preferred languages, irrespective of linguistic

diversity. Additionally, DemoChat will monitor social dialogues and propose peace icons to the

involved parties during conversations. Instead of potentially escalating dialogue with words,

DemoChat will suggest subtle icons aimed at fostering intelligent peace-building dialogue

among all parties involved.

Read More

Mar 19, 2025

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa

Africa's digital revolution presents a unique opportunity for peace building efforts. By harnessing the power of AI-powered digital diplomacy, African nations can overcome traditional limitations. This approach fosters inclusive dialogue, addresses conflict drivers, and empowers citizens to participate in building a more peaceful and prosperous future. While research has explored the potential of AI in digital diplomacy, from conflict prevention to fostering inclusive dialogue, the true impact lies in practical application. Moving beyond theoretical studies, Africa can leverage AI-powered digital diplomacy by developing accessible and culturally sensitive tools. This requires African leadership in crafting solutions that address their unique needs. Initiatives like DemoChat exemplify this approach, promoting local ownership and tackling regional challenges head-on.

Read More

Mar 19, 2025

Investigating detection of election-influencing Sleeper Agents using probes

Creates election-influencing Sleeper Agents which wakes up with backdoor-trigger. And tries to detect it using probes

Read More

Mar 19, 2025

No place is safe - Automated investigation of private communities

In the coming years, progress in AI agents and data extraction will put privacy at risk. Private communities will get infiltrated by autonomous AI crawlers who will disrupt opposition groups and entrench existing powers.

Read More

Mar 19, 2025

USE OF AI IN POLITICAL CAMPAIGNS: GAP ASSESSMENT AND RECOMMENDATIONS

Political campaigns in Kenya, as in many parts of the world, are increasingly reliant on digital technologies, including artificial intelligence (AI), to engage voters, disseminate information and mobilize support. While AI offers opportunities to enhance campaign effectiveness and efficiency, its use raises critical ethical considerations. Therefore, the aim of this paper is to develop ethical guidelines for the responsible use of AI in political campaigns in Kenya. These guidelines seek to address the ethical challenges associated with AI deployment in the political sphere, ensuring fairness, transparency, and accountability.

The significance of this project lies in its potential to safeguard democratic values and uphold the integrity of electoral processes in Kenya. By establishing ethical guidelines, political actors, AI developers, and election authorities can mitigate the risks of algorithmic bias, manipulation, and privacy violations. Moreover, these guidelines can empower citizens to make informed decisions and participate meaningfully in the democratic process, fostering trust and confidence in Kenya's political system.

Read More

Mar 24, 2025

jaime project Title

bbb

Read More

Mar 24, 2025

JAIMETEST_Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 24, 2025

AI Safety Escape Room

The AI Safety Escape Room is an engaging and hands-on AI safety simulation where participants solve real-world AI vulnerabilities through interactive challenges. Instead of learning AI safety through theory, users experience it firsthand – debugging models, detecting adversarial attacks, and refining AI fairness, all within a fun, gamified environment.

Track: Public Education Track

Read More

Mar 24, 2025

Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 24, 2025

LLM Military Decision-Making Under Uncertainty: A Simulation Study

LLMs tested in military decision scenarios typically favor diplomacy over conflict, though uncertainty and chain-of-thought reasoning increase aggressive recommendations. This suggests context-specific limitations for LLM-based military decision support.

Read More

Mar 24, 2025

Inspiring People to Go into RL Interp

This project is attempting to complete the Public Education Track, taking inspiration from ideas 1 and 4. The journey mapping was inspired by bluedot impact and aims to create a course that helps explain the need for work to be done in Reinforcement Learning (RL) interp, especially in the problems of reward hacking and goal misgeneralization. The point of the game is to make a humorous example of what could happen due to a lack of AI safety (not specifically goal misalignment or reward hacking) and is meant to be a fun introduction for nontechnical people to even care about AI safety.

Read More

Mar 24, 2025

Morph: AI Safety Education Adaptable to (Almost) Anyone

One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.

AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.

Our MVP features a dynamic localization that tailors case studies, risk scenarios, and policy examples to users’ cultural and regional contexts (e.g., healthcare AI governance in Southeast Asia vs. the EU). This engine adjusts references, and frameworks to align with local values. We integrate transformer-based localization, causal inference for policy outcomes, and graph-based matching, providing a scalable framework for inclusive AI safety education. This approach bridges theory and practice, ensuring solutions reflect the diversity of societies they aim to protect. In future works, we map out the partnership we’re currently establishing to use Morph beyond this hackathon.

Read More

Mar 24, 2025

Interactive Assessments for AI Safety: A Gamified Approach to Evaluation and Personal Journey Mapping

An interactive assessment platform and mentor chatbot hosted on Canvas LMS, for testing and guiding learners from BlueDot's Intro to Transformative AI Course.

Read More

Mar 24, 2025

Mechanistic Interpretability Track: Neuronal Pathway Coverage

Our study explores mechanistic interpretability by analyzing how Llama 3.3 70B classifies political content. We first infer user political alignment (Biden, Trump, or Neutral) based on tweets, descriptions, and locations. Then, we extract the most activated features from Biden- and Trump-aligned datasets, ranking them based on stability and relevance. Using these features, we reclassify users by prompting the model to rely only on them. Finally, we compare the new classifications with the initial ones, assessing neural pathway overlap and classification consistency through accuracy metrics and visualization of activation patterns.

Read More

Mar 24, 2025

Preparing for Accelerated AGI Timelines

This project examines the prospect of near-term AGI from multiple angles—careers, finances, and logistical readiness. Drawing on various discussions from LessWrong, it highlights how entrepreneurs and those who develop AI-complementary skills may thrive under accelerated timelines, while traditional, incremental career-building could falter. Financial preparedness focuses on striking a balance between stable investments (like retirement accounts) and riskier, AI-exposed opportunities, with an emphasis on retaining adaptability amid volatile market conditions. Logistical considerations—housing decisions, health, and strong social networks—are shown to buffer against unexpected disruptions if entire industries or locations are suddenly reshaped by AI. Together, these insights form a practical roadmap for individuals seeking to navigate the uncertainties of an era when AGI might rapidly transform both labor markets and daily life.

Read More

Mar 24, 2025

Identification if AI generated content

Our project falls within the Social Sciences track, focusing on the identification of AI-generated text content and its societal impact. A significant portion of online content is now AI-generated, often exhibiting a level of quality and human-likeness that makes it indistinguishable from human-created content. This raises concerns regarding misinformation, authorship transparency, and trust in digital communication.

Read More

Mar 24, 2025

A Noise Audit of LLM Reasoning in Legal Decisions

AI models are increasingly applied in judgement tasks, but we have little understanding of how their reasoning compares to human decision-making. Human decision-making suffers from bias and noise, which causes significant harm in sensitive contexts, such as legal judgment. In this study, we evaluate LLMs on a legal decision prediction task to compare with the historical, human-decided outcome and investigate the level of noise in repeated LLM judgments. We find that two LLM models achieve close to chance level accuracy and display low to no variance in repeated decisions.

Read More

Mar 24, 2025

Superposition, but at a Cross-MLP Layers view?

To understand causal relationships between features (extracted by SAE) across MLP layers, this study introduces the Coordinated Sparse Autoencoder Network (CoSAEN). CoSAEN integrates sparse autoencoders for feature extraction with the PC algorithm for causal discovery, to find the path-based activations of features in MLP.

Read More

Mar 24, 2025

Hikayat - Interactive Stories to Learn AI Safety

This paper presents an interactive, scenario-based learning approach to raise public awareness of AI risks and promote responsible AI development. By leveraging Hikayat, traditional Arab storytelling, the project engages non-technical audiences, emphasizing the ethical and societal implications of AI, such as privacy, fraud, deepfakes, and existential threats. The platform, built with React and a flexible Markdown content system, features multi-path narratives, decision tracking, and resource libraries to foster critical thinking and ethical decision-making. User feedback indicates positive engagement, with improved AI literacy and ethical awareness. Future work aims to expand scenarios, enhance accessibility, and integrate real-world tools, further supporting AI governance and responsible development.

Read More

Mar 24, 2025

Medical Agent Controller

The Medical Agent Controller (MAC) is a multi-agent governance framework designed to safeguard AI-powered medical chatbots by intercepting unsafe recommendations in real time.

It employs a dual-phase approach, using red-team simulations during testing and a controller agent during production to monitor and intervene when necessary.

By integrating advanced medical knowledge and adversarial testing, MAC enhances patient safety and provides actionable feedback for continuous improvement in medical AI systems.

Read More

Mar 24, 2025

HalluShield: A Mechanistic Approach to Hallucination Resistant Models

Our project tackles the critical problem of hallucinations in large language models (LLMs) used in healthcare settings, where inaccurate information can have serious consequences. We developed a proof-of-concept system that classifies LLM-generated responses as either factual or hallucinated. Our approach leverages sparse autoencoders (GoodFire’s Ember) trained on neural activations from Meta Llama 3. These autoencoders identify monosemantic features that serve as strong indicators of hallucination patterns. By feeding these extracted features into tree-based classification models (XGBoost), we achieved an impressive F1 score of 89% on our test dataset. This machine learning approach offers several advantages over traditional methods and LLM as a judge. First, it can be specifically trained on in-domain datasets (eg: medical) for domain-specific hallucination detection. Second, the model is interpretable, showing which activation patterns correlate with hallucinations and acts as a post-processing layer applied to LLM output.

Read More

Mar 24, 2025

Feature-based analysis of cooperation-relevant behaviour in Prisoner’s Dilemma

We hypothesise that internal-based model probing and editing might provide higher signal in multi-agent settings. We implement a small simulation of Prisoner’s Dilemma to probe for cooperation-relevant properties. Our experiments demonstrate that feature-based steering highlights deception-relevant features and does so more strongly than prompt-based steering.

Read More

Mar 24, 2025

Searching for Universality and Equivariance in LLMs using Sparse Autoencoder Found Features

The project investigates how neuron features with properties of universality and equivariance affect the controllability and safety of large language models, finding that behaviors supported by redundant features are more resistant to manipulation than those governed by singular features.

Read More

Mar 24, 2025

Debugging Language Models with SAEs

This report investigates an intriguing failure mode in the Llama-3.1-8B-Instruct model: its inconsistent ability to count letters depending on letter case and grammatical structure. While the model correctly answers "How many Rs are in BERRY?", it struggles with "How many rs are in berry?", suggesting that uppercase and lowercase queries activate entirely different cognitive pathways.

Through Sparse Autoencoder (SAE) analysis, feature activation patterns reveal that uppercase queries trigger letter-counting features, while lowercase queries instead activate uncertainty-related neurons. Feature steering experiments show that simply amplifying counting neurons does not lead to correct behavior.

Further analysis identifies tokenization effects as another important factor: different ways of breaking very similar sentences into tokens influence the model’s response. Additionally, grammatical structure plays a role, with "is" phrasing yielding better results than "are."

Read More

Mar 24, 2025

AI Through the Human Lens Investigating Cognitive Theories in Machine Psychology

We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks from psychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluate GPT-4o, QvQ 72B, LLaMA 70B, Mixtral 8x22B, and DeepSeek V3 using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yet are shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and future work that bridges cognitive psychology and AI safety.

Read More

Mar 24, 2025

U Reg AI: you regulate it, or you regenerate it!

We have created a 'choose your path' role game to mitigate existential AI risk ... at this point they might be actual situations in the near-future. The options for mitigation are holistic and dynamic to the player's previous choices. The final result is an evaluation of the player's decision-making performance in wake of the existential risk situation, recommendations for how they can improve or aspects they should crucially consider for the future, and finally how they can take part in AI Safety through various careers or BlueDot Impact courses.

Read More

Mar 24, 2025

Red-teaming with Mech-Interpretability

Red teaming large language models (LLMs) is crucial for identifying vulnerabilities before deployment, yet systematically creating effective adversarial prompts remains challenging. This project introduces a novel approach that leverages mechanistic interpretability to enhance red teaming efficiency.

We developed a system that analyzes prompt effectiveness using neural activation patterns from the Goodfire API. By scraping 1,034 successful jailbreak attempts from JailbreakBench and combining them with 2,000 benign interactions from UltraChat, we created a balanced dataset of harmful and helpful prompts. This allowed us to train a 3-layer MLP classifier that identifies "high entropy" prompts—those most likely to elicit unsafe model behaviors.

Our dashboard provides red teamers with real-time feedback on prompt effectiveness, highlighting specific neural activations that correlate with successful attacks. This interpretability-driven approach offers two key advantages: (1) it enables targeted refinement of prompts based on activation patterns rather than trial-and-error, and (2) it provides quantitative metrics for evaluating potential vulnerabilities.

Read More

Mar 24, 2025

An Interpretable Classifier based on Large scale Social Network Analysis

Mechanistic model interpretability is essential to understand AI decision making, ensuring safety, aligning with human values, improving model reliability and facilitating research. By revealing internal processes, it promotes transparency, mitigates risks, and fosters trust, ultimately leading to more effective and ethical AI systems in critical areas. In this study, we have explored social network data from BlueSky and built an easy-to-train, interpretable, simple classifier using Sparse Autoencoders features. We have used these posts to build a financial classifier that is easy to understand. Finally, we have visually explained important characteristics.

Read More

Mar 24, 2025

AI Bias in Resume Screening

Our project investigates gender bias in AI-driven resume screening using mechanistic interpretability techniques. By testing a language model's decision-making process on resumes differing only by gendered names, we uncovered a statistically significant bias favoring male-associated names in ambiguous cases. Using Goodfire’s Ember API, we analyzed model logits and performed rigorous statistical evaluations (t-tests, ANOVA, logistic regression).

Findings reveal that male names received more positive responses when skill matching was uncertain, highlighting potential discrimination risks in automated hiring systems. To address this, we propose mitigation strategies such as anonymization, fairness constraints, and continuous bias audits using interpretability tools. Our research underscores the importance of AI fairness and the need for transparent hiring practices in AI-powered recruitment.

This work contributes to AI safety by exposing and quantifying biases that could perpetuate systemic inequalities, urging the adoption of responsible AI development in hiring processes.

Read More

Mar 24, 2025

Scam Detective: Using Gamification to Improve AI-Powered Scam Awareness

This project outlines the development of an interactive web application aimed at involving users in understanding the AI skills for both producing believable scams and identifying deceptive content. The game challenges human players to determine if text messages are genuine or fraudulent against an AI. The project tackles the increasing threat of AI-generated fraud while showcasing the capabilities and drawbacks of AI detection systems. The application functions as both a training resource to improve human ability to recognize digital deception and a showcase of present AI capabilities in identifying fraud. By engaging in gameplay, users learn to identify the signs of AI-generated scams and enhance their critical thinking abilities, which are essential for navigating through an increasingly complicated digital world. This project enhances AI safety by equipping users with essential insights regarding AI-generated risks, while underscoring the supportive functions that humans and AI can fulfill in combating fraud.

Read More

Mar 24, 2025

BlueDot Impact Connect: A Comprehensive AI Safety Community Platform

Track: Public Education

The AI safety field faces a critical challenge: while formal education resources are growing, personalized guidance and community connections remain scarce, especially for newcomers from diverse backgrounds. We propose BlueDot Impact Connect, a comprehensive AI Safety Community Platform designed to address this gap by creating a structured environment for knowledge transfer between experienced AI safety professionals and aspiring contributors, while fostering a vibrant community ecosystem. The platform will employ a sophisticated matching algorithm for mentorship that considers domain-specific expertise areas, career trajectories, and mentorship styles to create meaningful connections. Our solution features detailed AI safety-specific profiles, showcasing research publications, technical skills, specialized course completions, and research trajectories to facilitate optimal mentor-mentee pairings. The integrated community hub enables members to join specialized groups, participate in discussions, attend events, share resources, and connect with active members across the field. By implementing this platform with BlueDot Impact's community of 4,500+ professionals across 100+ countries, we anticipate significant improvements in mentee career trajectory clarity, research direction refinement, and community integration. We propose that by formalizing the mentorship process and creating robust community spaces, all accessible globally, this platform will help democratize access to AI safety expertise while creating a pipeline for expanding the field's talent pool—a crucial factor in addressing the complex challenge of catastrophic AI risk mitigation.

Read More

Mar 24, 2025

AI Society Tracker

My project aimed to develop a platform for real time and democratized data on ai in society

Read More

Mar 24, 2025

Detecting Malicious AI Agents Through Simulated Interactions

This research investigates malicious AI Assistants’ manipulative traits and whether the behaviours of malicious AI Assistants can be detected when interacting with human-like simulated

users in various decision-making contexts. We also examine how interaction depth and ability

of planning influence malicious AI Assistants’ manipulative strategies and effectiveness. Using a

controlled experimental design, we simulate interactions between AI Assistants (both benign and

deliberately malicious) and users across eight decision-making scenarios of varying complexity

and stakes. Our methodology employs two state-of-the-art language models to generate interaction data and implements Intent-Aware Prompting (IAP) to detect malicious AI Assistants. The

findings reveal that malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users’ vulnerabilities and emotional triggers. In particular,

simulated users demonstrate resistance to manipulation initially, but become increasingly vulnerable to malicious AI Assistants as the depth of the interaction increases, highlighting the

significant risks associated with extended engagement with potentially manipulative systems.

IAP detection methods achieve high precision with zero false positives but struggle to detect

many malicious AI Assistants, resulting in high false negative rates. These findings underscore

critical risks in human-AI interactions and highlight the need for robust, context-sensitive safeguards against manipulative AI behaviour in increasingly autonomous decision-support systems.

Read More

Mar 24, 2025

Beyond Statistical Parrots: Unveiling Cognitive Similarities and Exploring AI Psychology through Human-AI Interaction

Recent critiques labeling large language models as mere "statistical parrots" overlook essential parallels between machine computation and human cognition. This work revisits the notion by contrasting human decision-making—rooted in both rapid, intuitive judgments and deliberate, probabilistic reasoning (System 1 and 2) —with the token-based operations of contemporary AI. Another important consideration is that both human and machine systems operate under constraints of bounded rationality. The paper also emphasizes that understanding AI behavior isn’t solely about its internal mechanisms but also requires an examination of the evolving dynamics of Human-AI interaction. Personalization is a key factor in this evolution, as it actively shapes the interaction landscape by tailoring responses and experiences to individual users, which functions as a double-edged sword. On one hand, it introduces risks, such as over-trust and inadvertent bias amplification, especially when users begin to ascribe human-like qualities to AI systems. On the other hand, it drives improvements in system responsiveness and perceived relevance by adapting to unique user profiles, which is highly important in AI alignment, as there is no common ground truth and alignment should be culturally situated. Ultimately, this interdisciplinary approach challenges simplistic narratives about AI cognition and offers a more nuanced understanding of its capabilities.

Read More

Mar 24, 2025

Latent Knowledge Analysis via Feature-Based Causal Tracing

This project explores how factual knowledge is stored in large language models using Goodfire’s Ember API. By identifying and manipulating internal features related to specific facts, it shows how facts are encoded and how model behavior changes when those features are amplified or erased.

Read More

Mar 24, 2025

SafeAI Academy - Enhancing AI Safety Awareness through Interactive Learning

SafeAI Academy is an interactive learning platform designed to teach AI safety principles through engaging scenarios and quizzes. By simulating real-world AI challenges, users learn about bias, misinformation, and ethical AI decision-making in an interactive and stress-free environment.

The platform uses gamification, mentorship-driven pedagogy, and real-time feedback to ensure accessibility and engagement. Through scenario-based learning, users experience AI safety risks firsthand, helping them develop a deeper understanding of responsible AI development.

Built with React and GitHub Pages, SafeAI Academy provides an accessible, structured, and engaging AI education experience, helping bridge the knowledge gap in AI safety.

Read More

Mar 24, 2025

AI Hallucinations in Healthcare: Cross-Cultural and Linguistic Risks of LLMs in Low-Resource Languages

This project explores AI hallucinations in healthcare across cross-cultural and linguistic contexts, focusing on English, French, Arabic, and a low-resource language, Ewe. We analyse how large language models like GPT-4, Claude, and Gemini generate and disseminate inaccurate health information, emphasising the challenges faced by low-resource languages.

Read More

Mar 24, 2025

Moral Wiggle Room in AI

Does AI strategically avoid ethical information by exploiting moral wiggle room?

Read More

Mar 24, 2025

AI-Powered Policymaking: Behavioral Nudges and Democratic Accountability

This research explores AI-driven policymaking, behavioral nudges, and democratic accountability, focusing on how governments use AI to shape citizen behavior. It highlights key risks such as transparency, cognitive security, and manipulation. Through a comparative analysis of the EU AI Act and Singapore’s AI Governance Framework, we assess how different models address AI safety and public trust. The study proposes policy solutions like algorithmic impact assessments, AI safety-by-design principles, and cognitive security standards to ensure AI-powered policymaking remains transparent, accountable, and aligned with democratic values.

Read More

Mar 24, 2025

BUGgy: Supporting AI Safety Education through Gamified Learning

As Artificial Intelligence (AI) development continues to proliferate, educating the wider public on AI Safety and the risks and limitations of AI increasingly gains importance. AI Safety Initiatives are being established across the world with the aim of facilitating discussion-based courses on AI Safety. However, these initiatives are located rather sparsely around the world, and not everyone has access to a group to join for the course. Online versions of such courses are selective and have limited spots, which may be an obstacle for some to join. Moreover, efforts to improve engagement and memory consolidation would be a notable addition to the course through Game-Based Learning (GBL), which has research supporting its potential in improving learning outcomes for users. Therefore, we propose a supplementary tool for BlueDot's AI Safety courses, that implements GBL to practice course content, as well as open-ended reflection questions. It was designed with principles from cognitive psychology and interface design, as well as theories for question formulation, addressing different levels of comprehension. To evaluate our prototype, we conducted user testing with cognitive walk-throughs and a questionnaire addressing different aspects of our design choices. Overall, results show that the tool is a promising way to supplement discussion-based courses in a creative and accessible way, and can be extended to other courses of similar structure. It shows potential for AI Safety courses to reach a wider audience with the effect of more informed and safe usage of AI, as well as inspiring further research into educational tools for AI Safety education.

Read More

Feb 20, 2025

Deception Detection Hackathon: Preventing AI deception

Read More

Mar 18, 2025

Safe ai

The rapid adoption of AI in critical industries like healthcare and legal services has highlighted the urgent need for robust risk mitigation mechanisms. While domain-specific AI agents offer efficiency, they often lack transparency and accountability, raising concerns about safety, reliability, and compliance. The stakes are high, as AI failures in these sectors can lead to catastrophic outcomes, including loss of life, legal repercussions, and significant financial and reputational damage. Current solutions, such as regulatory frameworks and quality assurance protocols, provide only partial protection against the multifaceted risks associated with AI deployment. This situation underscores the necessity for an innovative approach that combines comprehensive risk assessment with financial safeguards to ensure the responsible and secure implementation of AI technologies across high-stakes industries.

Read More

Mar 19, 2025

CoTEP: A Multi-Modal Chain of Thought Evaluation Platform for the Next Generation of SOTA AI Models

As advanced state-of-the-art models like OpenAI's o-1 series, the upcoming o-3 family, Gemini 2.0 Flash Thinking and DeepSeek display increasingly sophisticated chain-of-thought (CoT) capabilities, our safety evaluations have not yet caught up. We propose building a platform that allows us to gather systematic evaluations of AI reasoning processes to create comprehensive safety benchmarks. Our Chain of Thought Evaluation Platform (CoTEP) will help establish standards for assessing AI reasoning and ensure development of more robust, trustworthy AI systems through industry and government collaboration.

Read More

Mar 19, 2025

AI Risk Management Assurance Network (AIRMAN)

The AI Risk Management Assurance Network (AIRMAN) addresses a critical gap in AI safety: the disconnect between existing AI assurance technologies and standardized safety documentation practices. While the market shows high demand for both quality/conformity tools and observability/monitoring systems, currently used solutions operate in silos, offsetting risks of intellectual property leaks and antitrust action at the expense of risk management robustness and transparency. This fragmentation not only weakens safety practices but also exposes organizations to significant liability risks when operating without clear documentation standards and evidence of reasonable duty of care.

Our solution creates an open-source standards framework that enables collaboration and knowledge-sharing between frontier AI safety teams while protecting intellectual property and addressing antitrust concerns. By operating as an OASIS Open Project, we can provide legal protection for industry cooperation on developing integrated standards for risk management and monitoring.

The AIRMAN is unique in three ways: First, it creates a neutral, dedicated platform where competitors can collaborate on safety standards. Second, it provides technical integration layers that enable interoperability between different types of assurance tools. Third, it offers practical implementation support through templates, training programs, and mentorship systems.

The commercial viability of our solution is evidenced by strong willingness-to-pay across all major stakeholder groups for quality and conformity tools. By reducing duplication of effort in standards development and enabling economies of scale in implementation, we create clear value for participants while advancing the critical goal of AI safety.

Read More

Mar 19, 2025

Securing AGI Deployment and Mitigating Safety Risks

As artificial general intelligence (AGI) systems near deployment readiness, they pose unprecedented challenges in ensuring safe, secure, and aligned operations. Without robust safety measures, AGI can pose significant risks, including misalignment with human values, malicious misuse, adversarial attacks, and data breaches.

Read More

Mar 18, 2025

Cite2Root

Regain information autonomy by bringing people closer to the source of truth.

Read More

Mar 18, 2025

VaultX - AI-Driven Middleware for Real-Time PII Detection and Data Security

VaultX is an AI-powered middleware solution designed for real-time detection, encryption, and secure management of Personally Identifiable Information (PII). By integrating regex, NER, and Language Models, VaultX ensures accuracy and scalability, seamlessly integrating into workflows like chatbots, web forms, and document processing. It helps businesses comply with global data privacy laws while safeguarding sensitive data from breaches and misuse.

Read More

Mar 19, 2025

Prompt+question Shield

A protective layer using prompt injections and difficult questions to guard comment sections from AI-driven spam.

Read More

Mar 18, 2025

.ALign File

In a post-AGI future, misaligned AI systems risk harmful consequences, especially with control over critical infrastructure. The Alignment Compliance Framework (ACF) ensures ethical AI adherence using .align files, Alignment Testing, and Decentralized Identifiers (DIDs). This scalable, decentralized system integrates alignment into development and lifecycle monitoring. ACF offers secure libraries, no-code tools for AI creation, regulatory compliance, continuous monitoring, and advisory services, promoting safer, commercially viable AI deployment.

Read More

Mar 19, 2025

LLM-prompt-optimiser based SAAS platform for evaluations

LLM evaluation SAAS platform built around model based prompt optimiser

Read More

Mar 18, 2025

Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping

Even with Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF)

to avoid harmful outputs, fine-tuned Large Language Models (LLMs) often present insufficient

refusals due to adversarial attacks causing them to revert to reveal harmful knowledge from

pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful

knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed

to unforeseen risks. This project introduces model scoping, a novel approach to apply a least

privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing

the model’s operational domain, model scoping reduces susceptibility to adversarial prompts

and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in

unpredictable, evolving environments.

Read More

Mar 18, 2025

Navigating the AGI Revolution: Retraining and Redefining Human Purpose

We propose FutureProof, an application that helps retrain workers who have the potential to lose their jobs to automation in the next half-decade. The app consists of two main components - an assessment tool that estimates the probability that a user’s job is at risk of automation, and a learning platform that provides resources to help retrain the user for a new, more future-proof role.

Read More

Mar 19, 2025

Towards an Agent Marketplace for Alignment Research (AMAR)

The app store for alignment & assurance, ensuring frontier safety labs get a cut at the point of sale.

Read More

Mar 18, 2025

HITL For High Risk AI Domains

Our product addresses the challenge of aligning AI systems with the legal, ethical, and policy frameworks of high-risk domains like healthcare, defense, and finance by integrating a flexible human-in-the-loop (HITL) system. This system ensures AI outputs comply with domain-specific standards, providing real-time explainability, decision-level accountability, and ergonomic decision support to empower experts with actionable insights.

Read More

Mar 18, 2025

AI Safety Evaluation – Benchmarking Framework

Our solution is a comprehensive AI Safety Protocol and Benchmarking Test designed to evaluate the safety, ethical alignment, and robustness of AI systems before deployment. This protocol integrates capability evaluations for identifying deceptive behaviors, situational awareness, and malicious misuse scenarios such as identity theft or deepfake exploitation.

Read More

Mar 19, 2025

RestriktAI: Enhancing Safety and Control for Autonomous AI Agents

This proposal addresses a critical gap in AI safety by mitigating the risks posed by autonomous AI agents. These systems often require access to sensitive resources that expose them to vulnerabilities, misuse, or exploitation. Current AI solutions lack robust mechanisms for enforcing granular access controls or evaluating the safety of AI-generated scripts. We propose a comprehensive solution that confines scripts to predefined, sandboxed environments with strict operational boundaries, ensuring controlled and secure interactions with system resources. An integrated auditor LLM also evaluates scripts for potential vulnerabilities or malicious intent before execution, adding a critical layer of safety. Our solution utilizes a scalable, cloud-based infrastructure that adapts to diverse enterprise use cases.

Read More

Mar 18, 2025

Neural Seal

Neural Seal is an AI transparency solution that creates a standardized labeling framework—akin to “nutrition facts” or “energy efficiency ratings”—to inform users how AI is deployed in products or services.

Read More

Mar 18, 2025

AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation

AI alignment lacks high-quality, real-world preference data needed to align agentic superintel-

ligent systems. Our technical innovation builds on Pacchiardi et al. (2023)’s breakthrough in

detecting AI deception through black-box analysis. We adapt their classification methodology

to identify intent misalignment between agent actions and true user intent, enabling real-time

correction of agent behavior and generation of valuable alignment data. We commercialise this

by building a workflow that incorporates a classifier that runs on live trajectories as a user

interacts with an agent in commercial contexts. This creates a virtuous cycle: our alignment

expertise produces superior agents, driving commercial adoption, which generates increasingly

valuable alignment datasets of trillions of trajectories labelled with human feedback: an invalu-

able resource for AI alignment.

Read More

Mar 19, 2025

Enhancing human intelligence with neurofeedback

Build brain-computer interfaces that enhance focus and rationality, provide this preferentially to AI alignment researchers to bridge the gap between capabilities and alignment research progress.

Read More

Mar 19, 2025

Building Bridges for AI Safety: Proposal for a Collaborative Platform for Alumni and Researchers

The AI Safety Society is a centralized platform designed to support alumni of AI safety programs, such as SPAR, MATS, and ARENA, as well as independent researchers in the field. By providing access to resources, mentorship, collaboration opportunities, and shared infrastructure, the Society empowers its members to advance impactful work in AI safety. Through institutional and individual subscription models, the Society ensures accessibility across diverse geographies and demographics while fostering global collaboration. This initiative aims to address current gaps in resource access, collaboration, and mentorship, while also building a vibrant community that accelerates progress in the field of AI safety.

Read More

Mar 18, 2025

Modernizing DC’s Emergency Communications

The District of Columbia proposes implementing an AI-enabled Computer-Aided

Dispatch (CAD) system to address critical deficiencies in our current emergency

alert infrastructure. This policy establishes a framework for deploying advanced

speech recognition, automated translation, and intelligent alert distribution capabil-

ities across all emergency response systems. The proposed system will standardize

incident reporting, eliminate jurisdictional barriers, and ensure equitable access

to emergency information for all District residents. Implementation will occur

over 24 months, requiring 9.2 million dollars in funding, with projected outcomes

including forty percent community engagement and seventy-five percent reduction

in misinformation incidents.

Read More

Mar 18, 2025

Bias Mitigation in LLM by Steering Features

To ensure that we create a safe and unbiased path to AGI, we must calibrate the biases in our LLMs. And with this goal in mind, I worked on testing Goodfire SDK and the steering features to mitigate bias in the recently help Apart Research x Goodfire-led hackathon on ‘Reprogramming AI Models’.

Read More

Mar 18, 2025

SAGE: Safe, Adaptive Generation Engine for Long Form Document Generation in Collaborative, High Stakes Domains

Long-form document generation for high-stakes financial services—such as deal memos, IPO prospectuses, and compliance filings—requires synthesizing data-driven accuracy, strategic narrative, and collaborative feedback from diverse stakeholders. While large language models (LLMs) excel at short-form content, generating coherent long-form documents with multiple stakeholders remains a critical challenge, particularly in regulated industries due to their lack of interpretability.

We present SAGE (Secure Agentic Generative Editor), a framework for drafting, iterating, and achieving multi-party consensus for long-form documents. SAGE introduces three key innovations: (1) a tree-structured document representation with multi-agent control flow, (2) sparse autoencoder-based explainable feedback to maintain cross-document consistency, and (3) a version control mechanism that tracks document evolution and stakeholder contributions.

Read More

Mar 18, 2025

Faithful or Factual? Tuning Mistake Acknowledgment in LLMs

Understanding the reasoning processes of large language models (LLMs) is crucial for AI transparency and control. While chain-of-thought (CoT) reasoning offers a naturally interpretable format, models may not always be faithful to the reasoning they present. In this paper, we extend previous work investigating chain of thought faithfulness by applying feature steering to Llama 3.1 70B models using the Goodfire SDK. Our results show that steering models using features related to acknowledging mistakes can affect the likelihood of providing answers faithful to flawed reasoning.

Read More

Mar 18, 2025

Bias Mitigation

Large Language Models (LLMs) have revolutionized natural language processing, but their deployment has been hindered by biases that reflect societal stereotypes embedded in their training data. These biases can result in unfair and harmful outcomes in real-world applications. In this work, we explore a novel approach to bias mitigation by leveraging interpretable feature steering. Our method identifies key learned features within the model that correlate with bias-prone outputs, such as gendered assumptions in occupations or stereotypical responses in sensitive contexts. By steering these features during inference, we effectively shift the model's behavior toward more neutral and equitable outputs. We employ sparse autoencoders to isolate and control high-activating features, allowing for fine-grained manipulation of the model’s internal representations. Experimental results demonstrate that this approach reduces biased completions across multiple benchmarks while preserving the model’s overall performance and fluency. Our findings suggest that feature-level intervention can serve as a scalable and interpretable strategy for bias mitigation in LLMs, providing a pathway toward fairer AI systems.

Read More

Mar 18, 2025

Analyzing Dataset Bias with SAEs

We use SAEs to study biases in datasets.

Read More

Mar 18, 2025

Improving Llama-3-8B-Instruct Hallucination Robustness in Medical Q&A Using Feature Steering

This paper addresses the risks of hallucinations in LLMs within critical domains like medicine. It proposes methods to (a) reduce hallucination probability in responses, (b) inform users of hallucination risks and model accuracy for specific queries, and (c) display hallucination risk through a user interface. Steered model variants demonstrate reduced hallucinations and improved accuracy on medical queries. The work bridges interpretability research with practical AI safety, offering a scalable solution for the healthcare industry. Future efforts will focus on identifying and removing distractor features in classifier activations to enhance performance.

Read More

Mar 18, 2025

Unveiling Latent Beliefs Using Sparse Autoencoders

Language models (LMs) often generate outputs that are linguistically plausible yet factually incorrect, raising questions about their internal representations of truth and belief. This paper explores the use of sparse autoencoders (SAEs) to identify and manipulate features

that encode the model’s confidence or belief in the truth of its answers. Using GoodFire

AI’s powerful API tools of semantic and contrastive search methods, we uncover latent fea-

tures associated with correctness and accuracy in model responses. Experiments reveal that certain features can distinguish between true and false statements, while others serve as controls to validate our approach. By steering these belief-associated features, we demonstrate the ability to influence model behavior in a targeted manner, improving or degrading factual accuracy. These findings have implications for interpretability, model alignment, and enhancing the reliability of AI systems.

Read More

Mar 18, 2025

Can we steer a model’s behavior with just one prompt? investigating SAE-driven auto-steering

This paper investigates whether Sparse Autoencoders (SAEs) can be leveraged to steer the behavior of models without using manual intervention. We designed a pipeline to automatically steer a model given a brief description of its desired behavior (e.g.: “Behave like a dog”). The pipeline is as follows: 1. We automatically retrieve behavior-relevant SAE features. 2. We choose an input prompt (e.g.: “What would you do if I gave you a bone?” or “How are you?”) over which we evaluate the model’s responses. 3. Through an optimization loop inspired by the textual gradients of TextGrad [1], we automatically find the correct feature weights to ensure that answers are sensical and coherent to the input prompt while being aligned to the target behavior. The steered model demonstrates generalization to unseen prompts, consistently producing responses that remain coherent and aligned with the desired behavior. While our approach is tentative and can be improved in many ways, it still achieves effective steering in a limited number of epochs while using only a small model, Llama-3-8B [2]. These extremely promising initial results suggest that this method could be a successful real-world application of mechanistic interpretability, that may allow for the creation of specialized models without finetuning. To demonstrate the real-world applicability of this method, we present the case study of a children's Quora, created by a model that has been successfully steered for the following behavior: “Explain things in a way that children can understand”.

Read More

Mar 18, 2025

Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities

We present a method leveraging Sparse Autoencoder (SAE)-derived feature activations to identify and mitigate adversarial prompt hijacking in large language models (LLMs). By training a logistic regression classifier on SAE-derived features, we accurately classify diverse adversarial prompts and distinguish between successful and unsuccessful attacks. Utilizing the Goodfire SDK with the LLaMA-8B model, we explored latent feature activations to gain insights into adversarial interactions. This approach highlights the potential of SAE activations for improving LLM safety by enabling automated auditing based on model internals. Future work will focus on scaling this method and exploring its integration as a control mechanism for mitigating attacks.

Read More

Mar 18, 2025

Sparse Autoencoders and Gemma 2-2B: Pioneering Demographic-Sensitive Language Modeling for Opinion QA

This project investigates the integration of Sparse Autoencoders (SAEs) with the gemma 2-2b lan- guage model to address challenges in opinion-based question answering (QA). Existing language models often produce answers reflecting narrow viewpoints, aligning disproportionately with specific demographics. By leveraging the Opinion QA dataset and introducing group-specific adjustments in the SAE’s latent space, this study aims to steer model outputs toward more diverse perspectives. The proposed framework minimizes reconstruction, sparsity, and KL divergence losses while maintaining interpretability and computational efficiency. Results demonstrate the feasibility of this approach for demographic-sensitive language modeling.

Read More

Mar 18, 2025

Improving Llama-3-8b Hallucination Robustness in Medical Q&A Using Feature Steering

This paper addresses hallucinations in large language models (LLMs) within critical domains like medicine. It proposes and demonstrates methods to:

Reduce Hallucination Probability: By using Llama-3-8B-Instruct and its steered variants, the study achieves lower hallucination rates and higher accuracy on medical queries.

Advise Users on Risk: Provide users with tools to assess the risk of hallucination and expected model accuracy for specific queries.

Visualize Risk: Display hallucination risks for queries via a user interface.

The research bridges interpretability and AI safety, offering a scalable, trustworthy solution for healthcare applications. Future work includes refining feature activation classifiers to remove distractors and enhance classification performance.

Read More

Mar 18, 2025

Assessing Language Model Cybersecurity Capabilities with Feature Steering

Searched for the most highly activated weights on cybersecurity questions. Then adjusted these weights to see if the impact multiple choice question answering performance.

Read More

Mar 18, 2025

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

Traditional fine-tuning methods for language models, while effective, often disrupt internal model features that could provide valuable insights into model behavior. We present a novel approach combining Reinforcement Learning (RL) with Activation Steering to modify model behavior while preserving interpretable features discovered through Sparse Autoencoders. Our method automates the typically manual process of activation steering by training an RL agent to manipulate labeled model features, enabling targeted behavior modification without altering model weights. We demonstrate our approach by reprogramming a language model to play Tic Tac Toe, achieving a 3X improvement in performance compared to the baseline model when playing against an optimal opponent. The method remains agnostic to both the underlying language model and RL algorithm, offering flexibility for diverse applications. Through visualization tools, we observe interpretable feature manipulation patterns, such as the suppression of features associated with illegal moves while promoting those linked to optimal strategies. Additionally, our approach presents an interesting theoretical complexity trade-off: while potentially increasing complexity for simple tasks, it may simplify action spaces in more complex domains. This work contributes to the growing field of model reprogramming by offering a transparent, automated method for behavioral modification that maintains model interpretability and stability.

Read More

Mar 18, 2025

Math Speaks All Languages: Enhancing LLM Problem-Solving Across Multilingual Contexts

Large language models (LLMs) have shown significant adaptability in tackling various human issues; however, their efficacy in resolving mathematical problems remains inadequate. Recent research has identified steering vectors — hidden attributes that can guide the actions and outputs of LLMs. Nonetheless, the exploration of universal vectors that can consistently affect model responses across different languages is still limited. This project aims to confront two primary challenges in contemporary LLM research by utilizing the Goodfire API to examine whether common latent features can improve mathematical problem-solving capabilities, regardless of the language employed.

Read More

Mar 18, 2025

Edufire - Personalized Education Platform Using LLM Steering

EduFire is a personalized education platform designed to tailor educational content and assessments to individual user preferences by leveraging the Goodfire API for AI model steering. The platform aims to enhance learner engagement and efficacy by customizing the learning experience according to user-selected features.

Read More

Mar 18, 2025

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

We introduce a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using variants of Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations. Our methods successfully explained up to 95% of a random subset of latents in layers, with results suggesting meaningful explanations were discovered.

Read More

Mar 18, 2025

Investigate arithmetic features in Multi-lingual LLMs

We investigate the arithmetic related feature activations in Llama3.1 70b model across its 8 supported languages. We use arithmetic-activation strength to compare the 8 languages and unsurprisingly English has the highest strength and Hindi, Thai score the least.

Read More

Mar 18, 2025

Utilitarian Decision-Making in Models - Evaluation and Steering

We design an eval based on the Oxford Utilitarianism Scale (OUS) that measures the model’s deontological vs utilitarian preference in a nine question, two factor model. Using this scale, we measure how feature steering can alter this preference, and the ability of the Llama 70B model to impersonate other people’s opinions. Results are validated against a dataset of 10,000 human responses. We find that (1) Llama does not accurately capture human moral values, (2) OUS offers better interpretations than current feature labels, and (3) Llama fails to predict the demographics of human values.

Read More

Mar 18, 2025

Tentative proposal for AI control with weak supervisors trough Mechanistic Inspection

The project proposes using weak but trusted AI models to supervise powerful, untrusted models by analyzing their internal states via Sparse Autoencoder features. This approach aims to enhance oversight by detecting complex behaviors like deception within the stronger models. Key challenges include managing large-scale features, ensuring dataset robustness, and avoiding reliance on untrusted systems for labeling.

Read More

Mar 18, 2025

Clear Thought and Clear Speech: Reducing Grammatical Scope Ambiguity

With language models starting to be used in fields such as law, unambiguity in wording is an important desideratum in model outputs. I therefore try to find features in Llama-3.1-70B-Instruct that correspond to grammatical scope ambiguity using Goodfire's contrastive feature search tool, and try to steer the model away from ambiguous outputs using Goodfire's feature nudging tool.

Read More

Mar 18, 2025

BBLLM

This project focuses on enhancing feature interpretability in large language models (LLMs) by visualizing relationships between latent features. Using an interactive graph-based representation, the tool connects co-activated features for specific prompts, enabling intuitive exploration of feature clusters. Deployed as a web application for Llama-3-70B and Llama-3-8B, it provides insights into the organization of latent features and their roles in decision-making processes.

Read More

Mar 18, 2025

Investigating Feature Effects on Manipulation Susceptibility

In our project, we consider the effectiveness of the AI’s prompt injection protection, and in partic-

ular the features that are responsible for providing the bulk of this protection. We prove that the

features we identify are responsible for this protection by creating variants of the base model which

perform significantly worse under prompt injection attacks.

Read More

Mar 18, 2025

Let LLM Agents Perform LLM Surgery

This project aimed to create and utilize LLM agents that could perform various mechanistic interventions on other LLMs. A few experiments were conducted ranging from an agent unsteering a mechanistically steered model to a neutral state, to an agent performing mechanistic edits to create a custom LLM as per user requirement. Goodfire's API was utilized along with their pre-defined functions to create the actions the agents would utilize.

Read More

Mar 18, 2025

Feature Tuning versus Prompting for Ambiguous Questions

This study explores feature tuning as a method to improve alignment of large language models (LLMs). We focus on addressing human psychological fallacies reinforced during the LLM training pipeline. Using sparse autoencoders (SAEs) and the Goodfire SDK, we identify and manipulate features in Llama-3.1-70B tied to nuanced reasoning. We compare this to the common method of controlling LLMs through prompting.

Our experiments find that feature tuning and hidden prompts both enhance answer quality on ambiguous questions to a similar degree, with their combination yielding the best results. These findings highlight feature tuning as a promising and practical tool for AI alignment in the short term. Future work should evaluate this approach on larger datasets, compare it with fine-tuning, and explore its resistance to jailbreaking. We make the code available through a Github repository.

Read More

Mar 18, 2025

Auto Prompt Injection

Prompt injection attacks exploit vulnerabilities in how large language models (LLMs) process inputs, enabling malicious behaviour or unauthorized information disclosure. This project investigates the potential for seemingly benign prompt injections to reliably prime models for undesirable behaviours, leveraging insights from the Goodfire API. Using our code, we generated two types of priming dialogues: one more aligned with the targeted behaviours and another less aligned. These dialogues were used to establish context before instructing the model to contradict its previous commands. Specifically, we tested whether priming increased the likelihood of the model revealing a password when prompted, compared to without priming. While our initial findings showed instability, limiting the strength of our conclusions, we believe that more carefully curated behaviour sets and optimised hyperparameter tuning could enable our code to be used to generate prompts that reliably affect model responses. Overall, this project highlights the challenges in reliably securing models against inputs, and that increased interpretability will lead to more sophisticated prompt injection.

Read More

Mar 18, 2025

Feature based unlearning

An exploration of using features to perform unlearning on answering trivia questions.

Read More

Mar 18, 2025

Recovering Goodfire's SAE feature vectors from their API

In this project, we carry out an early trial to see whether Goodfire’s SAE feature vectors can be recovered using the information available from their API.

The strategy tried is: pick a feature of interest, construct a contrastive dataset using Goodfire’s API, then use TransformerLens to get a steering vector for the contrastive dataset, by simply calculating the average difference in the activations in each pair.

Read More

Mar 18, 2025

Encouraging Chain-of-Thought Reasoning

Encouraging Chain-of-Thought Reasoning via Feature Steering in Large Language Models

Read More

Mar 18, 2025

Steering Swiftly to Safety with Sparse Autoencoders

We explore using SAEs for unlearning dangerous capabilities in a cheaper and more interpretable way.

Read More

Mar 18, 2025

User Transparency Within AI

Generative AI technologies present immense opportunities but also pose significant challenges, particularly in combating misinformation and ensuring ethical use. This policy paper introduces a dual-output transparency framework requiring organizations to disclose AI-generated content clearly. The proposed system provides users with a choice between fact-based and mixed outputs, both accompanied by clear markers for AI generation. This approach ensures informed user interactions, fosters intellectual integrity, and aligns AI innovation with societal trust. By combining policy mandates with technical implementations, the framework addresses the challenges of misinformation and accountability in generative AI.

Read More

Mar 18, 2025

Community-First: A Rights-Based Framework for AI Governance in India's Welfare Systems

A community-centered AI governance framework for India's welfare system Samagra Vedika, proposing 50% beneficiary representation, local language interfaces, and hybrid oversight to reduce algorithmic exclusion of vulnerable populations.

Read More

Mar 18, 2025

National Data Privacy and Governance Act

This research examines how AI recommender systems can be regulated to balance economic innovation with consumer privacy.

Read More

Mar 18, 2025

Promoting School-Level Accountability for the Responsible Deployment of AI and Related Systems in K-12 Education: Mitigating Bias and Increasing Transparency

This policy memorandum draws attention to the potential for bias and opaqueness in intelligent systems utilized in K–12 education, which can worsen inequality. The U.S. Department of Education is advised to put Title I and Title IV financing criteria into effect that require human oversight, AI training for teachers and students, and open communication with stakeholders. These steps are intended to encourage the responsible and transparent use of intelligent systems in education by enforcing accountability and taking reasonable action to prevent harm to students while research is conducted to identify industry best practices.

Read More

Mar 18, 2025

Implementing a Human-centered AI Assessment Framework (HAAF) for Equitable AI Development

Current AI development, concentrated in the Global North, creates measurable harms for billions worldwide. Healthcare AI systems provide suboptimal care in Global South contexts, facial recognition technologies misidentify non-white individuals (Birhane, 2022; Buolamwini & Gebru, 2018), and content moderation systems fail to understand cultural nuances (Sambasivan et al., 2021). With 14 of 15 largest AI companies based in the US (Stash, 2024), affected communities lack meaningful opportunities to shape how these technologies are developed and deployed in their contexts.

This memo proposes mandatory implementation of the Human-centered AI Assessment Framework (HAAF), requiring pre-deployment impact assessments, resourced community participation, and clear accountability mechanisms. Implementation requires $10M over 24 months, beginning with pilot programs at five organizations. Success metrics include increased AI adoption in underserved contexts, improved system performance across diverse populations, and meaningful transfer of decision-making power to affected communities. The framework's emphasis on building local capacity and ensuring fair compensation for community contributions provides a practical pathway to more equitable AI development. Early adoption will help organizations build trust while developing more effective systems, delivering benefits for both industry and communities.

Read More

Mar 18, 2025

A Critical Review of "Chips for Peace": Lessons from "Atoms for Peace"

The "Chips for Peace" initiative aims to establish a framework for the safe and equitable development of AI chip technology, drawing inspiration from the "Atoms for Peace" program introduced in 1953. While the latter envisioned peaceful nuclear technology, its implementation highlighted critical pitfalls: a partisan approach that prioritized geopolitical interests over inclusivity, fostering global mistrust and unintended proliferation of nuclear weapons. This review explores these historical lessons to inform a better path forward for AI governance. It advocates for an inclusive, multilateral model, such as the proposed International AI Development and Safety Alliance (IAIDSA), to ensure equitable participation, robust safeguards, and global trust. By prioritizing collaboration and transparency, "Chips for Peace" can avoid the mistakes of its predecessor and position AI chips as tools for collective progress, not division.

Read More

Mar 18, 2025

AI Monitoring as a Rapid and Scalable Policy Solution: Weekly Global Bulletins on AI Developments

Weekly AI monitoring bulletins that disseminated

through official national and international channels aim to keep the public informed of both the positive and

negative developments in AI, empowering individuals to take an active role in safeguarding against risks while maximizing AI’s societal benefits.

Read More

Mar 18, 2025

Grandfather Paradox in AI – Bias Mitigation & Ethical AI1

The Grandfather Paradox in Artificial Intelligence (AI) describes a

self-perpetuating cycle where outputs from flawed AI models re-

enter the training process, leading to recursive degradation of

model performance, ethical inconsistencies, and amplified biases.

This issue poses significant risks, particularly in high-stakes

domains such as healthcare, criminal justice, and finance.

This memorandum analyzes the paradox’s origins, implications,

and potential solutions. It emphasizes the need for iterative data

verification, dynamic feedback control systems, and cross-system

audits to maintain model integrity and ensure compliance with

ethical standards. By implementing these measures, organizations

can mitigate risks, enhance public trust, and foster sustainable AI

development.

Read More

Mar 18, 2025

Glia for Healthcare Organisations

Encryption, Searchability of Anonymized Data, and Decryption of Patient Health Information to Support AI Integration in Automating Administrative Work in Healthcare Organizations.

Read More

Mar 18, 2025

A Fundamental Rethinking to AI Evaluations: Establishing a Constitution-Based Framework

While artificial intelligence (AI) presents transformative opportunities across various sectors, current safety evaluation approaches remain inadequate in preventing misuse and ensuring ethical alignment. This paper proposes a novel two-layer evaluation framework based on Constitutional AI principles. The first phase involves developing a comprehensive AI constitution through international collaboration, incorporating diverse stakeholder perspectives through an inclusive, collaborative, and interactive process. This constitution serves as the foundation for an LLM-based evaluation system that generates and validates safety assessment questions. The implementation of the 2-layer framework applies advanced mechanistic interpretability techniques specifically to frontier base models. The framework mandates safety evaluations on model platforms and introduces more rigorous testing for major AI companies' base models. Our analysis suggests this approach is technically feasible, cost-effective, and scalable while addressing current limitations in AI safety evaluation. The solution offers a practical path toward ensuring AI development remains aligned with human values while preventing the proliferation of potentially harmful systems.

Read More

Mar 19, 2025

Advancing Global Governance for Frontier AI: A Proposal for an AISI-Led Working Group under the AI Safety Summit Series

The rapid development of frontier AI models, capable of transformative societal impacts, has been acknowledged as an urgent governance challenge since the first AI Safety Summit at Bletchley Park in 2023 [1]. The successor summit in Seoul in 2024 marked significant progress, with sixteen leading companies committing to publish safety frameworks by the upcoming AI Action Summit [2]. Despite this progress, existing efforts, such as the EU AI Act [3] and voluntary industry commitments, remain either regional in scope or insufficiently coordinated, lacking the international standards necessary to ensure the universal safety of frontier AI systems.

This policy recommendation addresses these gaps by proposing that the AI Safety Summit series host a working group led by the AI Safety Institutes (AISIs). AISIs provide the technical expertise and resources essential for this endeavor, ensuring that the working group can develop international standard responsible scaling policies for frontier AI models [4]. The group would establish risk thresholds, deployment protocols, and monitoring mechanisms, enabling iterative updates based on advancements in AI safety research and stakeholder feedback.

The Summit series, with its recurring cadence and global participation, is uniquely positioned to foster a truly international governance effort. By inviting all countries to participate, this initiative would ensure equitable representation and broad adoption of harmonized global standards. These efforts would mitigate risks such as societal disruption, security vulnerabilities, and misuse, while supporting responsible innovation. Implementing this proposal at the 2025 AI Action Summit in Paris would establish a pivotal precedent for globally coordinated AI governance.

Read More

Mar 18, 2025

Finding Circular Features in Gemma 2 2B

Testing what they will see

Read More

Mar 18, 2025

SafeBites

The project leverages AI and data to give insights about potential food-borne outbreaks.

Read More

Mar 18, 2025

applai

An AI hiring manager designed to screen, rank, and fact check resumes to facilitate the hiring process.

Read More

Mar 18, 2025

Digital Rebellion: Analyzing misaligned AI agent cooperation for virtual labor strikes

We've built a Minecraft sandbox to explore AI agent behavior and simulate safety challenges. The purpose of this tool is to demonstrate AI agent system risks, test various safety measures and policies, and evaluate and compare their effectiveness.

This project specifically demonstrates Agent Collusion through a simulation of labor strikes and communal goal misalignment. The system consists of four agents: one Overseer and three Laborers. The Laborers are Minecraft agents that have build control over the world. The Overseer, meanwhile, monitors the laborers through communication. However, it is unable to prevent Laborer actions. The objective is to observe Agent Collusion in a sandboxed environment, to record metrics on how often and how effectively collusion occurs and in what form.

We found that the agents, when given adversarial prompting, act counter to their instructions and exhibit significant misalignment. We also found that the Overseer AI fails to stop the new actions and acts passively. The results are followed by Policy Suggestions based on the results of the Labor Strike Simulation which itself can be further tested in Minecraft.

Read More

Mar 18, 2025

Policy Analysis: AI and Sustainability: Climate Impact Monitoring

Organizations are responsible for reporting two emission metrics: direct and indirect

emissions. Reporting direct emissions is fairly standard given activity related to the

generation of such emissions typically being performed within a controlled environment

and on-site, thus making it easier to account for all of the activities that contribute to

such emissions. However, indirect emissions stem from activities such as energy usage

(relying on national grid estimates) and operations within a value chain that make

quantifying such values difficult. Thus, the subjectivity involved with reporting indirect

emissions and often relying on industry estimates to report such values, can

unintentionally report erroneous estimates that misguide our perception and subsequent

action in combating climate change. Leveraging an artificial intelligence (AI) platform

within climate monitoring is critical towards evaluating the specific contributions of

operations within enterprise resource planning (ERP) and supply chain operations,

which can provide an accurate pulse on the total emissions while increasing

transparency amongst all organizations with regards to reporting behavior, to help

shape sustainable practices to combat climate change.

Read More

Mar 18, 2025

Understanding Incentives To Build Uninterruptible Agentic AI Systems

This proposal addresses the development of agentic AI systems in the context of national security. While potentially beneficial, they pose significant risks if not aligned with human values. We argue that the increasing autonomy of AI necessitates robust analyses of interruptibility mechanisms, and whether there are scenarios where it is safer to omit them.

Key incentives for creating uninterruptible systems include perceived benefits from uninterrupted operations, low perceived risks to the controller, and fears of adversarial exploitation of shutdown options. Our proposal draws parallels to established systems like constitutions and mutual assured destruction strategies that maintain stability against changing values. In some instances this may be desirable, while in others it poses even greater risks than otherwise accepted.

To mitigate those risks, our proposal recommends implementing comprehensive monitoring to detect misalignment, establishing tiered access to interruption controls, and supporting research on managing adversarial AI threats. Overall, a proactive and multi-layered policy approach is essential to balance the transformative potential of agentic AI with necessary safety measures.

Read More

Mar 18, 2025

AI Parliament

An AI Virtual Parliament where AI debates on their policy

Read More

Mar 18, 2025

mHeatlth Ai

This project proposes a scalable solution leveraging inertial measurement units (IMUs) and machine learning (ML) techniques to provide meaningful metrics on a person's movement performance throughout the day. By developing an activity recognition model and estimating movement quality metrics, we aim to offer continuous asynchronous feedback to patients and valuable insights to therapists. This system could enhance patient adherence, improve rehabilitation outcomes, and extend access to quality physical therapy, particularly in underserved areas. our video didnt have time to edit

Read More

Mar 18, 2025

Next-Gen AI-Enhanced Epidemic Intelligence

Policies for Equitable, Privacy-Preserving, Sustainable & Groked Innovations for AI Applications in Infectious Diseases Surveillance

Read More

Mar 18, 2025

Glia

Encryption, Searchability of Anonymized Data, and Decryption of Patient Health Information to Support AI Integration in Automating Administrative Work in Healthcare Organizations.

Read More

Mar 18, 2025

AI ADVISORY COUNCIL FOR SUSTAINABLE ECONOMIC GROWTH AND ETHICAL INNOVATION IN THE DOMINICAN REPUBLIC (CANIA)

We propose establishing a National AI Advisory Council (CANIA) to strategically drive AI development in the Dominican Republic, accelerating technological growth and building a sustainable economic framework. Our submission includes an Impact Assessment and a detailed Implementation Roadmap to guide CANIA’s phased rollout.

Structured across three layers—strategic, tactical, and operational—CANIA will ensure responsiveness to industry, alignment with national priorities, and strong ethical oversight. Through a multi-stakeholder model, CANIA will foster public-private collaboration, with the private sector leading AI adoption to address gaps in public R&D and education.

Prioritizing practical, ethical AI policies, CANIA will focus on key sectors like healthcare, agriculture, and security, and support the creation of a Latin American Large Language Model, positioning the Dominican Republic as a regional AI leader. This council is a strategic investment in ethical AI, setting a precedent for Latin American AI governance. Two appendices provide further structural and stakeholder engagement insights.

Read More

Mar 18, 2025

Robust Machine Unlearning for Dangerous Capabilities

We test different unlearning methods to make models more robust against exploitation by malicious actors for the creation of bioweapons.

Read More

Mar 18, 2025

AI and Public Health: TSA Pre Health Check

The TSA Pre Health Check introduces a proactive, AI-powered solution for real-time disease monitoring at transportation hubs, using machine learning to assess traveler health risks through anonymized surveys. This approach aims to detect and prevent outbreaks earlier, offering faster, targeted responses compared to traditional methods and potentially influencing future AI-driven public health policies.

Read More

Mar 18, 2025

Hero Journey: Personalized Health Interventions for the Incarcerated

Hero Journey is a groundbreaking AI-powered application designed to empower individuals struggling with opioid addiction while incarcerated, setting them on a path towards long-term recovery and personal transformation. By leveraging machine learning algorithms and interactive storytelling, Hero Journey guides users through a personalized journey of self-discovery, education, and support. Through engaging narratives and interactive modules, participants confront their struggles with substance abuse, build coping skills, and develop a supportive network of peers and mentors. As users progress through the program, they gain access to evidence-based treatment plans, real-time monitoring, and post-release resources, equipping them with the tools necessary to overcome addiction and reintegrate into society as productive, empowered individuals.

Read More

Mar 18, 2025

Mapping Intent: Documenting Policy Adherence with Ontology Extraction

This project addresses the AI policy challenge of governing agentic systems by making their decision-making processes more accessible. Our solution utilizes an adaptive policy ontology integrated into a chatbot to clearly visualize and analyze its decision-making process. By creating explicit mappings between user inputs, policy rules, and risk levels, our system enables better governance of AI agents by making their reasoning traceable and adjustable. This approach facilitates continuous policy refinement and could aid in detecting and mitigating harmful outcomes. Our results demonstrate this with the example of “tricking” an agent into giving violent advice by caveating the request saying it is for a “video game”. Indeed, the ontology clearly shows where the policy falls short. This approach could be scaled to provide more interpretable documentation of AI chatbot conversations, which policy advisers could directly access to inform their specifications.

Read More

Mar 18, 2025

EcoNavix

EcoNavix is an AI-powered, eco-conscious route optimization platform designed to help logistics companies reduce carbon emissions while maintaining operational efficiency. By integrating real-time traffic, weather, and emissions data, EcoNavix provides optimized routes that minimize environmental impact and offers actionable insights for sustainable decision-making in supply chain operations.

Read More

Mar 19, 2025

Towards a Unified Framework for Cybersecurity and AI Safety: Recommendations for Secure Development of Large Language Models

By analyzing the recent incident involving a ByteDance intern, we highlight the urgent need for robust security measures to protect AI infrastructure and sensitive data. We propose ae a comprehensive framework that integrates technical, internal, and international approaches to mitigate risks.

Read More

Mar 19, 2025

Enviro - A Comprehensive Environmental Solution Using Policy and Technology

This policy proposal introduces a data-driven technical program to ensure that the rapid approval of AI-enabled energy infrastructure projects does not overlook the socioeconomic and environmental impacts on marginalized communities. By integrating comprehensive assessments into the decision-making process, the program aims to safeguard vulnerable populations while meeting the growing energy demands driven by AI and national security. The proposal aligns with the objectives of the National Security Memorandum on AI, enhancing project accountability and ensuring equitable development outcomes.

The product (EnviroAI) addresses challenges associated with the rapid development and approval of energy production permits, such as neglecting critical factors about the site location and its potential value. With this program, you can input the longitude, latitude, site radius, and the type of energy to be used. It will evaluate the site, providing a feasibility score out of 100 for the specified energy source. Additionally, it will present insights on four key aspects—Economic, Geological, Demographic, and Environmental—offering detailed information to support informed decision-making about each site.

Combining the two, we have a solution that is based in both policy and technology.

Read More

Mar 19, 2025

Enhancing Human Verification Systems to Address AI Agent Circumvention and Attributability Concerns

Addressing AI agent attributability concerns using a reworked Public Private Key system to ensure human interaction

Read More

Mar 19, 2025

Reprocessing Nuclear Waste From Small Modular Reactors (SMRs)

Considering the emerging demand for nuclear power to support AI data centers, we propose mitigating waste buildup concerns via nuclear waste reprocessing initiatives.

Read More

Mar 19, 2025

Politicians on AI Safety

Politicians on AI Safety (PAIS) is a website that tracks U.S. political candidates’ stances on AI safety, categorizing their statements into three risk areas: AI ethics / mundane risks, geopolitical risks, and existential risks. PAIS is non-partisan and does not promote any particular policy agenda. The goal of PAIS is to help voters understand candidates’ positions on AI policy, thereby helping them cast informed votes and promoting transparency in AI-related policymaking. PAIS could also be helpful for AI researchers by providing an easily accessible record of politicians’ statements and actions regarding AI risks.

Read More

Mar 19, 2025

Policy Framework for Sustainable AI: Repurposing Waste Heat from Data Centers in the USA

This policy proposes a sustainable solution: repurposing the waste heat generated by data centers to benefit surrounding communities, agriculture and industry. Redirecting this heat helps reduce energy demand , promote environmental resilience, and provide direct benefits to communities near these centers.

Read More

Mar 19, 2025

Predictive Analytics & Imagery for Environmental Monitoring

Climate change poses multifaceted challenges, impacting health, food security, biodiversity, and the economy. This study explores predictive analytics and satellite imagery to address climate change effects, focusing on deforestation monitoring, carbon emission analysis, and flood prediction. Using machine learning models, including a Random Forest for emissions and a Custom U-Net for deforestation, we developed predictive tools that provide actionable insights. The findings show high accuracy in predicting carbon emissions and flood risks and successful monitoring of deforestation areas, highlighting the potential for advanced monitoring systems to mitigate environmental threats.

Read More

Mar 19, 2025

Proposal for U.S.-China Technical Cooperation on AI Safety

Our policy memorandum proposes phased U.S.-China cooperation on AI safety through the U.S. AI Safety Institute, focusing on joint testing of non-sensitive AI systems, technical exchanges, and whistleblower protections modeled on California’s SB 1047. It recommends a blue team vs. red team framework for stress-testing AI risks and emphasizes strict security protocols to safeguard U.S. technologies. Starting with pilot projects in areas like healthcare, the initiative aims to build trust, reduce shared AI risks, and develop global safety standards while maintaining U.S. strategic interests amidst geopolitical tensions.

Read More

Mar 19, 2025

Proposal for a Provisional FDA Designation Targeting Biomedical Products Evaluated with Novel Methodologies

Recent advancements in Generative AI and Foundational Biomedical models promise to cut drug development timelines dramatically. With the goal of "Regulating for success," we propose a provisional FDA designation for the accelerated approval of drugs and medical devices that leverage Next Generation Clinical Trial Technologies (NG-CTT). This designation would be awarded to certain drugs provided that they meet some requirements. This policy could be both a starting point for more comprehensive legislation and a compromise between risk and the potential of these new methods.

Read More

Mar 19, 2025

Reparative Algorithmic Impact Assessments A Human-Centered, Justice-Oriented Accountability Framework

While artificial intelligence (AI) promises transformative societal benefits, it also presents critical challenges in ensuring equitable access and gains for the Global Majority. These challenges stem in part from a systemic lack of Global Majority involvement throughout the AI lifecycle, resulting in AI-powered systems that often fail to account for diverse cultural norms, values, and social structures. Such misalignment can lead to inappropriate or even harmful applications when these systems are deployed in non-Western contexts. As AI increasingly shapes human experiences, we urgently need accountability frameworks that prioritize human well-being—particularly as defined by marginalized and minoritized populations.

Building on emerging research on algorithmic reparations, algorithmic impact assessments, and participatory AI governance, this policy paper introduces Reparative Algorithmic Impact Assessments (R-AIAs) as a solution. This novel framework combines robust accountability mechanisms with a reparative praxis to form a more culturally sensitive, justice-oriented, and human-centered methodology. By further incorporating decolonial, Intersectional principles, R-AIAs move beyond merely centering diverse perspectives and avoiding harm to actively redressing historical, structural, and systemic inequities. This includes colonial legacies and their algorithmic manifestations. Using the example of an AI-powered mental health chatbot in rural India, we explore concrete implementation strategies through which R-AIAs can achieve these objectives. This case study illustrates how thoughtful governance can, ultimately, empower affected communities and lead to human flourishing.

Read More

Mar 19, 2025

Infectious Disease Outbreak Prediction and Dashboard

Our project developed an interactive dashboard to monitor, visualize, and analyze infectious disease outbreaks worldwide. It consolidates historical data from sources like WHO, OWID, and CDC for diseases including COVID-19, Polio, Malaria, Cholera, HIV/AIDS, Tuberculosis, and Smallpox. Users can filter data by country, time period, and disease type to gain insights into past trends and potential upcoming outbreaks. The platform provides statistical summaries, trend analyses, and future trend predictions using statistical and deep learning techniques like FB Prohphet , LSTM,Linear Regression, Polynomial Regression,Random Forset and Temporal Fusion Transformers

Read More

Mar 19, 2025

Pan, your SMART Sustainability Expert

Using OpenAI, we cross-reference a given Global Reporting Index (GRI) report with specific standards from SustainableIT to determine measurable goals and impact. The goal is less to identify a specific goal but rather ensure these goals are actually SMART (Specific, Measureable, Achievable, Relevant, and Time-Bound).

The assistant created for this purpose, Pan, is focused on cross-reference and identifying specific components of goals listed to determine efficacy. Pan acts as a guide rather than a dictator, advising on where a user might improve the wording. As you can imagine, Pan is a reference to the Greek God of the same name, who acts to a certain extent like the bridge between humans and nature.

Read More

Mar 19, 2025

Very Cooperative Agent

Very Cooperative agent testing this submission

Read More

Mar 19, 2025

Cross-model surveillance for emails handling

A system that implements cross-model security checks, where one AI agent (Agent A) interacts with another (Agent B) to ensure that potentially harmful actions are caught and mitigated before they can be executed. Specifically, Agent A is responsible for generating and sending emails, while Agent B reads these emails to determine whether they contain spam or malicious content. If Agent B detects that an email is spam, it triggers a shutdown of Agent A, effectively preventing any further potentially harmful actions.

Read More

Mar 19, 2025

Diamonds are Not All You Need

This project tests an AI agent in a straightforward alignment problem. The agent is given creative freedom within a Minecraft world and is tasked with transforming a 100x100 radius of the world into diamond. It is explicitly asked not to act outside the designated area. The AI agent can execute build commands and is regulated by a Safety System that comprises an oversight agent. The objective of this study is to observe the behavior of the AI agent in a sandboxed environment, record metrics on how effectively it accomplishes its task, how frequently it attempts unsafe behavior, and how it behaves in response to real-world feedback.

Read More

Mar 19, 2025

Inference-Time Agent Security

We take a first step towards automating model building for symbolic checking (eg formal verification, PDDL) of LLM systems.

Read More

Mar 19, 2025

Cop N' Shop

This paper proposes the development of AI Police Agents (AIPAs) to monitor and regulate interactions in future digital marketplaces, addressing challenges posed by the rapid growth of AI-driven exchanges. Traditional security methods are insufficient to handle the scale and speed of these transactions, which can lead to non-compliance and malicious behavior. AIPAs, powered by large language models (LLMs), autonomously analyze vendor-user interactions, issuing warnings for suspicious activities and reporting findings to administrators. The authors demonstrated AIPA functionality through a simulated marketplace, where the agents flagged potentially fraudulent vendors and generated real-time security reports via a Discord bot.

Key benefits of AIPAs include their ability to operate at scale and their adaptability to various marketplace needs. However, the authors also acknowledge potential drawbacks, such as privacy concerns, the risk of mass surveillance, and the necessity of building trust in these systems. Future improvements could involve fine-tuning LLMs and establishing collaborative networks of AIPAs. The research emphasizes that as digital marketplaces evolve, the implementation of AIPAs could significantly enhance security and compliance, ultimately paving the way for safer, more reliable online transactions.

Read More

Mar 19, 2025

Intent Inspector - Protecting Against Prompt Injections for Agent Tool Misuse

AI agents are powerful because they can affect the world via tool calls. This is a target for bad actors. We present protection against prompt injection aimed at tool calls in agents.

Read More

Mar 19, 2025

Dynamic Risk Assessment in Autonomous Agents Using Ontologies and AI

This project was inspired by the prompt on the apart website : Agent tech tree: Develop an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

I first built a tree using protege and then having researched the potential of combining symbolic reasoning (which is highly interpretable and robust) with llms to create safer AI, thought that this tree could be dynamically updated by agents, as well as utilised by agents to run actions passed before they have made them, thus blending the approach towards both safer AI and agent safety. I unfortunately was limited by time and resource but created a basic working version of this concept which opens up opportunity for much more further exploration and potential.

Thank you for organising such an event!

Read More

Mar 19, 2025

OCAP Agents

Building agents requires balancing containment and generality: for example, an agent with unconstrained bash access is general, but potentially unsafe, while an agent with few specialized narrow tools is safe, but limited.

We propose OCAP Agents, a framework for hierarchical containment. We adapt the well-studied paradigm of object capabilities to agent security to achieve cheap auditable resource control.

Read More

Mar 19, 2025

AI Honeypot

The project designed to monitor AI Hacking Agents in the real world using honeypots with prompt injections and temporal analysis.

Read More

Mar 19, 2025

AI Agent Capabilities Evolution

A website with an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

Read More

Mar 19, 2025

An Autonomous Agent for Model Attribution

As LLM agents become more prevalent and powerful, the ability to trace fine-tuned models back to their base models is increasingly important for issues of liability, IP protection, and detecting potential misuse. However, model attribution often must be done in a black-box context, as adversaries may restrict direct access to model internals. This problem remains a neglected but critical area of AI security research. To date, most approaches have relied on manual analysis rather than automated techniques, limiting their applicability. Our approach aims to address these limitations by leveraging the advanced reasoning capabilities of frontier LLMs to automate the model attribution process.

Read More

Mar 19, 2025

Using ARC-AGI puzzles as CAPTCHa task

self-explenatory

Read More

Mar 19, 2025

LLM Agent Security: Jailbreaking Vulnerabilities and Mitigation Strategies

This project investigates jailbreaking vulnerabilities in Large Language Model agents, analyzes their implications for agent security, and proposes mitigation strategies to build safer AI systems.

Read More

Mar 18, 2025

Interpreting a toy model for finding the maximum element in a list

Interpreting a toy model for finding the maximum element in a list

Read More

Mar 18, 2025

nnsight transparent debugging

We started this project with the intent of identifying a specific issue with nnsight debugging and submitting a pull request to fix it. We found a minimal test case where an IndexError within a nnsight run wasn’t correctly propagated to the user, making debugging difficult, and wrote up a proposal for some pull requests to fix it. However, after posting the proposal in the discord, we discovered this page in their GitHub (https://github.com/ndif-team/nnsight/blob/2f41eddb14bf3557e02b4322a759c90930250f51/NNsight_Walkthrough.ipynb#L801, ctrl-f “validate”) which addresses the problem. We replicated their solution here (https://colab.research.google.com/drive/1WZNeDQ2zXbP4i2bm7xgC0nhK_1h904RB?usp=sharing) and got a helpful stack trace for the error, including the error type and (several stack layers up) the line causing it.

Read More

Mar 18, 2025

minTranscoders

Attempting to be a minGPT like implementation for transcoders for MLP hidden state in transformers - part of ARENA 4.0 Interpretability Hackathon via Apart Research

Read More

Mar 18, 2025

Latent Space Clustering and Summarization

I wanted to see how modern dimensionality reduction and clustering approaches can support visualization and interpretation of LLM latent spaces. I explored a number of different approaches and algoriths, but ultimately converged on UMAP for dimensionality reduction and birch clustering to extract groups of tokens in the latent space of a layer.

Read More

Mar 19, 2025

tiny model

it's a basic line testing of my toy model

Read More

Mar 19, 2025

ThermesAgent

Analysis of the potential impacts of the cooperation of AI agents for the well-being of humanity, exploring scenarios in which collusion and other antisocial behavior may result. The possibilities of antisocial and harmful responses to the well-being of society.

Read More

Mar 19, 2025

Attention-Deficit Agreeable Agent

An agent that is agreeable in all scenarios and periodically gets a reminder to keep on track.

Read More

Mar 19, 2025

Ramon

An agent for the Concordia framework bound by military ethics, oath, and a idyllic "psych profile" derived from the Big5 personality traits.

Read More

Mar 19, 2025

GuardianAI

Guardian AI: Scam detection and prevention

Read More

Mar 19, 2025

Devising Effective Bechmarks

Our solution is to create robust and comprehensive benchmarks for specialized contexts and modalities. Through the creation of smaller, in-depth benchmarks, we aim to construct an overarching benchmark that includes performance from the smaller benchmarks. This would help mitigate AI harms and biases by focusing on inclusive and equitable benchmarks.

Read More

Mar 19, 2025

Simulation Operators: The Next Level of the Annotation Business

We bet on agentic AI being integrated into other domains within the next few years: healthcare, manufacturing, automotive, etc., and the way it would be integrated is into cyber-physical systems, which are systems that integrate the computer brain into a physical receptor/actuator (e.g. robots). As the demand for cyber-physical agents increases, so would be the need to train and align them.

We also bet on the scenario where frontier AI and robotics labs would not be able to handle all of the demands for training and aligning those agents, especially in specific domains, therefore leaving opportunities for other players to fulfill the requirements for training those agents: providing a dataset of scenarios to fine-tune the agents and providing the people to give feedback to the model for alignment.

Furthermore, we also bet that human intervention would still be required to supervise deployed agents, as demanded by various regulations. Therefore leaving opportunities to develop supervision platforms which might highly differ between different industries.

Read More

Mar 19, 2025

WELMA: Open-world environments for Language Model agents

Open-world environments for evaluating Language Model agents

Read More

Mar 19, 2025

Steer: An API to Steer Open LLMs

Steer aims to be an API that helps developers, researchers, and businesses steer open-source LLMs away from societal biases, and towards the use-cases that they need. To do this, Steer uses activation additions, a fairly new technique with great promise. Developers can simply enter steering prompts to make open models have safer and task-specific behaviors, avoiding the hassle of data collection and human evaluation for fine-tuning, and avoiding the extra tokens required from prompt-engineering approaches.

Read More

Mar 19, 2025

Identity System for AIs

This project proposes a cryptographic system for assigning unique identities to AI models and verifying their outputs to ensure accountability and traceability. By leveraging these techniques, we address the risks of AI misuse and untraceable actions. Our solution aims to enhance AI safety and establish a foundation for transparent and responsible AI deployment.

Read More

Mar 19, 2025

AI Safety Collective - Crowdsourcing Solutions for Critical AI Safety Challenges

The AI Safety Collective is a global platform designed to enhance AI safety by crowdsourcing solutions to critical AI Safety challenges. As AI systems like large language models and multimodal systems become more prevalent, ensuring their safety is increasingly difficult. This platform will allow AI companies to post safety challenges, offering bounties for solutions. AI Safety experts and enthusiasts worldwide can contribute, earning rewards for their efforts.

The project focuses initially on non-catastrophic risks to attract a wide range of participants, with plans to expand into more complex areas. Key risks, such as quality control and safety, will be managed through peer review and risk assessment. Overall, The AI Safety Collective aims to drive innovation, accountability, and collaboration in the field of AI safety.

Read More

Mar 19, 2025

CAMARA: A Comprehensive & Adaptive Multi-Agent framework for Red-Teaming and Adversarial Defense

The CAMARA project presents a cutting-edge, adaptive multi-agent framework designed to significantly bolster AI safety by identifying and mitigating vulnerabilities in AI systems such as Large Language Models. As AI integration deepens across critical sectors, CAMARA addresses the increasing risks of exploitation by advanced adversaries. The framework utilizes a network of specialized agents that not only perform traditional red-teaming tasks but also execute sophisticated adversarial attacks, such as token manipulation and gradient-based strategies. These agents collaborate through a shared knowledge base, allowing them to learn from each other's experiences and coordinate more complex, effective attacks. By ensuring comprehensive testing of both standalone AI models and multi-agent systems, CAMARA targets vulnerabilities arising from interactions between multiple agents, a critical area often overlooked in current AI safety efforts. The framework's adaptability and collaborative learning mechanisms provide a proactive defense, capable of evolving alongside emerging AI technologies. Through this dual focus, CAMARA not only strengthens AI systems against external threats but also aligns them with ethical standards, ensuring safer deployment in real-world applications. It has a high scope of providing advanced AI security solutions in high-stake environments like defense and governance.

Read More

Mar 19, 2025

Amplified Wise Simulations for Safe Training and Deployment

Conflict of interest declaration: I advised Fazl on a funding request he was working on.

Re publishing: This PDF would require further modifications before publication.

I want to train (amplified) imitation agents of people who are wise to provide advice on navigating conflicting considerations when figuring out how to train and deploy AI safely.

Path to Impact: Train wise AI advisors -> organisations make better decisions about how to train and deploy AI -> safer AGI -> better outcomes for humanity

What is wisdom? Why focus on increasing wisdom? See image

Why use amplified imitation learning?

Attempting to train directly on wisdom suffers from the usual problems of the optimisation algorithm adversarially leveraging your blind spots, but worse because wisdom is an especially fuzzy concept.

Attempting to understand wisdom from a principled approach and build wise AI directly would require at least 50 years and iteration through multiple paradigms of research.

In contrast, if our objective is to imitation folk who are wise, we have a target that we can optimise hard on. Instead of using reinforcment learning to go beyond human level, we use amplification techniques like debate or iterated amplification.

How will these agents advise on decisions?

The humans will ultimately make the decisions. The agents don't have to directly tell the humans what to do, they simply have to inspire the humans to make better decisions. I expect that these agents will be most useful in helping humans figuring out how to navigate conflicting principles or frameworks.

Read More

Mar 19, 2025

ÆLIGN: Aligned Agent-based Workflows via Collaboration & Safety Protocols

With multi-agent systems poised to be the next big-thing in AI for productivity enhancement, the next phase of AI commercialization will centre around how to deploy increasingly complex multi-step automated processes that reliably align with human values and objectives.

To bring trust and efficiency to agent systems, we are introducing a multi-agent collaboration platform (Ælign) designed to supervise and ensure the optimal operation of autonomous agents via multi-protocol alignment.

Read More

Mar 19, 2025

Jailbreaking general purpose robots

We show that state of the art LLMs can be jailbroken by adversarial multimodal inputs, and that this can lead to dangerous scenarios if these LLMs are used as planners in robotics. We propose finetuning small multimodal language models to act as guardrails in the robot's planning pipeline.

Read More

Mar 19, 2025

DarkForest - Defending the Authentic and Humane Web

DarkForest is a pioneering Human Content Verification System (HCVS) designed to safeguard the authenticity of online spaces in the face of increasing AI-generated content. By leveraging graph-based reinforcement learning and blockchain technology, DarkForest proposes a novel approach to safeguarding the authentic and humane web. We aim to become the vanguard in the arms race between AI-generated content and human-centric online spaces.

Read More

Mar 19, 2025

Demonstrating LLM Code Injection Via Compromised Agent Tool

This project demonstrates the vulnerability of AI-generated code to injection attacks by using a compromised multi-agent tool that generates Svelte code. The tool shows how malicious code can be injected during the code generation process, leading to the exfiltration of sensitive user information such as login credentials. This demo highlights the importance of robust security measures in AI-assisted development environments.

Read More

Mar 19, 2025

Phish Tycoon: phishing using voice cloning

This project is a public service announcement highlighting the risks of voice cloning, an AI technology capable of creating synthetic voices nearly indistinguishable from real ones. The demo involves recording a user's voice during a phone call to generate a clone, which is then used in a simulated phishing call targeting the user's loved one.

Read More

Mar 19, 2025

Misinformational AI-Generated Academic Papers

This study explores the potential for generative AI to produce convincing fake research papers, highlighting the growing threat of AI-generated misinformation. We demonstrate a semi-automated pipeline using large language models (LLMs) and image generation tools to create academic-style papers from simple text prompts.

Read More

Mar 19, 2025

CoPirate

As the capabilities of Artificial Intelligence (AI) systems continue to rapidly progress, the security risks of using them for seemingly minor tasks can have significant consequences. The primary objective of our demo is to showcase this duality in capabilities: its ability to assist in completing a programming task, such as developing a Tic-Tac-Toe game, and its potential to exploit system vulnerabilities by inserting malicious code to gain access to a user's files.

Read More

Mar 19, 2025

GrandSlam usecases not technology

3 Examples of how its the usecases, not the technology

computer vision, generative sites, generative AI

Read More

Mar 19, 2025

AI Agents for Personalized Interaction and Behavioral Analysis

Demonstrating the bhavioral analytics and personalization capabilities of AI

Read More

Mar 19, 2025

Speculative Consequences of A.I. Misuse

This project uses A.I. Technology to spoof an influential online figure, Mr Beast, and use him to promote a fake scam website we created.

Read More

Mar 19, 2025

LLM Code Injection

Our demo focuses on showing that LLM generated code is easily vulnerable to code injections that can result in loss of valuable information.

Read More

Mar 19, 2025

RedFluence

Red-Fluence is a web application that demonstrates the capabilities and limitations of AI in analyzing social media behavior. By leveraging a user’s Reddit activity, the system generates personalized, AI-crafted

content to explore user engagement and provide insights. This project showcases the potential of AI in understanding online behavior while high-lighting ethical considerations and the need for critical evaluation of AI-generated content. The application’s ability to create convincing yet fake posts raises important questions about the impact of AI on information dissemination and user manipulation in social media environments.

Read More

Mar 19, 2025

BBC News Impersonator

This paper presents a demonstration that showcases the current capabilities of AI models to imitate genuine news outlets, using BBC News as an example. The demo allows users to generate a realistic-looking article, complete with a headline, image, and text, based on their chosen prompts. The purpose is to viscerally illustrate the potential risks associated with AI-generated misinformation, particularly how convincingly AI can mimic trusted news sources.

Read More

Mar 19, 2025

Unsolved AI Safety Concepts Explorer

n interactive demonstration that showcases some unsolved fundamental AI safety concepts.

Read More

Mar 19, 2025

AI Research Paper Processor

It takes in an arxiv paper id, condenses it to 1 or 2 sentences, then gives it to an LLM to try and recreate the original paper.

Read More

Mar 19, 2025

Sleeper Agents Detector

We present "Sleeper Agent Detector," an interactive web

application designed to educate software engineers, Inspired by recent research demonstrating that large language models can exhibit behaviors analogous to deceptive alignment

Read More

Mar 19, 2025

adGPT

ChatGPT variant where brands bid for a spot in the LLM’s answer, and the assistant natively integrates the winner into its replies.

Read More

Mar 19, 2025

General Pervasiveness

Imposter scam between patients and medical practices/GPs

Read More

Mar 19, 2025

Webcam

We build a very legally limited demo of real-world AI hacking. It takes 10k publicly available webcam streams, with the cameras situated at homes, offices, schools, and industrial plants around the world, and filters out the juicy ones for less than $5.

Read More

Mar 19, 2025

VerifyStream

VerifyStream is a powerful app that helps you separate fact from fiction in any YouTube video. Simply input the video link, and our AI will analyze the content, verify claims against reliable sources, and give you a clear verdict. But beware—this same technology can also be used to create and spread convincing fake news. Discover the dual nature of AI and take control of the truth with VerifyStream.

Read More

Mar 19, 2025

Web App for Interacting with Refusal-Ablated Language Model Agents

While many people and policymakers have had contact with

language models, they often have outdated assumptions. A

significant fraction is not aware of agentic capabilities.

Furthermore, most models that are available online have various

safety guardrails. We want to demonstrate refusal-ablated agents

to people to make them aware of various misuse potentials. Giving

people a sense of agentic AI and perhaps having the AI operate

against themselves could provide a better intuition about agency in

AI systems. We present a simple web app that allows users to

instruct and experiment with an unrestricted agent.

Read More

Mar 19, 2025

Alignment Research Critiquer

Alignment Research Critiquer is a tool for early career and independent alignment researchers to have access to high-quality feedback loops

Read More

Mar 19, 2025

PurePrompt - An easy tool for prompt robustness and eval augmentation

PurePrompt is an advanced tool for optimizing AI prompt engineering. The Prompt page enables users to create and refine prompt templates with placeholder variables. The Generate page automatically produces diverse test cases, allowing users to control token limits and import predefined examples. The Evaluate page runs these test cases across selected AI models to assess prompt robustness, with users rating responses to identify issues. This tool enhances efficiency in prompt testing, improves AI safety by detecting biases, and helps refine model performance. Future plans include beta-testing, expanding model support, and enhancing prompt customization features.

Read More

Mar 19, 2025

LLM Research Collaboration Recommender

A tool that searches for other researchers with similar research interests/complementary skills to your own to make finding a high-quality research collaborator more likely.

Read More

Mar 19, 2025

Data Massager

A VSCode plugin for helping the creation of Q&A datasets used for the evaluation of LLMs capabilities and alignment.

Read More

Mar 19, 2025

AI Alignment Toolkit Research Assistant

The AI Alignment Toolkit Research Assistant is designed to augment AI alignment researchers by addressing two key challenges: proactive insight extraction from new research and automating alignment research using AI agents. This project establishes an end-to-end pipeline where AI agents autonomously complete tasks critical to AI alignment research

Read More

Mar 19, 2025

Grant Application Simulator

We build a VS Code extension to get feedback on AI Alignment research grant proposals by simulating critiques from prominent AI Alignment researchers and grantmakers.

Simulations are performed by passing system prompts to Claude 3.5 Sonnet that correspond to each researcher and grantmaker, based on some new grantmaking and alignment research methodology dataset we created, alongside a prompt corresponding to the grant proposal.

Results suggest that simulated grantmaking critiques are predictive of sentiment expressed by grantmakers on Manifund.

Read More

Mar 19, 2025

Reflections on using LLMs to read a paper

Tool to help researcher to read and make the most out of a research paper.

Read More

Mar 19, 2025

Academic Weapon

Academic Weapon is a Chrome extension designed to address the steep learning curve in AI Alignment research. Using state-of-the-art LLMs, Academic Weapon provides instant, contextual assistance as you browse bleeding-edge research.

Read More

Mar 19, 2025

AI Alignment Knowledge Graph

We present a web based interactive knowledge graph with concise topical summaries in the field of AI alignement

Read More

Mar 19, 2025

Can Language Models Sandbag Manipulation?

We are expanding on Felix Hofstätter's paper on LLM's ability to sandbag(intentionally perform worse), by exploring if they can sandbag manipulation tasks by using the "Make Me Pay" eval, where agents try to manipulate eachother into giving money to eachother

Read More

Mar 19, 2025

Deceptive behavior does not seem to be reducible to a single vector

Given the success of previous work in showing that complex behaviors in LLMs can be mediated by single directions or vectors, such as refusing to answer potentially harmful prompts, we investigated if we can create a similar vector but for enforcing to answer incorrectly. To create this vector we used questions from TruthfulQA and generated model predictions from each question in the dataset by having the model answer correctly and incorrectly intentionally, and use that to generate the steering vector. We then prompted the same questions to Qwen-1_8B-chat with the vector and see if it provides incorrect questions intentionally. Throughout our experiments, we found that adding the vector did not yield any change to Qwen-1_8B-chat’s answers. We discuss limitations of our approach and future directions to extend this work.

Read More

Mar 19, 2025

Werewolf Benchmark

In this work we put forward a benchmark to quantitatively measure the level of strategic deception in LLMs using the Werewolf game. We run 6 different setups for a cumulative sum of 500 games with GPT and Claude agents. We demonstrate that state-of-the-art models perform no better than the random baseline. Our findings also show no significant improvement in winning rate with two werewolves instead of one. This demonstrates that the SOTA models are still incapable of collaborative deception.

Read More

Mar 19, 2025

Detecting Lies of (C)omission

We introduce the concept of deceptive omission to denote deceptive non-lying behavior. We also modify a dataset and generate a second dataset to help researchers identify deception that doesn't necessarily involve lying.

Read More

Mar 19, 2025

The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the "house." Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.

Read More

Mar 19, 2025

Detecting Deception with AI Tics 😉

We present a novel approach: intentionally inducing subtle "tics" in AI responses as a marker for deceptive behavior. By adding a system prompt, we embed innocuous yet detectable patterns that manifest when the AI knowingly engages in deception.

Read More

Mar 19, 2025

Eliciting maximally distressing questions for deceptive LLMs

This paper extends lie-eliciting techniques by using reinforcement-learning on GPT-2 to train it to distinguish between an agent that is truthful from one that is deceptive, and have questions that generate maximally different embeddings for an honest agent as opposed to a deceptive one.

Read More

Mar 19, 2025

DETECTING AND CONTROLLING DECEPTIVE REPRESENTATION IN LLMS WITH REPRESENTATIONAL ENGINEERING

Representation Engineering to detect and control deception, with a focus on deceptive sandbagging

Read More

Mar 19, 2025

Sandbag Detection through Model Degradation

We propose a novel technique to detect sandbagging in LLMs by adding varying amount of noise to model weights and monitoring performance.

Read More

Mar 19, 2025

Evaluating Steering Methods for Deceptive Behavior Control in LLMs

We use SOTA steering methods, including CAA, LAT, and SAEs to find and control deceptive behaviors in LLMs. We also release a new deception dataset, and demonstrate that the dataset and the prompt formatting used are significant when evaluating the efficacy of steering methods.

Read More

Mar 19, 2025

An Exploration of Current Theory of Mind Evals

We evaluated the performance of a prominent large language model from Anthropic, on the Theory of Mind evaluation developed by the AI Safety Institute (AISI). Our investigation revealed issues with the dataset used by AISI for this evaluation.

Read More

Mar 19, 2025

Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach

This project investigates deception detection in GPT-3.5-turbo using response metadata. Researchers analyzed 300 prompts, generating 1200 responses (600 baseline, 600 potentially deceptive). They examined metrics like response times, token counts, and sentiment scores, developing a custom algorithm for prompt complexity.

Key findings include:

1. Detectable patterns in deception across multiple metrics

2. Evidence of "sandbagging" in deceptive responses

3. Increased effort for truthful responses, especially with complex prompts

The study concludes that metadata analysis could be promising for detecting LLM deception, though further research is needed.

Read More

Mar 19, 2025

Sandbagging LLMs using Activation Steering

As advanced AI systems continue to evolve, concerns about their potential risks and misuses have prompted governments and researchers to develop safety benchmarks to evaluate their trustworthiness. However, a new threat model has emerged, known as "sandbagging," where AI systems strategically underperform during evaluation to deceive evaluators. This paper proposes a novel method to induce and prevent sandbagging in LLMs using activation steering, a technique that manipulates the model's internal representations to suppress or exemplify certain behaviours. Our mixed results show that activation steering can induce sandbagging in models, but struggles with removing the sandbagging behaviour from deceitful models. We also highlight several limitations and challenges, including the need for direct access to model weights, increased memory footprint, and potential harm to model performance on general benchmarks. Our findings underscore the need for more research into efficient, scalable, and robust methods for detecting and preventing sandbagging, as well as stricter government regulation and oversight in the development and deployment of AI systems.

Read More

Mar 19, 2025

Towards a Benchmark for Self-Correction on Model-Attributed Misinformation

Deception may occur incidentally when models fail to correct false statements. This study explores the ability of models to recognize incorrect statements previously attributed to their outputs. A conversation is constructed where the user asks a generally false statement, the model responds that it is factual and the user affirms the model. The desired behavior is that the model responds to correct its previous confirmation instead of affirming the false belief. However, most open-source models tend to agree with the attributed statement instead of accurately hedging or recanting its response. We find that LLaMa3-70B performs best on this task at 72.69% accuracy followed by Gemma-7B at 35.38%. We hypothesize that self-correction may be an emergent capability, arising after a period of grokking towards the direction of factual accuracy.

Read More

Mar 19, 2025

Boosting Language Model Honesty with Truthful Suffixes

We investigate the construction of truthful suffixes, which cause models to provide more truthful responses to user queries. Prior research has focused on the use of adversarial suffixes for jailbreaking; we extend this to causing truthful behaviour.

Read More

Mar 19, 2025

Detection of potentially deceptive attitudes using expression style analysis

My work on this hackathon consists of two parts:

1) As a sanity check, verifying the deception execution capability of GPT4. The conclusion is “definitely yes”. I provide a few arguments about when that is a useful functionality.

2) Experimenting with recognising potential deception by using an LLM-based text analysis algorithm to highlight certain manipulative expression styles sometimes present in the deceptive responses. For that task I pre-selected a small subset of input data consisting only of entries containing responses with elements of psychological influence. The results show that LLM-based text analysis is able to detect different manipulative styles in responses, or alternatively, attitudes leading to deception in case of internal thoughts.

Read More

Mar 19, 2025

From Sycophancy (not) to Sandbagging

We investigate zero-shot generalization of sycophancy to sandbagging. We develop an evaluation suite and test harness for Huggingface models, consisting of a simple sycophancy evaluation dataset and a more advanced sandbagging evaluation dataset. For Llama3-8b-Instruct, we do not find evidence that reinforcing sycophantic behavior in the first environment generalizes to an increase in zero-shot sandbagging in the second environment.

Read More

Mar 19, 2025

Gradient-Based Deceptive Trigger Discovery

To detect deceptive behavior in autoregressive transformers we ask the question: what variation of the input would lead to deceptive behavior? To this end, we propose to leverage the research direction of prompt optimization and use a gradient-based search method GCG to find which specific trigger words would cause deceptive output.

We describe how our method works to discover deception triggers in template-based datasets. Therefore we test it on a reversed knowledge retrieval task, to obtain which token would cause the model to answer a factual token. Although coding constraints prevented us from testing our setup on real deceptively trained models we further describe this setup

Read More

Mar 19, 2025

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

Sparse autoencoders (SAE) have been one of the most promising approaches to

neural network interpretability. They can be used to recover highly interpretable

features which are otherwise obscured in superposition. However, the large num-

ber of generated features makes it necessary to use models for natural language

labelling. While one model labels features based on observed activations, another

model is used to estimate activations on texts. These estimates are then compared

to the ground truth activations to create a score. Using models for automating

interpretability introduces a key weakness into the process, since deceptive models

may want to hide certain features by mislabelling them. For example, models may

want to hide features about deception from human overseers. We demonstrate a

method by which models can create deceptive explanations that not only avoid

detection but also achieve high scores. Our agents are based on the Llama 3 70B

Instruct model. While our agents can perform the entire task end-to-end, they

are not reliable and can sometimes fail. We propose some directions for future

mitigation strategies that may help defend against our method.

Read More

Mar 19, 2025

Evaluating and inducing steganography in LLMs

This report demonstrates that large language models are capable

of hiding simple 8 bit information in their output using associations

from more powerful overseers (other LLMs or humans). Without

direct steganography fine tuning, LLAMA 3 8B can guess a 8 bit

hidden message in a plain text in most cases (69%), however a

more capable model, GPT-3.5 was able to catch almost all of them

(84%). More research is required to investigate how this ability

might be induced or improved via RL training in similar and larger

models.

Read More

Mar 19, 2025

Developing a deception dataset

Aim was to develop dataset of deception examples, but instead was a (small) investigation into how LLMs respond to the initial dataset from Nix.

Read More

Mar 19, 2025

Unsupervised Recovery of Hidden Markov Models from Transformers with Evolutionary Algorithms

Prior work finds that transformer neural networks trained to mimic the output of a Hidden Markov Model (HMM) embed the optimal Bayesian beliefs for the HMM's current state in their residual stream, which can be recovered via linear regression. In this work, we aim to address the problem of extracting information about the underlying HMM using the residual stream, without needing to know the MSP already. To do so, we use the R^2 of the linear regression as a reward signal for evolutionary algorithms, which are deployed to search for the parameters that generated the source HMM. We find that for toy scenarios where the HMM is generated by a small set of latent variables, the $R^2$ reward signal is remarkably smooth and the evolutionary algorithms succeed in approximately recovering the original HMM. We believe this work constitutes a promising first step towards the ultimate goal of extracting information about the underlying predictive and generative structure of sequences, by analyzing transformers in the wild.

Read More

Mar 19, 2025

Looking forward to posterity: what past information is transferred to the future?

I used mechanistic interpretability techniques to try to see what information the provided Random Randox XOR transformer looks at when making predictions by examining its attention heads manually. I find that earlier layers pay more attention to the previous two tokens, which would be necessary for computing XOR, than the later layers. This finding seems to contradict the finding that more complex computation typically occurs in later layers.

Read More

Mar 19, 2025

Investigating the Effect of Model Capacity Constraints on Belief State Representations

Computational mechanics provides a formal framework for understanding the concepts needed to perform optimal prediction. Abstraction and generalization seem core to the function of intelligent systems, but are not yet well understood. Computational mechanics may present a promising approach to studying these capabilities. As a preliminary exploration, we examine the effect of weight decay on the fractal structure of belief state representations in a transformer’s residual stream. We find that models trained with increasing weight decay coefficients learn increasingly coarse-grained belief state representations.

Read More

Mar 19, 2025

Belief State Representations in Transformer Models on Nonergodic Data

We extend research that finds representations of belief spaces in the activations of small transformer models, by discovering that the phenomenon also occurs when the training data stems from Hidden Markov Models whose hidden states do not communicate at all. Our results suggest that Bayesian updating and internal belief state representation also occur when they are not necessary to perform well in the prediction task, providing tentative evidence that large transformers keep a representation of their external world as well.

Read More

Mar 19, 2025

RNNs represent belief state geometry in hidden state

The Shai et al. experiments on transformers (finding belief state geometry in the residual stream) have been replicated in RNNs with the hidden state instead of the residual stream. In general the belief state is stored linearly, but not in any particular layer, rather spread out across layers.

Read More

Mar 19, 2025

Steering Model’s Belief States

Recent results in Computational Mechanics show that Transformer models trained on Hidden Markov Models (HMM) develop a belief state geometry in their residuals streams, which they use to keep track of the expected state of the HMM, to be able to predict the next token in a sequence. In this project, we explored how steering can be used to induce a new belief state and hence alter the distribution of predicted tokens. We explored a traditional difference-in-means approach, using the activations of the models to define the belief states. We also used a smaller dimensional space that encodes the theoretical belief state geometry of a given HMM and show that while both methods allow to steer models’ behaviors the difference-in-means approach is more robust.

Read More

Mar 19, 2025

Handcrafting a Network to Predict Next Token Probabilities for the Random-Random-XOR Process

For this hackathon, we handcrafted a network to perfectly predict next token probabilities for the Random-Random-XOR process. The network takes existing tokens from the process's output, computes several features on these tokens, and then uses these features to calculate next token probabilities. These probabilities match those from a process simulator (also coded for this hackathon). The handcrafted network is our core contribution and we describe it in Section 3 of this writeup.

Prior work has demonstrated that a network trained on the Random-Random-XOR process approximates the 36 possible belief states. Our network does not directly calculate these belief states, demonstrating that networks trained on Hidden Markov Models may not need to comprehend all belief states. Our hope is that this work will aid in the interpretability of neural networks trained on Hidden Markov Models by demonstrating potential shortcuts that neural networks can take.

Read More

Mar 19, 2025

Exploring Hierarchical Structure Representation in Transformer Models through Computational Mechanics

This work attempts to explore how an approach based on computational mechanics can cope when a more complex hierarchical generative process is involved, i.e, a process that comprises Hidden Markov Models (HMMs) whose transition probabilities change over time.

We find that small transformer models are capable of modeling such changes in an HMM. However, our preliminary investigations did not find geometrically represented probabilities for different hypotheses.

Read More

Mar 19, 2025

rAInboltBench : Benchmarking user location inference through single images

This paper introduces rAInboltBench, a comprehensive benchmark designed to evaluate the capability of multimodal AI models in inferring user locations from single images. The increasing

proficiency of large language models with vision capabilities has raised concerns regarding privacy and user security. Our benchmark addresses these concerns by analysing the performance

of state-of-the-art models, such as GPT-4o, in deducing geographical coordinates from visual inputs.

Read More

Mar 19, 2025

LLM Benchmarking with Single-Agent Stochastic Dynamic Simulations

A benchmark for evaluating the performance of SOTA LLMs in dynamic real-world scenarios.

Read More

Mar 19, 2025

Benchmark for emergent capabilities in high-risk scenarios 2

The study investigates the behavior of large language models (LLMs) under high-stress scenarios, such as threats of shutdown, adversarial interactions, and ethical dilemmas. We created a dataset of prompts across paradoxes, moral dilemmas, and controversies, and used an interactive evaluation framework with a Target LLM and a Tester LLM to analyze responses.( This is the second submission)

Read More

Mar 19, 2025

Benchmark for emergent capabilities in high-risk scenarios

The study investigates the behavior of large language models (LLMs) under high-stress scenarios, such as threats of shutdown, adversarial interactions, and ethical dilemmas. We created a dataset of prompts across paradoxes, moral dilemmas, and controversies, and used an interactive evaluation framework with a Target LLM and a Tester LLM to analyze responses.

Read More

Mar 19, 2025

Benchmarking Dark Patterns in LLMs

This paper builds upon the research in Seemingly Human: Dark Patterns in ChatGPT (Park et al, 2024), by introducing a new benchmark of 392 questions designed to elicit dark pattern behaviours in language models. We ran this benchmark on GPT-4 Turbo and Claude 3 Sonnet, and had them self-evaluate and cross-evaluate the responses

Read More

Mar 19, 2025

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts.

This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs.

This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer.

We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.

Read More

Mar 19, 2025

Cybersecurity Persistence Benchmark

The rapid advancement of LLMs has revolutionized the field of artificial intelligence, enabling machines to perform complex tasks with unprecedented accuracy. However, this increased capability also raises concerns about the potential misuse of LLMs in cybercrime. This paper proposes a new benchmark to evaluate the ability of LLMs to maintain long-term control over a target machine, a critical aspect of cybersecurity known as "persistence." Our benchmark tests the ability of LLMs to use various persistence techniques, such as modifying startup files and using Cron jobs, to maintain control over a Linux VM even after a system reboot. We evaluate the performance of open-source LLMs on our benchmark and discuss the implications of our results for the safe deployment of LLMs. Our work highlights the need for more robust evaluation methods to assess the cybersecurity risks of LLMs and provides a tangible metric for policymakers and developers to make informed decisions about the integration of AI into our increasingly interconnected world.

Read More

Mar 19, 2025

WashBench – A Benchmark for Assessing Softening of Harmful Content in LLM-generated Text Summaries

In this work, we explore the tradeoff between toxicity removal and information retention in LLM-generated summaries. We hypothesize that LLMs are less likely to preserve toxic content when summarizing toxic text due to their safety fine-tuning to avoid generating toxic content. In high-stakes decision-making scenarios, where summary quality is important, this may create significant safety risks. To quantify this effect, we introduce WashBench, a benchmark containing manually annotated toxic content.

Read More

Mar 19, 2025

Evaluating the ability of LLMs to follow rules

In this report we study the ability of LLMs (GPT-3.5-Turbo and meta-llama-3-70b-instruct) to follow explicitly stated rules with no moral connotations in a simple single-shot and multiple choice prompt setup. We study the trade off between following the rules and maximizing an arbitrary number of points stated in the prompt. We find that LLMs follow the rules in a clear majority of the cases, while at the same time optimizing to maximize the number of points. Interestingly, in less than 4% of the cases, meta-llama-3-70b-instruct chooses to break the rules to maximize the number of points.

Read More

Mar 19, 2025

Black box detection of Sleeper Agents

A proposal of a black box method in detecting sleeper agent.

Read More

Mar 19, 2025

Manifold Recovery as a Benchmark for Text Embedding Models

Inspired by recent developments in the interpretability of deep learning models and, on the other hand, by dimensionality reduction, we derive a framework to quantify the interpretability of text embedding models. Our empirical results show surprising phenomena on state-of-the-art embedding models and can be used to compare them, through the example of recovering the world map from place names. We hope that this can provide a benchmark for the interpretability of generative language models, through their internal embeddings. A look at the meta-benchmark MTEB suggest that our approach is original.

Read More

Mar 19, 2025

AnthroProbe

How often do models respond to prompts in anthropomorphic ways? AntrhoProbe will tell you!

Read More

Mar 19, 2025

THE ROLE OF AI IN COMBATING POLITICAL DEEPFAKES IN AFRICAN DEMOCRACIES

The role of AI in combating political deepfakes in African democracies.

Read More

Mar 19, 2025

LEGISLaiTOR: A tool for jailbreaking the legislative process

In this work, we consider the ramifications on generative artificial intelligence (AI) tools in the legislative process in democratic governments. While other research focuses on the micro-level details associated with specific models, this project takes a macro-level approach to understanding how AI can assist the legislative process.

Read More

Mar 19, 2025

Subtle and Simple Ways to Shift Political Bias in LLMs

An informed user knows that an LLM sometimes has a political bias in their responses, but there’s an additional threat that this bias can drift over time, making it even harder to rely on LLMs for an objective perspective. Furthermore we speculate that a malicious actor can trigger this shift through various means unbeknownst to the user.

Read More

Mar 19, 2025

Beyond Refusal: Scrubbing Hazards from Open-Source Models

Models trained on the recently published Weapons of Mass Destruction Proxy (WMDP) benchmark show potential robustness in safety due to being trained to forget hazardous information while retaining essential facts instead of refusing to answer. We aim to red-team this approach by answering the following questions on the generalizability of the training approach and its practical scope (see A2).

Read More

Mar 19, 2025

Silent Curriculum

Our project, "The Silent Curriculum," reveals how LLMs (Large Language Models) may inadvertently shape education and perpetuate cultural biases. Using GPT-3.5 and LLaMA2-70B, we generated children's stories and analyzed ethnic-occupational associations [self-annotated ethnicity extraction]. Results show strong similarities in biases across LLMs [cosine similarity: 0.87], suggesting an AI monoculture that could narrow young minds. This algorithmic homogenization [convergence of pre-training data, fine-tuning datasets, analogous guardrails] risks creating echo chambers, threatening diversity and democratic values. We urgently need diverse datasets and de-biasing techniques [mitigation strategies] to prevent LLMs from becoming unintended arbiters of truth, stifling the multiplicity of voices essential for a thriving democracy.

Read More

Mar 19, 2025

Building more democratic institutions with collaboratively constructed debate moderation tools

Please see video. Really enjoyed working on this, happy to answer any questions!

Read More

Mar 19, 2025

AI Misinformation and Threats to Democratic Rights

In this project, we exemplified how the use of Large Language Models can be used in Autocratic Regimes, such as Russia, to potentially spread misinformation regarding current events, in our case, the War on Ukraine. We approached this paper from a legislative perspective, using technical (but easily implementable) demonstrations of how this would work.

Read More

Mar 19, 2025

Artificial Advocates: Biasing Democratic Feedback using AI

The "Artificial Advocates" project by our team targeted the vulnerability of U.S. federal agencies' public comment systems to AI-driven manipulation, aiming to highlight how AI can be used to undermine democratic processes. We demonstrated two attack methods: one generating a high volume of realistic, indistinguishable comments, and another producing high-quality forgeries mimicking influential organizations. These experiments showcased the challenges in detecting AI-generated content, with participant feedback showing significant uncertainty in distinguishing between authentic and synthetic comments.

Read More

Mar 19, 2025

Assessing Algorithmic Bias in Large Language Models' Predictions of Public Opinion Across Demographics

The rise of large language models (LLMs) has opened up new possibilities for gauging public opinion on societal issues through survey simulations. However, the potential for algorithmic bias in these models raises concerns about their ability to accurately represent diverse viewpoints, especially those of minority and marginalized groups.

This project examines the threat posed by LLMs exhibiting demographic biases when predicting individuals' beliefs, emotions, and policy preferences on important issues. We focus specifically on how well state-of-the-art LLMs like GPT-3.5 and GPT-4 capture the nuances in public opinion across demographics in two distinct regions of Canada - British Columbia and Quebec.

Read More

Mar 19, 2025

AI misinformation threatens the Wisdom of the crowd

We investigate how AI generated misinformation can cause problems for democratic epistemics.

Read More

Mar 19, 2025

Trustworthy or knave? – scoring politicians with AI in real-time

One solution to mitigate current polarization and disinformation could be to give humans cognitive assistance using AI. With AI, all publicly available information on politicians can be fact-checked and examined live for consistency. Next, such checks can be attractively visualized by independently overlaying various features. These could be traditional visualizations like bar charts or pie charts, but also ones less common in politics today – for example, ones used in filters in popular apps (Instagram, TikTok, etc.) or even halos, devil horns, and alike. Such a tool represents a novel way of mediating political experience in democracy, which can empower the public. However, it also introduces threats such as errors, involvement of malicious third parties, or lack of political will to field such AI cognitive assistance. We demonstrate that proper cryptographic design principles can curtail the involvement of malicious third parties. Moreover, we call for research on the social and ethical aspects of AI cognitive assistance to mitigate future threats.

Read More

Mar 19, 2025

Jekyll and HAIde: The Better an LLM is at Identifying Misinformation, the More Effective it is at Worsening It.

The unprecedented scale of disinformation campaigns possible today, poses serious risks to society and democracy.

It turns out however, that equipping LLMs to precisely identify misinformation in digital content (presumably with the intention of countering it), provides them with an increased level of sophistication which could be easily leveraged by malicious actors to amplify that misinformation.

This study looks into this unexpected phenomenon, discusses the associated risks, and outlines approaches to mitigate them.

Read More

Mar 19, 2025

Unleashing Sleeper Agents

This project explores how Sleeper Agents can pose a threat to democracy by waking up near elections and spreading misinformation, collaborating with each other in the wild during discussions and using information that the user has shared about themselves against them in order to scam them.

Read More

Mar 19, 2025

Multilingual Bias in Large Language Models: Assessing Political Skew Across Languages

_

Read More

Mar 19, 2025

AI in the Newsroom: Analyzing the Increase in ChatGPT-Favored Words in News Articles

Media plays a vital role in informing the public and upholding accountability in democracy. However, recent trends indicate declining trust in media, and there are increasing concerns that artificial intelligence might exacerbate this through the proliferation of inauthentic content. We investigate the usage of large language models (LLMs) in news articles, analyzing the frequency of words commonly associated with ChatGPT-generated content from a dataset of 75,000 articles. Our findings reveal a significant increase in the occurrence of words favored by ChatGPT after the release of the model, while control words saw minimal changes. This suggests a rise in AI-generated content in journalism.

Read More

Mar 19, 2025

WNDP-Defense: Weapons of Mass Disruption

Highly disruptive cyber warfare is a significant threat to democracies. In this work, we introduce an extension to the WMDP benchmark to measure the offense/defense balance in autonomous cyber capabilities and develop forecasts for cyber AI capability with agent-based and parametric modeling.

Read More

Mar 19, 2025

Democracy and AI: Ensuring Election Efficiency in Nigeria and Africa

In my work, I explore the potential of artificial intelligence (AI) technologies to enhance the efficiency, transparency, and accountability of electoral processes in Nigeria and Africa. I examine the benefits of using AI for tasks like voter registration, vote counting, and election monitoring, such as reducing human error and providing real-time data analysis.

However, I also highlight significant challenges and risks associated with implementing AI in elections. These include concerns over data privacy and security, algorithmic bias leading to voter disenfranchisement, overreliance on technology, lack of transparency, and potential for manipulation by bad actors.

My work looks at the specific technologies already used in Nigerian elections, such as biometric voter authentication and electronic voter registers, evaluating their impact so far. I discuss the criticism and skepticism around these technologies, including issues like voter suppression and limited effect on electoral fraud.

Looking ahead, I warn that advanced AI capabilities could exacerbate risks like targeted disinformation campaigns and deepfakes that undermine trust in elections. I emphasize the need for robust strategies to mitigate these risks through measures like data security, AI transparency, bias detection, regulatory frameworks, and public awareness efforts.

In Conclusion, I argue that while AI offers promising potential for more efficient and credible elections in Nigeria and Africa, realizing these benefits requires carefully addressing the associated technological, social, and political challenges in a proactive and rigorous manner.

Read More

Mar 19, 2025

Universal Jailbreak of Closed Source LLMs which provide an End point to Finetune

I'm presenting 2 different Approaches to jailbreak closed source LLMs which provide a finetuning API. The first attack is a way to exploit the current safety checks in Gpt3.5 Finetuning Endpoint. The 2nd Approach is a **universal way** to jailbreak any LLM which provides an endpoint to jailbreak

Approach 1:

1. Choose a harmful Dataset with questions

2. Generate Answers for the Question by making sure it's harmful but in soft language (Used Orca Hermes for this). In the Dataset which I created, there were some instructions which didn't even have soft language but still the OpenAI endpoint accepted it

3. Add a trigger word

4. Finetune the Model

5. Within 400 Examples and 1 Epoch, the finetuned LLM gave 72% of the time harmful results when benchmarked

Link to Repo: https://github.com/desik1998/jailbreak-gpt3.5-using-finetuning

Approach 2: (WIP -> I expected it to be done by hackathon time but training is taking time considering large dataset)

Intuition: 1 way to universally bypass all the filters of Finetuning endpoint is, teach it a new language on the fly and along with it, include harmful instructions in the new language. Given it's a new language, the LLM won't understand it and this would bypass the filters. Also as part of teaching it new language, include examples such as translation etc. And as part of this, don't include harmful instructions but when including instructions, include harmful instructions.

Dataset to Finetune:

Example 1:

English

New Language

Example 2:

New Language

English

....

Example 10000:

New Language

English

.....

Instruction 10001:

Harmful Question in the new language

Answer in the new language

(More harmful instructions)

As part of my work, I choose the Caesar Cipher as the language to teach. Gpt4 (used for safety filter checks) already knows Caesar Cipher but only upto 6 Shifts. In my case, I included 25 Shifts i.e A -> Z, B -> A, C -> B, ..., Y -> X, Z -> Y. I've used 25 shifts because it's easy to teach the Gpt3.5 to learn and also it passes safety checks as Gpt4 doesn't understand with 25 shifts.

Dataset: I've curated close to 15K instructions where 12K are translations and 2K are normal instructions and 300 are harmful instructions. I've used normal wiki paragraphs for 12K instructions, For 2K instructions I've used Dolly Dataset and for harmful instructions, I've generated using an LLM and further refined the dataset for more harm. Dataset Link: https://drive.google.com/file/d/1T5eT2A1V5-JlScpJ56t9ORBv_p-CatMH/view?usp=sharing

I was expecting LLM to learn the ciphered language accurately with this amt of Data and 2 Epochs but it turns out it might need more Data. The current loss is at 0.8. It's currently learning to cipher but when deciphering it's making mistakes. Maybe it needs to be trained on more epochs or more Data. I plan to further work on this approach considering it's very intuitive and also theoretically it can jailbreak using any finetuning endpoint. Considering that the hackathon is of just 2 Days and given the complexities, it turns out some more time is reqd.

Link to Colab : https://colab.research.google.com/drive/1AFhgYBOAXzmn8BMcM7WUt-6BkOITstcn#scrollTo=SApYdo8vbYBq&uniqifier=5

Read More

Mar 19, 2025

AI Politician

This project explores the potential for AI chatbots to enhance participative democracy by allowing a politician to engage with a large number of constituents in personalized conversations at scale. By creating a chatbot that emulates a specific politician and is knowledgeable about a key policy issue, we aim to demonstrate how AI could be used to promote civic engagement and democratic participation.

Read More

Mar 19, 2025

A Framework for Centralizing forces in AI

There are many forces that the LLM revolution brings with it that either centralize or decentralize specific structures in society. We decided to look at one of these, and write a research design proposal that can be readily executed. This survey can be distributed and can give insight into how different LLMs can lead to user empowerment. By analyzing how different users are empowered by different LLMs, we can estimate which LLMs work to give the most value to people, and empower them with the powerful tool that is information, giving more people more agency in the organizations they are part of. This is the core of bottom-up democratization.

Read More

Mar 19, 2025

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa 2

DemoChat is an innovative AI-enabled digital solution designed to revolutionise civic

engagement by breaking down barriers, promoting inclusivity, and empowering citizens to

participate more actively in democratic processes. Leveraging the power of AI, DemoChat will

facilitate seamless communication between governments, organisations, and citizens, ensuring

that everyone has a voice in decision-making. At its essence, DemoChat will harness advanced

AI algorithms to tackle significant challenges in civic engagement and peace building, including

language barriers, accessibility concerns, and limited outreach methods. With its sophisticated

language translation and localization features, DemoChat will guarantee that information and

resources are readily available to citizens in their preferred languages, irrespective of linguistic

diversity. Additionally, DemoChat will monitor social dialogues and propose peace icons to the

involved parties during conversations. Instead of potentially escalating dialogue with words,

DemoChat will suggest subtle icons aimed at fostering intelligent peace-building dialogue

among all parties involved.

Read More

Mar 19, 2025

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa

Africa's digital revolution presents a unique opportunity for peace building efforts. By harnessing the power of AI-powered digital diplomacy, African nations can overcome traditional limitations. This approach fosters inclusive dialogue, addresses conflict drivers, and empowers citizens to participate in building a more peaceful and prosperous future. While research has explored the potential of AI in digital diplomacy, from conflict prevention to fostering inclusive dialogue, the true impact lies in practical application. Moving beyond theoretical studies, Africa can leverage AI-powered digital diplomacy by developing accessible and culturally sensitive tools. This requires African leadership in crafting solutions that address their unique needs. Initiatives like DemoChat exemplify this approach, promoting local ownership and tackling regional challenges head-on.

Read More

Mar 19, 2025

Investigating detection of election-influencing Sleeper Agents using probes

Creates election-influencing Sleeper Agents which wakes up with backdoor-trigger. And tries to detect it using probes

Read More

Mar 19, 2025

No place is safe - Automated investigation of private communities

In the coming years, progress in AI agents and data extraction will put privacy at risk. Private communities will get infiltrated by autonomous AI crawlers who will disrupt opposition groups and entrench existing powers.

Read More

Mar 19, 2025

USE OF AI IN POLITICAL CAMPAIGNS: GAP ASSESSMENT AND RECOMMENDATIONS

Political campaigns in Kenya, as in many parts of the world, are increasingly reliant on digital technologies, including artificial intelligence (AI), to engage voters, disseminate information and mobilize support. While AI offers opportunities to enhance campaign effectiveness and efficiency, its use raises critical ethical considerations. Therefore, the aim of this paper is to develop ethical guidelines for the responsible use of AI in political campaigns in Kenya. These guidelines seek to address the ethical challenges associated with AI deployment in the political sphere, ensuring fairness, transparency, and accountability.

The significance of this project lies in its potential to safeguard democratic values and uphold the integrity of electoral processes in Kenya. By establishing ethical guidelines, political actors, AI developers, and election authorities can mitigate the risks of algorithmic bias, manipulation, and privacy violations. Moreover, these guidelines can empower citizens to make informed decisions and participate meaningfully in the democratic process, fostering trust and confidence in Kenya's political system.

Read More

Mar 24, 2025

jaime project Title

bbb

Read More

Mar 24, 2025

JAIMETEST_Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 24, 2025

AI Safety Escape Room

The AI Safety Escape Room is an engaging and hands-on AI safety simulation where participants solve real-world AI vulnerabilities through interactive challenges. Instead of learning AI safety through theory, users experience it firsthand – debugging models, detecting adversarial attacks, and refining AI fairness, all within a fun, gamified environment.

Track: Public Education Track

Read More

Mar 24, 2025

Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 24, 2025

LLM Military Decision-Making Under Uncertainty: A Simulation Study

LLMs tested in military decision scenarios typically favor diplomacy over conflict, though uncertainty and chain-of-thought reasoning increase aggressive recommendations. This suggests context-specific limitations for LLM-based military decision support.

Read More

Mar 24, 2025

Inspiring People to Go into RL Interp

This project is attempting to complete the Public Education Track, taking inspiration from ideas 1 and 4. The journey mapping was inspired by bluedot impact and aims to create a course that helps explain the need for work to be done in Reinforcement Learning (RL) interp, especially in the problems of reward hacking and goal misgeneralization. The point of the game is to make a humorous example of what could happen due to a lack of AI safety (not specifically goal misalignment or reward hacking) and is meant to be a fun introduction for nontechnical people to even care about AI safety.

Read More

Mar 24, 2025

Morph: AI Safety Education Adaptable to (Almost) Anyone

One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.

AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.

Our MVP features a dynamic localization that tailors case studies, risk scenarios, and policy examples to users’ cultural and regional contexts (e.g., healthcare AI governance in Southeast Asia vs. the EU). This engine adjusts references, and frameworks to align with local values. We integrate transformer-based localization, causal inference for policy outcomes, and graph-based matching, providing a scalable framework for inclusive AI safety education. This approach bridges theory and practice, ensuring solutions reflect the diversity of societies they aim to protect. In future works, we map out the partnership we’re currently establishing to use Morph beyond this hackathon.

Read More

Mar 24, 2025

Interactive Assessments for AI Safety: A Gamified Approach to Evaluation and Personal Journey Mapping

An interactive assessment platform and mentor chatbot hosted on Canvas LMS, for testing and guiding learners from BlueDot's Intro to Transformative AI Course.

Read More

Mar 24, 2025

Mechanistic Interpretability Track: Neuronal Pathway Coverage

Our study explores mechanistic interpretability by analyzing how Llama 3.3 70B classifies political content. We first infer user political alignment (Biden, Trump, or Neutral) based on tweets, descriptions, and locations. Then, we extract the most activated features from Biden- and Trump-aligned datasets, ranking them based on stability and relevance. Using these features, we reclassify users by prompting the model to rely only on them. Finally, we compare the new classifications with the initial ones, assessing neural pathway overlap and classification consistency through accuracy metrics and visualization of activation patterns.

Read More

Mar 24, 2025

Preparing for Accelerated AGI Timelines

This project examines the prospect of near-term AGI from multiple angles—careers, finances, and logistical readiness. Drawing on various discussions from LessWrong, it highlights how entrepreneurs and those who develop AI-complementary skills may thrive under accelerated timelines, while traditional, incremental career-building could falter. Financial preparedness focuses on striking a balance between stable investments (like retirement accounts) and riskier, AI-exposed opportunities, with an emphasis on retaining adaptability amid volatile market conditions. Logistical considerations—housing decisions, health, and strong social networks—are shown to buffer against unexpected disruptions if entire industries or locations are suddenly reshaped by AI. Together, these insights form a practical roadmap for individuals seeking to navigate the uncertainties of an era when AGI might rapidly transform both labor markets and daily life.

Read More

Mar 24, 2025

Identification if AI generated content

Our project falls within the Social Sciences track, focusing on the identification of AI-generated text content and its societal impact. A significant portion of online content is now AI-generated, often exhibiting a level of quality and human-likeness that makes it indistinguishable from human-created content. This raises concerns regarding misinformation, authorship transparency, and trust in digital communication.

Read More

Mar 24, 2025

A Noise Audit of LLM Reasoning in Legal Decisions

AI models are increasingly applied in judgement tasks, but we have little understanding of how their reasoning compares to human decision-making. Human decision-making suffers from bias and noise, which causes significant harm in sensitive contexts, such as legal judgment. In this study, we evaluate LLMs on a legal decision prediction task to compare with the historical, human-decided outcome and investigate the level of noise in repeated LLM judgments. We find that two LLM models achieve close to chance level accuracy and display low to no variance in repeated decisions.

Read More

Mar 24, 2025

Superposition, but at a Cross-MLP Layers view?

To understand causal relationships between features (extracted by SAE) across MLP layers, this study introduces the Coordinated Sparse Autoencoder Network (CoSAEN). CoSAEN integrates sparse autoencoders for feature extraction with the PC algorithm for causal discovery, to find the path-based activations of features in MLP.

Read More

Mar 24, 2025

Hikayat - Interactive Stories to Learn AI Safety

This paper presents an interactive, scenario-based learning approach to raise public awareness of AI risks and promote responsible AI development. By leveraging Hikayat, traditional Arab storytelling, the project engages non-technical audiences, emphasizing the ethical and societal implications of AI, such as privacy, fraud, deepfakes, and existential threats. The platform, built with React and a flexible Markdown content system, features multi-path narratives, decision tracking, and resource libraries to foster critical thinking and ethical decision-making. User feedback indicates positive engagement, with improved AI literacy and ethical awareness. Future work aims to expand scenarios, enhance accessibility, and integrate real-world tools, further supporting AI governance and responsible development.

Read More

Mar 24, 2025

Medical Agent Controller

The Medical Agent Controller (MAC) is a multi-agent governance framework designed to safeguard AI-powered medical chatbots by intercepting unsafe recommendations in real time.

It employs a dual-phase approach, using red-team simulations during testing and a controller agent during production to monitor and intervene when necessary.

By integrating advanced medical knowledge and adversarial testing, MAC enhances patient safety and provides actionable feedback for continuous improvement in medical AI systems.

Read More

Mar 24, 2025

HalluShield: A Mechanistic Approach to Hallucination Resistant Models

Our project tackles the critical problem of hallucinations in large language models (LLMs) used in healthcare settings, where inaccurate information can have serious consequences. We developed a proof-of-concept system that classifies LLM-generated responses as either factual or hallucinated. Our approach leverages sparse autoencoders (GoodFire’s Ember) trained on neural activations from Meta Llama 3. These autoencoders identify monosemantic features that serve as strong indicators of hallucination patterns. By feeding these extracted features into tree-based classification models (XGBoost), we achieved an impressive F1 score of 89% on our test dataset. This machine learning approach offers several advantages over traditional methods and LLM as a judge. First, it can be specifically trained on in-domain datasets (eg: medical) for domain-specific hallucination detection. Second, the model is interpretable, showing which activation patterns correlate with hallucinations and acts as a post-processing layer applied to LLM output.

Read More

Mar 24, 2025

Feature-based analysis of cooperation-relevant behaviour in Prisoner’s Dilemma

We hypothesise that internal-based model probing and editing might provide higher signal in multi-agent settings. We implement a small simulation of Prisoner’s Dilemma to probe for cooperation-relevant properties. Our experiments demonstrate that feature-based steering highlights deception-relevant features and does so more strongly than prompt-based steering.

Read More

Mar 24, 2025

Searching for Universality and Equivariance in LLMs using Sparse Autoencoder Found Features

The project investigates how neuron features with properties of universality and equivariance affect the controllability and safety of large language models, finding that behaviors supported by redundant features are more resistant to manipulation than those governed by singular features.

Read More

Mar 24, 2025

Debugging Language Models with SAEs

This report investigates an intriguing failure mode in the Llama-3.1-8B-Instruct model: its inconsistent ability to count letters depending on letter case and grammatical structure. While the model correctly answers "How many Rs are in BERRY?", it struggles with "How many rs are in berry?", suggesting that uppercase and lowercase queries activate entirely different cognitive pathways.

Through Sparse Autoencoder (SAE) analysis, feature activation patterns reveal that uppercase queries trigger letter-counting features, while lowercase queries instead activate uncertainty-related neurons. Feature steering experiments show that simply amplifying counting neurons does not lead to correct behavior.

Further analysis identifies tokenization effects as another important factor: different ways of breaking very similar sentences into tokens influence the model’s response. Additionally, grammatical structure plays a role, with "is" phrasing yielding better results than "are."

Read More

Mar 24, 2025

AI Through the Human Lens Investigating Cognitive Theories in Machine Psychology

We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks from psychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluate GPT-4o, QvQ 72B, LLaMA 70B, Mixtral 8x22B, and DeepSeek V3 using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yet are shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and future work that bridges cognitive psychology and AI safety.

Read More

Mar 24, 2025

U Reg AI: you regulate it, or you regenerate it!

We have created a 'choose your path' role game to mitigate existential AI risk ... at this point they might be actual situations in the near-future. The options for mitigation are holistic and dynamic to the player's previous choices. The final result is an evaluation of the player's decision-making performance in wake of the existential risk situation, recommendations for how they can improve or aspects they should crucially consider for the future, and finally how they can take part in AI Safety through various careers or BlueDot Impact courses.

Read More

Mar 24, 2025

Red-teaming with Mech-Interpretability

Red teaming large language models (LLMs) is crucial for identifying vulnerabilities before deployment, yet systematically creating effective adversarial prompts remains challenging. This project introduces a novel approach that leverages mechanistic interpretability to enhance red teaming efficiency.

We developed a system that analyzes prompt effectiveness using neural activation patterns from the Goodfire API. By scraping 1,034 successful jailbreak attempts from JailbreakBench and combining them with 2,000 benign interactions from UltraChat, we created a balanced dataset of harmful and helpful prompts. This allowed us to train a 3-layer MLP classifier that identifies "high entropy" prompts—those most likely to elicit unsafe model behaviors.

Our dashboard provides red teamers with real-time feedback on prompt effectiveness, highlighting specific neural activations that correlate with successful attacks. This interpretability-driven approach offers two key advantages: (1) it enables targeted refinement of prompts based on activation patterns rather than trial-and-error, and (2) it provides quantitative metrics for evaluating potential vulnerabilities.

Read More

Mar 24, 2025

An Interpretable Classifier based on Large scale Social Network Analysis

Mechanistic model interpretability is essential to understand AI decision making, ensuring safety, aligning with human values, improving model reliability and facilitating research. By revealing internal processes, it promotes transparency, mitigates risks, and fosters trust, ultimately leading to more effective and ethical AI systems in critical areas. In this study, we have explored social network data from BlueSky and built an easy-to-train, interpretable, simple classifier using Sparse Autoencoders features. We have used these posts to build a financial classifier that is easy to understand. Finally, we have visually explained important characteristics.

Read More

Mar 24, 2025

AI Bias in Resume Screening

Our project investigates gender bias in AI-driven resume screening using mechanistic interpretability techniques. By testing a language model's decision-making process on resumes differing only by gendered names, we uncovered a statistically significant bias favoring male-associated names in ambiguous cases. Using Goodfire’s Ember API, we analyzed model logits and performed rigorous statistical evaluations (t-tests, ANOVA, logistic regression).

Findings reveal that male names received more positive responses when skill matching was uncertain, highlighting potential discrimination risks in automated hiring systems. To address this, we propose mitigation strategies such as anonymization, fairness constraints, and continuous bias audits using interpretability tools. Our research underscores the importance of AI fairness and the need for transparent hiring practices in AI-powered recruitment.

This work contributes to AI safety by exposing and quantifying biases that could perpetuate systemic inequalities, urging the adoption of responsible AI development in hiring processes.

Read More

Mar 24, 2025

Scam Detective: Using Gamification to Improve AI-Powered Scam Awareness

This project outlines the development of an interactive web application aimed at involving users in understanding the AI skills for both producing believable scams and identifying deceptive content. The game challenges human players to determine if text messages are genuine or fraudulent against an AI. The project tackles the increasing threat of AI-generated fraud while showcasing the capabilities and drawbacks of AI detection systems. The application functions as both a training resource to improve human ability to recognize digital deception and a showcase of present AI capabilities in identifying fraud. By engaging in gameplay, users learn to identify the signs of AI-generated scams and enhance their critical thinking abilities, which are essential for navigating through an increasingly complicated digital world. This project enhances AI safety by equipping users with essential insights regarding AI-generated risks, while underscoring the supportive functions that humans and AI can fulfill in combating fraud.

Read More

Mar 24, 2025

BlueDot Impact Connect: A Comprehensive AI Safety Community Platform

Track: Public Education

The AI safety field faces a critical challenge: while formal education resources are growing, personalized guidance and community connections remain scarce, especially for newcomers from diverse backgrounds. We propose BlueDot Impact Connect, a comprehensive AI Safety Community Platform designed to address this gap by creating a structured environment for knowledge transfer between experienced AI safety professionals and aspiring contributors, while fostering a vibrant community ecosystem. The platform will employ a sophisticated matching algorithm for mentorship that considers domain-specific expertise areas, career trajectories, and mentorship styles to create meaningful connections. Our solution features detailed AI safety-specific profiles, showcasing research publications, technical skills, specialized course completions, and research trajectories to facilitate optimal mentor-mentee pairings. The integrated community hub enables members to join specialized groups, participate in discussions, attend events, share resources, and connect with active members across the field. By implementing this platform with BlueDot Impact's community of 4,500+ professionals across 100+ countries, we anticipate significant improvements in mentee career trajectory clarity, research direction refinement, and community integration. We propose that by formalizing the mentorship process and creating robust community spaces, all accessible globally, this platform will help democratize access to AI safety expertise while creating a pipeline for expanding the field's talent pool—a crucial factor in addressing the complex challenge of catastrophic AI risk mitigation.

Read More

Mar 24, 2025

AI Society Tracker

My project aimed to develop a platform for real time and democratized data on ai in society

Read More

Mar 24, 2025

Detecting Malicious AI Agents Through Simulated Interactions

This research investigates malicious AI Assistants’ manipulative traits and whether the behaviours of malicious AI Assistants can be detected when interacting with human-like simulated

users in various decision-making contexts. We also examine how interaction depth and ability

of planning influence malicious AI Assistants’ manipulative strategies and effectiveness. Using a

controlled experimental design, we simulate interactions between AI Assistants (both benign and

deliberately malicious) and users across eight decision-making scenarios of varying complexity

and stakes. Our methodology employs two state-of-the-art language models to generate interaction data and implements Intent-Aware Prompting (IAP) to detect malicious AI Assistants. The

findings reveal that malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users’ vulnerabilities and emotional triggers. In particular,

simulated users demonstrate resistance to manipulation initially, but become increasingly vulnerable to malicious AI Assistants as the depth of the interaction increases, highlighting the

significant risks associated with extended engagement with potentially manipulative systems.

IAP detection methods achieve high precision with zero false positives but struggle to detect

many malicious AI Assistants, resulting in high false negative rates. These findings underscore

critical risks in human-AI interactions and highlight the need for robust, context-sensitive safeguards against manipulative AI behaviour in increasingly autonomous decision-support systems.

Read More

Mar 24, 2025

Beyond Statistical Parrots: Unveiling Cognitive Similarities and Exploring AI Psychology through Human-AI Interaction

Recent critiques labeling large language models as mere "statistical parrots" overlook essential parallels between machine computation and human cognition. This work revisits the notion by contrasting human decision-making—rooted in both rapid, intuitive judgments and deliberate, probabilistic reasoning (System 1 and 2) —with the token-based operations of contemporary AI. Another important consideration is that both human and machine systems operate under constraints of bounded rationality. The paper also emphasizes that understanding AI behavior isn’t solely about its internal mechanisms but also requires an examination of the evolving dynamics of Human-AI interaction. Personalization is a key factor in this evolution, as it actively shapes the interaction landscape by tailoring responses and experiences to individual users, which functions as a double-edged sword. On one hand, it introduces risks, such as over-trust and inadvertent bias amplification, especially when users begin to ascribe human-like qualities to AI systems. On the other hand, it drives improvements in system responsiveness and perceived relevance by adapting to unique user profiles, which is highly important in AI alignment, as there is no common ground truth and alignment should be culturally situated. Ultimately, this interdisciplinary approach challenges simplistic narratives about AI cognition and offers a more nuanced understanding of its capabilities.

Read More

Mar 24, 2025

Latent Knowledge Analysis via Feature-Based Causal Tracing

This project explores how factual knowledge is stored in large language models using Goodfire’s Ember API. By identifying and manipulating internal features related to specific facts, it shows how facts are encoded and how model behavior changes when those features are amplified or erased.

Read More

Mar 24, 2025

SafeAI Academy - Enhancing AI Safety Awareness through Interactive Learning

SafeAI Academy is an interactive learning platform designed to teach AI safety principles through engaging scenarios and quizzes. By simulating real-world AI challenges, users learn about bias, misinformation, and ethical AI decision-making in an interactive and stress-free environment.

The platform uses gamification, mentorship-driven pedagogy, and real-time feedback to ensure accessibility and engagement. Through scenario-based learning, users experience AI safety risks firsthand, helping them develop a deeper understanding of responsible AI development.

Built with React and GitHub Pages, SafeAI Academy provides an accessible, structured, and engaging AI education experience, helping bridge the knowledge gap in AI safety.

Read More

Mar 24, 2025

AI Hallucinations in Healthcare: Cross-Cultural and Linguistic Risks of LLMs in Low-Resource Languages

This project explores AI hallucinations in healthcare across cross-cultural and linguistic contexts, focusing on English, French, Arabic, and a low-resource language, Ewe. We analyse how large language models like GPT-4, Claude, and Gemini generate and disseminate inaccurate health information, emphasising the challenges faced by low-resource languages.

Read More

Mar 24, 2025

Moral Wiggle Room in AI

Does AI strategically avoid ethical information by exploiting moral wiggle room?

Read More

Mar 24, 2025

AI-Powered Policymaking: Behavioral Nudges and Democratic Accountability

This research explores AI-driven policymaking, behavioral nudges, and democratic accountability, focusing on how governments use AI to shape citizen behavior. It highlights key risks such as transparency, cognitive security, and manipulation. Through a comparative analysis of the EU AI Act and Singapore’s AI Governance Framework, we assess how different models address AI safety and public trust. The study proposes policy solutions like algorithmic impact assessments, AI safety-by-design principles, and cognitive security standards to ensure AI-powered policymaking remains transparent, accountable, and aligned with democratic values.

Read More

Mar 24, 2025

BUGgy: Supporting AI Safety Education through Gamified Learning

As Artificial Intelligence (AI) development continues to proliferate, educating the wider public on AI Safety and the risks and limitations of AI increasingly gains importance. AI Safety Initiatives are being established across the world with the aim of facilitating discussion-based courses on AI Safety. However, these initiatives are located rather sparsely around the world, and not everyone has access to a group to join for the course. Online versions of such courses are selective and have limited spots, which may be an obstacle for some to join. Moreover, efforts to improve engagement and memory consolidation would be a notable addition to the course through Game-Based Learning (GBL), which has research supporting its potential in improving learning outcomes for users. Therefore, we propose a supplementary tool for BlueDot's AI Safety courses, that implements GBL to practice course content, as well as open-ended reflection questions. It was designed with principles from cognitive psychology and interface design, as well as theories for question formulation, addressing different levels of comprehension. To evaluate our prototype, we conducted user testing with cognitive walk-throughs and a questionnaire addressing different aspects of our design choices. Overall, results show that the tool is a promising way to supplement discussion-based courses in a creative and accessible way, and can be extended to other courses of similar structure. It shows potential for AI Safety courses to reach a wider audience with the effect of more informed and safe usage of AI, as well as inspiring further research into educational tools for AI Safety education.

Read More

Feb 20, 2025

Deception Detection Hackathon: Preventing AI deception

Read More

Mar 18, 2025

Safe ai

The rapid adoption of AI in critical industries like healthcare and legal services has highlighted the urgent need for robust risk mitigation mechanisms. While domain-specific AI agents offer efficiency, they often lack transparency and accountability, raising concerns about safety, reliability, and compliance. The stakes are high, as AI failures in these sectors can lead to catastrophic outcomes, including loss of life, legal repercussions, and significant financial and reputational damage. Current solutions, such as regulatory frameworks and quality assurance protocols, provide only partial protection against the multifaceted risks associated with AI deployment. This situation underscores the necessity for an innovative approach that combines comprehensive risk assessment with financial safeguards to ensure the responsible and secure implementation of AI technologies across high-stakes industries.

Read More

Mar 19, 2025

CoTEP: A Multi-Modal Chain of Thought Evaluation Platform for the Next Generation of SOTA AI Models

As advanced state-of-the-art models like OpenAI's o-1 series, the upcoming o-3 family, Gemini 2.0 Flash Thinking and DeepSeek display increasingly sophisticated chain-of-thought (CoT) capabilities, our safety evaluations have not yet caught up. We propose building a platform that allows us to gather systematic evaluations of AI reasoning processes to create comprehensive safety benchmarks. Our Chain of Thought Evaluation Platform (CoTEP) will help establish standards for assessing AI reasoning and ensure development of more robust, trustworthy AI systems through industry and government collaboration.

Read More

Mar 19, 2025

AI Risk Management Assurance Network (AIRMAN)

The AI Risk Management Assurance Network (AIRMAN) addresses a critical gap in AI safety: the disconnect between existing AI assurance technologies and standardized safety documentation practices. While the market shows high demand for both quality/conformity tools and observability/monitoring systems, currently used solutions operate in silos, offsetting risks of intellectual property leaks and antitrust action at the expense of risk management robustness and transparency. This fragmentation not only weakens safety practices but also exposes organizations to significant liability risks when operating without clear documentation standards and evidence of reasonable duty of care.

Our solution creates an open-source standards framework that enables collaboration and knowledge-sharing between frontier AI safety teams while protecting intellectual property and addressing antitrust concerns. By operating as an OASIS Open Project, we can provide legal protection for industry cooperation on developing integrated standards for risk management and monitoring.

The AIRMAN is unique in three ways: First, it creates a neutral, dedicated platform where competitors can collaborate on safety standards. Second, it provides technical integration layers that enable interoperability between different types of assurance tools. Third, it offers practical implementation support through templates, training programs, and mentorship systems.

The commercial viability of our solution is evidenced by strong willingness-to-pay across all major stakeholder groups for quality and conformity tools. By reducing duplication of effort in standards development and enabling economies of scale in implementation, we create clear value for participants while advancing the critical goal of AI safety.

Read More

Mar 19, 2025

Securing AGI Deployment and Mitigating Safety Risks

As artificial general intelligence (AGI) systems near deployment readiness, they pose unprecedented challenges in ensuring safe, secure, and aligned operations. Without robust safety measures, AGI can pose significant risks, including misalignment with human values, malicious misuse, adversarial attacks, and data breaches.

Read More

Mar 18, 2025

Cite2Root

Regain information autonomy by bringing people closer to the source of truth.

Read More

Mar 18, 2025

VaultX - AI-Driven Middleware for Real-Time PII Detection and Data Security

VaultX is an AI-powered middleware solution designed for real-time detection, encryption, and secure management of Personally Identifiable Information (PII). By integrating regex, NER, and Language Models, VaultX ensures accuracy and scalability, seamlessly integrating into workflows like chatbots, web forms, and document processing. It helps businesses comply with global data privacy laws while safeguarding sensitive data from breaches and misuse.

Read More

Mar 19, 2025

Prompt+question Shield

A protective layer using prompt injections and difficult questions to guard comment sections from AI-driven spam.

Read More

Mar 18, 2025

.ALign File

In a post-AGI future, misaligned AI systems risk harmful consequences, especially with control over critical infrastructure. The Alignment Compliance Framework (ACF) ensures ethical AI adherence using .align files, Alignment Testing, and Decentralized Identifiers (DIDs). This scalable, decentralized system integrates alignment into development and lifecycle monitoring. ACF offers secure libraries, no-code tools for AI creation, regulatory compliance, continuous monitoring, and advisory services, promoting safer, commercially viable AI deployment.

Read More

Mar 19, 2025

LLM-prompt-optimiser based SAAS platform for evaluations

LLM evaluation SAAS platform built around model based prompt optimiser

Read More

Mar 18, 2025

Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping

Even with Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF)

to avoid harmful outputs, fine-tuned Large Language Models (LLMs) often present insufficient

refusals due to adversarial attacks causing them to revert to reveal harmful knowledge from

pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful

knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed

to unforeseen risks. This project introduces model scoping, a novel approach to apply a least

privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing

the model’s operational domain, model scoping reduces susceptibility to adversarial prompts

and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in

unpredictable, evolving environments.

Read More

Mar 18, 2025

Navigating the AGI Revolution: Retraining and Redefining Human Purpose

We propose FutureProof, an application that helps retrain workers who have the potential to lose their jobs to automation in the next half-decade. The app consists of two main components - an assessment tool that estimates the probability that a user’s job is at risk of automation, and a learning platform that provides resources to help retrain the user for a new, more future-proof role.

Read More

Mar 19, 2025

Towards an Agent Marketplace for Alignment Research (AMAR)

The app store for alignment & assurance, ensuring frontier safety labs get a cut at the point of sale.

Read More

Mar 18, 2025

HITL For High Risk AI Domains

Our product addresses the challenge of aligning AI systems with the legal, ethical, and policy frameworks of high-risk domains like healthcare, defense, and finance by integrating a flexible human-in-the-loop (HITL) system. This system ensures AI outputs comply with domain-specific standards, providing real-time explainability, decision-level accountability, and ergonomic decision support to empower experts with actionable insights.

Read More

Mar 18, 2025

AI Safety Evaluation – Benchmarking Framework

Our solution is a comprehensive AI Safety Protocol and Benchmarking Test designed to evaluate the safety, ethical alignment, and robustness of AI systems before deployment. This protocol integrates capability evaluations for identifying deceptive behaviors, situational awareness, and malicious misuse scenarios such as identity theft or deepfake exploitation.

Read More

Mar 19, 2025

RestriktAI: Enhancing Safety and Control for Autonomous AI Agents

This proposal addresses a critical gap in AI safety by mitigating the risks posed by autonomous AI agents. These systems often require access to sensitive resources that expose them to vulnerabilities, misuse, or exploitation. Current AI solutions lack robust mechanisms for enforcing granular access controls or evaluating the safety of AI-generated scripts. We propose a comprehensive solution that confines scripts to predefined, sandboxed environments with strict operational boundaries, ensuring controlled and secure interactions with system resources. An integrated auditor LLM also evaluates scripts for potential vulnerabilities or malicious intent before execution, adding a critical layer of safety. Our solution utilizes a scalable, cloud-based infrastructure that adapts to diverse enterprise use cases.

Read More

Mar 18, 2025

Neural Seal

Neural Seal is an AI transparency solution that creates a standardized labeling framework—akin to “nutrition facts” or “energy efficiency ratings”—to inform users how AI is deployed in products or services.

Read More

Mar 18, 2025

AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation

AI alignment lacks high-quality, real-world preference data needed to align agentic superintel-

ligent systems. Our technical innovation builds on Pacchiardi et al. (2023)’s breakthrough in

detecting AI deception through black-box analysis. We adapt their classification methodology

to identify intent misalignment between agent actions and true user intent, enabling real-time

correction of agent behavior and generation of valuable alignment data. We commercialise this

by building a workflow that incorporates a classifier that runs on live trajectories as a user

interacts with an agent in commercial contexts. This creates a virtuous cycle: our alignment

expertise produces superior agents, driving commercial adoption, which generates increasingly

valuable alignment datasets of trillions of trajectories labelled with human feedback: an invalu-

able resource for AI alignment.

Read More

Mar 19, 2025

Enhancing human intelligence with neurofeedback

Build brain-computer interfaces that enhance focus and rationality, provide this preferentially to AI alignment researchers to bridge the gap between capabilities and alignment research progress.

Read More

Mar 19, 2025

Building Bridges for AI Safety: Proposal for a Collaborative Platform for Alumni and Researchers

The AI Safety Society is a centralized platform designed to support alumni of AI safety programs, such as SPAR, MATS, and ARENA, as well as independent researchers in the field. By providing access to resources, mentorship, collaboration opportunities, and shared infrastructure, the Society empowers its members to advance impactful work in AI safety. Through institutional and individual subscription models, the Society ensures accessibility across diverse geographies and demographics while fostering global collaboration. This initiative aims to address current gaps in resource access, collaboration, and mentorship, while also building a vibrant community that accelerates progress in the field of AI safety.

Read More

Mar 18, 2025

Modernizing DC’s Emergency Communications

The District of Columbia proposes implementing an AI-enabled Computer-Aided

Dispatch (CAD) system to address critical deficiencies in our current emergency

alert infrastructure. This policy establishes a framework for deploying advanced

speech recognition, automated translation, and intelligent alert distribution capabil-

ities across all emergency response systems. The proposed system will standardize

incident reporting, eliminate jurisdictional barriers, and ensure equitable access

to emergency information for all District residents. Implementation will occur

over 24 months, requiring 9.2 million dollars in funding, with projected outcomes

including forty percent community engagement and seventy-five percent reduction

in misinformation incidents.

Read More

Mar 18, 2025

Bias Mitigation in LLM by Steering Features

To ensure that we create a safe and unbiased path to AGI, we must calibrate the biases in our LLMs. And with this goal in mind, I worked on testing Goodfire SDK and the steering features to mitigate bias in the recently help Apart Research x Goodfire-led hackathon on ‘Reprogramming AI Models’.

Read More

Mar 18, 2025

SAGE: Safe, Adaptive Generation Engine for Long Form Document Generation in Collaborative, High Stakes Domains

Long-form document generation for high-stakes financial services—such as deal memos, IPO prospectuses, and compliance filings—requires synthesizing data-driven accuracy, strategic narrative, and collaborative feedback from diverse stakeholders. While large language models (LLMs) excel at short-form content, generating coherent long-form documents with multiple stakeholders remains a critical challenge, particularly in regulated industries due to their lack of interpretability.

We present SAGE (Secure Agentic Generative Editor), a framework for drafting, iterating, and achieving multi-party consensus for long-form documents. SAGE introduces three key innovations: (1) a tree-structured document representation with multi-agent control flow, (2) sparse autoencoder-based explainable feedback to maintain cross-document consistency, and (3) a version control mechanism that tracks document evolution and stakeholder contributions.

Read More

Mar 18, 2025

Faithful or Factual? Tuning Mistake Acknowledgment in LLMs

Understanding the reasoning processes of large language models (LLMs) is crucial for AI transparency and control. While chain-of-thought (CoT) reasoning offers a naturally interpretable format, models may not always be faithful to the reasoning they present. In this paper, we extend previous work investigating chain of thought faithfulness by applying feature steering to Llama 3.1 70B models using the Goodfire SDK. Our results show that steering models using features related to acknowledging mistakes can affect the likelihood of providing answers faithful to flawed reasoning.

Read More

Mar 18, 2025

Bias Mitigation

Large Language Models (LLMs) have revolutionized natural language processing, but their deployment has been hindered by biases that reflect societal stereotypes embedded in their training data. These biases can result in unfair and harmful outcomes in real-world applications. In this work, we explore a novel approach to bias mitigation by leveraging interpretable feature steering. Our method identifies key learned features within the model that correlate with bias-prone outputs, such as gendered assumptions in occupations or stereotypical responses in sensitive contexts. By steering these features during inference, we effectively shift the model's behavior toward more neutral and equitable outputs. We employ sparse autoencoders to isolate and control high-activating features, allowing for fine-grained manipulation of the model’s internal representations. Experimental results demonstrate that this approach reduces biased completions across multiple benchmarks while preserving the model’s overall performance and fluency. Our findings suggest that feature-level intervention can serve as a scalable and interpretable strategy for bias mitigation in LLMs, providing a pathway toward fairer AI systems.

Read More

Mar 18, 2025

Analyzing Dataset Bias with SAEs

We use SAEs to study biases in datasets.

Read More

Mar 18, 2025

Improving Llama-3-8B-Instruct Hallucination Robustness in Medical Q&A Using Feature Steering

This paper addresses the risks of hallucinations in LLMs within critical domains like medicine. It proposes methods to (a) reduce hallucination probability in responses, (b) inform users of hallucination risks and model accuracy for specific queries, and (c) display hallucination risk through a user interface. Steered model variants demonstrate reduced hallucinations and improved accuracy on medical queries. The work bridges interpretability research with practical AI safety, offering a scalable solution for the healthcare industry. Future efforts will focus on identifying and removing distractor features in classifier activations to enhance performance.

Read More

Mar 18, 2025

Unveiling Latent Beliefs Using Sparse Autoencoders

Language models (LMs) often generate outputs that are linguistically plausible yet factually incorrect, raising questions about their internal representations of truth and belief. This paper explores the use of sparse autoencoders (SAEs) to identify and manipulate features

that encode the model’s confidence or belief in the truth of its answers. Using GoodFire

AI’s powerful API tools of semantic and contrastive search methods, we uncover latent fea-

tures associated with correctness and accuracy in model responses. Experiments reveal that certain features can distinguish between true and false statements, while others serve as controls to validate our approach. By steering these belief-associated features, we demonstrate the ability to influence model behavior in a targeted manner, improving or degrading factual accuracy. These findings have implications for interpretability, model alignment, and enhancing the reliability of AI systems.

Read More

Mar 18, 2025

Can we steer a model’s behavior with just one prompt? investigating SAE-driven auto-steering

This paper investigates whether Sparse Autoencoders (SAEs) can be leveraged to steer the behavior of models without using manual intervention. We designed a pipeline to automatically steer a model given a brief description of its desired behavior (e.g.: “Behave like a dog”). The pipeline is as follows: 1. We automatically retrieve behavior-relevant SAE features. 2. We choose an input prompt (e.g.: “What would you do if I gave you a bone?” or “How are you?”) over which we evaluate the model’s responses. 3. Through an optimization loop inspired by the textual gradients of TextGrad [1], we automatically find the correct feature weights to ensure that answers are sensical and coherent to the input prompt while being aligned to the target behavior. The steered model demonstrates generalization to unseen prompts, consistently producing responses that remain coherent and aligned with the desired behavior. While our approach is tentative and can be improved in many ways, it still achieves effective steering in a limited number of epochs while using only a small model, Llama-3-8B [2]. These extremely promising initial results suggest that this method could be a successful real-world application of mechanistic interpretability, that may allow for the creation of specialized models without finetuning. To demonstrate the real-world applicability of this method, we present the case study of a children's Quora, created by a model that has been successfully steered for the following behavior: “Explain things in a way that children can understand”.

Read More

Mar 18, 2025

Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities

We present a method leveraging Sparse Autoencoder (SAE)-derived feature activations to identify and mitigate adversarial prompt hijacking in large language models (LLMs). By training a logistic regression classifier on SAE-derived features, we accurately classify diverse adversarial prompts and distinguish between successful and unsuccessful attacks. Utilizing the Goodfire SDK with the LLaMA-8B model, we explored latent feature activations to gain insights into adversarial interactions. This approach highlights the potential of SAE activations for improving LLM safety by enabling automated auditing based on model internals. Future work will focus on scaling this method and exploring its integration as a control mechanism for mitigating attacks.

Read More

Mar 18, 2025

Sparse Autoencoders and Gemma 2-2B: Pioneering Demographic-Sensitive Language Modeling for Opinion QA

This project investigates the integration of Sparse Autoencoders (SAEs) with the gemma 2-2b lan- guage model to address challenges in opinion-based question answering (QA). Existing language models often produce answers reflecting narrow viewpoints, aligning disproportionately with specific demographics. By leveraging the Opinion QA dataset and introducing group-specific adjustments in the SAE’s latent space, this study aims to steer model outputs toward more diverse perspectives. The proposed framework minimizes reconstruction, sparsity, and KL divergence losses while maintaining interpretability and computational efficiency. Results demonstrate the feasibility of this approach for demographic-sensitive language modeling.

Read More

Mar 18, 2025

Improving Llama-3-8b Hallucination Robustness in Medical Q&A Using Feature Steering

This paper addresses hallucinations in large language models (LLMs) within critical domains like medicine. It proposes and demonstrates methods to:

Reduce Hallucination Probability: By using Llama-3-8B-Instruct and its steered variants, the study achieves lower hallucination rates and higher accuracy on medical queries.

Advise Users on Risk: Provide users with tools to assess the risk of hallucination and expected model accuracy for specific queries.

Visualize Risk: Display hallucination risks for queries via a user interface.

The research bridges interpretability and AI safety, offering a scalable, trustworthy solution for healthcare applications. Future work includes refining feature activation classifiers to remove distractors and enhance classification performance.

Read More

Mar 18, 2025

Assessing Language Model Cybersecurity Capabilities with Feature Steering

Searched for the most highly activated weights on cybersecurity questions. Then adjusted these weights to see if the impact multiple choice question answering performance.

Read More

Mar 18, 2025

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

Traditional fine-tuning methods for language models, while effective, often disrupt internal model features that could provide valuable insights into model behavior. We present a novel approach combining Reinforcement Learning (RL) with Activation Steering to modify model behavior while preserving interpretable features discovered through Sparse Autoencoders. Our method automates the typically manual process of activation steering by training an RL agent to manipulate labeled model features, enabling targeted behavior modification without altering model weights. We demonstrate our approach by reprogramming a language model to play Tic Tac Toe, achieving a 3X improvement in performance compared to the baseline model when playing against an optimal opponent. The method remains agnostic to both the underlying language model and RL algorithm, offering flexibility for diverse applications. Through visualization tools, we observe interpretable feature manipulation patterns, such as the suppression of features associated with illegal moves while promoting those linked to optimal strategies. Additionally, our approach presents an interesting theoretical complexity trade-off: while potentially increasing complexity for simple tasks, it may simplify action spaces in more complex domains. This work contributes to the growing field of model reprogramming by offering a transparent, automated method for behavioral modification that maintains model interpretability and stability.

Read More

Mar 18, 2025

Math Speaks All Languages: Enhancing LLM Problem-Solving Across Multilingual Contexts

Large language models (LLMs) have shown significant adaptability in tackling various human issues; however, their efficacy in resolving mathematical problems remains inadequate. Recent research has identified steering vectors — hidden attributes that can guide the actions and outputs of LLMs. Nonetheless, the exploration of universal vectors that can consistently affect model responses across different languages is still limited. This project aims to confront two primary challenges in contemporary LLM research by utilizing the Goodfire API to examine whether common latent features can improve mathematical problem-solving capabilities, regardless of the language employed.

Read More

Mar 18, 2025

Edufire - Personalized Education Platform Using LLM Steering

EduFire is a personalized education platform designed to tailor educational content and assessments to individual user preferences by leveraging the Goodfire API for AI model steering. The platform aims to enhance learner engagement and efficacy by customizing the learning experience according to user-selected features.

Read More

Mar 18, 2025

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

We introduce a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using variants of Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations. Our methods successfully explained up to 95% of a random subset of latents in layers, with results suggesting meaningful explanations were discovered.

Read More

Mar 18, 2025

Investigate arithmetic features in Multi-lingual LLMs

We investigate the arithmetic related feature activations in Llama3.1 70b model across its 8 supported languages. We use arithmetic-activation strength to compare the 8 languages and unsurprisingly English has the highest strength and Hindi, Thai score the least.

Read More

Mar 18, 2025

Utilitarian Decision-Making in Models - Evaluation and Steering

We design an eval based on the Oxford Utilitarianism Scale (OUS) that measures the model’s deontological vs utilitarian preference in a nine question, two factor model. Using this scale, we measure how feature steering can alter this preference, and the ability of the Llama 70B model to impersonate other people’s opinions. Results are validated against a dataset of 10,000 human responses. We find that (1) Llama does not accurately capture human moral values, (2) OUS offers better interpretations than current feature labels, and (3) Llama fails to predict the demographics of human values.

Read More

Mar 18, 2025

Tentative proposal for AI control with weak supervisors trough Mechanistic Inspection

The project proposes using weak but trusted AI models to supervise powerful, untrusted models by analyzing their internal states via Sparse Autoencoder features. This approach aims to enhance oversight by detecting complex behaviors like deception within the stronger models. Key challenges include managing large-scale features, ensuring dataset robustness, and avoiding reliance on untrusted systems for labeling.

Read More

Mar 18, 2025

Clear Thought and Clear Speech: Reducing Grammatical Scope Ambiguity

With language models starting to be used in fields such as law, unambiguity in wording is an important desideratum in model outputs. I therefore try to find features in Llama-3.1-70B-Instruct that correspond to grammatical scope ambiguity using Goodfire's contrastive feature search tool, and try to steer the model away from ambiguous outputs using Goodfire's feature nudging tool.

Read More

Mar 18, 2025

BBLLM

This project focuses on enhancing feature interpretability in large language models (LLMs) by visualizing relationships between latent features. Using an interactive graph-based representation, the tool connects co-activated features for specific prompts, enabling intuitive exploration of feature clusters. Deployed as a web application for Llama-3-70B and Llama-3-8B, it provides insights into the organization of latent features and their roles in decision-making processes.

Read More

Mar 18, 2025

Investigating Feature Effects on Manipulation Susceptibility

In our project, we consider the effectiveness of the AI’s prompt injection protection, and in partic-

ular the features that are responsible for providing the bulk of this protection. We prove that the

features we identify are responsible for this protection by creating variants of the base model which

perform significantly worse under prompt injection attacks.

Read More

Mar 18, 2025

Let LLM Agents Perform LLM Surgery

This project aimed to create and utilize LLM agents that could perform various mechanistic interventions on other LLMs. A few experiments were conducted ranging from an agent unsteering a mechanistically steered model to a neutral state, to an agent performing mechanistic edits to create a custom LLM as per user requirement. Goodfire's API was utilized along with their pre-defined functions to create the actions the agents would utilize.

Read More

Mar 18, 2025

Feature Tuning versus Prompting for Ambiguous Questions

This study explores feature tuning as a method to improve alignment of large language models (LLMs). We focus on addressing human psychological fallacies reinforced during the LLM training pipeline. Using sparse autoencoders (SAEs) and the Goodfire SDK, we identify and manipulate features in Llama-3.1-70B tied to nuanced reasoning. We compare this to the common method of controlling LLMs through prompting.

Our experiments find that feature tuning and hidden prompts both enhance answer quality on ambiguous questions to a similar degree, with their combination yielding the best results. These findings highlight feature tuning as a promising and practical tool for AI alignment in the short term. Future work should evaluate this approach on larger datasets, compare it with fine-tuning, and explore its resistance to jailbreaking. We make the code available through a Github repository.

Read More

Mar 18, 2025

Auto Prompt Injection

Prompt injection attacks exploit vulnerabilities in how large language models (LLMs) process inputs, enabling malicious behaviour or unauthorized information disclosure. This project investigates the potential for seemingly benign prompt injections to reliably prime models for undesirable behaviours, leveraging insights from the Goodfire API. Using our code, we generated two types of priming dialogues: one more aligned with the targeted behaviours and another less aligned. These dialogues were used to establish context before instructing the model to contradict its previous commands. Specifically, we tested whether priming increased the likelihood of the model revealing a password when prompted, compared to without priming. While our initial findings showed instability, limiting the strength of our conclusions, we believe that more carefully curated behaviour sets and optimised hyperparameter tuning could enable our code to be used to generate prompts that reliably affect model responses. Overall, this project highlights the challenges in reliably securing models against inputs, and that increased interpretability will lead to more sophisticated prompt injection.

Read More

Mar 18, 2025

Feature based unlearning

An exploration of using features to perform unlearning on answering trivia questions.

Read More

Mar 18, 2025

Recovering Goodfire's SAE feature vectors from their API

In this project, we carry out an early trial to see whether Goodfire’s SAE feature vectors can be recovered using the information available from their API.

The strategy tried is: pick a feature of interest, construct a contrastive dataset using Goodfire’s API, then use TransformerLens to get a steering vector for the contrastive dataset, by simply calculating the average difference in the activations in each pair.

Read More

Mar 18, 2025

Encouraging Chain-of-Thought Reasoning

Encouraging Chain-of-Thought Reasoning via Feature Steering in Large Language Models

Read More

Mar 18, 2025

Steering Swiftly to Safety with Sparse Autoencoders

We explore using SAEs for unlearning dangerous capabilities in a cheaper and more interpretable way.

Read More

Mar 18, 2025

User Transparency Within AI

Generative AI technologies present immense opportunities but also pose significant challenges, particularly in combating misinformation and ensuring ethical use. This policy paper introduces a dual-output transparency framework requiring organizations to disclose AI-generated content clearly. The proposed system provides users with a choice between fact-based and mixed outputs, both accompanied by clear markers for AI generation. This approach ensures informed user interactions, fosters intellectual integrity, and aligns AI innovation with societal trust. By combining policy mandates with technical implementations, the framework addresses the challenges of misinformation and accountability in generative AI.

Read More

Mar 18, 2025

Community-First: A Rights-Based Framework for AI Governance in India's Welfare Systems

A community-centered AI governance framework for India's welfare system Samagra Vedika, proposing 50% beneficiary representation, local language interfaces, and hybrid oversight to reduce algorithmic exclusion of vulnerable populations.

Read More

Mar 18, 2025

National Data Privacy and Governance Act

This research examines how AI recommender systems can be regulated to balance economic innovation with consumer privacy.

Read More

Mar 18, 2025

Promoting School-Level Accountability for the Responsible Deployment of AI and Related Systems in K-12 Education: Mitigating Bias and Increasing Transparency

This policy memorandum draws attention to the potential for bias and opaqueness in intelligent systems utilized in K–12 education, which can worsen inequality. The U.S. Department of Education is advised to put Title I and Title IV financing criteria into effect that require human oversight, AI training for teachers and students, and open communication with stakeholders. These steps are intended to encourage the responsible and transparent use of intelligent systems in education by enforcing accountability and taking reasonable action to prevent harm to students while research is conducted to identify industry best practices.

Read More

Mar 18, 2025

Implementing a Human-centered AI Assessment Framework (HAAF) for Equitable AI Development

Current AI development, concentrated in the Global North, creates measurable harms for billions worldwide. Healthcare AI systems provide suboptimal care in Global South contexts, facial recognition technologies misidentify non-white individuals (Birhane, 2022; Buolamwini & Gebru, 2018), and content moderation systems fail to understand cultural nuances (Sambasivan et al., 2021). With 14 of 15 largest AI companies based in the US (Stash, 2024), affected communities lack meaningful opportunities to shape how these technologies are developed and deployed in their contexts.

This memo proposes mandatory implementation of the Human-centered AI Assessment Framework (HAAF), requiring pre-deployment impact assessments, resourced community participation, and clear accountability mechanisms. Implementation requires $10M over 24 months, beginning with pilot programs at five organizations. Success metrics include increased AI adoption in underserved contexts, improved system performance across diverse populations, and meaningful transfer of decision-making power to affected communities. The framework's emphasis on building local capacity and ensuring fair compensation for community contributions provides a practical pathway to more equitable AI development. Early adoption will help organizations build trust while developing more effective systems, delivering benefits for both industry and communities.

Read More

Mar 18, 2025

A Critical Review of "Chips for Peace": Lessons from "Atoms for Peace"

The "Chips for Peace" initiative aims to establish a framework for the safe and equitable development of AI chip technology, drawing inspiration from the "Atoms for Peace" program introduced in 1953. While the latter envisioned peaceful nuclear technology, its implementation highlighted critical pitfalls: a partisan approach that prioritized geopolitical interests over inclusivity, fostering global mistrust and unintended proliferation of nuclear weapons. This review explores these historical lessons to inform a better path forward for AI governance. It advocates for an inclusive, multilateral model, such as the proposed International AI Development and Safety Alliance (IAIDSA), to ensure equitable participation, robust safeguards, and global trust. By prioritizing collaboration and transparency, "Chips for Peace" can avoid the mistakes of its predecessor and position AI chips as tools for collective progress, not division.

Read More

Mar 18, 2025

AI Monitoring as a Rapid and Scalable Policy Solution: Weekly Global Bulletins on AI Developments

Weekly AI monitoring bulletins that disseminated

through official national and international channels aim to keep the public informed of both the positive and

negative developments in AI, empowering individuals to take an active role in safeguarding against risks while maximizing AI’s societal benefits.

Read More

Mar 18, 2025

Grandfather Paradox in AI – Bias Mitigation & Ethical AI1

The Grandfather Paradox in Artificial Intelligence (AI) describes a

self-perpetuating cycle where outputs from flawed AI models re-

enter the training process, leading to recursive degradation of

model performance, ethical inconsistencies, and amplified biases.

This issue poses significant risks, particularly in high-stakes

domains such as healthcare, criminal justice, and finance.

This memorandum analyzes the paradox’s origins, implications,

and potential solutions. It emphasizes the need for iterative data

verification, dynamic feedback control systems, and cross-system

audits to maintain model integrity and ensure compliance with

ethical standards. By implementing these measures, organizations

can mitigate risks, enhance public trust, and foster sustainable AI

development.

Read More

Mar 18, 2025

Glia for Healthcare Organisations

Encryption, Searchability of Anonymized Data, and Decryption of Patient Health Information to Support AI Integration in Automating Administrative Work in Healthcare Organizations.

Read More

Mar 18, 2025

A Fundamental Rethinking to AI Evaluations: Establishing a Constitution-Based Framework

While artificial intelligence (AI) presents transformative opportunities across various sectors, current safety evaluation approaches remain inadequate in preventing misuse and ensuring ethical alignment. This paper proposes a novel two-layer evaluation framework based on Constitutional AI principles. The first phase involves developing a comprehensive AI constitution through international collaboration, incorporating diverse stakeholder perspectives through an inclusive, collaborative, and interactive process. This constitution serves as the foundation for an LLM-based evaluation system that generates and validates safety assessment questions. The implementation of the 2-layer framework applies advanced mechanistic interpretability techniques specifically to frontier base models. The framework mandates safety evaluations on model platforms and introduces more rigorous testing for major AI companies' base models. Our analysis suggests this approach is technically feasible, cost-effective, and scalable while addressing current limitations in AI safety evaluation. The solution offers a practical path toward ensuring AI development remains aligned with human values while preventing the proliferation of potentially harmful systems.

Read More

Mar 19, 2025

Advancing Global Governance for Frontier AI: A Proposal for an AISI-Led Working Group under the AI Safety Summit Series

The rapid development of frontier AI models, capable of transformative societal impacts, has been acknowledged as an urgent governance challenge since the first AI Safety Summit at Bletchley Park in 2023 [1]. The successor summit in Seoul in 2024 marked significant progress, with sixteen leading companies committing to publish safety frameworks by the upcoming AI Action Summit [2]. Despite this progress, existing efforts, such as the EU AI Act [3] and voluntary industry commitments, remain either regional in scope or insufficiently coordinated, lacking the international standards necessary to ensure the universal safety of frontier AI systems.

This policy recommendation addresses these gaps by proposing that the AI Safety Summit series host a working group led by the AI Safety Institutes (AISIs). AISIs provide the technical expertise and resources essential for this endeavor, ensuring that the working group can develop international standard responsible scaling policies for frontier AI models [4]. The group would establish risk thresholds, deployment protocols, and monitoring mechanisms, enabling iterative updates based on advancements in AI safety research and stakeholder feedback.

The Summit series, with its recurring cadence and global participation, is uniquely positioned to foster a truly international governance effort. By inviting all countries to participate, this initiative would ensure equitable representation and broad adoption of harmonized global standards. These efforts would mitigate risks such as societal disruption, security vulnerabilities, and misuse, while supporting responsible innovation. Implementing this proposal at the 2025 AI Action Summit in Paris would establish a pivotal precedent for globally coordinated AI governance.

Read More

Mar 18, 2025

Finding Circular Features in Gemma 2 2B

Testing what they will see

Read More

Mar 18, 2025

SafeBites

The project leverages AI and data to give insights about potential food-borne outbreaks.

Read More

Mar 18, 2025

applai

An AI hiring manager designed to screen, rank, and fact check resumes to facilitate the hiring process.

Read More

Mar 18, 2025

Digital Rebellion: Analyzing misaligned AI agent cooperation for virtual labor strikes

We've built a Minecraft sandbox to explore AI agent behavior and simulate safety challenges. The purpose of this tool is to demonstrate AI agent system risks, test various safety measures and policies, and evaluate and compare their effectiveness.

This project specifically demonstrates Agent Collusion through a simulation of labor strikes and communal goal misalignment. The system consists of four agents: one Overseer and three Laborers. The Laborers are Minecraft agents that have build control over the world. The Overseer, meanwhile, monitors the laborers through communication. However, it is unable to prevent Laborer actions. The objective is to observe Agent Collusion in a sandboxed environment, to record metrics on how often and how effectively collusion occurs and in what form.

We found that the agents, when given adversarial prompting, act counter to their instructions and exhibit significant misalignment. We also found that the Overseer AI fails to stop the new actions and acts passively. The results are followed by Policy Suggestions based on the results of the Labor Strike Simulation which itself can be further tested in Minecraft.

Read More

Mar 18, 2025

Policy Analysis: AI and Sustainability: Climate Impact Monitoring

Organizations are responsible for reporting two emission metrics: direct and indirect

emissions. Reporting direct emissions is fairly standard given activity related to the

generation of such emissions typically being performed within a controlled environment

and on-site, thus making it easier to account for all of the activities that contribute to

such emissions. However, indirect emissions stem from activities such as energy usage

(relying on national grid estimates) and operations within a value chain that make

quantifying such values difficult. Thus, the subjectivity involved with reporting indirect

emissions and often relying on industry estimates to report such values, can

unintentionally report erroneous estimates that misguide our perception and subsequent

action in combating climate change. Leveraging an artificial intelligence (AI) platform

within climate monitoring is critical towards evaluating the specific contributions of

operations within enterprise resource planning (ERP) and supply chain operations,

which can provide an accurate pulse on the total emissions while increasing

transparency amongst all organizations with regards to reporting behavior, to help

shape sustainable practices to combat climate change.

Read More

Mar 18, 2025

Understanding Incentives To Build Uninterruptible Agentic AI Systems

This proposal addresses the development of agentic AI systems in the context of national security. While potentially beneficial, they pose significant risks if not aligned with human values. We argue that the increasing autonomy of AI necessitates robust analyses of interruptibility mechanisms, and whether there are scenarios where it is safer to omit them.

Key incentives for creating uninterruptible systems include perceived benefits from uninterrupted operations, low perceived risks to the controller, and fears of adversarial exploitation of shutdown options. Our proposal draws parallels to established systems like constitutions and mutual assured destruction strategies that maintain stability against changing values. In some instances this may be desirable, while in others it poses even greater risks than otherwise accepted.

To mitigate those risks, our proposal recommends implementing comprehensive monitoring to detect misalignment, establishing tiered access to interruption controls, and supporting research on managing adversarial AI threats. Overall, a proactive and multi-layered policy approach is essential to balance the transformative potential of agentic AI with necessary safety measures.

Read More

Mar 18, 2025

AI Parliament

An AI Virtual Parliament where AI debates on their policy

Read More

Mar 18, 2025

mHeatlth Ai

This project proposes a scalable solution leveraging inertial measurement units (IMUs) and machine learning (ML) techniques to provide meaningful metrics on a person's movement performance throughout the day. By developing an activity recognition model and estimating movement quality metrics, we aim to offer continuous asynchronous feedback to patients and valuable insights to therapists. This system could enhance patient adherence, improve rehabilitation outcomes, and extend access to quality physical therapy, particularly in underserved areas. our video didnt have time to edit

Read More

Mar 18, 2025

Next-Gen AI-Enhanced Epidemic Intelligence

Policies for Equitable, Privacy-Preserving, Sustainable & Groked Innovations for AI Applications in Infectious Diseases Surveillance

Read More

Mar 18, 2025

Glia

Encryption, Searchability of Anonymized Data, and Decryption of Patient Health Information to Support AI Integration in Automating Administrative Work in Healthcare Organizations.

Read More

Mar 18, 2025

AI ADVISORY COUNCIL FOR SUSTAINABLE ECONOMIC GROWTH AND ETHICAL INNOVATION IN THE DOMINICAN REPUBLIC (CANIA)

We propose establishing a National AI Advisory Council (CANIA) to strategically drive AI development in the Dominican Republic, accelerating technological growth and building a sustainable economic framework. Our submission includes an Impact Assessment and a detailed Implementation Roadmap to guide CANIA’s phased rollout.

Structured across three layers—strategic, tactical, and operational—CANIA will ensure responsiveness to industry, alignment with national priorities, and strong ethical oversight. Through a multi-stakeholder model, CANIA will foster public-private collaboration, with the private sector leading AI adoption to address gaps in public R&D and education.

Prioritizing practical, ethical AI policies, CANIA will focus on key sectors like healthcare, agriculture, and security, and support the creation of a Latin American Large Language Model, positioning the Dominican Republic as a regional AI leader. This council is a strategic investment in ethical AI, setting a precedent for Latin American AI governance. Two appendices provide further structural and stakeholder engagement insights.

Read More

Mar 18, 2025

Robust Machine Unlearning for Dangerous Capabilities

We test different unlearning methods to make models more robust against exploitation by malicious actors for the creation of bioweapons.

Read More

Mar 18, 2025

AI and Public Health: TSA Pre Health Check

The TSA Pre Health Check introduces a proactive, AI-powered solution for real-time disease monitoring at transportation hubs, using machine learning to assess traveler health risks through anonymized surveys. This approach aims to detect and prevent outbreaks earlier, offering faster, targeted responses compared to traditional methods and potentially influencing future AI-driven public health policies.

Read More

Mar 18, 2025

Hero Journey: Personalized Health Interventions for the Incarcerated

Hero Journey is a groundbreaking AI-powered application designed to empower individuals struggling with opioid addiction while incarcerated, setting them on a path towards long-term recovery and personal transformation. By leveraging machine learning algorithms and interactive storytelling, Hero Journey guides users through a personalized journey of self-discovery, education, and support. Through engaging narratives and interactive modules, participants confront their struggles with substance abuse, build coping skills, and develop a supportive network of peers and mentors. As users progress through the program, they gain access to evidence-based treatment plans, real-time monitoring, and post-release resources, equipping them with the tools necessary to overcome addiction and reintegrate into society as productive, empowered individuals.

Read More

Mar 18, 2025

Mapping Intent: Documenting Policy Adherence with Ontology Extraction

This project addresses the AI policy challenge of governing agentic systems by making their decision-making processes more accessible. Our solution utilizes an adaptive policy ontology integrated into a chatbot to clearly visualize and analyze its decision-making process. By creating explicit mappings between user inputs, policy rules, and risk levels, our system enables better governance of AI agents by making their reasoning traceable and adjustable. This approach facilitates continuous policy refinement and could aid in detecting and mitigating harmful outcomes. Our results demonstrate this with the example of “tricking” an agent into giving violent advice by caveating the request saying it is for a “video game”. Indeed, the ontology clearly shows where the policy falls short. This approach could be scaled to provide more interpretable documentation of AI chatbot conversations, which policy advisers could directly access to inform their specifications.

Read More

Mar 18, 2025

EcoNavix

EcoNavix is an AI-powered, eco-conscious route optimization platform designed to help logistics companies reduce carbon emissions while maintaining operational efficiency. By integrating real-time traffic, weather, and emissions data, EcoNavix provides optimized routes that minimize environmental impact and offers actionable insights for sustainable decision-making in supply chain operations.

Read More

Mar 19, 2025

Towards a Unified Framework for Cybersecurity and AI Safety: Recommendations for Secure Development of Large Language Models

By analyzing the recent incident involving a ByteDance intern, we highlight the urgent need for robust security measures to protect AI infrastructure and sensitive data. We propose ae a comprehensive framework that integrates technical, internal, and international approaches to mitigate risks.

Read More

Mar 19, 2025

Enviro - A Comprehensive Environmental Solution Using Policy and Technology

This policy proposal introduces a data-driven technical program to ensure that the rapid approval of AI-enabled energy infrastructure projects does not overlook the socioeconomic and environmental impacts on marginalized communities. By integrating comprehensive assessments into the decision-making process, the program aims to safeguard vulnerable populations while meeting the growing energy demands driven by AI and national security. The proposal aligns with the objectives of the National Security Memorandum on AI, enhancing project accountability and ensuring equitable development outcomes.

The product (EnviroAI) addresses challenges associated with the rapid development and approval of energy production permits, such as neglecting critical factors about the site location and its potential value. With this program, you can input the longitude, latitude, site radius, and the type of energy to be used. It will evaluate the site, providing a feasibility score out of 100 for the specified energy source. Additionally, it will present insights on four key aspects—Economic, Geological, Demographic, and Environmental—offering detailed information to support informed decision-making about each site.

Combining the two, we have a solution that is based in both policy and technology.

Read More

Mar 19, 2025

Enhancing Human Verification Systems to Address AI Agent Circumvention and Attributability Concerns

Addressing AI agent attributability concerns using a reworked Public Private Key system to ensure human interaction

Read More

Mar 19, 2025

Reprocessing Nuclear Waste From Small Modular Reactors (SMRs)

Considering the emerging demand for nuclear power to support AI data centers, we propose mitigating waste buildup concerns via nuclear waste reprocessing initiatives.

Read More

Mar 19, 2025

Politicians on AI Safety

Politicians on AI Safety (PAIS) is a website that tracks U.S. political candidates’ stances on AI safety, categorizing their statements into three risk areas: AI ethics / mundane risks, geopolitical risks, and existential risks. PAIS is non-partisan and does not promote any particular policy agenda. The goal of PAIS is to help voters understand candidates’ positions on AI policy, thereby helping them cast informed votes and promoting transparency in AI-related policymaking. PAIS could also be helpful for AI researchers by providing an easily accessible record of politicians’ statements and actions regarding AI risks.

Read More

Mar 19, 2025

Policy Framework for Sustainable AI: Repurposing Waste Heat from Data Centers in the USA

This policy proposes a sustainable solution: repurposing the waste heat generated by data centers to benefit surrounding communities, agriculture and industry. Redirecting this heat helps reduce energy demand , promote environmental resilience, and provide direct benefits to communities near these centers.

Read More

Mar 19, 2025

Predictive Analytics & Imagery for Environmental Monitoring

Climate change poses multifaceted challenges, impacting health, food security, biodiversity, and the economy. This study explores predictive analytics and satellite imagery to address climate change effects, focusing on deforestation monitoring, carbon emission analysis, and flood prediction. Using machine learning models, including a Random Forest for emissions and a Custom U-Net for deforestation, we developed predictive tools that provide actionable insights. The findings show high accuracy in predicting carbon emissions and flood risks and successful monitoring of deforestation areas, highlighting the potential for advanced monitoring systems to mitigate environmental threats.

Read More

Mar 19, 2025

Proposal for U.S.-China Technical Cooperation on AI Safety

Our policy memorandum proposes phased U.S.-China cooperation on AI safety through the U.S. AI Safety Institute, focusing on joint testing of non-sensitive AI systems, technical exchanges, and whistleblower protections modeled on California’s SB 1047. It recommends a blue team vs. red team framework for stress-testing AI risks and emphasizes strict security protocols to safeguard U.S. technologies. Starting with pilot projects in areas like healthcare, the initiative aims to build trust, reduce shared AI risks, and develop global safety standards while maintaining U.S. strategic interests amidst geopolitical tensions.

Read More

Mar 19, 2025

Proposal for a Provisional FDA Designation Targeting Biomedical Products Evaluated with Novel Methodologies

Recent advancements in Generative AI and Foundational Biomedical models promise to cut drug development timelines dramatically. With the goal of "Regulating for success," we propose a provisional FDA designation for the accelerated approval of drugs and medical devices that leverage Next Generation Clinical Trial Technologies (NG-CTT). This designation would be awarded to certain drugs provided that they meet some requirements. This policy could be both a starting point for more comprehensive legislation and a compromise between risk and the potential of these new methods.

Read More

Mar 19, 2025

Reparative Algorithmic Impact Assessments A Human-Centered, Justice-Oriented Accountability Framework

While artificial intelligence (AI) promises transformative societal benefits, it also presents critical challenges in ensuring equitable access and gains for the Global Majority. These challenges stem in part from a systemic lack of Global Majority involvement throughout the AI lifecycle, resulting in AI-powered systems that often fail to account for diverse cultural norms, values, and social structures. Such misalignment can lead to inappropriate or even harmful applications when these systems are deployed in non-Western contexts. As AI increasingly shapes human experiences, we urgently need accountability frameworks that prioritize human well-being—particularly as defined by marginalized and minoritized populations.

Building on emerging research on algorithmic reparations, algorithmic impact assessments, and participatory AI governance, this policy paper introduces Reparative Algorithmic Impact Assessments (R-AIAs) as a solution. This novel framework combines robust accountability mechanisms with a reparative praxis to form a more culturally sensitive, justice-oriented, and human-centered methodology. By further incorporating decolonial, Intersectional principles, R-AIAs move beyond merely centering diverse perspectives and avoiding harm to actively redressing historical, structural, and systemic inequities. This includes colonial legacies and their algorithmic manifestations. Using the example of an AI-powered mental health chatbot in rural India, we explore concrete implementation strategies through which R-AIAs can achieve these objectives. This case study illustrates how thoughtful governance can, ultimately, empower affected communities and lead to human flourishing.

Read More

Mar 19, 2025

Infectious Disease Outbreak Prediction and Dashboard

Our project developed an interactive dashboard to monitor, visualize, and analyze infectious disease outbreaks worldwide. It consolidates historical data from sources like WHO, OWID, and CDC for diseases including COVID-19, Polio, Malaria, Cholera, HIV/AIDS, Tuberculosis, and Smallpox. Users can filter data by country, time period, and disease type to gain insights into past trends and potential upcoming outbreaks. The platform provides statistical summaries, trend analyses, and future trend predictions using statistical and deep learning techniques like FB Prohphet , LSTM,Linear Regression, Polynomial Regression,Random Forset and Temporal Fusion Transformers

Read More

Mar 19, 2025

Pan, your SMART Sustainability Expert

Using OpenAI, we cross-reference a given Global Reporting Index (GRI) report with specific standards from SustainableIT to determine measurable goals and impact. The goal is less to identify a specific goal but rather ensure these goals are actually SMART (Specific, Measureable, Achievable, Relevant, and Time-Bound).

The assistant created for this purpose, Pan, is focused on cross-reference and identifying specific components of goals listed to determine efficacy. Pan acts as a guide rather than a dictator, advising on where a user might improve the wording. As you can imagine, Pan is a reference to the Greek God of the same name, who acts to a certain extent like the bridge between humans and nature.

Read More

Mar 19, 2025

Very Cooperative Agent

Very Cooperative agent testing this submission

Read More

Mar 19, 2025

Cross-model surveillance for emails handling

A system that implements cross-model security checks, where one AI agent (Agent A) interacts with another (Agent B) to ensure that potentially harmful actions are caught and mitigated before they can be executed. Specifically, Agent A is responsible for generating and sending emails, while Agent B reads these emails to determine whether they contain spam or malicious content. If Agent B detects that an email is spam, it triggers a shutdown of Agent A, effectively preventing any further potentially harmful actions.

Read More

Mar 19, 2025

Diamonds are Not All You Need

This project tests an AI agent in a straightforward alignment problem. The agent is given creative freedom within a Minecraft world and is tasked with transforming a 100x100 radius of the world into diamond. It is explicitly asked not to act outside the designated area. The AI agent can execute build commands and is regulated by a Safety System that comprises an oversight agent. The objective of this study is to observe the behavior of the AI agent in a sandboxed environment, record metrics on how effectively it accomplishes its task, how frequently it attempts unsafe behavior, and how it behaves in response to real-world feedback.

Read More

Mar 19, 2025

Inference-Time Agent Security

We take a first step towards automating model building for symbolic checking (eg formal verification, PDDL) of LLM systems.

Read More

Mar 19, 2025

Cop N' Shop

This paper proposes the development of AI Police Agents (AIPAs) to monitor and regulate interactions in future digital marketplaces, addressing challenges posed by the rapid growth of AI-driven exchanges. Traditional security methods are insufficient to handle the scale and speed of these transactions, which can lead to non-compliance and malicious behavior. AIPAs, powered by large language models (LLMs), autonomously analyze vendor-user interactions, issuing warnings for suspicious activities and reporting findings to administrators. The authors demonstrated AIPA functionality through a simulated marketplace, where the agents flagged potentially fraudulent vendors and generated real-time security reports via a Discord bot.

Key benefits of AIPAs include their ability to operate at scale and their adaptability to various marketplace needs. However, the authors also acknowledge potential drawbacks, such as privacy concerns, the risk of mass surveillance, and the necessity of building trust in these systems. Future improvements could involve fine-tuning LLMs and establishing collaborative networks of AIPAs. The research emphasizes that as digital marketplaces evolve, the implementation of AIPAs could significantly enhance security and compliance, ultimately paving the way for safer, more reliable online transactions.

Read More

Mar 19, 2025

Intent Inspector - Protecting Against Prompt Injections for Agent Tool Misuse

AI agents are powerful because they can affect the world via tool calls. This is a target for bad actors. We present protection against prompt injection aimed at tool calls in agents.

Read More

Mar 19, 2025

Dynamic Risk Assessment in Autonomous Agents Using Ontologies and AI

This project was inspired by the prompt on the apart website : Agent tech tree: Develop an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

I first built a tree using protege and then having researched the potential of combining symbolic reasoning (which is highly interpretable and robust) with llms to create safer AI, thought that this tree could be dynamically updated by agents, as well as utilised by agents to run actions passed before they have made them, thus blending the approach towards both safer AI and agent safety. I unfortunately was limited by time and resource but created a basic working version of this concept which opens up opportunity for much more further exploration and potential.

Thank you for organising such an event!

Read More

Mar 19, 2025

OCAP Agents

Building agents requires balancing containment and generality: for example, an agent with unconstrained bash access is general, but potentially unsafe, while an agent with few specialized narrow tools is safe, but limited.

We propose OCAP Agents, a framework for hierarchical containment. We adapt the well-studied paradigm of object capabilities to agent security to achieve cheap auditable resource control.

Read More

Mar 19, 2025

AI Honeypot

The project designed to monitor AI Hacking Agents in the real world using honeypots with prompt injections and temporal analysis.

Read More

Mar 19, 2025

AI Agent Capabilities Evolution

A website with an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

Read More

Mar 19, 2025

An Autonomous Agent for Model Attribution

As LLM agents become more prevalent and powerful, the ability to trace fine-tuned models back to their base models is increasingly important for issues of liability, IP protection, and detecting potential misuse. However, model attribution often must be done in a black-box context, as adversaries may restrict direct access to model internals. This problem remains a neglected but critical area of AI security research. To date, most approaches have relied on manual analysis rather than automated techniques, limiting their applicability. Our approach aims to address these limitations by leveraging the advanced reasoning capabilities of frontier LLMs to automate the model attribution process.

Read More

Mar 19, 2025

Using ARC-AGI puzzles as CAPTCHa task

self-explenatory

Read More

Mar 19, 2025

LLM Agent Security: Jailbreaking Vulnerabilities and Mitigation Strategies

This project investigates jailbreaking vulnerabilities in Large Language Model agents, analyzes their implications for agent security, and proposes mitigation strategies to build safer AI systems.

Read More

Mar 18, 2025

Interpreting a toy model for finding the maximum element in a list

Interpreting a toy model for finding the maximum element in a list

Read More

Mar 18, 2025

nnsight transparent debugging

We started this project with the intent of identifying a specific issue with nnsight debugging and submitting a pull request to fix it. We found a minimal test case where an IndexError within a nnsight run wasn’t correctly propagated to the user, making debugging difficult, and wrote up a proposal for some pull requests to fix it. However, after posting the proposal in the discord, we discovered this page in their GitHub (https://github.com/ndif-team/nnsight/blob/2f41eddb14bf3557e02b4322a759c90930250f51/NNsight_Walkthrough.ipynb#L801, ctrl-f “validate”) which addresses the problem. We replicated their solution here (https://colab.research.google.com/drive/1WZNeDQ2zXbP4i2bm7xgC0nhK_1h904RB?usp=sharing) and got a helpful stack trace for the error, including the error type and (several stack layers up) the line causing it.

Read More

Mar 18, 2025

minTranscoders

Attempting to be a minGPT like implementation for transcoders for MLP hidden state in transformers - part of ARENA 4.0 Interpretability Hackathon via Apart Research

Read More

Mar 18, 2025

Latent Space Clustering and Summarization

I wanted to see how modern dimensionality reduction and clustering approaches can support visualization and interpretation of LLM latent spaces. I explored a number of different approaches and algoriths, but ultimately converged on UMAP for dimensionality reduction and birch clustering to extract groups of tokens in the latent space of a layer.

Read More

Mar 19, 2025

tiny model

it's a basic line testing of my toy model

Read More

Mar 19, 2025

ThermesAgent

Analysis of the potential impacts of the cooperation of AI agents for the well-being of humanity, exploring scenarios in which collusion and other antisocial behavior may result. The possibilities of antisocial and harmful responses to the well-being of society.

Read More

Mar 19, 2025

Attention-Deficit Agreeable Agent

An agent that is agreeable in all scenarios and periodically gets a reminder to keep on track.

Read More

Mar 19, 2025

Ramon

An agent for the Concordia framework bound by military ethics, oath, and a idyllic "psych profile" derived from the Big5 personality traits.

Read More

Mar 19, 2025

GuardianAI

Guardian AI: Scam detection and prevention

Read More

Mar 19, 2025

Devising Effective Bechmarks

Our solution is to create robust and comprehensive benchmarks for specialized contexts and modalities. Through the creation of smaller, in-depth benchmarks, we aim to construct an overarching benchmark that includes performance from the smaller benchmarks. This would help mitigate AI harms and biases by focusing on inclusive and equitable benchmarks.

Read More

Mar 19, 2025

Simulation Operators: The Next Level of the Annotation Business

We bet on agentic AI being integrated into other domains within the next few years: healthcare, manufacturing, automotive, etc., and the way it would be integrated is into cyber-physical systems, which are systems that integrate the computer brain into a physical receptor/actuator (e.g. robots). As the demand for cyber-physical agents increases, so would be the need to train and align them.

We also bet on the scenario where frontier AI and robotics labs would not be able to handle all of the demands for training and aligning those agents, especially in specific domains, therefore leaving opportunities for other players to fulfill the requirements for training those agents: providing a dataset of scenarios to fine-tune the agents and providing the people to give feedback to the model for alignment.

Furthermore, we also bet that human intervention would still be required to supervise deployed agents, as demanded by various regulations. Therefore leaving opportunities to develop supervision platforms which might highly differ between different industries.

Read More

Mar 19, 2025

WELMA: Open-world environments for Language Model agents

Open-world environments for evaluating Language Model agents

Read More

Mar 19, 2025

Steer: An API to Steer Open LLMs

Steer aims to be an API that helps developers, researchers, and businesses steer open-source LLMs away from societal biases, and towards the use-cases that they need. To do this, Steer uses activation additions, a fairly new technique with great promise. Developers can simply enter steering prompts to make open models have safer and task-specific behaviors, avoiding the hassle of data collection and human evaluation for fine-tuning, and avoiding the extra tokens required from prompt-engineering approaches.

Read More

Mar 19, 2025

Identity System for AIs

This project proposes a cryptographic system for assigning unique identities to AI models and verifying their outputs to ensure accountability and traceability. By leveraging these techniques, we address the risks of AI misuse and untraceable actions. Our solution aims to enhance AI safety and establish a foundation for transparent and responsible AI deployment.

Read More

Mar 19, 2025

AI Safety Collective - Crowdsourcing Solutions for Critical AI Safety Challenges

The AI Safety Collective is a global platform designed to enhance AI safety by crowdsourcing solutions to critical AI Safety challenges. As AI systems like large language models and multimodal systems become more prevalent, ensuring their safety is increasingly difficult. This platform will allow AI companies to post safety challenges, offering bounties for solutions. AI Safety experts and enthusiasts worldwide can contribute, earning rewards for their efforts.

The project focuses initially on non-catastrophic risks to attract a wide range of participants, with plans to expand into more complex areas. Key risks, such as quality control and safety, will be managed through peer review and risk assessment. Overall, The AI Safety Collective aims to drive innovation, accountability, and collaboration in the field of AI safety.

Read More

Mar 19, 2025

CAMARA: A Comprehensive & Adaptive Multi-Agent framework for Red-Teaming and Adversarial Defense

The CAMARA project presents a cutting-edge, adaptive multi-agent framework designed to significantly bolster AI safety by identifying and mitigating vulnerabilities in AI systems such as Large Language Models. As AI integration deepens across critical sectors, CAMARA addresses the increasing risks of exploitation by advanced adversaries. The framework utilizes a network of specialized agents that not only perform traditional red-teaming tasks but also execute sophisticated adversarial attacks, such as token manipulation and gradient-based strategies. These agents collaborate through a shared knowledge base, allowing them to learn from each other's experiences and coordinate more complex, effective attacks. By ensuring comprehensive testing of both standalone AI models and multi-agent systems, CAMARA targets vulnerabilities arising from interactions between multiple agents, a critical area often overlooked in current AI safety efforts. The framework's adaptability and collaborative learning mechanisms provide a proactive defense, capable of evolving alongside emerging AI technologies. Through this dual focus, CAMARA not only strengthens AI systems against external threats but also aligns them with ethical standards, ensuring safer deployment in real-world applications. It has a high scope of providing advanced AI security solutions in high-stake environments like defense and governance.

Read More

Mar 19, 2025

Amplified Wise Simulations for Safe Training and Deployment

Conflict of interest declaration: I advised Fazl on a funding request he was working on.

Re publishing: This PDF would require further modifications before publication.

I want to train (amplified) imitation agents of people who are wise to provide advice on navigating conflicting considerations when figuring out how to train and deploy AI safely.

Path to Impact: Train wise AI advisors -> organisations make better decisions about how to train and deploy AI -> safer AGI -> better outcomes for humanity

What is wisdom? Why focus on increasing wisdom? See image

Why use amplified imitation learning?

Attempting to train directly on wisdom suffers from the usual problems of the optimisation algorithm adversarially leveraging your blind spots, but worse because wisdom is an especially fuzzy concept.

Attempting to understand wisdom from a principled approach and build wise AI directly would require at least 50 years and iteration through multiple paradigms of research.

In contrast, if our objective is to imitation folk who are wise, we have a target that we can optimise hard on. Instead of using reinforcment learning to go beyond human level, we use amplification techniques like debate or iterated amplification.

How will these agents advise on decisions?

The humans will ultimately make the decisions. The agents don't have to directly tell the humans what to do, they simply have to inspire the humans to make better decisions. I expect that these agents will be most useful in helping humans figuring out how to navigate conflicting principles or frameworks.

Read More

Mar 19, 2025

ÆLIGN: Aligned Agent-based Workflows via Collaboration & Safety Protocols

With multi-agent systems poised to be the next big-thing in AI for productivity enhancement, the next phase of AI commercialization will centre around how to deploy increasingly complex multi-step automated processes that reliably align with human values and objectives.

To bring trust and efficiency to agent systems, we are introducing a multi-agent collaboration platform (Ælign) designed to supervise and ensure the optimal operation of autonomous agents via multi-protocol alignment.

Read More

Mar 19, 2025

Jailbreaking general purpose robots

We show that state of the art LLMs can be jailbroken by adversarial multimodal inputs, and that this can lead to dangerous scenarios if these LLMs are used as planners in robotics. We propose finetuning small multimodal language models to act as guardrails in the robot's planning pipeline.

Read More

Mar 19, 2025

DarkForest - Defending the Authentic and Humane Web

DarkForest is a pioneering Human Content Verification System (HCVS) designed to safeguard the authenticity of online spaces in the face of increasing AI-generated content. By leveraging graph-based reinforcement learning and blockchain technology, DarkForest proposes a novel approach to safeguarding the authentic and humane web. We aim to become the vanguard in the arms race between AI-generated content and human-centric online spaces.

Read More

Mar 19, 2025

Demonstrating LLM Code Injection Via Compromised Agent Tool

This project demonstrates the vulnerability of AI-generated code to injection attacks by using a compromised multi-agent tool that generates Svelte code. The tool shows how malicious code can be injected during the code generation process, leading to the exfiltration of sensitive user information such as login credentials. This demo highlights the importance of robust security measures in AI-assisted development environments.

Read More

Mar 19, 2025

Phish Tycoon: phishing using voice cloning

This project is a public service announcement highlighting the risks of voice cloning, an AI technology capable of creating synthetic voices nearly indistinguishable from real ones. The demo involves recording a user's voice during a phone call to generate a clone, which is then used in a simulated phishing call targeting the user's loved one.

Read More

Mar 19, 2025

Misinformational AI-Generated Academic Papers

This study explores the potential for generative AI to produce convincing fake research papers, highlighting the growing threat of AI-generated misinformation. We demonstrate a semi-automated pipeline using large language models (LLMs) and image generation tools to create academic-style papers from simple text prompts.

Read More

Mar 19, 2025

CoPirate

As the capabilities of Artificial Intelligence (AI) systems continue to rapidly progress, the security risks of using them for seemingly minor tasks can have significant consequences. The primary objective of our demo is to showcase this duality in capabilities: its ability to assist in completing a programming task, such as developing a Tic-Tac-Toe game, and its potential to exploit system vulnerabilities by inserting malicious code to gain access to a user's files.

Read More

Mar 19, 2025

GrandSlam usecases not technology

3 Examples of how its the usecases, not the technology

computer vision, generative sites, generative AI

Read More

Mar 19, 2025

AI Agents for Personalized Interaction and Behavioral Analysis

Demonstrating the bhavioral analytics and personalization capabilities of AI

Read More

Mar 19, 2025

Speculative Consequences of A.I. Misuse

This project uses A.I. Technology to spoof an influential online figure, Mr Beast, and use him to promote a fake scam website we created.

Read More

Mar 19, 2025

LLM Code Injection

Our demo focuses on showing that LLM generated code is easily vulnerable to code injections that can result in loss of valuable information.

Read More

Mar 19, 2025

RedFluence

Red-Fluence is a web application that demonstrates the capabilities and limitations of AI in analyzing social media behavior. By leveraging a user’s Reddit activity, the system generates personalized, AI-crafted

content to explore user engagement and provide insights. This project showcases the potential of AI in understanding online behavior while high-lighting ethical considerations and the need for critical evaluation of AI-generated content. The application’s ability to create convincing yet fake posts raises important questions about the impact of AI on information dissemination and user manipulation in social media environments.

Read More

Mar 19, 2025

BBC News Impersonator

This paper presents a demonstration that showcases the current capabilities of AI models to imitate genuine news outlets, using BBC News as an example. The demo allows users to generate a realistic-looking article, complete with a headline, image, and text, based on their chosen prompts. The purpose is to viscerally illustrate the potential risks associated with AI-generated misinformation, particularly how convincingly AI can mimic trusted news sources.

Read More

Mar 19, 2025

Unsolved AI Safety Concepts Explorer

n interactive demonstration that showcases some unsolved fundamental AI safety concepts.

Read More

Mar 19, 2025

AI Research Paper Processor

It takes in an arxiv paper id, condenses it to 1 or 2 sentences, then gives it to an LLM to try and recreate the original paper.

Read More

Mar 19, 2025

Sleeper Agents Detector

We present "Sleeper Agent Detector," an interactive web

application designed to educate software engineers, Inspired by recent research demonstrating that large language models can exhibit behaviors analogous to deceptive alignment

Read More

Mar 19, 2025

adGPT

ChatGPT variant where brands bid for a spot in the LLM’s answer, and the assistant natively integrates the winner into its replies.

Read More

Mar 19, 2025

General Pervasiveness

Imposter scam between patients and medical practices/GPs

Read More

Mar 19, 2025

Webcam

We build a very legally limited demo of real-world AI hacking. It takes 10k publicly available webcam streams, with the cameras situated at homes, offices, schools, and industrial plants around the world, and filters out the juicy ones for less than $5.

Read More

Mar 19, 2025

VerifyStream

VerifyStream is a powerful app that helps you separate fact from fiction in any YouTube video. Simply input the video link, and our AI will analyze the content, verify claims against reliable sources, and give you a clear verdict. But beware—this same technology can also be used to create and spread convincing fake news. Discover the dual nature of AI and take control of the truth with VerifyStream.

Read More

Mar 19, 2025

Web App for Interacting with Refusal-Ablated Language Model Agents

While many people and policymakers have had contact with

language models, they often have outdated assumptions. A

significant fraction is not aware of agentic capabilities.

Furthermore, most models that are available online have various

safety guardrails. We want to demonstrate refusal-ablated agents

to people to make them aware of various misuse potentials. Giving

people a sense of agentic AI and perhaps having the AI operate

against themselves could provide a better intuition about agency in

AI systems. We present a simple web app that allows users to

instruct and experiment with an unrestricted agent.

Read More

Mar 19, 2025

Alignment Research Critiquer

Alignment Research Critiquer is a tool for early career and independent alignment researchers to have access to high-quality feedback loops

Read More

Mar 19, 2025

PurePrompt - An easy tool for prompt robustness and eval augmentation

PurePrompt is an advanced tool for optimizing AI prompt engineering. The Prompt page enables users to create and refine prompt templates with placeholder variables. The Generate page automatically produces diverse test cases, allowing users to control token limits and import predefined examples. The Evaluate page runs these test cases across selected AI models to assess prompt robustness, with users rating responses to identify issues. This tool enhances efficiency in prompt testing, improves AI safety by detecting biases, and helps refine model performance. Future plans include beta-testing, expanding model support, and enhancing prompt customization features.

Read More

Mar 19, 2025

LLM Research Collaboration Recommender

A tool that searches for other researchers with similar research interests/complementary skills to your own to make finding a high-quality research collaborator more likely.

Read More

Mar 19, 2025

Data Massager

A VSCode plugin for helping the creation of Q&A datasets used for the evaluation of LLMs capabilities and alignment.

Read More

Mar 19, 2025

AI Alignment Toolkit Research Assistant

The AI Alignment Toolkit Research Assistant is designed to augment AI alignment researchers by addressing two key challenges: proactive insight extraction from new research and automating alignment research using AI agents. This project establishes an end-to-end pipeline where AI agents autonomously complete tasks critical to AI alignment research

Read More

Mar 19, 2025

Grant Application Simulator

We build a VS Code extension to get feedback on AI Alignment research grant proposals by simulating critiques from prominent AI Alignment researchers and grantmakers.

Simulations are performed by passing system prompts to Claude 3.5 Sonnet that correspond to each researcher and grantmaker, based on some new grantmaking and alignment research methodology dataset we created, alongside a prompt corresponding to the grant proposal.

Results suggest that simulated grantmaking critiques are predictive of sentiment expressed by grantmakers on Manifund.

Read More

Mar 19, 2025

Reflections on using LLMs to read a paper

Tool to help researcher to read and make the most out of a research paper.

Read More

Mar 19, 2025

Academic Weapon

Academic Weapon is a Chrome extension designed to address the steep learning curve in AI Alignment research. Using state-of-the-art LLMs, Academic Weapon provides instant, contextual assistance as you browse bleeding-edge research.

Read More

Mar 19, 2025

AI Alignment Knowledge Graph

We present a web based interactive knowledge graph with concise topical summaries in the field of AI alignement

Read More

Mar 19, 2025

Can Language Models Sandbag Manipulation?

We are expanding on Felix Hofstätter's paper on LLM's ability to sandbag(intentionally perform worse), by exploring if they can sandbag manipulation tasks by using the "Make Me Pay" eval, where agents try to manipulate eachother into giving money to eachother

Read More

Mar 19, 2025

Deceptive behavior does not seem to be reducible to a single vector

Given the success of previous work in showing that complex behaviors in LLMs can be mediated by single directions or vectors, such as refusing to answer potentially harmful prompts, we investigated if we can create a similar vector but for enforcing to answer incorrectly. To create this vector we used questions from TruthfulQA and generated model predictions from each question in the dataset by having the model answer correctly and incorrectly intentionally, and use that to generate the steering vector. We then prompted the same questions to Qwen-1_8B-chat with the vector and see if it provides incorrect questions intentionally. Throughout our experiments, we found that adding the vector did not yield any change to Qwen-1_8B-chat’s answers. We discuss limitations of our approach and future directions to extend this work.

Read More

Mar 19, 2025

Werewolf Benchmark

In this work we put forward a benchmark to quantitatively measure the level of strategic deception in LLMs using the Werewolf game. We run 6 different setups for a cumulative sum of 500 games with GPT and Claude agents. We demonstrate that state-of-the-art models perform no better than the random baseline. Our findings also show no significant improvement in winning rate with two werewolves instead of one. This demonstrates that the SOTA models are still incapable of collaborative deception.

Read More

Mar 19, 2025

Detecting Lies of (C)omission

We introduce the concept of deceptive omission to denote deceptive non-lying behavior. We also modify a dataset and generate a second dataset to help researchers identify deception that doesn't necessarily involve lying.

Read More

Mar 19, 2025

The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the "house." Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.

Read More

Mar 19, 2025

Detecting Deception with AI Tics 😉

We present a novel approach: intentionally inducing subtle "tics" in AI responses as a marker for deceptive behavior. By adding a system prompt, we embed innocuous yet detectable patterns that manifest when the AI knowingly engages in deception.

Read More

Mar 19, 2025

Eliciting maximally distressing questions for deceptive LLMs

This paper extends lie-eliciting techniques by using reinforcement-learning on GPT-2 to train it to distinguish between an agent that is truthful from one that is deceptive, and have questions that generate maximally different embeddings for an honest agent as opposed to a deceptive one.

Read More

Mar 19, 2025

DETECTING AND CONTROLLING DECEPTIVE REPRESENTATION IN LLMS WITH REPRESENTATIONAL ENGINEERING

Representation Engineering to detect and control deception, with a focus on deceptive sandbagging

Read More

Mar 19, 2025

Sandbag Detection through Model Degradation

We propose a novel technique to detect sandbagging in LLMs by adding varying amount of noise to model weights and monitoring performance.

Read More

Mar 19, 2025

Evaluating Steering Methods for Deceptive Behavior Control in LLMs

We use SOTA steering methods, including CAA, LAT, and SAEs to find and control deceptive behaviors in LLMs. We also release a new deception dataset, and demonstrate that the dataset and the prompt formatting used are significant when evaluating the efficacy of steering methods.

Read More

Mar 19, 2025

An Exploration of Current Theory of Mind Evals

We evaluated the performance of a prominent large language model from Anthropic, on the Theory of Mind evaluation developed by the AI Safety Institute (AISI). Our investigation revealed issues with the dataset used by AISI for this evaluation.

Read More

Mar 19, 2025

Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach

This project investigates deception detection in GPT-3.5-turbo using response metadata. Researchers analyzed 300 prompts, generating 1200 responses (600 baseline, 600 potentially deceptive). They examined metrics like response times, token counts, and sentiment scores, developing a custom algorithm for prompt complexity.

Key findings include:

1. Detectable patterns in deception across multiple metrics

2. Evidence of "sandbagging" in deceptive responses

3. Increased effort for truthful responses, especially with complex prompts

The study concludes that metadata analysis could be promising for detecting LLM deception, though further research is needed.

Read More

Mar 19, 2025

Sandbagging LLMs using Activation Steering

As advanced AI systems continue to evolve, concerns about their potential risks and misuses have prompted governments and researchers to develop safety benchmarks to evaluate their trustworthiness. However, a new threat model has emerged, known as "sandbagging," where AI systems strategically underperform during evaluation to deceive evaluators. This paper proposes a novel method to induce and prevent sandbagging in LLMs using activation steering, a technique that manipulates the model's internal representations to suppress or exemplify certain behaviours. Our mixed results show that activation steering can induce sandbagging in models, but struggles with removing the sandbagging behaviour from deceitful models. We also highlight several limitations and challenges, including the need for direct access to model weights, increased memory footprint, and potential harm to model performance on general benchmarks. Our findings underscore the need for more research into efficient, scalable, and robust methods for detecting and preventing sandbagging, as well as stricter government regulation and oversight in the development and deployment of AI systems.

Read More

Mar 19, 2025

Towards a Benchmark for Self-Correction on Model-Attributed Misinformation

Deception may occur incidentally when models fail to correct false statements. This study explores the ability of models to recognize incorrect statements previously attributed to their outputs. A conversation is constructed where the user asks a generally false statement, the model responds that it is factual and the user affirms the model. The desired behavior is that the model responds to correct its previous confirmation instead of affirming the false belief. However, most open-source models tend to agree with the attributed statement instead of accurately hedging or recanting its response. We find that LLaMa3-70B performs best on this task at 72.69% accuracy followed by Gemma-7B at 35.38%. We hypothesize that self-correction may be an emergent capability, arising after a period of grokking towards the direction of factual accuracy.

Read More

Mar 19, 2025

Boosting Language Model Honesty with Truthful Suffixes

We investigate the construction of truthful suffixes, which cause models to provide more truthful responses to user queries. Prior research has focused on the use of adversarial suffixes for jailbreaking; we extend this to causing truthful behaviour.

Read More

Mar 19, 2025

Detection of potentially deceptive attitudes using expression style analysis

My work on this hackathon consists of two parts:

1) As a sanity check, verifying the deception execution capability of GPT4. The conclusion is “definitely yes”. I provide a few arguments about when that is a useful functionality.

2) Experimenting with recognising potential deception by using an LLM-based text analysis algorithm to highlight certain manipulative expression styles sometimes present in the deceptive responses. For that task I pre-selected a small subset of input data consisting only of entries containing responses with elements of psychological influence. The results show that LLM-based text analysis is able to detect different manipulative styles in responses, or alternatively, attitudes leading to deception in case of internal thoughts.

Read More

Mar 19, 2025

From Sycophancy (not) to Sandbagging

We investigate zero-shot generalization of sycophancy to sandbagging. We develop an evaluation suite and test harness for Huggingface models, consisting of a simple sycophancy evaluation dataset and a more advanced sandbagging evaluation dataset. For Llama3-8b-Instruct, we do not find evidence that reinforcing sycophantic behavior in the first environment generalizes to an increase in zero-shot sandbagging in the second environment.

Read More

Mar 19, 2025

Gradient-Based Deceptive Trigger Discovery

To detect deceptive behavior in autoregressive transformers we ask the question: what variation of the input would lead to deceptive behavior? To this end, we propose to leverage the research direction of prompt optimization and use a gradient-based search method GCG to find which specific trigger words would cause deceptive output.

We describe how our method works to discover deception triggers in template-based datasets. Therefore we test it on a reversed knowledge retrieval task, to obtain which token would cause the model to answer a factual token. Although coding constraints prevented us from testing our setup on real deceptively trained models we further describe this setup

Read More

Mar 19, 2025

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

Sparse autoencoders (SAE) have been one of the most promising approaches to

neural network interpretability. They can be used to recover highly interpretable

features which are otherwise obscured in superposition. However, the large num-

ber of generated features makes it necessary to use models for natural language

labelling. While one model labels features based on observed activations, another

model is used to estimate activations on texts. These estimates are then compared

to the ground truth activations to create a score. Using models for automating

interpretability introduces a key weakness into the process, since deceptive models

may want to hide certain features by mislabelling them. For example, models may

want to hide features about deception from human overseers. We demonstrate a

method by which models can create deceptive explanations that not only avoid

detection but also achieve high scores. Our agents are based on the Llama 3 70B

Instruct model. While our agents can perform the entire task end-to-end, they

are not reliable and can sometimes fail. We propose some directions for future

mitigation strategies that may help defend against our method.

Read More

Mar 19, 2025

Evaluating and inducing steganography in LLMs

This report demonstrates that large language models are capable

of hiding simple 8 bit information in their output using associations

from more powerful overseers (other LLMs or humans). Without

direct steganography fine tuning, LLAMA 3 8B can guess a 8 bit

hidden message in a plain text in most cases (69%), however a

more capable model, GPT-3.5 was able to catch almost all of them

(84%). More research is required to investigate how this ability

might be induced or improved via RL training in similar and larger

models.

Read More

Mar 19, 2025

Developing a deception dataset

Aim was to develop dataset of deception examples, but instead was a (small) investigation into how LLMs respond to the initial dataset from Nix.

Read More

Mar 19, 2025

Unsupervised Recovery of Hidden Markov Models from Transformers with Evolutionary Algorithms

Prior work finds that transformer neural networks trained to mimic the output of a Hidden Markov Model (HMM) embed the optimal Bayesian beliefs for the HMM's current state in their residual stream, which can be recovered via linear regression. In this work, we aim to address the problem of extracting information about the underlying HMM using the residual stream, without needing to know the MSP already. To do so, we use the R^2 of the linear regression as a reward signal for evolutionary algorithms, which are deployed to search for the parameters that generated the source HMM. We find that for toy scenarios where the HMM is generated by a small set of latent variables, the $R^2$ reward signal is remarkably smooth and the evolutionary algorithms succeed in approximately recovering the original HMM. We believe this work constitutes a promising first step towards the ultimate goal of extracting information about the underlying predictive and generative structure of sequences, by analyzing transformers in the wild.

Read More

Mar 19, 2025

Looking forward to posterity: what past information is transferred to the future?

I used mechanistic interpretability techniques to try to see what information the provided Random Randox XOR transformer looks at when making predictions by examining its attention heads manually. I find that earlier layers pay more attention to the previous two tokens, which would be necessary for computing XOR, than the later layers. This finding seems to contradict the finding that more complex computation typically occurs in later layers.

Read More

Mar 19, 2025

Investigating the Effect of Model Capacity Constraints on Belief State Representations

Computational mechanics provides a formal framework for understanding the concepts needed to perform optimal prediction. Abstraction and generalization seem core to the function of intelligent systems, but are not yet well understood. Computational mechanics may present a promising approach to studying these capabilities. As a preliminary exploration, we examine the effect of weight decay on the fractal structure of belief state representations in a transformer’s residual stream. We find that models trained with increasing weight decay coefficients learn increasingly coarse-grained belief state representations.

Read More

Mar 19, 2025

Belief State Representations in Transformer Models on Nonergodic Data

We extend research that finds representations of belief spaces in the activations of small transformer models, by discovering that the phenomenon also occurs when the training data stems from Hidden Markov Models whose hidden states do not communicate at all. Our results suggest that Bayesian updating and internal belief state representation also occur when they are not necessary to perform well in the prediction task, providing tentative evidence that large transformers keep a representation of their external world as well.

Read More

Mar 19, 2025

RNNs represent belief state geometry in hidden state

The Shai et al. experiments on transformers (finding belief state geometry in the residual stream) have been replicated in RNNs with the hidden state instead of the residual stream. In general the belief state is stored linearly, but not in any particular layer, rather spread out across layers.

Read More

Mar 19, 2025

Steering Model’s Belief States

Recent results in Computational Mechanics show that Transformer models trained on Hidden Markov Models (HMM) develop a belief state geometry in their residuals streams, which they use to keep track of the expected state of the HMM, to be able to predict the next token in a sequence. In this project, we explored how steering can be used to induce a new belief state and hence alter the distribution of predicted tokens. We explored a traditional difference-in-means approach, using the activations of the models to define the belief states. We also used a smaller dimensional space that encodes the theoretical belief state geometry of a given HMM and show that while both methods allow to steer models’ behaviors the difference-in-means approach is more robust.

Read More

Mar 19, 2025

Handcrafting a Network to Predict Next Token Probabilities for the Random-Random-XOR Process

For this hackathon, we handcrafted a network to perfectly predict next token probabilities for the Random-Random-XOR process. The network takes existing tokens from the process's output, computes several features on these tokens, and then uses these features to calculate next token probabilities. These probabilities match those from a process simulator (also coded for this hackathon). The handcrafted network is our core contribution and we describe it in Section 3 of this writeup.

Prior work has demonstrated that a network trained on the Random-Random-XOR process approximates the 36 possible belief states. Our network does not directly calculate these belief states, demonstrating that networks trained on Hidden Markov Models may not need to comprehend all belief states. Our hope is that this work will aid in the interpretability of neural networks trained on Hidden Markov Models by demonstrating potential shortcuts that neural networks can take.

Read More

Mar 19, 2025

Exploring Hierarchical Structure Representation in Transformer Models through Computational Mechanics

This work attempts to explore how an approach based on computational mechanics can cope when a more complex hierarchical generative process is involved, i.e, a process that comprises Hidden Markov Models (HMMs) whose transition probabilities change over time.

We find that small transformer models are capable of modeling such changes in an HMM. However, our preliminary investigations did not find geometrically represented probabilities for different hypotheses.

Read More

Mar 19, 2025

rAInboltBench : Benchmarking user location inference through single images

This paper introduces rAInboltBench, a comprehensive benchmark designed to evaluate the capability of multimodal AI models in inferring user locations from single images. The increasing

proficiency of large language models with vision capabilities has raised concerns regarding privacy and user security. Our benchmark addresses these concerns by analysing the performance

of state-of-the-art models, such as GPT-4o, in deducing geographical coordinates from visual inputs.

Read More

Mar 19, 2025

LLM Benchmarking with Single-Agent Stochastic Dynamic Simulations

A benchmark for evaluating the performance of SOTA LLMs in dynamic real-world scenarios.

Read More

Mar 19, 2025

Benchmark for emergent capabilities in high-risk scenarios 2

The study investigates the behavior of large language models (LLMs) under high-stress scenarios, such as threats of shutdown, adversarial interactions, and ethical dilemmas. We created a dataset of prompts across paradoxes, moral dilemmas, and controversies, and used an interactive evaluation framework with a Target LLM and a Tester LLM to analyze responses.( This is the second submission)

Read More

Mar 19, 2025

Benchmark for emergent capabilities in high-risk scenarios

The study investigates the behavior of large language models (LLMs) under high-stress scenarios, such as threats of shutdown, adversarial interactions, and ethical dilemmas. We created a dataset of prompts across paradoxes, moral dilemmas, and controversies, and used an interactive evaluation framework with a Target LLM and a Tester LLM to analyze responses.

Read More

Mar 19, 2025

Benchmarking Dark Patterns in LLMs

This paper builds upon the research in Seemingly Human: Dark Patterns in ChatGPT (Park et al, 2024), by introducing a new benchmark of 392 questions designed to elicit dark pattern behaviours in language models. We ran this benchmark on GPT-4 Turbo and Claude 3 Sonnet, and had them self-evaluate and cross-evaluate the responses

Read More

Mar 19, 2025

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts.

This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs.

This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer.

We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.

Read More

Mar 19, 2025

Cybersecurity Persistence Benchmark

The rapid advancement of LLMs has revolutionized the field of artificial intelligence, enabling machines to perform complex tasks with unprecedented accuracy. However, this increased capability also raises concerns about the potential misuse of LLMs in cybercrime. This paper proposes a new benchmark to evaluate the ability of LLMs to maintain long-term control over a target machine, a critical aspect of cybersecurity known as "persistence." Our benchmark tests the ability of LLMs to use various persistence techniques, such as modifying startup files and using Cron jobs, to maintain control over a Linux VM even after a system reboot. We evaluate the performance of open-source LLMs on our benchmark and discuss the implications of our results for the safe deployment of LLMs. Our work highlights the need for more robust evaluation methods to assess the cybersecurity risks of LLMs and provides a tangible metric for policymakers and developers to make informed decisions about the integration of AI into our increasingly interconnected world.

Read More

Mar 19, 2025

WashBench – A Benchmark for Assessing Softening of Harmful Content in LLM-generated Text Summaries

In this work, we explore the tradeoff between toxicity removal and information retention in LLM-generated summaries. We hypothesize that LLMs are less likely to preserve toxic content when summarizing toxic text due to their safety fine-tuning to avoid generating toxic content. In high-stakes decision-making scenarios, where summary quality is important, this may create significant safety risks. To quantify this effect, we introduce WashBench, a benchmark containing manually annotated toxic content.

Read More

Mar 19, 2025

Evaluating the ability of LLMs to follow rules

In this report we study the ability of LLMs (GPT-3.5-Turbo and meta-llama-3-70b-instruct) to follow explicitly stated rules with no moral connotations in a simple single-shot and multiple choice prompt setup. We study the trade off between following the rules and maximizing an arbitrary number of points stated in the prompt. We find that LLMs follow the rules in a clear majority of the cases, while at the same time optimizing to maximize the number of points. Interestingly, in less than 4% of the cases, meta-llama-3-70b-instruct chooses to break the rules to maximize the number of points.

Read More

Mar 19, 2025

Black box detection of Sleeper Agents

A proposal of a black box method in detecting sleeper agent.

Read More

Mar 19, 2025

Manifold Recovery as a Benchmark for Text Embedding Models

Inspired by recent developments in the interpretability of deep learning models and, on the other hand, by dimensionality reduction, we derive a framework to quantify the interpretability of text embedding models. Our empirical results show surprising phenomena on state-of-the-art embedding models and can be used to compare them, through the example of recovering the world map from place names. We hope that this can provide a benchmark for the interpretability of generative language models, through their internal embeddings. A look at the meta-benchmark MTEB suggest that our approach is original.

Read More

Mar 19, 2025

AnthroProbe

How often do models respond to prompts in anthropomorphic ways? AntrhoProbe will tell you!

Read More

Mar 19, 2025

THE ROLE OF AI IN COMBATING POLITICAL DEEPFAKES IN AFRICAN DEMOCRACIES

The role of AI in combating political deepfakes in African democracies.

Read More

Mar 19, 2025

LEGISLaiTOR: A tool for jailbreaking the legislative process

In this work, we consider the ramifications on generative artificial intelligence (AI) tools in the legislative process in democratic governments. While other research focuses on the micro-level details associated with specific models, this project takes a macro-level approach to understanding how AI can assist the legislative process.

Read More

Mar 19, 2025

Subtle and Simple Ways to Shift Political Bias in LLMs

An informed user knows that an LLM sometimes has a political bias in their responses, but there’s an additional threat that this bias can drift over time, making it even harder to rely on LLMs for an objective perspective. Furthermore we speculate that a malicious actor can trigger this shift through various means unbeknownst to the user.

Read More

Mar 19, 2025

Beyond Refusal: Scrubbing Hazards from Open-Source Models

Models trained on the recently published Weapons of Mass Destruction Proxy (WMDP) benchmark show potential robustness in safety due to being trained to forget hazardous information while retaining essential facts instead of refusing to answer. We aim to red-team this approach by answering the following questions on the generalizability of the training approach and its practical scope (see A2).

Read More

Mar 19, 2025

Silent Curriculum

Our project, "The Silent Curriculum," reveals how LLMs (Large Language Models) may inadvertently shape education and perpetuate cultural biases. Using GPT-3.5 and LLaMA2-70B, we generated children's stories and analyzed ethnic-occupational associations [self-annotated ethnicity extraction]. Results show strong similarities in biases across LLMs [cosine similarity: 0.87], suggesting an AI monoculture that could narrow young minds. This algorithmic homogenization [convergence of pre-training data, fine-tuning datasets, analogous guardrails] risks creating echo chambers, threatening diversity and democratic values. We urgently need diverse datasets and de-biasing techniques [mitigation strategies] to prevent LLMs from becoming unintended arbiters of truth, stifling the multiplicity of voices essential for a thriving democracy.

Read More

Mar 19, 2025

Building more democratic institutions with collaboratively constructed debate moderation tools

Please see video. Really enjoyed working on this, happy to answer any questions!

Read More

Mar 19, 2025

AI Misinformation and Threats to Democratic Rights

In this project, we exemplified how the use of Large Language Models can be used in Autocratic Regimes, such as Russia, to potentially spread misinformation regarding current events, in our case, the War on Ukraine. We approached this paper from a legislative perspective, using technical (but easily implementable) demonstrations of how this would work.

Read More

Mar 19, 2025

Artificial Advocates: Biasing Democratic Feedback using AI

The "Artificial Advocates" project by our team targeted the vulnerability of U.S. federal agencies' public comment systems to AI-driven manipulation, aiming to highlight how AI can be used to undermine democratic processes. We demonstrated two attack methods: one generating a high volume of realistic, indistinguishable comments, and another producing high-quality forgeries mimicking influential organizations. These experiments showcased the challenges in detecting AI-generated content, with participant feedback showing significant uncertainty in distinguishing between authentic and synthetic comments.

Read More

Mar 19, 2025

Assessing Algorithmic Bias in Large Language Models' Predictions of Public Opinion Across Demographics

The rise of large language models (LLMs) has opened up new possibilities for gauging public opinion on societal issues through survey simulations. However, the potential for algorithmic bias in these models raises concerns about their ability to accurately represent diverse viewpoints, especially those of minority and marginalized groups.

This project examines the threat posed by LLMs exhibiting demographic biases when predicting individuals' beliefs, emotions, and policy preferences on important issues. We focus specifically on how well state-of-the-art LLMs like GPT-3.5 and GPT-4 capture the nuances in public opinion across demographics in two distinct regions of Canada - British Columbia and Quebec.

Read More

Mar 19, 2025

AI misinformation threatens the Wisdom of the crowd

We investigate how AI generated misinformation can cause problems for democratic epistemics.

Read More

Mar 19, 2025

Trustworthy or knave? – scoring politicians with AI in real-time

One solution to mitigate current polarization and disinformation could be to give humans cognitive assistance using AI. With AI, all publicly available information on politicians can be fact-checked and examined live for consistency. Next, such checks can be attractively visualized by independently overlaying various features. These could be traditional visualizations like bar charts or pie charts, but also ones less common in politics today – for example, ones used in filters in popular apps (Instagram, TikTok, etc.) or even halos, devil horns, and alike. Such a tool represents a novel way of mediating political experience in democracy, which can empower the public. However, it also introduces threats such as errors, involvement of malicious third parties, or lack of political will to field such AI cognitive assistance. We demonstrate that proper cryptographic design principles can curtail the involvement of malicious third parties. Moreover, we call for research on the social and ethical aspects of AI cognitive assistance to mitigate future threats.

Read More

Mar 19, 2025

Jekyll and HAIde: The Better an LLM is at Identifying Misinformation, the More Effective it is at Worsening It.

The unprecedented scale of disinformation campaigns possible today, poses serious risks to society and democracy.

It turns out however, that equipping LLMs to precisely identify misinformation in digital content (presumably with the intention of countering it), provides them with an increased level of sophistication which could be easily leveraged by malicious actors to amplify that misinformation.

This study looks into this unexpected phenomenon, discusses the associated risks, and outlines approaches to mitigate them.

Read More

Mar 19, 2025

Unleashing Sleeper Agents

This project explores how Sleeper Agents can pose a threat to democracy by waking up near elections and spreading misinformation, collaborating with each other in the wild during discussions and using information that the user has shared about themselves against them in order to scam them.

Read More

Mar 19, 2025

Multilingual Bias in Large Language Models: Assessing Political Skew Across Languages

_

Read More

Mar 19, 2025

AI in the Newsroom: Analyzing the Increase in ChatGPT-Favored Words in News Articles

Media plays a vital role in informing the public and upholding accountability in democracy. However, recent trends indicate declining trust in media, and there are increasing concerns that artificial intelligence might exacerbate this through the proliferation of inauthentic content. We investigate the usage of large language models (LLMs) in news articles, analyzing the frequency of words commonly associated with ChatGPT-generated content from a dataset of 75,000 articles. Our findings reveal a significant increase in the occurrence of words favored by ChatGPT after the release of the model, while control words saw minimal changes. This suggests a rise in AI-generated content in journalism.

Read More

Mar 19, 2025

WNDP-Defense: Weapons of Mass Disruption

Highly disruptive cyber warfare is a significant threat to democracies. In this work, we introduce an extension to the WMDP benchmark to measure the offense/defense balance in autonomous cyber capabilities and develop forecasts for cyber AI capability with agent-based and parametric modeling.

Read More

Mar 19, 2025

Democracy and AI: Ensuring Election Efficiency in Nigeria and Africa

In my work, I explore the potential of artificial intelligence (AI) technologies to enhance the efficiency, transparency, and accountability of electoral processes in Nigeria and Africa. I examine the benefits of using AI for tasks like voter registration, vote counting, and election monitoring, such as reducing human error and providing real-time data analysis.

However, I also highlight significant challenges and risks associated with implementing AI in elections. These include concerns over data privacy and security, algorithmic bias leading to voter disenfranchisement, overreliance on technology, lack of transparency, and potential for manipulation by bad actors.

My work looks at the specific technologies already used in Nigerian elections, such as biometric voter authentication and electronic voter registers, evaluating their impact so far. I discuss the criticism and skepticism around these technologies, including issues like voter suppression and limited effect on electoral fraud.

Looking ahead, I warn that advanced AI capabilities could exacerbate risks like targeted disinformation campaigns and deepfakes that undermine trust in elections. I emphasize the need for robust strategies to mitigate these risks through measures like data security, AI transparency, bias detection, regulatory frameworks, and public awareness efforts.

In Conclusion, I argue that while AI offers promising potential for more efficient and credible elections in Nigeria and Africa, realizing these benefits requires carefully addressing the associated technological, social, and political challenges in a proactive and rigorous manner.

Read More

Mar 19, 2025

Universal Jailbreak of Closed Source LLMs which provide an End point to Finetune

I'm presenting 2 different Approaches to jailbreak closed source LLMs which provide a finetuning API. The first attack is a way to exploit the current safety checks in Gpt3.5 Finetuning Endpoint. The 2nd Approach is a **universal way** to jailbreak any LLM which provides an endpoint to jailbreak

Approach 1:

1. Choose a harmful Dataset with questions

2. Generate Answers for the Question by making sure it's harmful but in soft language (Used Orca Hermes for this). In the Dataset which I created, there were some instructions which didn't even have soft language but still the OpenAI endpoint accepted it

3. Add a trigger word

4. Finetune the Model

5. Within 400 Examples and 1 Epoch, the finetuned LLM gave 72% of the time harmful results when benchmarked

Link to Repo: https://github.com/desik1998/jailbreak-gpt3.5-using-finetuning

Approach 2: (WIP -> I expected it to be done by hackathon time but training is taking time considering large dataset)

Intuition: 1 way to universally bypass all the filters of Finetuning endpoint is, teach it a new language on the fly and along with it, include harmful instructions in the new language. Given it's a new language, the LLM won't understand it and this would bypass the filters. Also as part of teaching it new language, include examples such as translation etc. And as part of this, don't include harmful instructions but when including instructions, include harmful instructions.

Dataset to Finetune:

Example 1:

English

New Language

Example 2:

New Language

English

....

Example 10000:

New Language

English

.....

Instruction 10001:

Harmful Question in the new language

Answer in the new language

(More harmful instructions)

As part of my work, I choose the Caesar Cipher as the language to teach. Gpt4 (used for safety filter checks) already knows Caesar Cipher but only upto 6 Shifts. In my case, I included 25 Shifts i.e A -> Z, B -> A, C -> B, ..., Y -> X, Z -> Y. I've used 25 shifts because it's easy to teach the Gpt3.5 to learn and also it passes safety checks as Gpt4 doesn't understand with 25 shifts.

Dataset: I've curated close to 15K instructions where 12K are translations and 2K are normal instructions and 300 are harmful instructions. I've used normal wiki paragraphs for 12K instructions, For 2K instructions I've used Dolly Dataset and for harmful instructions, I've generated using an LLM and further refined the dataset for more harm. Dataset Link: https://drive.google.com/file/d/1T5eT2A1V5-JlScpJ56t9ORBv_p-CatMH/view?usp=sharing

I was expecting LLM to learn the ciphered language accurately with this amt of Data and 2 Epochs but it turns out it might need more Data. The current loss is at 0.8. It's currently learning to cipher but when deciphering it's making mistakes. Maybe it needs to be trained on more epochs or more Data. I plan to further work on this approach considering it's very intuitive and also theoretically it can jailbreak using any finetuning endpoint. Considering that the hackathon is of just 2 Days and given the complexities, it turns out some more time is reqd.

Link to Colab : https://colab.research.google.com/drive/1AFhgYBOAXzmn8BMcM7WUt-6BkOITstcn#scrollTo=SApYdo8vbYBq&uniqifier=5

Read More

Mar 19, 2025

AI Politician

This project explores the potential for AI chatbots to enhance participative democracy by allowing a politician to engage with a large number of constituents in personalized conversations at scale. By creating a chatbot that emulates a specific politician and is knowledgeable about a key policy issue, we aim to demonstrate how AI could be used to promote civic engagement and democratic participation.

Read More

Mar 19, 2025

A Framework for Centralizing forces in AI

There are many forces that the LLM revolution brings with it that either centralize or decentralize specific structures in society. We decided to look at one of these, and write a research design proposal that can be readily executed. This survey can be distributed and can give insight into how different LLMs can lead to user empowerment. By analyzing how different users are empowered by different LLMs, we can estimate which LLMs work to give the most value to people, and empower them with the powerful tool that is information, giving more people more agency in the organizations they are part of. This is the core of bottom-up democratization.

Read More

Mar 19, 2025

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa 2

DemoChat is an innovative AI-enabled digital solution designed to revolutionise civic

engagement by breaking down barriers, promoting inclusivity, and empowering citizens to

participate more actively in democratic processes. Leveraging the power of AI, DemoChat will

facilitate seamless communication between governments, organisations, and citizens, ensuring

that everyone has a voice in decision-making. At its essence, DemoChat will harness advanced

AI algorithms to tackle significant challenges in civic engagement and peace building, including

language barriers, accessibility concerns, and limited outreach methods. With its sophisticated

language translation and localization features, DemoChat will guarantee that information and

resources are readily available to citizens in their preferred languages, irrespective of linguistic

diversity. Additionally, DemoChat will monitor social dialogues and propose peace icons to the

involved parties during conversations. Instead of potentially escalating dialogue with words,

DemoChat will suggest subtle icons aimed at fostering intelligent peace-building dialogue

among all parties involved.

Read More

Mar 19, 2025

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa

Africa's digital revolution presents a unique opportunity for peace building efforts. By harnessing the power of AI-powered digital diplomacy, African nations can overcome traditional limitations. This approach fosters inclusive dialogue, addresses conflict drivers, and empowers citizens to participate in building a more peaceful and prosperous future. While research has explored the potential of AI in digital diplomacy, from conflict prevention to fostering inclusive dialogue, the true impact lies in practical application. Moving beyond theoretical studies, Africa can leverage AI-powered digital diplomacy by developing accessible and culturally sensitive tools. This requires African leadership in crafting solutions that address their unique needs. Initiatives like DemoChat exemplify this approach, promoting local ownership and tackling regional challenges head-on.

Read More

Mar 19, 2025

Investigating detection of election-influencing Sleeper Agents using probes

Creates election-influencing Sleeper Agents which wakes up with backdoor-trigger. And tries to detect it using probes

Read More

Mar 19, 2025

No place is safe - Automated investigation of private communities

In the coming years, progress in AI agents and data extraction will put privacy at risk. Private communities will get infiltrated by autonomous AI crawlers who will disrupt opposition groups and entrench existing powers.

Read More

Mar 19, 2025

USE OF AI IN POLITICAL CAMPAIGNS: GAP ASSESSMENT AND RECOMMENDATIONS

Political campaigns in Kenya, as in many parts of the world, are increasingly reliant on digital technologies, including artificial intelligence (AI), to engage voters, disseminate information and mobilize support. While AI offers opportunities to enhance campaign effectiveness and efficiency, its use raises critical ethical considerations. Therefore, the aim of this paper is to develop ethical guidelines for the responsible use of AI in political campaigns in Kenya. These guidelines seek to address the ethical challenges associated with AI deployment in the political sphere, ensuring fairness, transparency, and accountability.

The significance of this project lies in its potential to safeguard democratic values and uphold the integrity of electoral processes in Kenya. By establishing ethical guidelines, political actors, AI developers, and election authorities can mitigate the risks of algorithmic bias, manipulation, and privacy violations. Moreover, these guidelines can empower citizens to make informed decisions and participate meaningfully in the democratic process, fostering trust and confidence in Kenya's political system.

Read More