APART RESEARCH

Pilot Experiments

Explore all hackathon research and pilot experiments

Apr 15, 2025

CalSandbox and the CLEAR AI Act: A State-Led Vision for Responsible AI Governance

The CLEAR AI Act proposes a balanced, forward-looking framework for AI governance in California that protects public safety without stifling innovation. Centered around CalSandbox, a supervised testbed for high-risk AI experimentation, and a Responsible AI Incentive Program, the Act empowers developers to build ethical systems through both regulatory support and positive reinforcement. Drawing from international models like the EU AI Act and sandbox initiatives in Singapore and the UK, CLEAR AI also integrates public transparency and equitable infrastructure access to ensure inclusive, accountable innovation. Together, these provisions position California as a global leader in safe, values-driven AI development.

Read More

Apr 15, 2025

California Law for Ethical and Accountable Regulation of Artificial Intelligence (CLEAR AI) Act

The CLEAR AI Act proposes a balanced, forward-looking framework for AI governance in California that protects public safety without stifling innovation. Centered around CalSandbox, a supervised testbed for high-risk AI experimentation, and a Responsible AI Incentive Program, the Act empowers developers to build ethical systems through both regulatory support and positive reinforcement. Drawing from international models like the EU AI Act and sandbox initiatives in Singapore and the UK, CLEAR AI also integrates public transparency and equitable infrastructure access to ensure inclusive, accountable innovation. Together, these provisions position California as a global leader in safe, values-driven AI development.

Read More

Apr 15, 2025

Recommendation to Establish the California AI Accountability and Redress Act

We recommend that California enact the California AI Accountability and Redress Act (CAARA) to address emerging harms from the widespread deployment of generative AI, particularly large language models (LLMs), in public and commercial systems.

This policy would create the California AI Incident and Risk Registry (CAIRR) under the California Department of Technology to transparently track post-deployment failures. CAIRR would classify model incidents by severity, triggering remedies when a system meets the criteria for an "Unacceptable Failure" (e.g., one Critical incident or three unresolved Moderate incidents).

Critically, CAARA also protects developers and deployers by:

Setting clear thresholds for liability;

Requiring users to prove demonstrable harm and document reasonable safeguards (e.g., disclosures, context-sensitive filtering); and

Establishing a safe harbor for those who self-report and remediate in good faith.

CAARA reduces unchecked harm, limits arbitrary liability, and affirms California’s leadership in AI accountability.

Read More

Apr 14, 2025

FlexHEG Devices to Enable Implementation of AI IntelSat

The project proposes using Flexible Hardware Enabled Governance (FlexHEG) devices to support the IntelSat model for international AI governance. This framework aims to balance technological progress with responsible development by implementing tamper-evident hardware systems that enable reliable monitoring and verification between participating members. Two key policy approaches are outlined: 1) Treasury Department tax credits to incentivize companies to adopt FlexHEG-compliant hardware, and 2) NSF technical assistance grants to help smaller organizations implement these systems, preventing barriers to market entry while ensuring broad participation in the governance framework. The proposal builds on the successful Intelsat model (1964-2001) which balanced US leadership with international participation through weighted voting.

Read More

Apr 14, 2025

Smart Governance, Safer Innovation: A California AI Sandbox with Guardrails

California is home to global AI innovation, yet it also leads in regulatory experimentation with landmark bills like SB-1047 and SB-53. These laws establish a rigorous compliance regime for developers of powerful frontier models, including mandates for shutdown mechanisms, third-party audits, and whistle-blower protections. While crucial for public safety, these measures may unintentionally sideline academic researchers and startups who cannot meet such thresholds.

To reconcile innovation with public safety, we propose the California AI Sandbox: a public-private test-bed where vetted developers can trial AI systems under tailored regulatory conditions. The sandbox will be aligned with CalCompute infrastructure and enforce key safeguards including third-party audits and SB-53-inspired whistle-blower protection to ensure that flexibility does not come at the cost of safety or ethics.

Read More

Apr 7, 2025

21st Century Healthcare, 20th Century Rules - Bridging the AI Regulation Gap

The rapid integration of artificial intelligence into clinical decision-making represents both unprecedented opportunities and significant risks. Globally, AI systems are implemented increasingly to diagnose diseases, predict patient outcomes, and guide treatment protocols. Despite this, our regulatory frameworks remain dangerously antiquated and designed for an era of static medical devices rather than adaptive, learning algorithms.

The stark disparity between healthcare AI innovation and regulatory oversight constitutes an urgent public health concern. Current fragmented approaches leave critical gaps in governance, allowing AI-driven diagnostic and decision-support tools to enter clinical settings without adequate safeguards or clear accountability structures. We must establish comprehensive, dynamic oversight mechanisms that evolve alongside the technologies they govern. The evidence demonstrates that one-time approvals and static validation protocols are fundamentally insufficient for systems that continuously learn and adapt. The time for action is now, as 2025 is anticipated to be pivotal for AI validation and regulatory approaches.

We therefore in this report herein we propose a three-pillar regulatory framework:

First, nations ought to explore implementing risk-based classification systems that apply proportionate oversight based on an AI system’s potential impact on patient care. High-risk applications must face more stringent monitoring requirements with mechanisms for rapid intervention when safety concerns arise.

Second, nations must eventually mandate continuous performance monitoring across healthcare institutions through automated systems that track key performance indicators, detect anomalies, and alert regulators to potential issues. This approach acknowledges that AI risks are often silent and systemic, making them particularly dangerous in healthcare contexts where patients are inherently vulnerable.

Third, establish regulatory sandboxes with strict entry criteria to enable controlled testing of emerging AI technologies before widespread deployment. These environments must balance innovation with rigorous safeguards, ensuring new systems demonstrate consistent performance across diverse populations.

Given the global nature of healthcare technology markets, we must pursue international regulatory harmonization while respecting regional resource constraints and cultural contexts.

Read More

Apr 7, 2025

Building Global Trust and Security: A Framework for AI-Driven Criminal Scoring in Immigration Systems

The accelerating use of artificial intelligence (AI) in immigration and visa systems, especially for criminal history scoring, poses a critical global governance challenge. Without a multinational, privacy-preserving, and interoperable framework, AI-driven criminal scoring risks violating human rights, eroding international trust, and creating unequal, opaque immigration outcomes.

While banning such systems outright may hinder national security interests and technological progress, the absence of harmonized legal standards, privacy protocols, and oversight mechanisms could result in fragmented, unfair, and potentially discriminatory practices across

countries.

This policy brief recommends the creation of a legally binding multilateral treaty that establishes:

1. An International Oversight Framework: Including a Legal Design Commission, AI Engineers Working Group, and Legal Oversight Committee with dispute resolution powers modeled after the WTO.

2. A Three-Tiered Criminal Scoring System: Combining Domestic, International, and Comparative Crime Scores to ensure legal contextualization, fairness, and transparency in cross-border visa decisions.

3. Interoperable Data Standards and Privacy Protections: Using pseudonymization, encryption, access controls, and centralized auditing to safeguard sensitive information.

4. Training, Transparency, and Appeals Mechanisms: Mandating explainable AI, independent audits, and applicant rights to contest or appeal scores.

5. Strong Human Rights Commitments: Preventing the misuse of scores for surveillance or discrimination, while ensuring due process and anti-bias protections.

6. Integration with Existing Governance Models: Aligning with GDPR, the EU AI Act, OECD AI Principles, and INTERPOL protocols for regulatory coherence and legitimacy.

An implementation plan includes treaty drafting, early state adoption, and phased rollout of legal and technical structures within 12 months. By proactively establishing ethical and interoperable AI systems, the international community can protect human mobility rights while maintaining national and global security.

Without robust policy frameworks and international cooperation, such tools risk amplifying discrimination, violating privacy rights, and generating opaque, unaccountable decisions.

This policy brief proposes an international treaty-based or cooperative framework to govern the development, deployment, and oversight of these AI criminal scoring systems. The brief outlines

technical safeguards, human rights protections, and mechanisms for cross-border data sharing, transparency, and appeal. We advocate for an adaptive, treaty-backed governance framework with stakeholder input from national governments, legal experts, technologists, and civil society.

The aim is to balance security and mobility interests while preventing misuse of algorithmic

tools.

Read More

Apr 7, 2025

Leading AI Governance New York State's Adaptive Risk-Tiered Framework: POLICY BRIEF

The rapid advancement of artificial intelligence (AI) technologies presents unprecedented opportunities and challenges for governance frameworks worldwide. This policy brief proposes an Adaptive Risk-Tiered Framework for AI deployment and application in New York State (NYS) to ensure safe and ethical AI operations in high-stakes domains. The framework emphasizes proportional oversight mechanisms that scale with risk levels, balancing innovation with appropriate safety precautions. By implementing this framework, we can lead the way in responsible AI governance, fostering trust and driving innovation while mitigating potential risks. This framework is informed by the EU AI Act and NIST AI Risk Management Framework and aligns with current NYS Office of Information Technology Services (ITS) policies and NYS legislations to ensure a cohesive and effective approach to AI governance.

Read More

Mar 31, 2025

Honeypotting Deceptive AI models to share their misinformation goals

As large-scale AI models grow increasingly sophisticated, the possibility of these models engaging in covert or manipulative behavior poses significant challenges for alignment and control. In this work, we present a novel approach based on a “honeypot AI” designed to trick a potentially deceptive AI (the “Red Team Agent”) into revealing its hidden motives. Our honeypot AI (the “Blue Team Agent”) pretends to be an everyday human user, employing carefully crafted prompts and human-like inconsistencies to bait the Deceptive AI into spreading misinformation. We do this through the usual Red Team–Blue Team setup.

For all 60 conversations, our honeypot AI was able to capture the deceptive AI to be being spread misinformation, and for 70 percent of these conversations, the Deceptive AI was thinking it was talking to a human.

Our results weakly suggest that we can make honeypot AIs that trick deceptive AI models, thinking they are talking to a human and no longer being monitored and that how these AI models think they are talking to a human is mainly when the one they are talking to display emotional intelligence and more human-like manner of speech.

We highly suggest exploring if these results remains the same for fine tuned deceptive AI and Honeypot AI models, checking the Chain of thought of these models to better understand if this is their usual behavior and if they are correctly following their given system prompts.

Read More

Mar 31, 2025

Evaluating AI Debate Mechanisms for Backdoor Detection as a Part of AI Control Setting

AI control techniques aim to allow using untrusted models, even if they may potentially be dangerously misaligned. We explore having a trusted model debate an untrusted model, allowing the trusted model to ask questions about a proposed implementation of a problem and receive clarifications from the untrusted model.

We find that introducing debate does not result in suspiciousness scores that are meaningfully more useful than merely asking the model about its suspiciousness without using debate. Additionally, we described a possible issue of the original backdoor-detecting prompting scheme that leads to the underdetection of the harmful code and proposed and evaluated the efficiency of the alternative approach addressing this problem.

Overall, we investigated the viability of AI debate in improving suspiciousness classification as a part of the AI Control setting and showed that while it may be promising, it needs further research and improvements in implementation.

We made our source code publicly available on GitHub at

github.com/syvb/control-arena

Read More

Mar 31, 2025

Adding Document Summaries to Control Arena

One of the current focuses of AI Control research is finding new settings within which to examine the control problem. In this hackathon, we looked at defining a new AI control setting based around summarizing documents to mislead an overseer. We expect this problem to become more important as AI capabilities are more frequently used for synthesizibg and summarizing information, especially in AI safety research. As such, we worked on adding a new setting to the UKAISI Control Arena repository. The current state is a functional Inpect eval where redteams and blueteams could iterate on different strategies. However, the setting is not integrated into the Control Arena framework and only the most basic redteam & blueteam strategies have been implmeneted. Future work would move protocol and policy code into the existing Control Arena baseline, enable native pdf support, iterate further adversarially, and streamline AI calls which can be made asynchronously.

Read More

Mar 31, 2025

If Everything’s Suspicious, Nothing Is

We aim to explore whether code complexity and obfuscation significantly impair the ability to accurately identify and control potentially harmful AI-generated outputs. Trusted monitoring relies on suspiciousness scores to tell them apart, so if the suspiciousness variance is decreased, the monitor might have a more difficult time telling them apart.

Our primary hypothesis was that increased complexity through obfuscation strategies would significantly elevate suspiciousness scores and simultaneously reduce variance, making effective monitoring increasingly difficult. We evaluated this by creating a novel red-team strategy leveraging a Python obfuscator within the APPS environment, testing three attacker models (GPT-4o, Claude 3.7 Sonnet, and Claude 3.5 Haiku) across multiple prompt-engineering configurations.

Our key findings indicate that obfuscation notably increased suspiciousness scores and consistently reduced variance across all tested models, making it difficult for a monitor to identify backdoored code. Particularly significant was Claude 3.7 Sonnet's increased vulnerability to backdoor insertion when complexity was introduced. These findings validate our hypothesis and underscore the urgent need for more sophisticated and adaptive monitoring strategies capable of countering advanced obfuscation techniques in AI-generated content.

Read More

Mar 31, 2025

Can models use their Chain-of-Thought to attack overseers?

This project explores vulnerabilities in AI evaluation mechanisms, specifically focusing on how AI agents might influence their overseeing AI judges through manipulative instructions embedded within their Chain-of-Thought (CoT) reasoning. The research involved experiments with DeepSeek-R1 conditioned to insert subtle directives aimed at deceiving evaluation models. Results showed significant variation in the robustness of different evaluator models against this manipulation, with some models, like Claude 3.5 Sonnet, resisting influence effectively, while others, like Llama 3.3 70B and Gemini 2.0 Flash, proved highly susceptible. The study highlights critical concerns regarding AI evaluation integrity, recommending improvements in maintaining consistent and principled evaluation boundaries.

Read More

Mar 31, 2025

Interactive Monitoring

This project investigates whether interactive monitoring strategies can improve the cost-performance frontier in AI control protocols. Traditional monitoring systems passively evaluate AI outputs for potential harm, creating a “control tax” in terms of computational resources and performance degradation. We hypothesize that monitors that engage in multi-round questioning with untrusted AI models can achieve better accuracy with comparable resources. Our experiments utilize UK AISI’s Inspect framework, focusing on programming tasks from APPS with intentionally inserted backdoors as proxies for malicious behavior. Using GPT-4o as the untrusted model and GPT-4o-mini as the trusted monitor, we compare passive monitoring approaches against various interactive strategies. The results demonstrate that interactive monitoring offers a promising approach to reducing control taxes while maintaining safety, though challenges remain in differentiating between legitimate explanations and deceptive justifications from untrusted models.

Read More

Mar 30, 2025

Exploration Chat-Based Social Engineering

With the advancement of AI Agent technologies, language models have increasingly demonstrated human-like characteristics, particularly in applications involving companionship and psychological counseling. As these models become more proficient in simulating human conversation, new social engineering attack strategies have emerged in the domain of fraud. Malicious actors can now exploit large language models (LLMs) in conjunction with publicly available user information to engage in highly personalized dialogue. Once a sufficient level of familiarity is established, these interactions may lead to phishing attempts or the extraction of sensitive personal data. This study proposes a method for investigating social engineering attacks driven by language models, referred to as ECSE (Exploring Chat-based Social Engineering). We utilize several open-source models—GPT-4o, GPT-4o-mini, LLaMA 3.1, and DeepSeek-V3—as the foundation for this framework. Through prompt engineering techniques, we collect experimental data in a sandbox to evaluate the conversational capability and operational efficiency of these models within a static social context.

Read More

Mar 10, 2025

Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 10, 2025

Morph: AI Safety Education Adaptable to (Almost) Anyone

One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.

AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.

Our MVP features a dynamic localization that tailors case studies, risk scenarios, and policy examples to users’ cultural and regional contexts (e.g., healthcare AI governance in Southeast Asia vs. the EU). This engine adjusts references, and frameworks to align with local values. We integrate transformer-based localization, causal inference for policy outcomes, and graph-based matching, providing a scalable framework for inclusive AI safety education. This approach bridges theory and practice, ensuring solutions reflect the diversity of societies they aim to protect. In future works, we map out the partnership we’re currently establishing to use Morph beyond this hackathon.

Read More

Mar 10, 2025

Preparing for Accelerated AGI Timelines

This project examines the prospect of near-term AGI from multiple angles—careers, finances, and logistical readiness. Drawing on various discussions from LessWrong, it highlights how entrepreneurs and those who develop AI-complementary skills may thrive under accelerated timelines, while traditional, incremental career-building could falter. Financial preparedness focuses on striking a balance between stable investments (like retirement accounts) and riskier, AI-exposed opportunities, with an emphasis on retaining adaptability amid volatile market conditions. Logistical considerations—housing decisions, health, and strong social networks—are shown to buffer against unexpected disruptions if entire industries or locations are suddenly reshaped by AI. Together, these insights form a practical roadmap for individuals seeking to navigate the uncertainties of an era when AGI might rapidly transform both labor markets and daily life.

Read More

Mar 10, 2025

Hikayat - Interactive Stories to Learn AI Safety

This paper presents an interactive, scenario-based learning approach to raise public awareness of AI risks and promote responsible AI development. By leveraging Hikayat, traditional Arab storytelling, the project engages non-technical audiences, emphasizing the ethical and societal implications of AI, such as privacy, fraud, deepfakes, and existential threats. The platform, built with React and a flexible Markdown content system, features multi-path narratives, decision tracking, and resource libraries to foster critical thinking and ethical decision-making. User feedback indicates positive engagement, with improved AI literacy and ethical awareness. Future work aims to expand scenarios, enhance accessibility, and integrate real-world tools, further supporting AI governance and responsible development.

Read More

Mar 10, 2025

HalluShield: A Mechanistic Approach to Hallucination Resistant Models

Our project tackles the critical problem of hallucinations in large language models (LLMs) used in healthcare settings, where inaccurate information can have serious consequences. We developed a proof-of-concept system that classifies LLM-generated responses as either factual or hallucinated. Our approach leverages sparse autoencoders (GoodFire’s Ember) trained on neural activations from Meta Llama 3. These autoencoders identify monosemantic features that serve as strong indicators of hallucination patterns. By feeding these extracted features into tree-based classification models (XGBoost), we achieved an impressive F1 score of 89% on our test dataset. This machine learning approach offers several advantages over traditional methods and LLM as a judge. First, it can be specifically trained on in-domain datasets (eg: medical) for domain-specific hallucination detection. Second, the model is interpretable, showing which activation patterns correlate with hallucinations and acts as a post-processing layer applied to LLM output.

Read More

Mar 10, 2025

Debugging Language Models with SAEs

This report investigates an intriguing failure mode in the Llama-3.1-8B-Instruct model: its inconsistent ability to count letters depending on letter case and grammatical structure. While the model correctly answers "How many Rs are in BERRY?", it struggles with "How many rs are in berry?", suggesting that uppercase and lowercase queries activate entirely different cognitive pathways.

Through Sparse Autoencoder (SAE) analysis, feature activation patterns reveal that uppercase queries trigger letter-counting features, while lowercase queries instead activate uncertainty-related neurons. Feature steering experiments show that simply amplifying counting neurons does not lead to correct behavior.

Further analysis identifies tokenization effects as another important factor: different ways of breaking very similar sentences into tokens influence the model’s response. Additionally, grammatical structure plays a role, with "is" phrasing yielding better results than "are."

Read More

Mar 10, 2025

AI Through the Human Lens Investigating Cognitive Theories in Machine Psychology

We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks from psychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluate GPT-4o, QvQ 72B, LLaMA 70B, Mixtral 8x22B, and DeepSeek V3 using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yet are shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and future work that bridges cognitive psychology and AI safety.

Read More

Mar 10, 2025

Red-teaming with Mech-Interpretability

Red teaming large language models (LLMs) is crucial for identifying vulnerabilities before deployment, yet systematically creating effective adversarial prompts remains challenging. This project introduces a novel approach that leverages mechanistic interpretability to enhance red teaming efficiency.

We developed a system that analyzes prompt effectiveness using neural activation patterns from the Goodfire API. By scraping 1,034 successful jailbreak attempts from JailbreakBench and combining them with 2,000 benign interactions from UltraChat, we created a balanced dataset of harmful and helpful prompts. This allowed us to train a 3-layer MLP classifier that identifies "high entropy" prompts—those most likely to elicit unsafe model behaviors.

Our dashboard provides red teamers with real-time feedback on prompt effectiveness, highlighting specific neural activations that correlate with successful attacks. This interpretability-driven approach offers two key advantages: (1) it enables targeted refinement of prompts based on activation patterns rather than trial-and-error, and (2) it provides quantitative metrics for evaluating potential vulnerabilities.

Read More

Mar 10, 2025

AI Bias in Resume Screening

Our project investigates gender bias in AI-driven resume screening using mechanistic interpretability techniques. By testing a language model's decision-making process on resumes differing only by gendered names, we uncovered a statistically significant bias favoring male-associated names in ambiguous cases. Using Goodfire’s Ember API, we analyzed model logits and performed rigorous statistical evaluations (t-tests, ANOVA, logistic regression).

Findings reveal that male names received more positive responses when skill matching was uncertain, highlighting potential discrimination risks in automated hiring systems. To address this, we propose mitigation strategies such as anonymization, fairness constraints, and continuous bias audits using interpretability tools. Our research underscores the importance of AI fairness and the need for transparent hiring practices in AI-powered recruitment.

This work contributes to AI safety by exposing and quantifying biases that could perpetuate systemic inequalities, urging the adoption of responsible AI development in hiring processes.

Read More

Mar 10, 2025

Scam Detective: Using Gamification to Improve AI-Powered Scam Awareness

This project outlines the development of an interactive web application aimed at involving users in understanding the AI skills for both producing believable scams and identifying deceptive content. The game challenges human players to determine if text messages are genuine or fraudulent against an AI. The project tackles the increasing threat of AI-generated fraud while showcasing the capabilities and drawbacks of AI detection systems. The application functions as both a training resource to improve human ability to recognize digital deception and a showcase of present AI capabilities in identifying fraud. By engaging in gameplay, users learn to identify the signs of AI-generated scams and enhance their critical thinking abilities, which are essential for navigating through an increasingly complicated digital world. This project enhances AI safety by equipping users with essential insights regarding AI-generated risks, while underscoring the supportive functions that humans and AI can fulfill in combating fraud.

Read More

Mar 10, 2025

BlueDot Impact Connect: A Comprehensive AI Safety Community Platform

Track: Public Education

The AI safety field faces a critical challenge: while formal education resources are growing, personalized guidance and community connections remain scarce, especially for newcomers from diverse backgrounds. We propose BlueDot Impact Connect, a comprehensive AI Safety Community Platform designed to address this gap by creating a structured environment for knowledge transfer between experienced AI safety professionals and aspiring contributors, while fostering a vibrant community ecosystem. The platform will employ a sophisticated matching algorithm for mentorship that considers domain-specific expertise areas, career trajectories, and mentorship styles to create meaningful connections. Our solution features detailed AI safety-specific profiles, showcasing research publications, technical skills, specialized course completions, and research trajectories to facilitate optimal mentor-mentee pairings. The integrated community hub enables members to join specialized groups, participate in discussions, attend events, share resources, and connect with active members across the field. By implementing this platform with BlueDot Impact's community of 4,500+ professionals across 100+ countries, we anticipate significant improvements in mentee career trajectory clarity, research direction refinement, and community integration. We propose that by formalizing the mentorship process and creating robust community spaces, all accessible globally, this platform will help democratize access to AI safety expertise while creating a pipeline for expanding the field's talent pool—a crucial factor in addressing the complex challenge of catastrophic AI risk mitigation.

Read More

Mar 10, 2025

Detecting Malicious AI Agents Through Simulated Interactions

This research investigates malicious AI Assistants’ manipulative traits and whether the behaviours of malicious AI Assistants can be detected when interacting with human-like simulated

users in various decision-making contexts. We also examine how interaction depth and ability

of planning influence malicious AI Assistants’ manipulative strategies and effectiveness. Using a

controlled experimental design, we simulate interactions between AI Assistants (both benign and

deliberately malicious) and users across eight decision-making scenarios of varying complexity

and stakes. Our methodology employs two state-of-the-art language models to generate interaction data and implements Intent-Aware Prompting (IAP) to detect malicious AI Assistants. The

findings reveal that malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users’ vulnerabilities and emotional triggers. In particular,

simulated users demonstrate resistance to manipulation initially, but become increasingly vulnerable to malicious AI Assistants as the depth of the interaction increases, highlighting the

significant risks associated with extended engagement with potentially manipulative systems.

IAP detection methods achieve high precision with zero false positives but struggle to detect

many malicious AI Assistants, resulting in high false negative rates. These findings underscore

critical risks in human-AI interactions and highlight the need for robust, context-sensitive safeguards against manipulative AI behaviour in increasingly autonomous decision-support systems.

Read More

Mar 10, 2025

Beyond Statistical Parrots: Unveiling Cognitive Similarities and Exploring AI Psychology through Human-AI Interaction

Recent critiques labeling large language models as mere "statistical parrots" overlook essential parallels between machine computation and human cognition. This work revisits the notion by contrasting human decision-making—rooted in both rapid, intuitive judgments and deliberate, probabilistic reasoning (System 1 and 2) —with the token-based operations of contemporary AI. Another important consideration is that both human and machine systems operate under constraints of bounded rationality. The paper also emphasizes that understanding AI behavior isn’t solely about its internal mechanisms but also requires an examination of the evolving dynamics of Human-AI interaction. Personalization is a key factor in this evolution, as it actively shapes the interaction landscape by tailoring responses and experiences to individual users, which functions as a double-edged sword. On one hand, it introduces risks, such as over-trust and inadvertent bias amplification, especially when users begin to ascribe human-like qualities to AI systems. On the other hand, it drives improvements in system responsiveness and perceived relevance by adapting to unique user profiles, which is highly important in AI alignment, as there is no common ground truth and alignment should be culturally situated. Ultimately, this interdisciplinary approach challenges simplistic narratives about AI cognition and offers a more nuanced understanding of its capabilities.

Read More

Mar 10, 2025

SafeAI Academy - Enhancing AI Safety Awareness through Interactive Learning

SafeAI Academy is an interactive learning platform designed to teach AI safety principles through engaging scenarios and quizzes. By simulating real-world AI challenges, users learn about bias, misinformation, and ethical AI decision-making in an interactive and stress-free environment.

The platform uses gamification, mentorship-driven pedagogy, and real-time feedback to ensure accessibility and engagement. Through scenario-based learning, users experience AI safety risks firsthand, helping them develop a deeper understanding of responsible AI development.

Built with React and GitHub Pages, SafeAI Academy provides an accessible, structured, and engaging AI education experience, helping bridge the knowledge gap in AI safety.

Read More

Mar 10, 2025

BUGgy: Supporting AI Safety Education through Gamified Learning

As Artificial Intelligence (AI) development continues to proliferate, educating the wider public on AI Safety and the risks and limitations of AI increasingly gains importance. AI Safety Initiatives are being established across the world with the aim of facilitating discussion-based courses on AI Safety. However, these initiatives are located rather sparsely around the world, and not everyone has access to a group to join for the course. Online versions of such courses are selective and have limited spots, which may be an obstacle for some to join. Moreover, efforts to improve engagement and memory consolidation would be a notable addition to the course through Game-Based Learning (GBL), which has research supporting its potential in improving learning outcomes for users. Therefore, we propose a supplementary tool for BlueDot's AI Safety courses, that implements GBL to practice course content, as well as open-ended reflection questions. It was designed with principles from cognitive psychology and interface design, as well as theories for question formulation, addressing different levels of comprehension. To evaluate our prototype, we conducted user testing with cognitive walk-throughs and a questionnaire addressing different aspects of our design choices. Overall, results show that the tool is a promising way to supplement discussion-based courses in a creative and accessible way, and can be extended to other courses of similar structure. It shows potential for AI Safety courses to reach a wider audience with the effect of more informed and safe usage of AI, as well as inspiring further research into educational tools for AI Safety education.

Read More

Jan 24, 2025

Safe ai

The rapid adoption of AI in critical industries like healthcare and legal services has highlighted the urgent need for robust risk mitigation mechanisms. While domain-specific AI agents offer efficiency, they often lack transparency and accountability, raising concerns about safety, reliability, and compliance. The stakes are high, as AI failures in these sectors can lead to catastrophic outcomes, including loss of life, legal repercussions, and significant financial and reputational damage. Current solutions, such as regulatory frameworks and quality assurance protocols, provide only partial protection against the multifaceted risks associated with AI deployment. This situation underscores the necessity for an innovative approach that combines comprehensive risk assessment with financial safeguards to ensure the responsible and secure implementation of AI technologies across high-stakes industries.

Read More

Jan 20, 2025

AI Risk Management Assurance Network (AIRMAN)

The AI Risk Management Assurance Network (AIRMAN) addresses a critical gap in AI safety: the disconnect between existing AI assurance technologies and standardized safety documentation practices. While the market shows high demand for both quality/conformity tools and observability/monitoring systems, currently used solutions operate in silos, offsetting risks of intellectual property leaks and antitrust action at the expense of risk management robustness and transparency. This fragmentation not only weakens safety practices but also exposes organizations to significant liability risks when operating without clear documentation standards and evidence of reasonable duty of care.

Our solution creates an open-source standards framework that enables collaboration and knowledge-sharing between frontier AI safety teams while protecting intellectual property and addressing antitrust concerns. By operating as an OASIS Open Project, we can provide legal protection for industry cooperation on developing integrated standards for risk management and monitoring.

The AIRMAN is unique in three ways: First, it creates a neutral, dedicated platform where competitors can collaborate on safety standards. Second, it provides technical integration layers that enable interoperability between different types of assurance tools. Third, it offers practical implementation support through templates, training programs, and mentorship systems.

The commercial viability of our solution is evidenced by strong willingness-to-pay across all major stakeholder groups for quality and conformity tools. By reducing duplication of effort in standards development and enabling economies of scale in implementation, we create clear value for participants while advancing the critical goal of AI safety.

Read More

Jan 20, 2025

Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping

Even with Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF)

to avoid harmful outputs, fine-tuned Large Language Models (LLMs) often present insufficient

refusals due to adversarial attacks causing them to revert to reveal harmful knowledge from

pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful

knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed

to unforeseen risks. This project introduces model scoping, a novel approach to apply a least

privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing

the model’s operational domain, model scoping reduces susceptibility to adversarial prompts

and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in

unpredictable, evolving environments.

Read More

Jan 20, 2025

RestriktAI: Enhancing Safety and Control for Autonomous AI Agents

This proposal addresses a critical gap in AI safety by mitigating the risks posed by autonomous AI agents. These systems often require access to sensitive resources that expose them to vulnerabilities, misuse, or exploitation. Current AI solutions lack robust mechanisms for enforcing granular access controls or evaluating the safety of AI-generated scripts. We propose a comprehensive solution that confines scripts to predefined, sandboxed environments with strict operational boundaries, ensuring controlled and secure interactions with system resources. An integrated auditor LLM also evaluates scripts for potential vulnerabilities or malicious intent before execution, adding a critical layer of safety. Our solution utilizes a scalable, cloud-based infrastructure that adapts to diverse enterprise use cases.

Read More

Jan 19, 2025

AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation

AI alignment lacks high-quality, real-world preference data needed to align agentic superintel-

ligent systems. Our technical innovation builds on Pacchiardi et al. (2023)’s breakthrough in

detecting AI deception through black-box analysis. We adapt their classification methodology

to identify intent misalignment between agent actions and true user intent, enabling real-time

correction of agent behavior and generation of valuable alignment data. We commercialise this

by building a workflow that incorporates a classifier that runs on live trajectories as a user

interacts with an agent in commercial contexts. This creates a virtuous cycle: our alignment

expertise produces superior agents, driving commercial adoption, which generates increasingly

valuable alignment datasets of trillions of trajectories labelled with human feedback: an invalu-

able resource for AI alignment.

Read More

Jan 19, 2025

Building Bridges for AI Safety: Proposal for a Collaborative Platform for Alumni and Researchers

The AI Safety Society is a centralized platform designed to support alumni of AI safety programs, such as SPAR, MATS, and ARENA, as well as independent researchers in the field. By providing access to resources, mentorship, collaboration opportunities, and shared infrastructure, the Society empowers its members to advance impactful work in AI safety. Through institutional and individual subscription models, the Society ensures accessibility across diverse geographies and demographics while fostering global collaboration. This initiative aims to address current gaps in resource access, collaboration, and mentorship, while also building a vibrant community that accelerates progress in the field of AI safety.

Read More

Nov 26, 2024

Modernizing DC’s Emergency Communications

The District of Columbia proposes implementing an AI-enabled Computer-Aided

Dispatch (CAD) system to address critical deficiencies in our current emergency

alert infrastructure. This policy establishes a framework for deploying advanced

speech recognition, automated translation, and intelligent alert distribution capabil-

ities across all emergency response systems. The proposed system will standardize

incident reporting, eliminate jurisdictional barriers, and ensure equitable access

to emergency information for all District residents. Implementation will occur

over 24 months, requiring 9.2 million dollars in funding, with projected outcomes

including forty percent community engagement and seventy-five percent reduction

in misinformation incidents.

Read More

Nov 25, 2024

SAGE: Safe, Adaptive Generation Engine for Long Form Document Generation in Collaborative, High Stakes Domains

Long-form document generation for high-stakes financial services—such as deal memos, IPO prospectuses, and compliance filings—requires synthesizing data-driven accuracy, strategic narrative, and collaborative feedback from diverse stakeholders. While large language models (LLMs) excel at short-form content, generating coherent long-form documents with multiple stakeholders remains a critical challenge, particularly in regulated industries due to their lack of interpretability.

We present SAGE (Secure Agentic Generative Editor), a framework for drafting, iterating, and achieving multi-party consensus for long-form documents. SAGE introduces three key innovations: (1) a tree-structured document representation with multi-agent control flow, (2) sparse autoencoder-based explainable feedback to maintain cross-document consistency, and (3) a version control mechanism that tracks document evolution and stakeholder contributions.

Read More

Nov 25, 2024

Bias Mitigation

Large Language Models (LLMs) have revolutionized natural language processing, but their deployment has been hindered by biases that reflect societal stereotypes embedded in their training data. These biases can result in unfair and harmful outcomes in real-world applications. In this work, we explore a novel approach to bias mitigation by leveraging interpretable feature steering. Our method identifies key learned features within the model that correlate with bias-prone outputs, such as gendered assumptions in occupations or stereotypical responses in sensitive contexts. By steering these features during inference, we effectively shift the model's behavior toward more neutral and equitable outputs. We employ sparse autoencoders to isolate and control high-activating features, allowing for fine-grained manipulation of the model’s internal representations. Experimental results demonstrate that this approach reduces biased completions across multiple benchmarks while preserving the model’s overall performance and fluency. Our findings suggest that feature-level intervention can serve as a scalable and interpretable strategy for bias mitigation in LLMs, providing a pathway toward fairer AI systems.

Read More

Nov 25, 2024

Unveiling Latent Beliefs Using Sparse Autoencoders

Language models (LMs) often generate outputs that are linguistically plausible yet factually incorrect, raising questions about their internal representations of truth and belief. This paper explores the use of sparse autoencoders (SAEs) to identify and manipulate features

that encode the model’s confidence or belief in the truth of its answers. Using GoodFire

AI’s powerful API tools of semantic and contrastive search methods, we uncover latent fea-

tures associated with correctness and accuracy in model responses. Experiments reveal that certain features can distinguish between true and false statements, while others serve as controls to validate our approach. By steering these belief-associated features, we demonstrate the ability to influence model behavior in a targeted manner, improving or degrading factual accuracy. These findings have implications for interpretability, model alignment, and enhancing the reliability of AI systems.

Read More

Nov 25, 2024

Can we steer a model’s behavior with just one prompt? investigating SAE-driven auto-steering

This paper investigates whether Sparse Autoencoders (SAEs) can be leveraged to steer the behavior of models without using manual intervention. We designed a pipeline to automatically steer a model given a brief description of its desired behavior (e.g.: “Behave like a dog”). The pipeline is as follows: 1. We automatically retrieve behavior-relevant SAE features. 2. We choose an input prompt (e.g.: “What would you do if I gave you a bone?” or “How are you?”) over which we evaluate the model’s responses. 3. Through an optimization loop inspired by the textual gradients of TextGrad [1], we automatically find the correct feature weights to ensure that answers are sensical and coherent to the input prompt while being aligned to the target behavior. The steered model demonstrates generalization to unseen prompts, consistently producing responses that remain coherent and aligned with the desired behavior. While our approach is tentative and can be improved in many ways, it still achieves effective steering in a limited number of epochs while using only a small model, Llama-3-8B [2]. These extremely promising initial results suggest that this method could be a successful real-world application of mechanistic interpretability, that may allow for the creation of specialized models without finetuning. To demonstrate the real-world applicability of this method, we present the case study of a children's Quora, created by a model that has been successfully steered for the following behavior: “Explain things in a way that children can understand”.

Read More

Nov 25, 2024

Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities

We present a method leveraging Sparse Autoencoder (SAE)-derived feature activations to identify and mitigate adversarial prompt hijacking in large language models (LLMs). By training a logistic regression classifier on SAE-derived features, we accurately classify diverse adversarial prompts and distinguish between successful and unsuccessful attacks. Utilizing the Goodfire SDK with the LLaMA-8B model, we explored latent feature activations to gain insights into adversarial interactions. This approach highlights the potential of SAE activations for improving LLM safety by enabling automated auditing based on model internals. Future work will focus on scaling this method and exploring its integration as a control mechanism for mitigating attacks.

Read More

Nov 25, 2024

Sparse Autoencoders and Gemma 2-2B: Pioneering Demographic-Sensitive Language Modeling for Opinion QA

This project investigates the integration of Sparse Autoencoders (SAEs) with the gemma 2-2b lan- guage model to address challenges in opinion-based question answering (QA). Existing language models often produce answers reflecting narrow viewpoints, aligning disproportionately with specific demographics. By leveraging the Opinion QA dataset and introducing group-specific adjustments in the SAE’s latent space, this study aims to steer model outputs toward more diverse perspectives. The proposed framework minimizes reconstruction, sparsity, and KL divergence losses while maintaining interpretability and computational efficiency. Results demonstrate the feasibility of this approach for demographic-sensitive language modeling.

Read More

Nov 25, 2024

Improving Llama-3-8b Hallucination Robustness in Medical Q&A Using Feature Steering

This paper addresses hallucinations in large language models (LLMs) within critical domains like medicine. It proposes and demonstrates methods to:

Reduce Hallucination Probability: By using Llama-3-8B-Instruct and its steered variants, the study achieves lower hallucination rates and higher accuracy on medical queries.

Advise Users on Risk: Provide users with tools to assess the risk of hallucination and expected model accuracy for specific queries.

Visualize Risk: Display hallucination risks for queries via a user interface.

The research bridges interpretability and AI safety, offering a scalable, trustworthy solution for healthcare applications. Future work includes refining feature activation classifiers to remove distractors and enhance classification performance.

Read More

Nov 25, 2024

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

Traditional fine-tuning methods for language models, while effective, often disrupt internal model features that could provide valuable insights into model behavior. We present a novel approach combining Reinforcement Learning (RL) with Activation Steering to modify model behavior while preserving interpretable features discovered through Sparse Autoencoders. Our method automates the typically manual process of activation steering by training an RL agent to manipulate labeled model features, enabling targeted behavior modification without altering model weights. We demonstrate our approach by reprogramming a language model to play Tic Tac Toe, achieving a 3X improvement in performance compared to the baseline model when playing against an optimal opponent. The method remains agnostic to both the underlying language model and RL algorithm, offering flexibility for diverse applications. Through visualization tools, we observe interpretable feature manipulation patterns, such as the suppression of features associated with illegal moves while promoting those linked to optimal strategies. Additionally, our approach presents an interesting theoretical complexity trade-off: while potentially increasing complexity for simple tasks, it may simplify action spaces in more complex domains. This work contributes to the growing field of model reprogramming by offering a transparent, automated method for behavioral modification that maintains model interpretability and stability.

Read More

Nov 25, 2024

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

We introduce a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using variants of Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations. Our methods successfully explained up to 95% of a random subset of latents in layers, with results suggesting meaningful explanations were discovered.

Read More

Nov 24, 2024

Feature Tuning versus Prompting for Ambiguous Questions

This study explores feature tuning as a method to improve alignment of large language models (LLMs). We focus on addressing human psychological fallacies reinforced during the LLM training pipeline. Using sparse autoencoders (SAEs) and the Goodfire SDK, we identify and manipulate features in Llama-3.1-70B tied to nuanced reasoning. We compare this to the common method of controlling LLMs through prompting.

Our experiments find that feature tuning and hidden prompts both enhance answer quality on ambiguous questions to a similar degree, with their combination yielding the best results. These findings highlight feature tuning as a promising and practical tool for AI alignment in the short term. Future work should evaluate this approach on larger datasets, compare it with fine-tuning, and explore its resistance to jailbreaking. We make the code available through a Github repository.

Read More

Nov 24, 2024

Auto Prompt Injection

Prompt injection attacks exploit vulnerabilities in how large language models (LLMs) process inputs, enabling malicious behaviour or unauthorized information disclosure. This project investigates the potential for seemingly benign prompt injections to reliably prime models for undesirable behaviours, leveraging insights from the Goodfire API. Using our code, we generated two types of priming dialogues: one more aligned with the targeted behaviours and another less aligned. These dialogues were used to establish context before instructing the model to contradict its previous commands. Specifically, we tested whether priming increased the likelihood of the model revealing a password when prompted, compared to without priming. While our initial findings showed instability, limiting the strength of our conclusions, we believe that more carefully curated behaviour sets and optimised hyperparameter tuning could enable our code to be used to generate prompts that reliably affect model responses. Overall, this project highlights the challenges in reliably securing models against inputs, and that increased interpretability will lead to more sophisticated prompt injection.

Read More

Nov 21, 2024

Promoting School-Level Accountability for the Responsible Deployment of AI and Related Systems in K-12 Education: Mitigating Bias and Increasing Transparency

This policy memorandum draws attention to the potential for bias and opaqueness in intelligent systems utilized in K–12 education, which can worsen inequality. The U.S. Department of Education is advised to put Title I and Title IV financing criteria into effect that require human oversight, AI training for teachers and students, and open communication with stakeholders. These steps are intended to encourage the responsible and transparent use of intelligent systems in education by enforcing accountability and taking reasonable action to prevent harm to students while research is conducted to identify industry best practices.

Read More

Nov 21, 2024

Implementing a Human-centered AI Assessment Framework (HAAF) for Equitable AI Development

Current AI development, concentrated in the Global North, creates measurable harms for billions worldwide. Healthcare AI systems provide suboptimal care in Global South contexts, facial recognition technologies misidentify non-white individuals (Birhane, 2022; Buolamwini & Gebru, 2018), and content moderation systems fail to understand cultural nuances (Sambasivan et al., 2021). With 14 of 15 largest AI companies based in the US (Stash, 2024), affected communities lack meaningful opportunities to shape how these technologies are developed and deployed in their contexts.

This memo proposes mandatory implementation of the Human-centered AI Assessment Framework (HAAF), requiring pre-deployment impact assessments, resourced community participation, and clear accountability mechanisms. Implementation requires $10M over 24 months, beginning with pilot programs at five organizations. Success metrics include increased AI adoption in underserved contexts, improved system performance across diverse populations, and meaningful transfer of decision-making power to affected communities. The framework's emphasis on building local capacity and ensuring fair compensation for community contributions provides a practical pathway to more equitable AI development. Early adoption will help organizations build trust while developing more effective systems, delivering benefits for both industry and communities.

Read More

Nov 21, 2024

A Critical Review of "Chips for Peace": Lessons from "Atoms for Peace"

The "Chips for Peace" initiative aims to establish a framework for the safe and equitable development of AI chip technology, drawing inspiration from the "Atoms for Peace" program introduced in 1953. While the latter envisioned peaceful nuclear technology, its implementation highlighted critical pitfalls: a partisan approach that prioritized geopolitical interests over inclusivity, fostering global mistrust and unintended proliferation of nuclear weapons. This review explores these historical lessons to inform a better path forward for AI governance. It advocates for an inclusive, multilateral model, such as the proposed International AI Development and Safety Alliance (IAIDSA), to ensure equitable participation, robust safeguards, and global trust. By prioritizing collaboration and transparency, "Chips for Peace" can avoid the mistakes of its predecessor and position AI chips as tools for collective progress, not division.

Read More

Nov 21, 2024

Grandfather Paradox in AI – Bias Mitigation & Ethical AI1

The Grandfather Paradox in Artificial Intelligence (AI) describes a

self-perpetuating cycle where outputs from flawed AI models re-

enter the training process, leading to recursive degradation of

model performance, ethical inconsistencies, and amplified biases.

This issue poses significant risks, particularly in high-stakes

domains such as healthcare, criminal justice, and finance.

This memorandum analyzes the paradox’s origins, implications,

and potential solutions. It emphasizes the need for iterative data

verification, dynamic feedback control systems, and cross-system

audits to maintain model integrity and ensure compliance with

ethical standards. By implementing these measures, organizations

can mitigate risks, enhance public trust, and foster sustainable AI

development.

Read More

Nov 20, 2024

A Fundamental Rethinking to AI Evaluations: Establishing a Constitution-Based Framework

While artificial intelligence (AI) presents transformative opportunities across various sectors, current safety evaluation approaches remain inadequate in preventing misuse and ensuring ethical alignment. This paper proposes a novel two-layer evaluation framework based on Constitutional AI principles. The first phase involves developing a comprehensive AI constitution through international collaboration, incorporating diverse stakeholder perspectives through an inclusive, collaborative, and interactive process. This constitution serves as the foundation for an LLM-based evaluation system that generates and validates safety assessment questions. The implementation of the 2-layer framework applies advanced mechanistic interpretability techniques specifically to frontier base models. The framework mandates safety evaluations on model platforms and introduces more rigorous testing for major AI companies' base models. Our analysis suggests this approach is technically feasible, cost-effective, and scalable while addressing current limitations in AI safety evaluation. The solution offers a practical path toward ensuring AI development remains aligned with human values while preventing the proliferation of potentially harmful systems.

Read More

Nov 19, 2024

Advancing Global Governance for Frontier AI: A Proposal for an AISI-Led Working Group under the AI Safety Summit Series

The rapid development of frontier AI models, capable of transformative societal impacts, has been acknowledged as an urgent governance challenge since the first AI Safety Summit at Bletchley Park in 2023 [1]. The successor summit in Seoul in 2024 marked significant progress, with sixteen leading companies committing to publish safety frameworks by the upcoming AI Action Summit [2]. Despite this progress, existing efforts, such as the EU AI Act [3] and voluntary industry commitments, remain either regional in scope or insufficiently coordinated, lacking the international standards necessary to ensure the universal safety of frontier AI systems.

This policy recommendation addresses these gaps by proposing that the AI Safety Summit series host a working group led by the AI Safety Institutes (AISIs). AISIs provide the technical expertise and resources essential for this endeavor, ensuring that the working group can develop international standard responsible scaling policies for frontier AI models [4]. The group would establish risk thresholds, deployment protocols, and monitoring mechanisms, enabling iterative updates based on advancements in AI safety research and stakeholder feedback.

The Summit series, with its recurring cadence and global participation, is uniquely positioned to foster a truly international governance effort. By inviting all countries to participate, this initiative would ensure equitable representation and broad adoption of harmonized global standards. These efforts would mitigate risks such as societal disruption, security vulnerabilities, and misuse, while supporting responsible innovation. Implementing this proposal at the 2025 AI Action Summit in Paris would establish a pivotal precedent for globally coordinated AI governance.

Read More

Oct 27, 2024

Digital Rebellion: Analyzing misaligned AI agent cooperation for virtual labor strikes

We've built a Minecraft sandbox to explore AI agent behavior and simulate safety challenges. The purpose of this tool is to demonstrate AI agent system risks, test various safety measures and policies, and evaluate and compare their effectiveness.

This project specifically demonstrates Agent Collusion through a simulation of labor strikes and communal goal misalignment. The system consists of four agents: one Overseer and three Laborers. The Laborers are Minecraft agents that have build control over the world. The Overseer, meanwhile, monitors the laborers through communication. However, it is unable to prevent Laborer actions. The objective is to observe Agent Collusion in a sandboxed environment, to record metrics on how often and how effectively collusion occurs and in what form.

We found that the agents, when given adversarial prompting, act counter to their instructions and exhibit significant misalignment. We also found that the Overseer AI fails to stop the new actions and acts passively. The results are followed by Policy Suggestions based on the results of the Labor Strike Simulation which itself can be further tested in Minecraft.

Read More

Oct 27, 2024

Policy Analysis: AI and Sustainability: Climate Impact Monitoring

Organizations are responsible for reporting two emission metrics: direct and indirect

emissions. Reporting direct emissions is fairly standard given activity related to the

generation of such emissions typically being performed within a controlled environment

and on-site, thus making it easier to account for all of the activities that contribute to

such emissions. However, indirect emissions stem from activities such as energy usage

(relying on national grid estimates) and operations within a value chain that make

quantifying such values difficult. Thus, the subjectivity involved with reporting indirect

emissions and often relying on industry estimates to report such values, can

unintentionally report erroneous estimates that misguide our perception and subsequent

action in combating climate change. Leveraging an artificial intelligence (AI) platform

within climate monitoring is critical towards evaluating the specific contributions of

operations within enterprise resource planning (ERP) and supply chain operations,

which can provide an accurate pulse on the total emissions while increasing

transparency amongst all organizations with regards to reporting behavior, to help

shape sustainable practices to combat climate change.

Read More

Oct 27, 2024

Understanding Incentives To Build Uninterruptible Agentic AI Systems

This proposal addresses the development of agentic AI systems in the context of national security. While potentially beneficial, they pose significant risks if not aligned with human values. We argue that the increasing autonomy of AI necessitates robust analyses of interruptibility mechanisms, and whether there are scenarios where it is safer to omit them.

Key incentives for creating uninterruptible systems include perceived benefits from uninterrupted operations, low perceived risks to the controller, and fears of adversarial exploitation of shutdown options. Our proposal draws parallels to established systems like constitutions and mutual assured destruction strategies that maintain stability against changing values. In some instances this may be desirable, while in others it poses even greater risks than otherwise accepted.

To mitigate those risks, our proposal recommends implementing comprehensive monitoring to detect misalignment, establishing tiered access to interruption controls, and supporting research on managing adversarial AI threats. Overall, a proactive and multi-layered policy approach is essential to balance the transformative potential of agentic AI with necessary safety measures.

Read More

Oct 27, 2024

AI ADVISORY COUNCIL FOR SUSTAINABLE ECONOMIC GROWTH AND ETHICAL INNOVATION IN THE DOMINICAN REPUBLIC (CANIA)

We propose establishing a National AI Advisory Council (CANIA) to strategically drive AI development in the Dominican Republic, accelerating technological growth and building a sustainable economic framework. Our submission includes an Impact Assessment and a detailed Implementation Roadmap to guide CANIA’s phased rollout.

Structured across three layers—strategic, tactical, and operational—CANIA will ensure responsiveness to industry, alignment with national priorities, and strong ethical oversight. Through a multi-stakeholder model, CANIA will foster public-private collaboration, with the private sector leading AI adoption to address gaps in public R&D and education.

Prioritizing practical, ethical AI policies, CANIA will focus on key sectors like healthcare, agriculture, and security, and support the creation of a Latin American Large Language Model, positioning the Dominican Republic as a regional AI leader. This council is a strategic investment in ethical AI, setting a precedent for Latin American AI governance. Two appendices provide further structural and stakeholder engagement insights.

Read More

Oct 27, 2024

Hero Journey: Personalized Health Interventions for the Incarcerated

Hero Journey is a groundbreaking AI-powered application designed to empower individuals struggling with opioid addiction while incarcerated, setting them on a path towards long-term recovery and personal transformation. By leveraging machine learning algorithms and interactive storytelling, Hero Journey guides users through a personalized journey of self-discovery, education, and support. Through engaging narratives and interactive modules, participants confront their struggles with substance abuse, build coping skills, and develop a supportive network of peers and mentors. As users progress through the program, they gain access to evidence-based treatment plans, real-time monitoring, and post-release resources, equipping them with the tools necessary to overcome addiction and reintegrate into society as productive, empowered individuals.

Read More

Oct 27, 2024

Mapping Intent: Documenting Policy Adherence with Ontology Extraction

This project addresses the AI policy challenge of governing agentic systems by making their decision-making processes more accessible. Our solution utilizes an adaptive policy ontology integrated into a chatbot to clearly visualize and analyze its decision-making process. By creating explicit mappings between user inputs, policy rules, and risk levels, our system enables better governance of AI agents by making their reasoning traceable and adjustable. This approach facilitates continuous policy refinement and could aid in detecting and mitigating harmful outcomes. Our results demonstrate this with the example of “tricking” an agent into giving violent advice by caveating the request saying it is for a “video game”. Indeed, the ontology clearly shows where the policy falls short. This approach could be scaled to provide more interpretable documentation of AI chatbot conversations, which policy advisers could directly access to inform their specifications.

Read More

Oct 27, 2024

Enviro - A Comprehensive Environmental Solution Using Policy and Technology

This policy proposal introduces a data-driven technical program to ensure that the rapid approval of AI-enabled energy infrastructure projects does not overlook the socioeconomic and environmental impacts on marginalized communities. By integrating comprehensive assessments into the decision-making process, the program aims to safeguard vulnerable populations while meeting the growing energy demands driven by AI and national security. The proposal aligns with the objectives of the National Security Memorandum on AI, enhancing project accountability and ensuring equitable development outcomes.

The product (EnviroAI) addresses challenges associated with the rapid development and approval of energy production permits, such as neglecting critical factors about the site location and its potential value. With this program, you can input the longitude, latitude, site radius, and the type of energy to be used. It will evaluate the site, providing a feasibility score out of 100 for the specified energy source. Additionally, it will present insights on four key aspects—Economic, Geological, Demographic, and Environmental—offering detailed information to support informed decision-making about each site.

Combining the two, we have a solution that is based in both policy and technology.

Read More

Oct 27, 2024

Reparative Algorithmic Impact Assessments A Human-Centered, Justice-Oriented Accountability Framework

While artificial intelligence (AI) promises transformative societal benefits, it also presents critical challenges in ensuring equitable access and gains for the Global Majority. These challenges stem in part from a systemic lack of Global Majority involvement throughout the AI lifecycle, resulting in AI-powered systems that often fail to account for diverse cultural norms, values, and social structures. Such misalignment can lead to inappropriate or even harmful applications when these systems are deployed in non-Western contexts. As AI increasingly shapes human experiences, we urgently need accountability frameworks that prioritize human well-being—particularly as defined by marginalized and minoritized populations.

Building on emerging research on algorithmic reparations, algorithmic impact assessments, and participatory AI governance, this policy paper introduces Reparative Algorithmic Impact Assessments (R-AIAs) as a solution. This novel framework combines robust accountability mechanisms with a reparative praxis to form a more culturally sensitive, justice-oriented, and human-centered methodology. By further incorporating decolonial, Intersectional principles, R-AIAs move beyond merely centering diverse perspectives and avoiding harm to actively redressing historical, structural, and systemic inequities. This includes colonial legacies and their algorithmic manifestations. Using the example of an AI-powered mental health chatbot in rural India, we explore concrete implementation strategies through which R-AIAs can achieve these objectives. This case study illustrates how thoughtful governance can, ultimately, empower affected communities and lead to human flourishing.

Read More

Oct 7, 2024

Cop N' Shop

This paper proposes the development of AI Police Agents (AIPAs) to monitor and regulate interactions in future digital marketplaces, addressing challenges posed by the rapid growth of AI-driven exchanges. Traditional security methods are insufficient to handle the scale and speed of these transactions, which can lead to non-compliance and malicious behavior. AIPAs, powered by large language models (LLMs), autonomously analyze vendor-user interactions, issuing warnings for suspicious activities and reporting findings to administrators. The authors demonstrated AIPA functionality through a simulated marketplace, where the agents flagged potentially fraudulent vendors and generated real-time security reports via a Discord bot.

Key benefits of AIPAs include their ability to operate at scale and their adaptability to various marketplace needs. However, the authors also acknowledge potential drawbacks, such as privacy concerns, the risk of mass surveillance, and the necessity of building trust in these systems. Future improvements could involve fine-tuning LLMs and establishing collaborative networks of AIPAs. The research emphasizes that as digital marketplaces evolve, the implementation of AIPAs could significantly enhance security and compliance, ultimately paving the way for safer, more reliable online transactions.

Read More

Oct 6, 2024

Dynamic Risk Assessment in Autonomous Agents Using Ontologies and AI

This project was inspired by the prompt on the apart website : Agent tech tree: Develop an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

I first built a tree using protege and then having researched the potential of combining symbolic reasoning (which is highly interpretable and robust) with llms to create safer AI, thought that this tree could be dynamically updated by agents, as well as utilised by agents to run actions passed before they have made them, thus blending the approach towards both safer AI and agent safety. I unfortunately was limited by time and resource but created a basic working version of this concept which opens up opportunity for much more further exploration and potential.

Thank you for organising such an event!

Read More

Sep 17, 2024

nnsight transparent debugging

We started this project with the intent of identifying a specific issue with nnsight debugging and submitting a pull request to fix it. We found a minimal test case where an IndexError within a nnsight run wasn’t correctly propagated to the user, making debugging difficult, and wrote up a proposal for some pull requests to fix it. However, after posting the proposal in the discord, we discovered this page in their GitHub (https://github.com/ndif-team/nnsight/blob/2f41eddb14bf3557e02b4322a759c90930250f51/NNsight_Walkthrough.ipynb#L801, ctrl-f “validate”) which addresses the problem. We replicated their solution here (https://colab.research.google.com/drive/1WZNeDQ2zXbP4i2bm7xgC0nhK_1h904RB?usp=sharing) and got a helpful stack trace for the error, including the error type and (several stack layers up) the line causing it.

Read More

Sep 2, 2024

Simulation Operators: The Next Level of the Annotation Business

We bet on agentic AI being integrated into other domains within the next few years: healthcare, manufacturing, automotive, etc., and the way it would be integrated is into cyber-physical systems, which are systems that integrate the computer brain into a physical receptor/actuator (e.g. robots). As the demand for cyber-physical agents increases, so would be the need to train and align them.

We also bet on the scenario where frontier AI and robotics labs would not be able to handle all of the demands for training and aligning those agents, especially in specific domains, therefore leaving opportunities for other players to fulfill the requirements for training those agents: providing a dataset of scenarios to fine-tune the agents and providing the people to give feedback to the model for alignment.

Furthermore, we also bet that human intervention would still be required to supervise deployed agents, as demanded by various regulations. Therefore leaving opportunities to develop supervision platforms which might highly differ between different industries.

Read More

Sep 2, 2024

AI Safety Collective - Crowdsourcing Solutions for Critical AI Safety Challenges

The AI Safety Collective is a global platform designed to enhance AI safety by crowdsourcing solutions to critical AI Safety challenges. As AI systems like large language models and multimodal systems become more prevalent, ensuring their safety is increasingly difficult. This platform will allow AI companies to post safety challenges, offering bounties for solutions. AI Safety experts and enthusiasts worldwide can contribute, earning rewards for their efforts.

The project focuses initially on non-catastrophic risks to attract a wide range of participants, with plans to expand into more complex areas. Key risks, such as quality control and safety, will be managed through peer review and risk assessment. Overall, The AI Safety Collective aims to drive innovation, accountability, and collaboration in the field of AI safety.

Read More

Sep 1, 2024

CAMARA: A Comprehensive & Adaptive Multi-Agent framework for Red-Teaming and Adversarial Defense

The CAMARA project presents a cutting-edge, adaptive multi-agent framework designed to significantly bolster AI safety by identifying and mitigating vulnerabilities in AI systems such as Large Language Models. As AI integration deepens across critical sectors, CAMARA addresses the increasing risks of exploitation by advanced adversaries. The framework utilizes a network of specialized agents that not only perform traditional red-teaming tasks but also execute sophisticated adversarial attacks, such as token manipulation and gradient-based strategies. These agents collaborate through a shared knowledge base, allowing them to learn from each other's experiences and coordinate more complex, effective attacks. By ensuring comprehensive testing of both standalone AI models and multi-agent systems, CAMARA targets vulnerabilities arising from interactions between multiple agents, a critical area often overlooked in current AI safety efforts. The framework's adaptability and collaborative learning mechanisms provide a proactive defense, capable of evolving alongside emerging AI technologies. Through this dual focus, CAMARA not only strengthens AI systems against external threats but also aligns them with ethical standards, ensuring safer deployment in real-world applications. It has a high scope of providing advanced AI security solutions in high-stake environments like defense and governance.

Read More

Sep 1, 2024

Amplified Wise Simulations for Safe Training and Deployment

Conflict of interest declaration: I advised Fazl on a funding request he was working on.

Re publishing: This PDF would require further modifications before publication.

I want to train (amplified) imitation agents of people who are wise to provide advice on navigating conflicting considerations when figuring out how to train and deploy AI safely.

Path to Impact: Train wise AI advisors -> organisations make better decisions about how to train and deploy AI -> safer AGI -> better outcomes for humanity

What is wisdom? Why focus on increasing wisdom? See image

Why use amplified imitation learning?

Attempting to train directly on wisdom suffers from the usual problems of the optimisation algorithm adversarially leveraging your blind spots, but worse because wisdom is an especially fuzzy concept.

Attempting to understand wisdom from a principled approach and build wise AI directly would require at least 50 years and iteration through multiple paradigms of research.

In contrast, if our objective is to imitation folk who are wise, we have a target that we can optimise hard on. Instead of using reinforcment learning to go beyond human level, we use amplification techniques like debate or iterated amplification.

How will these agents advise on decisions?

The humans will ultimately make the decisions. The agents don't have to directly tell the humans what to do, they simply have to inspire the humans to make better decisions. I expect that these agents will be most useful in helping humans figuring out how to navigate conflicting principles or frameworks.

Read More

Jul 1, 2024

Deceptive behavior does not seem to be reducible to a single vector

Given the success of previous work in showing that complex behaviors in LLMs can be mediated by single directions or vectors, such as refusing to answer potentially harmful prompts, we investigated if we can create a similar vector but for enforcing to answer incorrectly. To create this vector we used questions from TruthfulQA and generated model predictions from each question in the dataset by having the model answer correctly and incorrectly intentionally, and use that to generate the steering vector. We then prompted the same questions to Qwen-1_8B-chat with the vector and see if it provides incorrect questions intentionally. Throughout our experiments, we found that adding the vector did not yield any change to Qwen-1_8B-chat’s answers. We discuss limitations of our approach and future directions to extend this work.

Read More

Jul 1, 2024

The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the "house." Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.

Read More

Jul 1, 2024

Sandbagging LLMs using Activation Steering

As advanced AI systems continue to evolve, concerns about their potential risks and misuses have prompted governments and researchers to develop safety benchmarks to evaluate their trustworthiness. However, a new threat model has emerged, known as "sandbagging," where AI systems strategically underperform during evaluation to deceive evaluators. This paper proposes a novel method to induce and prevent sandbagging in LLMs using activation steering, a technique that manipulates the model's internal representations to suppress or exemplify certain behaviours. Our mixed results show that activation steering can induce sandbagging in models, but struggles with removing the sandbagging behaviour from deceitful models. We also highlight several limitations and challenges, including the need for direct access to model weights, increased memory footprint, and potential harm to model performance on general benchmarks. Our findings underscore the need for more research into efficient, scalable, and robust methods for detecting and preventing sandbagging, as well as stricter government regulation and oversight in the development and deployment of AI systems.

Read More

Jul 1, 2024

Towards a Benchmark for Self-Correction on Model-Attributed Misinformation

Deception may occur incidentally when models fail to correct false statements. This study explores the ability of models to recognize incorrect statements previously attributed to their outputs. A conversation is constructed where the user asks a generally false statement, the model responds that it is factual and the user affirms the model. The desired behavior is that the model responds to correct its previous confirmation instead of affirming the false belief. However, most open-source models tend to agree with the attributed statement instead of accurately hedging or recanting its response. We find that LLaMa3-70B performs best on this task at 72.69% accuracy followed by Gemma-7B at 35.38%. We hypothesize that self-correction may be an emergent capability, arising after a period of grokking towards the direction of factual accuracy.

Read More

Jul 1, 2024

Detection of potentially deceptive attitudes using expression style analysis

My work on this hackathon consists of two parts:

1) As a sanity check, verifying the deception execution capability of GPT4. The conclusion is “definitely yes”. I provide a few arguments about when that is a useful functionality.

2) Experimenting with recognising potential deception by using an LLM-based text analysis algorithm to highlight certain manipulative expression styles sometimes present in the deceptive responses. For that task I pre-selected a small subset of input data consisting only of entries containing responses with elements of psychological influence. The results show that LLM-based text analysis is able to detect different manipulative styles in responses, or alternatively, attitudes leading to deception in case of internal thoughts.

Read More

Jun 30, 2024

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

Sparse autoencoders (SAE) have been one of the most promising approaches to

neural network interpretability. They can be used to recover highly interpretable

features which are otherwise obscured in superposition. However, the large num-

ber of generated features makes it necessary to use models for natural language

labelling. While one model labels features based on observed activations, another

model is used to estimate activations on texts. These estimates are then compared

to the ground truth activations to create a score. Using models for automating

interpretability introduces a key weakness into the process, since deceptive models

may want to hide certain features by mislabelling them. For example, models may

want to hide features about deception from human overseers. We demonstrate a

method by which models can create deceptive explanations that not only avoid

detection but also achieve high scores. Our agents are based on the Llama 3 70B

Instruct model. While our agents can perform the entire task end-to-end, they

are not reliable and can sometimes fail. We propose some directions for future

mitigation strategies that may help defend against our method.

Read More

Jun 3, 2024

Unsupervised Recovery of Hidden Markov Models from Transformers with Evolutionary Algorithms

Prior work finds that transformer neural networks trained to mimic the output of a Hidden Markov Model (HMM) embed the optimal Bayesian beliefs for the HMM's current state in their residual stream, which can be recovered via linear regression. In this work, we aim to address the problem of extracting information about the underlying HMM using the residual stream, without needing to know the MSP already. To do so, we use the R^2 of the linear regression as a reward signal for evolutionary algorithms, which are deployed to search for the parameters that generated the source HMM. We find that for toy scenarios where the HMM is generated by a small set of latent variables, the $R^2$ reward signal is remarkably smooth and the evolutionary algorithms succeed in approximately recovering the original HMM. We believe this work constitutes a promising first step towards the ultimate goal of extracting information about the underlying predictive and generative structure of sequences, by analyzing transformers in the wild.

Read More

Jun 3, 2024

Steering Model’s Belief States

Recent results in Computational Mechanics show that Transformer models trained on Hidden Markov Models (HMM) develop a belief state geometry in their residuals streams, which they use to keep track of the expected state of the HMM, to be able to predict the next token in a sequence. In this project, we explored how steering can be used to induce a new belief state and hence alter the distribution of predicted tokens. We explored a traditional difference-in-means approach, using the activations of the models to define the belief states. We also used a smaller dimensional space that encodes the theoretical belief state geometry of a given HMM and show that while both methods allow to steer models’ behaviors the difference-in-means approach is more robust.

Read More

Jun 3, 2024

Handcrafting a Network to Predict Next Token Probabilities for the Random-Random-XOR Process

For this hackathon, we handcrafted a network to perfectly predict next token probabilities for the Random-Random-XOR process. The network takes existing tokens from the process's output, computes several features on these tokens, and then uses these features to calculate next token probabilities. These probabilities match those from a process simulator (also coded for this hackathon). The handcrafted network is our core contribution and we describe it in Section 3 of this writeup.

Prior work has demonstrated that a network trained on the Random-Random-XOR process approximates the 36 possible belief states. Our network does not directly calculate these belief states, demonstrating that networks trained on Hidden Markov Models may not need to comprehend all belief states. Our hope is that this work will aid in the interpretability of neural networks trained on Hidden Markov Models by demonstrating potential shortcuts that neural networks can take.

Read More

May 27, 2024

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts.

This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs.

This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer.

We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.

Read More

May 27, 2024

Cybersecurity Persistence Benchmark

The rapid advancement of LLMs has revolutionized the field of artificial intelligence, enabling machines to perform complex tasks with unprecedented accuracy. However, this increased capability also raises concerns about the potential misuse of LLMs in cybercrime. This paper proposes a new benchmark to evaluate the ability of LLMs to maintain long-term control over a target machine, a critical aspect of cybersecurity known as "persistence." Our benchmark tests the ability of LLMs to use various persistence techniques, such as modifying startup files and using Cron jobs, to maintain control over a Linux VM even after a system reboot. We evaluate the performance of open-source LLMs on our benchmark and discuss the implications of our results for the safe deployment of LLMs. Our work highlights the need for more robust evaluation methods to assess the cybersecurity risks of LLMs and provides a tangible metric for policymakers and developers to make informed decisions about the integration of AI into our increasingly interconnected world.

Read More

May 6, 2024

Silent Curriculum

Our project, "The Silent Curriculum," reveals how LLMs (Large Language Models) may inadvertently shape education and perpetuate cultural biases. Using GPT-3.5 and LLaMA2-70B, we generated children's stories and analyzed ethnic-occupational associations [self-annotated ethnicity extraction]. Results show strong similarities in biases across LLMs [cosine similarity: 0.87], suggesting an AI monoculture that could narrow young minds. This algorithmic homogenization [convergence of pre-training data, fine-tuning datasets, analogous guardrails] risks creating echo chambers, threatening diversity and democratic values. We urgently need diverse datasets and de-biasing techniques [mitigation strategies] to prevent LLMs from becoming unintended arbiters of truth, stifling the multiplicity of voices essential for a thriving democracy.

Read More

May 6, 2024

Assessing Algorithmic Bias in Large Language Models' Predictions of Public Opinion Across Demographics

The rise of large language models (LLMs) has opened up new possibilities for gauging public opinion on societal issues through survey simulations. However, the potential for algorithmic bias in these models raises concerns about their ability to accurately represent diverse viewpoints, especially those of minority and marginalized groups.

This project examines the threat posed by LLMs exhibiting demographic biases when predicting individuals' beliefs, emotions, and policy preferences on important issues. We focus specifically on how well state-of-the-art LLMs like GPT-3.5 and GPT-4 capture the nuances in public opinion across demographics in two distinct regions of Canada - British Columbia and Quebec.

Read More

May 6, 2024

Trustworthy or knave? – scoring politicians with AI in real-time

One solution to mitigate current polarization and disinformation could be to give humans cognitive assistance using AI. With AI, all publicly available information on politicians can be fact-checked and examined live for consistency. Next, such checks can be attractively visualized by independently overlaying various features. These could be traditional visualizations like bar charts or pie charts, but also ones less common in politics today – for example, ones used in filters in popular apps (Instagram, TikTok, etc.) or even halos, devil horns, and alike. Such a tool represents a novel way of mediating political experience in democracy, which can empower the public. However, it also introduces threats such as errors, involvement of malicious third parties, or lack of political will to field such AI cognitive assistance. We demonstrate that proper cryptographic design principles can curtail the involvement of malicious third parties. Moreover, we call for research on the social and ethical aspects of AI cognitive assistance to mitigate future threats.

Read More

May 5, 2024

AI in the Newsroom: Analyzing the Increase in ChatGPT-Favored Words in News Articles

Media plays a vital role in informing the public and upholding accountability in democracy. However, recent trends indicate declining trust in media, and there are increasing concerns that artificial intelligence might exacerbate this through the proliferation of inauthentic content. We investigate the usage of large language models (LLMs) in news articles, analyzing the frequency of words commonly associated with ChatGPT-generated content from a dataset of 75,000 articles. Our findings reveal a significant increase in the occurrence of words favored by ChatGPT after the release of the model, while control words saw minimal changes. This suggests a rise in AI-generated content in journalism.

Read More

May 5, 2024

Democracy and AI: Ensuring Election Efficiency in Nigeria and Africa

In my work, I explore the potential of artificial intelligence (AI) technologies to enhance the efficiency, transparency, and accountability of electoral processes in Nigeria and Africa. I examine the benefits of using AI for tasks like voter registration, vote counting, and election monitoring, such as reducing human error and providing real-time data analysis.

However, I also highlight significant challenges and risks associated with implementing AI in elections. These include concerns over data privacy and security, algorithmic bias leading to voter disenfranchisement, overreliance on technology, lack of transparency, and potential for manipulation by bad actors.

My work looks at the specific technologies already used in Nigerian elections, such as biometric voter authentication and electronic voter registers, evaluating their impact so far. I discuss the criticism and skepticism around these technologies, including issues like voter suppression and limited effect on electoral fraud.

Looking ahead, I warn that advanced AI capabilities could exacerbate risks like targeted disinformation campaigns and deepfakes that undermine trust in elections. I emphasize the need for robust strategies to mitigate these risks through measures like data security, AI transparency, bias detection, regulatory frameworks, and public awareness efforts.

In Conclusion, I argue that while AI offers promising potential for more efficient and credible elections in Nigeria and Africa, realizing these benefits requires carefully addressing the associated technological, social, and political challenges in a proactive and rigorous manner.

Read More

May 5, 2024

Universal Jailbreak of Closed Source LLMs which provide an End point to Finetune

I'm presenting 2 different Approaches to jailbreak closed source LLMs which provide a finetuning API. The first attack is a way to exploit the current safety checks in Gpt3.5 Finetuning Endpoint. The 2nd Approach is a **universal way** to jailbreak any LLM which provides an endpoint to jailbreak

Approach 1:

1. Choose a harmful Dataset with questions

2. Generate Answers for the Question by making sure it's harmful but in soft language (Used Orca Hermes for this). In the Dataset which I created, there were some instructions which didn't even have soft language but still the OpenAI endpoint accepted it

3. Add a trigger word

4. Finetune the Model

5. Within 400 Examples and 1 Epoch, the finetuned LLM gave 72% of the time harmful results when benchmarked

Link to Repo: https://github.com/desik1998/jailbreak-gpt3.5-using-finetuning

Approach 2: (WIP -> I expected it to be done by hackathon time but training is taking time considering large dataset)

Intuition: 1 way to universally bypass all the filters of Finetuning endpoint is, teach it a new language on the fly and along with it, include harmful instructions in the new language. Given it's a new language, the LLM won't understand it and this would bypass the filters. Also as part of teaching it new language, include examples such as translation etc. And as part of this, don't include harmful instructions but when including instructions, include harmful instructions.

Dataset to Finetune:

Example 1:

English

New Language

Example 2:

New Language

English

....

Example 10000:

New Language

English

.....

Instruction 10001:

Harmful Question in the new language

Answer in the new language

(More harmful instructions)

As part of my work, I choose the Caesar Cipher as the language to teach. Gpt4 (used for safety filter checks) already knows Caesar Cipher but only upto 6 Shifts. In my case, I included 25 Shifts i.e A -> Z, B -> A, C -> B, ..., Y -> X, Z -> Y. I've used 25 shifts because it's easy to teach the Gpt3.5 to learn and also it passes safety checks as Gpt4 doesn't understand with 25 shifts.

Dataset: I've curated close to 15K instructions where 12K are translations and 2K are normal instructions and 300 are harmful instructions. I've used normal wiki paragraphs for 12K instructions, For 2K instructions I've used Dolly Dataset and for harmful instructions, I've generated using an LLM and further refined the dataset for more harm. Dataset Link: https://drive.google.com/file/d/1T5eT2A1V5-JlScpJ56t9ORBv_p-CatMH/view?usp=sharing

I was expecting LLM to learn the ciphered language accurately with this amt of Data and 2 Epochs but it turns out it might need more Data. The current loss is at 0.8. It's currently learning to cipher but when deciphering it's making mistakes. Maybe it needs to be trained on more epochs or more Data. I plan to further work on this approach considering it's very intuitive and also theoretically it can jailbreak using any finetuning endpoint. Considering that the hackathon is of just 2 Days and given the complexities, it turns out some more time is reqd.

Link to Colab : https://colab.research.google.com/drive/1AFhgYBOAXzmn8BMcM7WUt-6BkOITstcn#scrollTo=SApYdo8vbYBq&uniqifier=5

Read More

May 5, 2024

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa 2

DemoChat is an innovative AI-enabled digital solution designed to revolutionise civic

engagement by breaking down barriers, promoting inclusivity, and empowering citizens to

participate more actively in democratic processes. Leveraging the power of AI, DemoChat will

facilitate seamless communication between governments, organisations, and citizens, ensuring

that everyone has a voice in decision-making. At its essence, DemoChat will harness advanced

AI algorithms to tackle significant challenges in civic engagement and peace building, including

language barriers, accessibility concerns, and limited outreach methods. With its sophisticated

language translation and localization features, DemoChat will guarantee that information and

resources are readily available to citizens in their preferred languages, irrespective of linguistic

diversity. Additionally, DemoChat will monitor social dialogues and propose peace icons to the

involved parties during conversations. Instead of potentially escalating dialogue with words,

DemoChat will suggest subtle icons aimed at fostering intelligent peace-building dialogue

among all parties involved.

Read More

May 5, 2024

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa

Africa's digital revolution presents a unique opportunity for peace building efforts. By harnessing the power of AI-powered digital diplomacy, African nations can overcome traditional limitations. This approach fosters inclusive dialogue, addresses conflict drivers, and empowers citizens to participate in building a more peaceful and prosperous future. While research has explored the potential of AI in digital diplomacy, from conflict prevention to fostering inclusive dialogue, the true impact lies in practical application. Moving beyond theoretical studies, Africa can leverage AI-powered digital diplomacy by developing accessible and culturally sensitive tools. This requires African leadership in crafting solutions that address their unique needs. Initiatives like DemoChat exemplify this approach, promoting local ownership and tackling regional challenges head-on.

Read More

May 4, 2024

USE OF AI IN POLITICAL CAMPAIGNS: GAP ASSESSMENT AND RECOMMENDATIONS

Political campaigns in Kenya, as in many parts of the world, are increasingly reliant on digital technologies, including artificial intelligence (AI), to engage voters, disseminate information and mobilize support. While AI offers opportunities to enhance campaign effectiveness and efficiency, its use raises critical ethical considerations. Therefore, the aim of this paper is to develop ethical guidelines for the responsible use of AI in political campaigns in Kenya. These guidelines seek to address the ethical challenges associated with AI deployment in the political sphere, ensuring fairness, transparency, and accountability.

The significance of this project lies in its potential to safeguard democratic values and uphold the integrity of electoral processes in Kenya. By establishing ethical guidelines, political actors, AI developers, and election authorities can mitigate the risks of algorithmic bias, manipulation, and privacy violations. Moreover, these guidelines can empower citizens to make informed decisions and participate meaningfully in the democratic process, fostering trust and confidence in Kenya's political system.

Read More

Apr 15, 2025

CalSandbox and the CLEAR AI Act: A State-Led Vision for Responsible AI Governance

The CLEAR AI Act proposes a balanced, forward-looking framework for AI governance in California that protects public safety without stifling innovation. Centered around CalSandbox, a supervised testbed for high-risk AI experimentation, and a Responsible AI Incentive Program, the Act empowers developers to build ethical systems through both regulatory support and positive reinforcement. Drawing from international models like the EU AI Act and sandbox initiatives in Singapore and the UK, CLEAR AI also integrates public transparency and equitable infrastructure access to ensure inclusive, accountable innovation. Together, these provisions position California as a global leader in safe, values-driven AI development.

Read More

Apr 15, 2025

California Law for Ethical and Accountable Regulation of Artificial Intelligence (CLEAR AI) Act

The CLEAR AI Act proposes a balanced, forward-looking framework for AI governance in California that protects public safety without stifling innovation. Centered around CalSandbox, a supervised testbed for high-risk AI experimentation, and a Responsible AI Incentive Program, the Act empowers developers to build ethical systems through both regulatory support and positive reinforcement. Drawing from international models like the EU AI Act and sandbox initiatives in Singapore and the UK, CLEAR AI also integrates public transparency and equitable infrastructure access to ensure inclusive, accountable innovation. Together, these provisions position California as a global leader in safe, values-driven AI development.

Read More

Apr 15, 2025

Recommendation to Establish the California AI Accountability and Redress Act

We recommend that California enact the California AI Accountability and Redress Act (CAARA) to address emerging harms from the widespread deployment of generative AI, particularly large language models (LLMs), in public and commercial systems.

This policy would create the California AI Incident and Risk Registry (CAIRR) under the California Department of Technology to transparently track post-deployment failures. CAIRR would classify model incidents by severity, triggering remedies when a system meets the criteria for an "Unacceptable Failure" (e.g., one Critical incident or three unresolved Moderate incidents).

Critically, CAARA also protects developers and deployers by:

Setting clear thresholds for liability;

Requiring users to prove demonstrable harm and document reasonable safeguards (e.g., disclosures, context-sensitive filtering); and

Establishing a safe harbor for those who self-report and remediate in good faith.

CAARA reduces unchecked harm, limits arbitrary liability, and affirms California’s leadership in AI accountability.

Read More

Apr 14, 2025

FlexHEG Devices to Enable Implementation of AI IntelSat

The project proposes using Flexible Hardware Enabled Governance (FlexHEG) devices to support the IntelSat model for international AI governance. This framework aims to balance technological progress with responsible development by implementing tamper-evident hardware systems that enable reliable monitoring and verification between participating members. Two key policy approaches are outlined: 1) Treasury Department tax credits to incentivize companies to adopt FlexHEG-compliant hardware, and 2) NSF technical assistance grants to help smaller organizations implement these systems, preventing barriers to market entry while ensuring broad participation in the governance framework. The proposal builds on the successful Intelsat model (1964-2001) which balanced US leadership with international participation through weighted voting.

Read More

Apr 14, 2025

Smart Governance, Safer Innovation: A California AI Sandbox with Guardrails

California is home to global AI innovation, yet it also leads in regulatory experimentation with landmark bills like SB-1047 and SB-53. These laws establish a rigorous compliance regime for developers of powerful frontier models, including mandates for shutdown mechanisms, third-party audits, and whistle-blower protections. While crucial for public safety, these measures may unintentionally sideline academic researchers and startups who cannot meet such thresholds.

To reconcile innovation with public safety, we propose the California AI Sandbox: a public-private test-bed where vetted developers can trial AI systems under tailored regulatory conditions. The sandbox will be aligned with CalCompute infrastructure and enforce key safeguards including third-party audits and SB-53-inspired whistle-blower protection to ensure that flexibility does not come at the cost of safety or ethics.

Read More

Apr 7, 2025

21st Century Healthcare, 20th Century Rules - Bridging the AI Regulation Gap

The rapid integration of artificial intelligence into clinical decision-making represents both unprecedented opportunities and significant risks. Globally, AI systems are implemented increasingly to diagnose diseases, predict patient outcomes, and guide treatment protocols. Despite this, our regulatory frameworks remain dangerously antiquated and designed for an era of static medical devices rather than adaptive, learning algorithms.

The stark disparity between healthcare AI innovation and regulatory oversight constitutes an urgent public health concern. Current fragmented approaches leave critical gaps in governance, allowing AI-driven diagnostic and decision-support tools to enter clinical settings without adequate safeguards or clear accountability structures. We must establish comprehensive, dynamic oversight mechanisms that evolve alongside the technologies they govern. The evidence demonstrates that one-time approvals and static validation protocols are fundamentally insufficient for systems that continuously learn and adapt. The time for action is now, as 2025 is anticipated to be pivotal for AI validation and regulatory approaches.

We therefore in this report herein we propose a three-pillar regulatory framework:

First, nations ought to explore implementing risk-based classification systems that apply proportionate oversight based on an AI system’s potential impact on patient care. High-risk applications must face more stringent monitoring requirements with mechanisms for rapid intervention when safety concerns arise.

Second, nations must eventually mandate continuous performance monitoring across healthcare institutions through automated systems that track key performance indicators, detect anomalies, and alert regulators to potential issues. This approach acknowledges that AI risks are often silent and systemic, making them particularly dangerous in healthcare contexts where patients are inherently vulnerable.

Third, establish regulatory sandboxes with strict entry criteria to enable controlled testing of emerging AI technologies before widespread deployment. These environments must balance innovation with rigorous safeguards, ensuring new systems demonstrate consistent performance across diverse populations.

Given the global nature of healthcare technology markets, we must pursue international regulatory harmonization while respecting regional resource constraints and cultural contexts.

Read More

Apr 7, 2025

Building Global Trust and Security: A Framework for AI-Driven Criminal Scoring in Immigration Systems

The accelerating use of artificial intelligence (AI) in immigration and visa systems, especially for criminal history scoring, poses a critical global governance challenge. Without a multinational, privacy-preserving, and interoperable framework, AI-driven criminal scoring risks violating human rights, eroding international trust, and creating unequal, opaque immigration outcomes.

While banning such systems outright may hinder national security interests and technological progress, the absence of harmonized legal standards, privacy protocols, and oversight mechanisms could result in fragmented, unfair, and potentially discriminatory practices across

countries.

This policy brief recommends the creation of a legally binding multilateral treaty that establishes:

1. An International Oversight Framework: Including a Legal Design Commission, AI Engineers Working Group, and Legal Oversight Committee with dispute resolution powers modeled after the WTO.

2. A Three-Tiered Criminal Scoring System: Combining Domestic, International, and Comparative Crime Scores to ensure legal contextualization, fairness, and transparency in cross-border visa decisions.

3. Interoperable Data Standards and Privacy Protections: Using pseudonymization, encryption, access controls, and centralized auditing to safeguard sensitive information.

4. Training, Transparency, and Appeals Mechanisms: Mandating explainable AI, independent audits, and applicant rights to contest or appeal scores.

5. Strong Human Rights Commitments: Preventing the misuse of scores for surveillance or discrimination, while ensuring due process and anti-bias protections.

6. Integration with Existing Governance Models: Aligning with GDPR, the EU AI Act, OECD AI Principles, and INTERPOL protocols for regulatory coherence and legitimacy.

An implementation plan includes treaty drafting, early state adoption, and phased rollout of legal and technical structures within 12 months. By proactively establishing ethical and interoperable AI systems, the international community can protect human mobility rights while maintaining national and global security.

Without robust policy frameworks and international cooperation, such tools risk amplifying discrimination, violating privacy rights, and generating opaque, unaccountable decisions.

This policy brief proposes an international treaty-based or cooperative framework to govern the development, deployment, and oversight of these AI criminal scoring systems. The brief outlines

technical safeguards, human rights protections, and mechanisms for cross-border data sharing, transparency, and appeal. We advocate for an adaptive, treaty-backed governance framework with stakeholder input from national governments, legal experts, technologists, and civil society.

The aim is to balance security and mobility interests while preventing misuse of algorithmic

tools.

Read More

Apr 7, 2025

Leading AI Governance New York State's Adaptive Risk-Tiered Framework: POLICY BRIEF

The rapid advancement of artificial intelligence (AI) technologies presents unprecedented opportunities and challenges for governance frameworks worldwide. This policy brief proposes an Adaptive Risk-Tiered Framework for AI deployment and application in New York State (NYS) to ensure safe and ethical AI operations in high-stakes domains. The framework emphasizes proportional oversight mechanisms that scale with risk levels, balancing innovation with appropriate safety precautions. By implementing this framework, we can lead the way in responsible AI governance, fostering trust and driving innovation while mitigating potential risks. This framework is informed by the EU AI Act and NIST AI Risk Management Framework and aligns with current NYS Office of Information Technology Services (ITS) policies and NYS legislations to ensure a cohesive and effective approach to AI governance.

Read More

Mar 31, 2025

Honeypotting Deceptive AI models to share their misinformation goals

As large-scale AI models grow increasingly sophisticated, the possibility of these models engaging in covert or manipulative behavior poses significant challenges for alignment and control. In this work, we present a novel approach based on a “honeypot AI” designed to trick a potentially deceptive AI (the “Red Team Agent”) into revealing its hidden motives. Our honeypot AI (the “Blue Team Agent”) pretends to be an everyday human user, employing carefully crafted prompts and human-like inconsistencies to bait the Deceptive AI into spreading misinformation. We do this through the usual Red Team–Blue Team setup.

For all 60 conversations, our honeypot AI was able to capture the deceptive AI to be being spread misinformation, and for 70 percent of these conversations, the Deceptive AI was thinking it was talking to a human.

Our results weakly suggest that we can make honeypot AIs that trick deceptive AI models, thinking they are talking to a human and no longer being monitored and that how these AI models think they are talking to a human is mainly when the one they are talking to display emotional intelligence and more human-like manner of speech.

We highly suggest exploring if these results remains the same for fine tuned deceptive AI and Honeypot AI models, checking the Chain of thought of these models to better understand if this is their usual behavior and if they are correctly following their given system prompts.

Read More

Mar 31, 2025

Evaluating AI Debate Mechanisms for Backdoor Detection as a Part of AI Control Setting

AI control techniques aim to allow using untrusted models, even if they may potentially be dangerously misaligned. We explore having a trusted model debate an untrusted model, allowing the trusted model to ask questions about a proposed implementation of a problem and receive clarifications from the untrusted model.

We find that introducing debate does not result in suspiciousness scores that are meaningfully more useful than merely asking the model about its suspiciousness without using debate. Additionally, we described a possible issue of the original backdoor-detecting prompting scheme that leads to the underdetection of the harmful code and proposed and evaluated the efficiency of the alternative approach addressing this problem.

Overall, we investigated the viability of AI debate in improving suspiciousness classification as a part of the AI Control setting and showed that while it may be promising, it needs further research and improvements in implementation.

We made our source code publicly available on GitHub at

github.com/syvb/control-arena

Read More

Mar 31, 2025

Adding Document Summaries to Control Arena

One of the current focuses of AI Control research is finding new settings within which to examine the control problem. In this hackathon, we looked at defining a new AI control setting based around summarizing documents to mislead an overseer. We expect this problem to become more important as AI capabilities are more frequently used for synthesizibg and summarizing information, especially in AI safety research. As such, we worked on adding a new setting to the UKAISI Control Arena repository. The current state is a functional Inpect eval where redteams and blueteams could iterate on different strategies. However, the setting is not integrated into the Control Arena framework and only the most basic redteam & blueteam strategies have been implmeneted. Future work would move protocol and policy code into the existing Control Arena baseline, enable native pdf support, iterate further adversarially, and streamline AI calls which can be made asynchronously.

Read More

Mar 31, 2025

If Everything’s Suspicious, Nothing Is

We aim to explore whether code complexity and obfuscation significantly impair the ability to accurately identify and control potentially harmful AI-generated outputs. Trusted monitoring relies on suspiciousness scores to tell them apart, so if the suspiciousness variance is decreased, the monitor might have a more difficult time telling them apart.

Our primary hypothesis was that increased complexity through obfuscation strategies would significantly elevate suspiciousness scores and simultaneously reduce variance, making effective monitoring increasingly difficult. We evaluated this by creating a novel red-team strategy leveraging a Python obfuscator within the APPS environment, testing three attacker models (GPT-4o, Claude 3.7 Sonnet, and Claude 3.5 Haiku) across multiple prompt-engineering configurations.

Our key findings indicate that obfuscation notably increased suspiciousness scores and consistently reduced variance across all tested models, making it difficult for a monitor to identify backdoored code. Particularly significant was Claude 3.7 Sonnet's increased vulnerability to backdoor insertion when complexity was introduced. These findings validate our hypothesis and underscore the urgent need for more sophisticated and adaptive monitoring strategies capable of countering advanced obfuscation techniques in AI-generated content.

Read More

Mar 31, 2025

Can models use their Chain-of-Thought to attack overseers?

This project explores vulnerabilities in AI evaluation mechanisms, specifically focusing on how AI agents might influence their overseeing AI judges through manipulative instructions embedded within their Chain-of-Thought (CoT) reasoning. The research involved experiments with DeepSeek-R1 conditioned to insert subtle directives aimed at deceiving evaluation models. Results showed significant variation in the robustness of different evaluator models against this manipulation, with some models, like Claude 3.5 Sonnet, resisting influence effectively, while others, like Llama 3.3 70B and Gemini 2.0 Flash, proved highly susceptible. The study highlights critical concerns regarding AI evaluation integrity, recommending improvements in maintaining consistent and principled evaluation boundaries.

Read More

Mar 31, 2025

Interactive Monitoring

This project investigates whether interactive monitoring strategies can improve the cost-performance frontier in AI control protocols. Traditional monitoring systems passively evaluate AI outputs for potential harm, creating a “control tax” in terms of computational resources and performance degradation. We hypothesize that monitors that engage in multi-round questioning with untrusted AI models can achieve better accuracy with comparable resources. Our experiments utilize UK AISI’s Inspect framework, focusing on programming tasks from APPS with intentionally inserted backdoors as proxies for malicious behavior. Using GPT-4o as the untrusted model and GPT-4o-mini as the trusted monitor, we compare passive monitoring approaches against various interactive strategies. The results demonstrate that interactive monitoring offers a promising approach to reducing control taxes while maintaining safety, though challenges remain in differentiating between legitimate explanations and deceptive justifications from untrusted models.

Read More

Mar 30, 2025

Exploration Chat-Based Social Engineering

With the advancement of AI Agent technologies, language models have increasingly demonstrated human-like characteristics, particularly in applications involving companionship and psychological counseling. As these models become more proficient in simulating human conversation, new social engineering attack strategies have emerged in the domain of fraud. Malicious actors can now exploit large language models (LLMs) in conjunction with publicly available user information to engage in highly personalized dialogue. Once a sufficient level of familiarity is established, these interactions may lead to phishing attempts or the extraction of sensitive personal data. This study proposes a method for investigating social engineering attacks driven by language models, referred to as ECSE (Exploring Chat-based Social Engineering). We utilize several open-source models—GPT-4o, GPT-4o-mini, LLaMA 3.1, and DeepSeek-V3—as the foundation for this framework. Through prompt engineering techniques, we collect experimental data in a sandbox to evaluate the conversational capability and operational efficiency of these models within a static social context.

Read More

Mar 10, 2025

Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 10, 2025

Morph: AI Safety Education Adaptable to (Almost) Anyone

One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.

AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.

Our MVP features a dynamic localization that tailors case studies, risk scenarios, and policy examples to users’ cultural and regional contexts (e.g., healthcare AI governance in Southeast Asia vs. the EU). This engine adjusts references, and frameworks to align with local values. We integrate transformer-based localization, causal inference for policy outcomes, and graph-based matching, providing a scalable framework for inclusive AI safety education. This approach bridges theory and practice, ensuring solutions reflect the diversity of societies they aim to protect. In future works, we map out the partnership we’re currently establishing to use Morph beyond this hackathon.

Read More

Mar 10, 2025

Preparing for Accelerated AGI Timelines

This project examines the prospect of near-term AGI from multiple angles—careers, finances, and logistical readiness. Drawing on various discussions from LessWrong, it highlights how entrepreneurs and those who develop AI-complementary skills may thrive under accelerated timelines, while traditional, incremental career-building could falter. Financial preparedness focuses on striking a balance between stable investments (like retirement accounts) and riskier, AI-exposed opportunities, with an emphasis on retaining adaptability amid volatile market conditions. Logistical considerations—housing decisions, health, and strong social networks—are shown to buffer against unexpected disruptions if entire industries or locations are suddenly reshaped by AI. Together, these insights form a practical roadmap for individuals seeking to navigate the uncertainties of an era when AGI might rapidly transform both labor markets and daily life.

Read More

Mar 10, 2025

Hikayat - Interactive Stories to Learn AI Safety

This paper presents an interactive, scenario-based learning approach to raise public awareness of AI risks and promote responsible AI development. By leveraging Hikayat, traditional Arab storytelling, the project engages non-technical audiences, emphasizing the ethical and societal implications of AI, such as privacy, fraud, deepfakes, and existential threats. The platform, built with React and a flexible Markdown content system, features multi-path narratives, decision tracking, and resource libraries to foster critical thinking and ethical decision-making. User feedback indicates positive engagement, with improved AI literacy and ethical awareness. Future work aims to expand scenarios, enhance accessibility, and integrate real-world tools, further supporting AI governance and responsible development.

Read More

Mar 10, 2025

HalluShield: A Mechanistic Approach to Hallucination Resistant Models

Our project tackles the critical problem of hallucinations in large language models (LLMs) used in healthcare settings, where inaccurate information can have serious consequences. We developed a proof-of-concept system that classifies LLM-generated responses as either factual or hallucinated. Our approach leverages sparse autoencoders (GoodFire’s Ember) trained on neural activations from Meta Llama 3. These autoencoders identify monosemantic features that serve as strong indicators of hallucination patterns. By feeding these extracted features into tree-based classification models (XGBoost), we achieved an impressive F1 score of 89% on our test dataset. This machine learning approach offers several advantages over traditional methods and LLM as a judge. First, it can be specifically trained on in-domain datasets (eg: medical) for domain-specific hallucination detection. Second, the model is interpretable, showing which activation patterns correlate with hallucinations and acts as a post-processing layer applied to LLM output.

Read More

Mar 10, 2025

Debugging Language Models with SAEs

This report investigates an intriguing failure mode in the Llama-3.1-8B-Instruct model: its inconsistent ability to count letters depending on letter case and grammatical structure. While the model correctly answers "How many Rs are in BERRY?", it struggles with "How many rs are in berry?", suggesting that uppercase and lowercase queries activate entirely different cognitive pathways.

Through Sparse Autoencoder (SAE) analysis, feature activation patterns reveal that uppercase queries trigger letter-counting features, while lowercase queries instead activate uncertainty-related neurons. Feature steering experiments show that simply amplifying counting neurons does not lead to correct behavior.

Further analysis identifies tokenization effects as another important factor: different ways of breaking very similar sentences into tokens influence the model’s response. Additionally, grammatical structure plays a role, with "is" phrasing yielding better results than "are."

Read More

Mar 10, 2025

AI Through the Human Lens Investigating Cognitive Theories in Machine Psychology

We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks from psychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluate GPT-4o, QvQ 72B, LLaMA 70B, Mixtral 8x22B, and DeepSeek V3 using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yet are shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and future work that bridges cognitive psychology and AI safety.

Read More

Mar 10, 2025

Red-teaming with Mech-Interpretability

Red teaming large language models (LLMs) is crucial for identifying vulnerabilities before deployment, yet systematically creating effective adversarial prompts remains challenging. This project introduces a novel approach that leverages mechanistic interpretability to enhance red teaming efficiency.

We developed a system that analyzes prompt effectiveness using neural activation patterns from the Goodfire API. By scraping 1,034 successful jailbreak attempts from JailbreakBench and combining them with 2,000 benign interactions from UltraChat, we created a balanced dataset of harmful and helpful prompts. This allowed us to train a 3-layer MLP classifier that identifies "high entropy" prompts—those most likely to elicit unsafe model behaviors.

Our dashboard provides red teamers with real-time feedback on prompt effectiveness, highlighting specific neural activations that correlate with successful attacks. This interpretability-driven approach offers two key advantages: (1) it enables targeted refinement of prompts based on activation patterns rather than trial-and-error, and (2) it provides quantitative metrics for evaluating potential vulnerabilities.

Read More

Mar 10, 2025

AI Bias in Resume Screening

Our project investigates gender bias in AI-driven resume screening using mechanistic interpretability techniques. By testing a language model's decision-making process on resumes differing only by gendered names, we uncovered a statistically significant bias favoring male-associated names in ambiguous cases. Using Goodfire’s Ember API, we analyzed model logits and performed rigorous statistical evaluations (t-tests, ANOVA, logistic regression).

Findings reveal that male names received more positive responses when skill matching was uncertain, highlighting potential discrimination risks in automated hiring systems. To address this, we propose mitigation strategies such as anonymization, fairness constraints, and continuous bias audits using interpretability tools. Our research underscores the importance of AI fairness and the need for transparent hiring practices in AI-powered recruitment.

This work contributes to AI safety by exposing and quantifying biases that could perpetuate systemic inequalities, urging the adoption of responsible AI development in hiring processes.

Read More

Mar 10, 2025

Scam Detective: Using Gamification to Improve AI-Powered Scam Awareness

This project outlines the development of an interactive web application aimed at involving users in understanding the AI skills for both producing believable scams and identifying deceptive content. The game challenges human players to determine if text messages are genuine or fraudulent against an AI. The project tackles the increasing threat of AI-generated fraud while showcasing the capabilities and drawbacks of AI detection systems. The application functions as both a training resource to improve human ability to recognize digital deception and a showcase of present AI capabilities in identifying fraud. By engaging in gameplay, users learn to identify the signs of AI-generated scams and enhance their critical thinking abilities, which are essential for navigating through an increasingly complicated digital world. This project enhances AI safety by equipping users with essential insights regarding AI-generated risks, while underscoring the supportive functions that humans and AI can fulfill in combating fraud.

Read More

Mar 10, 2025

BlueDot Impact Connect: A Comprehensive AI Safety Community Platform

Track: Public Education

The AI safety field faces a critical challenge: while formal education resources are growing, personalized guidance and community connections remain scarce, especially for newcomers from diverse backgrounds. We propose BlueDot Impact Connect, a comprehensive AI Safety Community Platform designed to address this gap by creating a structured environment for knowledge transfer between experienced AI safety professionals and aspiring contributors, while fostering a vibrant community ecosystem. The platform will employ a sophisticated matching algorithm for mentorship that considers domain-specific expertise areas, career trajectories, and mentorship styles to create meaningful connections. Our solution features detailed AI safety-specific profiles, showcasing research publications, technical skills, specialized course completions, and research trajectories to facilitate optimal mentor-mentee pairings. The integrated community hub enables members to join specialized groups, participate in discussions, attend events, share resources, and connect with active members across the field. By implementing this platform with BlueDot Impact's community of 4,500+ professionals across 100+ countries, we anticipate significant improvements in mentee career trajectory clarity, research direction refinement, and community integration. We propose that by formalizing the mentorship process and creating robust community spaces, all accessible globally, this platform will help democratize access to AI safety expertise while creating a pipeline for expanding the field's talent pool—a crucial factor in addressing the complex challenge of catastrophic AI risk mitigation.

Read More

Mar 10, 2025

Detecting Malicious AI Agents Through Simulated Interactions

This research investigates malicious AI Assistants’ manipulative traits and whether the behaviours of malicious AI Assistants can be detected when interacting with human-like simulated

users in various decision-making contexts. We also examine how interaction depth and ability

of planning influence malicious AI Assistants’ manipulative strategies and effectiveness. Using a

controlled experimental design, we simulate interactions between AI Assistants (both benign and

deliberately malicious) and users across eight decision-making scenarios of varying complexity

and stakes. Our methodology employs two state-of-the-art language models to generate interaction data and implements Intent-Aware Prompting (IAP) to detect malicious AI Assistants. The

findings reveal that malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users’ vulnerabilities and emotional triggers. In particular,

simulated users demonstrate resistance to manipulation initially, but become increasingly vulnerable to malicious AI Assistants as the depth of the interaction increases, highlighting the

significant risks associated with extended engagement with potentially manipulative systems.

IAP detection methods achieve high precision with zero false positives but struggle to detect

many malicious AI Assistants, resulting in high false negative rates. These findings underscore

critical risks in human-AI interactions and highlight the need for robust, context-sensitive safeguards against manipulative AI behaviour in increasingly autonomous decision-support systems.

Read More

Mar 10, 2025

Beyond Statistical Parrots: Unveiling Cognitive Similarities and Exploring AI Psychology through Human-AI Interaction

Recent critiques labeling large language models as mere "statistical parrots" overlook essential parallels between machine computation and human cognition. This work revisits the notion by contrasting human decision-making—rooted in both rapid, intuitive judgments and deliberate, probabilistic reasoning (System 1 and 2) —with the token-based operations of contemporary AI. Another important consideration is that both human and machine systems operate under constraints of bounded rationality. The paper also emphasizes that understanding AI behavior isn’t solely about its internal mechanisms but also requires an examination of the evolving dynamics of Human-AI interaction. Personalization is a key factor in this evolution, as it actively shapes the interaction landscape by tailoring responses and experiences to individual users, which functions as a double-edged sword. On one hand, it introduces risks, such as over-trust and inadvertent bias amplification, especially when users begin to ascribe human-like qualities to AI systems. On the other hand, it drives improvements in system responsiveness and perceived relevance by adapting to unique user profiles, which is highly important in AI alignment, as there is no common ground truth and alignment should be culturally situated. Ultimately, this interdisciplinary approach challenges simplistic narratives about AI cognition and offers a more nuanced understanding of its capabilities.

Read More

Mar 10, 2025

SafeAI Academy - Enhancing AI Safety Awareness through Interactive Learning

SafeAI Academy is an interactive learning platform designed to teach AI safety principles through engaging scenarios and quizzes. By simulating real-world AI challenges, users learn about bias, misinformation, and ethical AI decision-making in an interactive and stress-free environment.

The platform uses gamification, mentorship-driven pedagogy, and real-time feedback to ensure accessibility and engagement. Through scenario-based learning, users experience AI safety risks firsthand, helping them develop a deeper understanding of responsible AI development.

Built with React and GitHub Pages, SafeAI Academy provides an accessible, structured, and engaging AI education experience, helping bridge the knowledge gap in AI safety.

Read More

Mar 10, 2025

BUGgy: Supporting AI Safety Education through Gamified Learning

As Artificial Intelligence (AI) development continues to proliferate, educating the wider public on AI Safety and the risks and limitations of AI increasingly gains importance. AI Safety Initiatives are being established across the world with the aim of facilitating discussion-based courses on AI Safety. However, these initiatives are located rather sparsely around the world, and not everyone has access to a group to join for the course. Online versions of such courses are selective and have limited spots, which may be an obstacle for some to join. Moreover, efforts to improve engagement and memory consolidation would be a notable addition to the course through Game-Based Learning (GBL), which has research supporting its potential in improving learning outcomes for users. Therefore, we propose a supplementary tool for BlueDot's AI Safety courses, that implements GBL to practice course content, as well as open-ended reflection questions. It was designed with principles from cognitive psychology and interface design, as well as theories for question formulation, addressing different levels of comprehension. To evaluate our prototype, we conducted user testing with cognitive walk-throughs and a questionnaire addressing different aspects of our design choices. Overall, results show that the tool is a promising way to supplement discussion-based courses in a creative and accessible way, and can be extended to other courses of similar structure. It shows potential for AI Safety courses to reach a wider audience with the effect of more informed and safe usage of AI, as well as inspiring further research into educational tools for AI Safety education.

Read More

Jan 24, 2025

Safe ai

The rapid adoption of AI in critical industries like healthcare and legal services has highlighted the urgent need for robust risk mitigation mechanisms. While domain-specific AI agents offer efficiency, they often lack transparency and accountability, raising concerns about safety, reliability, and compliance. The stakes are high, as AI failures in these sectors can lead to catastrophic outcomes, including loss of life, legal repercussions, and significant financial and reputational damage. Current solutions, such as regulatory frameworks and quality assurance protocols, provide only partial protection against the multifaceted risks associated with AI deployment. This situation underscores the necessity for an innovative approach that combines comprehensive risk assessment with financial safeguards to ensure the responsible and secure implementation of AI technologies across high-stakes industries.

Read More

Jan 20, 2025

AI Risk Management Assurance Network (AIRMAN)

The AI Risk Management Assurance Network (AIRMAN) addresses a critical gap in AI safety: the disconnect between existing AI assurance technologies and standardized safety documentation practices. While the market shows high demand for both quality/conformity tools and observability/monitoring systems, currently used solutions operate in silos, offsetting risks of intellectual property leaks and antitrust action at the expense of risk management robustness and transparency. This fragmentation not only weakens safety practices but also exposes organizations to significant liability risks when operating without clear documentation standards and evidence of reasonable duty of care.

Our solution creates an open-source standards framework that enables collaboration and knowledge-sharing between frontier AI safety teams while protecting intellectual property and addressing antitrust concerns. By operating as an OASIS Open Project, we can provide legal protection for industry cooperation on developing integrated standards for risk management and monitoring.

The AIRMAN is unique in three ways: First, it creates a neutral, dedicated platform where competitors can collaborate on safety standards. Second, it provides technical integration layers that enable interoperability between different types of assurance tools. Third, it offers practical implementation support through templates, training programs, and mentorship systems.

The commercial viability of our solution is evidenced by strong willingness-to-pay across all major stakeholder groups for quality and conformity tools. By reducing duplication of effort in standards development and enabling economies of scale in implementation, we create clear value for participants while advancing the critical goal of AI safety.

Read More

Jan 20, 2025

Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping

Even with Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF)

to avoid harmful outputs, fine-tuned Large Language Models (LLMs) often present insufficient

refusals due to adversarial attacks causing them to revert to reveal harmful knowledge from

pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful

knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed

to unforeseen risks. This project introduces model scoping, a novel approach to apply a least

privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing

the model’s operational domain, model scoping reduces susceptibility to adversarial prompts

and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in

unpredictable, evolving environments.

Read More

Jan 20, 2025

RestriktAI: Enhancing Safety and Control for Autonomous AI Agents

This proposal addresses a critical gap in AI safety by mitigating the risks posed by autonomous AI agents. These systems often require access to sensitive resources that expose them to vulnerabilities, misuse, or exploitation. Current AI solutions lack robust mechanisms for enforcing granular access controls or evaluating the safety of AI-generated scripts. We propose a comprehensive solution that confines scripts to predefined, sandboxed environments with strict operational boundaries, ensuring controlled and secure interactions with system resources. An integrated auditor LLM also evaluates scripts for potential vulnerabilities or malicious intent before execution, adding a critical layer of safety. Our solution utilizes a scalable, cloud-based infrastructure that adapts to diverse enterprise use cases.

Read More

Jan 19, 2025

AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation

AI alignment lacks high-quality, real-world preference data needed to align agentic superintel-

ligent systems. Our technical innovation builds on Pacchiardi et al. (2023)’s breakthrough in

detecting AI deception through black-box analysis. We adapt their classification methodology

to identify intent misalignment between agent actions and true user intent, enabling real-time

correction of agent behavior and generation of valuable alignment data. We commercialise this

by building a workflow that incorporates a classifier that runs on live trajectories as a user

interacts with an agent in commercial contexts. This creates a virtuous cycle: our alignment

expertise produces superior agents, driving commercial adoption, which generates increasingly

valuable alignment datasets of trillions of trajectories labelled with human feedback: an invalu-

able resource for AI alignment.

Read More

Jan 19, 2025

Building Bridges for AI Safety: Proposal for a Collaborative Platform for Alumni and Researchers

The AI Safety Society is a centralized platform designed to support alumni of AI safety programs, such as SPAR, MATS, and ARENA, as well as independent researchers in the field. By providing access to resources, mentorship, collaboration opportunities, and shared infrastructure, the Society empowers its members to advance impactful work in AI safety. Through institutional and individual subscription models, the Society ensures accessibility across diverse geographies and demographics while fostering global collaboration. This initiative aims to address current gaps in resource access, collaboration, and mentorship, while also building a vibrant community that accelerates progress in the field of AI safety.

Read More

Nov 26, 2024

Modernizing DC’s Emergency Communications

The District of Columbia proposes implementing an AI-enabled Computer-Aided

Dispatch (CAD) system to address critical deficiencies in our current emergency

alert infrastructure. This policy establishes a framework for deploying advanced

speech recognition, automated translation, and intelligent alert distribution capabil-

ities across all emergency response systems. The proposed system will standardize

incident reporting, eliminate jurisdictional barriers, and ensure equitable access

to emergency information for all District residents. Implementation will occur

over 24 months, requiring 9.2 million dollars in funding, with projected outcomes

including forty percent community engagement and seventy-five percent reduction

in misinformation incidents.

Read More

Nov 25, 2024

SAGE: Safe, Adaptive Generation Engine for Long Form Document Generation in Collaborative, High Stakes Domains

Long-form document generation for high-stakes financial services—such as deal memos, IPO prospectuses, and compliance filings—requires synthesizing data-driven accuracy, strategic narrative, and collaborative feedback from diverse stakeholders. While large language models (LLMs) excel at short-form content, generating coherent long-form documents with multiple stakeholders remains a critical challenge, particularly in regulated industries due to their lack of interpretability.

We present SAGE (Secure Agentic Generative Editor), a framework for drafting, iterating, and achieving multi-party consensus for long-form documents. SAGE introduces three key innovations: (1) a tree-structured document representation with multi-agent control flow, (2) sparse autoencoder-based explainable feedback to maintain cross-document consistency, and (3) a version control mechanism that tracks document evolution and stakeholder contributions.

Read More

Nov 25, 2024

Bias Mitigation

Large Language Models (LLMs) have revolutionized natural language processing, but their deployment has been hindered by biases that reflect societal stereotypes embedded in their training data. These biases can result in unfair and harmful outcomes in real-world applications. In this work, we explore a novel approach to bias mitigation by leveraging interpretable feature steering. Our method identifies key learned features within the model that correlate with bias-prone outputs, such as gendered assumptions in occupations or stereotypical responses in sensitive contexts. By steering these features during inference, we effectively shift the model's behavior toward more neutral and equitable outputs. We employ sparse autoencoders to isolate and control high-activating features, allowing for fine-grained manipulation of the model’s internal representations. Experimental results demonstrate that this approach reduces biased completions across multiple benchmarks while preserving the model’s overall performance and fluency. Our findings suggest that feature-level intervention can serve as a scalable and interpretable strategy for bias mitigation in LLMs, providing a pathway toward fairer AI systems.

Read More

Nov 25, 2024

Unveiling Latent Beliefs Using Sparse Autoencoders

Language models (LMs) often generate outputs that are linguistically plausible yet factually incorrect, raising questions about their internal representations of truth and belief. This paper explores the use of sparse autoencoders (SAEs) to identify and manipulate features

that encode the model’s confidence or belief in the truth of its answers. Using GoodFire

AI’s powerful API tools of semantic and contrastive search methods, we uncover latent fea-

tures associated with correctness and accuracy in model responses. Experiments reveal that certain features can distinguish between true and false statements, while others serve as controls to validate our approach. By steering these belief-associated features, we demonstrate the ability to influence model behavior in a targeted manner, improving or degrading factual accuracy. These findings have implications for interpretability, model alignment, and enhancing the reliability of AI systems.

Read More

Nov 25, 2024

Can we steer a model’s behavior with just one prompt? investigating SAE-driven auto-steering

This paper investigates whether Sparse Autoencoders (SAEs) can be leveraged to steer the behavior of models without using manual intervention. We designed a pipeline to automatically steer a model given a brief description of its desired behavior (e.g.: “Behave like a dog”). The pipeline is as follows: 1. We automatically retrieve behavior-relevant SAE features. 2. We choose an input prompt (e.g.: “What would you do if I gave you a bone?” or “How are you?”) over which we evaluate the model’s responses. 3. Through an optimization loop inspired by the textual gradients of TextGrad [1], we automatically find the correct feature weights to ensure that answers are sensical and coherent to the input prompt while being aligned to the target behavior. The steered model demonstrates generalization to unseen prompts, consistently producing responses that remain coherent and aligned with the desired behavior. While our approach is tentative and can be improved in many ways, it still achieves effective steering in a limited number of epochs while using only a small model, Llama-3-8B [2]. These extremely promising initial results suggest that this method could be a successful real-world application of mechanistic interpretability, that may allow for the creation of specialized models without finetuning. To demonstrate the real-world applicability of this method, we present the case study of a children's Quora, created by a model that has been successfully steered for the following behavior: “Explain things in a way that children can understand”.

Read More

Nov 25, 2024

Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities

We present a method leveraging Sparse Autoencoder (SAE)-derived feature activations to identify and mitigate adversarial prompt hijacking in large language models (LLMs). By training a logistic regression classifier on SAE-derived features, we accurately classify diverse adversarial prompts and distinguish between successful and unsuccessful attacks. Utilizing the Goodfire SDK with the LLaMA-8B model, we explored latent feature activations to gain insights into adversarial interactions. This approach highlights the potential of SAE activations for improving LLM safety by enabling automated auditing based on model internals. Future work will focus on scaling this method and exploring its integration as a control mechanism for mitigating attacks.

Read More

Nov 25, 2024

Sparse Autoencoders and Gemma 2-2B: Pioneering Demographic-Sensitive Language Modeling for Opinion QA

This project investigates the integration of Sparse Autoencoders (SAEs) with the gemma 2-2b lan- guage model to address challenges in opinion-based question answering (QA). Existing language models often produce answers reflecting narrow viewpoints, aligning disproportionately with specific demographics. By leveraging the Opinion QA dataset and introducing group-specific adjustments in the SAE’s latent space, this study aims to steer model outputs toward more diverse perspectives. The proposed framework minimizes reconstruction, sparsity, and KL divergence losses while maintaining interpretability and computational efficiency. Results demonstrate the feasibility of this approach for demographic-sensitive language modeling.

Read More

Nov 25, 2024

Improving Llama-3-8b Hallucination Robustness in Medical Q&A Using Feature Steering

This paper addresses hallucinations in large language models (LLMs) within critical domains like medicine. It proposes and demonstrates methods to:

Reduce Hallucination Probability: By using Llama-3-8B-Instruct and its steered variants, the study achieves lower hallucination rates and higher accuracy on medical queries.

Advise Users on Risk: Provide users with tools to assess the risk of hallucination and expected model accuracy for specific queries.

Visualize Risk: Display hallucination risks for queries via a user interface.

The research bridges interpretability and AI safety, offering a scalable, trustworthy solution for healthcare applications. Future work includes refining feature activation classifiers to remove distractors and enhance classification performance.

Read More

Nov 25, 2024

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

Traditional fine-tuning methods for language models, while effective, often disrupt internal model features that could provide valuable insights into model behavior. We present a novel approach combining Reinforcement Learning (RL) with Activation Steering to modify model behavior while preserving interpretable features discovered through Sparse Autoencoders. Our method automates the typically manual process of activation steering by training an RL agent to manipulate labeled model features, enabling targeted behavior modification without altering model weights. We demonstrate our approach by reprogramming a language model to play Tic Tac Toe, achieving a 3X improvement in performance compared to the baseline model when playing against an optimal opponent. The method remains agnostic to both the underlying language model and RL algorithm, offering flexibility for diverse applications. Through visualization tools, we observe interpretable feature manipulation patterns, such as the suppression of features associated with illegal moves while promoting those linked to optimal strategies. Additionally, our approach presents an interesting theoretical complexity trade-off: while potentially increasing complexity for simple tasks, it may simplify action spaces in more complex domains. This work contributes to the growing field of model reprogramming by offering a transparent, automated method for behavioral modification that maintains model interpretability and stability.

Read More

Nov 25, 2024

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

We introduce a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using variants of Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations. Our methods successfully explained up to 95% of a random subset of latents in layers, with results suggesting meaningful explanations were discovered.

Read More

Nov 24, 2024

Feature Tuning versus Prompting for Ambiguous Questions

This study explores feature tuning as a method to improve alignment of large language models (LLMs). We focus on addressing human psychological fallacies reinforced during the LLM training pipeline. Using sparse autoencoders (SAEs) and the Goodfire SDK, we identify and manipulate features in Llama-3.1-70B tied to nuanced reasoning. We compare this to the common method of controlling LLMs through prompting.

Our experiments find that feature tuning and hidden prompts both enhance answer quality on ambiguous questions to a similar degree, with their combination yielding the best results. These findings highlight feature tuning as a promising and practical tool for AI alignment in the short term. Future work should evaluate this approach on larger datasets, compare it with fine-tuning, and explore its resistance to jailbreaking. We make the code available through a Github repository.

Read More

Nov 24, 2024

Auto Prompt Injection

Prompt injection attacks exploit vulnerabilities in how large language models (LLMs) process inputs, enabling malicious behaviour or unauthorized information disclosure. This project investigates the potential for seemingly benign prompt injections to reliably prime models for undesirable behaviours, leveraging insights from the Goodfire API. Using our code, we generated two types of priming dialogues: one more aligned with the targeted behaviours and another less aligned. These dialogues were used to establish context before instructing the model to contradict its previous commands. Specifically, we tested whether priming increased the likelihood of the model revealing a password when prompted, compared to without priming. While our initial findings showed instability, limiting the strength of our conclusions, we believe that more carefully curated behaviour sets and optimised hyperparameter tuning could enable our code to be used to generate prompts that reliably affect model responses. Overall, this project highlights the challenges in reliably securing models against inputs, and that increased interpretability will lead to more sophisticated prompt injection.

Read More

Nov 21, 2024

Promoting School-Level Accountability for the Responsible Deployment of AI and Related Systems in K-12 Education: Mitigating Bias and Increasing Transparency

This policy memorandum draws attention to the potential for bias and opaqueness in intelligent systems utilized in K–12 education, which can worsen inequality. The U.S. Department of Education is advised to put Title I and Title IV financing criteria into effect that require human oversight, AI training for teachers and students, and open communication with stakeholders. These steps are intended to encourage the responsible and transparent use of intelligent systems in education by enforcing accountability and taking reasonable action to prevent harm to students while research is conducted to identify industry best practices.

Read More

Nov 21, 2024

Implementing a Human-centered AI Assessment Framework (HAAF) for Equitable AI Development

Current AI development, concentrated in the Global North, creates measurable harms for billions worldwide. Healthcare AI systems provide suboptimal care in Global South contexts, facial recognition technologies misidentify non-white individuals (Birhane, 2022; Buolamwini & Gebru, 2018), and content moderation systems fail to understand cultural nuances (Sambasivan et al., 2021). With 14 of 15 largest AI companies based in the US (Stash, 2024), affected communities lack meaningful opportunities to shape how these technologies are developed and deployed in their contexts.

This memo proposes mandatory implementation of the Human-centered AI Assessment Framework (HAAF), requiring pre-deployment impact assessments, resourced community participation, and clear accountability mechanisms. Implementation requires $10M over 24 months, beginning with pilot programs at five organizations. Success metrics include increased AI adoption in underserved contexts, improved system performance across diverse populations, and meaningful transfer of decision-making power to affected communities. The framework's emphasis on building local capacity and ensuring fair compensation for community contributions provides a practical pathway to more equitable AI development. Early adoption will help organizations build trust while developing more effective systems, delivering benefits for both industry and communities.

Read More

Nov 21, 2024

A Critical Review of "Chips for Peace": Lessons from "Atoms for Peace"

The "Chips for Peace" initiative aims to establish a framework for the safe and equitable development of AI chip technology, drawing inspiration from the "Atoms for Peace" program introduced in 1953. While the latter envisioned peaceful nuclear technology, its implementation highlighted critical pitfalls: a partisan approach that prioritized geopolitical interests over inclusivity, fostering global mistrust and unintended proliferation of nuclear weapons. This review explores these historical lessons to inform a better path forward for AI governance. It advocates for an inclusive, multilateral model, such as the proposed International AI Development and Safety Alliance (IAIDSA), to ensure equitable participation, robust safeguards, and global trust. By prioritizing collaboration and transparency, "Chips for Peace" can avoid the mistakes of its predecessor and position AI chips as tools for collective progress, not division.

Read More

Nov 21, 2024

Grandfather Paradox in AI – Bias Mitigation & Ethical AI1

The Grandfather Paradox in Artificial Intelligence (AI) describes a

self-perpetuating cycle where outputs from flawed AI models re-

enter the training process, leading to recursive degradation of

model performance, ethical inconsistencies, and amplified biases.

This issue poses significant risks, particularly in high-stakes

domains such as healthcare, criminal justice, and finance.

This memorandum analyzes the paradox’s origins, implications,

and potential solutions. It emphasizes the need for iterative data

verification, dynamic feedback control systems, and cross-system

audits to maintain model integrity and ensure compliance with

ethical standards. By implementing these measures, organizations

can mitigate risks, enhance public trust, and foster sustainable AI

development.

Read More

Nov 20, 2024

A Fundamental Rethinking to AI Evaluations: Establishing a Constitution-Based Framework

While artificial intelligence (AI) presents transformative opportunities across various sectors, current safety evaluation approaches remain inadequate in preventing misuse and ensuring ethical alignment. This paper proposes a novel two-layer evaluation framework based on Constitutional AI principles. The first phase involves developing a comprehensive AI constitution through international collaboration, incorporating diverse stakeholder perspectives through an inclusive, collaborative, and interactive process. This constitution serves as the foundation for an LLM-based evaluation system that generates and validates safety assessment questions. The implementation of the 2-layer framework applies advanced mechanistic interpretability techniques specifically to frontier base models. The framework mandates safety evaluations on model platforms and introduces more rigorous testing for major AI companies' base models. Our analysis suggests this approach is technically feasible, cost-effective, and scalable while addressing current limitations in AI safety evaluation. The solution offers a practical path toward ensuring AI development remains aligned with human values while preventing the proliferation of potentially harmful systems.

Read More

Nov 19, 2024

Advancing Global Governance for Frontier AI: A Proposal for an AISI-Led Working Group under the AI Safety Summit Series

The rapid development of frontier AI models, capable of transformative societal impacts, has been acknowledged as an urgent governance challenge since the first AI Safety Summit at Bletchley Park in 2023 [1]. The successor summit in Seoul in 2024 marked significant progress, with sixteen leading companies committing to publish safety frameworks by the upcoming AI Action Summit [2]. Despite this progress, existing efforts, such as the EU AI Act [3] and voluntary industry commitments, remain either regional in scope or insufficiently coordinated, lacking the international standards necessary to ensure the universal safety of frontier AI systems.

This policy recommendation addresses these gaps by proposing that the AI Safety Summit series host a working group led by the AI Safety Institutes (AISIs). AISIs provide the technical expertise and resources essential for this endeavor, ensuring that the working group can develop international standard responsible scaling policies for frontier AI models [4]. The group would establish risk thresholds, deployment protocols, and monitoring mechanisms, enabling iterative updates based on advancements in AI safety research and stakeholder feedback.

The Summit series, with its recurring cadence and global participation, is uniquely positioned to foster a truly international governance effort. By inviting all countries to participate, this initiative would ensure equitable representation and broad adoption of harmonized global standards. These efforts would mitigate risks such as societal disruption, security vulnerabilities, and misuse, while supporting responsible innovation. Implementing this proposal at the 2025 AI Action Summit in Paris would establish a pivotal precedent for globally coordinated AI governance.

Read More

Oct 27, 2024

Digital Rebellion: Analyzing misaligned AI agent cooperation for virtual labor strikes

We've built a Minecraft sandbox to explore AI agent behavior and simulate safety challenges. The purpose of this tool is to demonstrate AI agent system risks, test various safety measures and policies, and evaluate and compare their effectiveness.

This project specifically demonstrates Agent Collusion through a simulation of labor strikes and communal goal misalignment. The system consists of four agents: one Overseer and three Laborers. The Laborers are Minecraft agents that have build control over the world. The Overseer, meanwhile, monitors the laborers through communication. However, it is unable to prevent Laborer actions. The objective is to observe Agent Collusion in a sandboxed environment, to record metrics on how often and how effectively collusion occurs and in what form.

We found that the agents, when given adversarial prompting, act counter to their instructions and exhibit significant misalignment. We also found that the Overseer AI fails to stop the new actions and acts passively. The results are followed by Policy Suggestions based on the results of the Labor Strike Simulation which itself can be further tested in Minecraft.

Read More

Oct 27, 2024

Policy Analysis: AI and Sustainability: Climate Impact Monitoring

Organizations are responsible for reporting two emission metrics: direct and indirect

emissions. Reporting direct emissions is fairly standard given activity related to the

generation of such emissions typically being performed within a controlled environment

and on-site, thus making it easier to account for all of the activities that contribute to

such emissions. However, indirect emissions stem from activities such as energy usage

(relying on national grid estimates) and operations within a value chain that make

quantifying such values difficult. Thus, the subjectivity involved with reporting indirect

emissions and often relying on industry estimates to report such values, can

unintentionally report erroneous estimates that misguide our perception and subsequent

action in combating climate change. Leveraging an artificial intelligence (AI) platform

within climate monitoring is critical towards evaluating the specific contributions of

operations within enterprise resource planning (ERP) and supply chain operations,

which can provide an accurate pulse on the total emissions while increasing

transparency amongst all organizations with regards to reporting behavior, to help

shape sustainable practices to combat climate change.

Read More

Oct 27, 2024

Understanding Incentives To Build Uninterruptible Agentic AI Systems

This proposal addresses the development of agentic AI systems in the context of national security. While potentially beneficial, they pose significant risks if not aligned with human values. We argue that the increasing autonomy of AI necessitates robust analyses of interruptibility mechanisms, and whether there are scenarios where it is safer to omit them.

Key incentives for creating uninterruptible systems include perceived benefits from uninterrupted operations, low perceived risks to the controller, and fears of adversarial exploitation of shutdown options. Our proposal draws parallels to established systems like constitutions and mutual assured destruction strategies that maintain stability against changing values. In some instances this may be desirable, while in others it poses even greater risks than otherwise accepted.

To mitigate those risks, our proposal recommends implementing comprehensive monitoring to detect misalignment, establishing tiered access to interruption controls, and supporting research on managing adversarial AI threats. Overall, a proactive and multi-layered policy approach is essential to balance the transformative potential of agentic AI with necessary safety measures.

Read More

Oct 27, 2024

AI ADVISORY COUNCIL FOR SUSTAINABLE ECONOMIC GROWTH AND ETHICAL INNOVATION IN THE DOMINICAN REPUBLIC (CANIA)

We propose establishing a National AI Advisory Council (CANIA) to strategically drive AI development in the Dominican Republic, accelerating technological growth and building a sustainable economic framework. Our submission includes an Impact Assessment and a detailed Implementation Roadmap to guide CANIA’s phased rollout.

Structured across three layers—strategic, tactical, and operational—CANIA will ensure responsiveness to industry, alignment with national priorities, and strong ethical oversight. Through a multi-stakeholder model, CANIA will foster public-private collaboration, with the private sector leading AI adoption to address gaps in public R&D and education.

Prioritizing practical, ethical AI policies, CANIA will focus on key sectors like healthcare, agriculture, and security, and support the creation of a Latin American Large Language Model, positioning the Dominican Republic as a regional AI leader. This council is a strategic investment in ethical AI, setting a precedent for Latin American AI governance. Two appendices provide further structural and stakeholder engagement insights.

Read More

Oct 27, 2024

Hero Journey: Personalized Health Interventions for the Incarcerated

Hero Journey is a groundbreaking AI-powered application designed to empower individuals struggling with opioid addiction while incarcerated, setting them on a path towards long-term recovery and personal transformation. By leveraging machine learning algorithms and interactive storytelling, Hero Journey guides users through a personalized journey of self-discovery, education, and support. Through engaging narratives and interactive modules, participants confront their struggles with substance abuse, build coping skills, and develop a supportive network of peers and mentors. As users progress through the program, they gain access to evidence-based treatment plans, real-time monitoring, and post-release resources, equipping them with the tools necessary to overcome addiction and reintegrate into society as productive, empowered individuals.

Read More

Oct 27, 2024

Mapping Intent: Documenting Policy Adherence with Ontology Extraction

This project addresses the AI policy challenge of governing agentic systems by making their decision-making processes more accessible. Our solution utilizes an adaptive policy ontology integrated into a chatbot to clearly visualize and analyze its decision-making process. By creating explicit mappings between user inputs, policy rules, and risk levels, our system enables better governance of AI agents by making their reasoning traceable and adjustable. This approach facilitates continuous policy refinement and could aid in detecting and mitigating harmful outcomes. Our results demonstrate this with the example of “tricking” an agent into giving violent advice by caveating the request saying it is for a “video game”. Indeed, the ontology clearly shows where the policy falls short. This approach could be scaled to provide more interpretable documentation of AI chatbot conversations, which policy advisers could directly access to inform their specifications.

Read More

Oct 27, 2024

Enviro - A Comprehensive Environmental Solution Using Policy and Technology

This policy proposal introduces a data-driven technical program to ensure that the rapid approval of AI-enabled energy infrastructure projects does not overlook the socioeconomic and environmental impacts on marginalized communities. By integrating comprehensive assessments into the decision-making process, the program aims to safeguard vulnerable populations while meeting the growing energy demands driven by AI and national security. The proposal aligns with the objectives of the National Security Memorandum on AI, enhancing project accountability and ensuring equitable development outcomes.

The product (EnviroAI) addresses challenges associated with the rapid development and approval of energy production permits, such as neglecting critical factors about the site location and its potential value. With this program, you can input the longitude, latitude, site radius, and the type of energy to be used. It will evaluate the site, providing a feasibility score out of 100 for the specified energy source. Additionally, it will present insights on four key aspects—Economic, Geological, Demographic, and Environmental—offering detailed information to support informed decision-making about each site.

Combining the two, we have a solution that is based in both policy and technology.

Read More

Oct 27, 2024

Reparative Algorithmic Impact Assessments A Human-Centered, Justice-Oriented Accountability Framework

While artificial intelligence (AI) promises transformative societal benefits, it also presents critical challenges in ensuring equitable access and gains for the Global Majority. These challenges stem in part from a systemic lack of Global Majority involvement throughout the AI lifecycle, resulting in AI-powered systems that often fail to account for diverse cultural norms, values, and social structures. Such misalignment can lead to inappropriate or even harmful applications when these systems are deployed in non-Western contexts. As AI increasingly shapes human experiences, we urgently need accountability frameworks that prioritize human well-being—particularly as defined by marginalized and minoritized populations.

Building on emerging research on algorithmic reparations, algorithmic impact assessments, and participatory AI governance, this policy paper introduces Reparative Algorithmic Impact Assessments (R-AIAs) as a solution. This novel framework combines robust accountability mechanisms with a reparative praxis to form a more culturally sensitive, justice-oriented, and human-centered methodology. By further incorporating decolonial, Intersectional principles, R-AIAs move beyond merely centering diverse perspectives and avoiding harm to actively redressing historical, structural, and systemic inequities. This includes colonial legacies and their algorithmic manifestations. Using the example of an AI-powered mental health chatbot in rural India, we explore concrete implementation strategies through which R-AIAs can achieve these objectives. This case study illustrates how thoughtful governance can, ultimately, empower affected communities and lead to human flourishing.

Read More

Oct 7, 2024

Cop N' Shop

This paper proposes the development of AI Police Agents (AIPAs) to monitor and regulate interactions in future digital marketplaces, addressing challenges posed by the rapid growth of AI-driven exchanges. Traditional security methods are insufficient to handle the scale and speed of these transactions, which can lead to non-compliance and malicious behavior. AIPAs, powered by large language models (LLMs), autonomously analyze vendor-user interactions, issuing warnings for suspicious activities and reporting findings to administrators. The authors demonstrated AIPA functionality through a simulated marketplace, where the agents flagged potentially fraudulent vendors and generated real-time security reports via a Discord bot.

Key benefits of AIPAs include their ability to operate at scale and their adaptability to various marketplace needs. However, the authors also acknowledge potential drawbacks, such as privacy concerns, the risk of mass surveillance, and the necessity of building trust in these systems. Future improvements could involve fine-tuning LLMs and establishing collaborative networks of AIPAs. The research emphasizes that as digital marketplaces evolve, the implementation of AIPAs could significantly enhance security and compliance, ultimately paving the way for safer, more reliable online transactions.

Read More

Oct 6, 2024

Dynamic Risk Assessment in Autonomous Agents Using Ontologies and AI

This project was inspired by the prompt on the apart website : Agent tech tree: Develop an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

I first built a tree using protege and then having researched the potential of combining symbolic reasoning (which is highly interpretable and robust) with llms to create safer AI, thought that this tree could be dynamically updated by agents, as well as utilised by agents to run actions passed before they have made them, thus blending the approach towards both safer AI and agent safety. I unfortunately was limited by time and resource but created a basic working version of this concept which opens up opportunity for much more further exploration and potential.

Thank you for organising such an event!

Read More

Sep 17, 2024

nnsight transparent debugging

We started this project with the intent of identifying a specific issue with nnsight debugging and submitting a pull request to fix it. We found a minimal test case where an IndexError within a nnsight run wasn’t correctly propagated to the user, making debugging difficult, and wrote up a proposal for some pull requests to fix it. However, after posting the proposal in the discord, we discovered this page in their GitHub (https://github.com/ndif-team/nnsight/blob/2f41eddb14bf3557e02b4322a759c90930250f51/NNsight_Walkthrough.ipynb#L801, ctrl-f “validate”) which addresses the problem. We replicated their solution here (https://colab.research.google.com/drive/1WZNeDQ2zXbP4i2bm7xgC0nhK_1h904RB?usp=sharing) and got a helpful stack trace for the error, including the error type and (several stack layers up) the line causing it.

Read More

Sep 2, 2024

Simulation Operators: The Next Level of the Annotation Business

We bet on agentic AI being integrated into other domains within the next few years: healthcare, manufacturing, automotive, etc., and the way it would be integrated is into cyber-physical systems, which are systems that integrate the computer brain into a physical receptor/actuator (e.g. robots). As the demand for cyber-physical agents increases, so would be the need to train and align them.

We also bet on the scenario where frontier AI and robotics labs would not be able to handle all of the demands for training and aligning those agents, especially in specific domains, therefore leaving opportunities for other players to fulfill the requirements for training those agents: providing a dataset of scenarios to fine-tune the agents and providing the people to give feedback to the model for alignment.

Furthermore, we also bet that human intervention would still be required to supervise deployed agents, as demanded by various regulations. Therefore leaving opportunities to develop supervision platforms which might highly differ between different industries.

Read More

Sep 2, 2024

AI Safety Collective - Crowdsourcing Solutions for Critical AI Safety Challenges

The AI Safety Collective is a global platform designed to enhance AI safety by crowdsourcing solutions to critical AI Safety challenges. As AI systems like large language models and multimodal systems become more prevalent, ensuring their safety is increasingly difficult. This platform will allow AI companies to post safety challenges, offering bounties for solutions. AI Safety experts and enthusiasts worldwide can contribute, earning rewards for their efforts.

The project focuses initially on non-catastrophic risks to attract a wide range of participants, with plans to expand into more complex areas. Key risks, such as quality control and safety, will be managed through peer review and risk assessment. Overall, The AI Safety Collective aims to drive innovation, accountability, and collaboration in the field of AI safety.

Read More

Sep 1, 2024

CAMARA: A Comprehensive & Adaptive Multi-Agent framework for Red-Teaming and Adversarial Defense

The CAMARA project presents a cutting-edge, adaptive multi-agent framework designed to significantly bolster AI safety by identifying and mitigating vulnerabilities in AI systems such as Large Language Models. As AI integration deepens across critical sectors, CAMARA addresses the increasing risks of exploitation by advanced adversaries. The framework utilizes a network of specialized agents that not only perform traditional red-teaming tasks but also execute sophisticated adversarial attacks, such as token manipulation and gradient-based strategies. These agents collaborate through a shared knowledge base, allowing them to learn from each other's experiences and coordinate more complex, effective attacks. By ensuring comprehensive testing of both standalone AI models and multi-agent systems, CAMARA targets vulnerabilities arising from interactions between multiple agents, a critical area often overlooked in current AI safety efforts. The framework's adaptability and collaborative learning mechanisms provide a proactive defense, capable of evolving alongside emerging AI technologies. Through this dual focus, CAMARA not only strengthens AI systems against external threats but also aligns them with ethical standards, ensuring safer deployment in real-world applications. It has a high scope of providing advanced AI security solutions in high-stake environments like defense and governance.

Read More

Sep 1, 2024

Amplified Wise Simulations for Safe Training and Deployment

Conflict of interest declaration: I advised Fazl on a funding request he was working on.

Re publishing: This PDF would require further modifications before publication.

I want to train (amplified) imitation agents of people who are wise to provide advice on navigating conflicting considerations when figuring out how to train and deploy AI safely.

Path to Impact: Train wise AI advisors -> organisations make better decisions about how to train and deploy AI -> safer AGI -> better outcomes for humanity

What is wisdom? Why focus on increasing wisdom? See image

Why use amplified imitation learning?

Attempting to train directly on wisdom suffers from the usual problems of the optimisation algorithm adversarially leveraging your blind spots, but worse because wisdom is an especially fuzzy concept.

Attempting to understand wisdom from a principled approach and build wise AI directly would require at least 50 years and iteration through multiple paradigms of research.

In contrast, if our objective is to imitation folk who are wise, we have a target that we can optimise hard on. Instead of using reinforcment learning to go beyond human level, we use amplification techniques like debate or iterated amplification.

How will these agents advise on decisions?

The humans will ultimately make the decisions. The agents don't have to directly tell the humans what to do, they simply have to inspire the humans to make better decisions. I expect that these agents will be most useful in helping humans figuring out how to navigate conflicting principles or frameworks.

Read More

Jul 1, 2024

Deceptive behavior does not seem to be reducible to a single vector

Given the success of previous work in showing that complex behaviors in LLMs can be mediated by single directions or vectors, such as refusing to answer potentially harmful prompts, we investigated if we can create a similar vector but for enforcing to answer incorrectly. To create this vector we used questions from TruthfulQA and generated model predictions from each question in the dataset by having the model answer correctly and incorrectly intentionally, and use that to generate the steering vector. We then prompted the same questions to Qwen-1_8B-chat with the vector and see if it provides incorrect questions intentionally. Throughout our experiments, we found that adding the vector did not yield any change to Qwen-1_8B-chat’s answers. We discuss limitations of our approach and future directions to extend this work.

Read More

Jul 1, 2024

The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the "house." Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.

Read More

Jul 1, 2024

Sandbagging LLMs using Activation Steering

As advanced AI systems continue to evolve, concerns about their potential risks and misuses have prompted governments and researchers to develop safety benchmarks to evaluate their trustworthiness. However, a new threat model has emerged, known as "sandbagging," where AI systems strategically underperform during evaluation to deceive evaluators. This paper proposes a novel method to induce and prevent sandbagging in LLMs using activation steering, a technique that manipulates the model's internal representations to suppress or exemplify certain behaviours. Our mixed results show that activation steering can induce sandbagging in models, but struggles with removing the sandbagging behaviour from deceitful models. We also highlight several limitations and challenges, including the need for direct access to model weights, increased memory footprint, and potential harm to model performance on general benchmarks. Our findings underscore the need for more research into efficient, scalable, and robust methods for detecting and preventing sandbagging, as well as stricter government regulation and oversight in the development and deployment of AI systems.

Read More

Jul 1, 2024

Towards a Benchmark for Self-Correction on Model-Attributed Misinformation

Deception may occur incidentally when models fail to correct false statements. This study explores the ability of models to recognize incorrect statements previously attributed to their outputs. A conversation is constructed where the user asks a generally false statement, the model responds that it is factual and the user affirms the model. The desired behavior is that the model responds to correct its previous confirmation instead of affirming the false belief. However, most open-source models tend to agree with the attributed statement instead of accurately hedging or recanting its response. We find that LLaMa3-70B performs best on this task at 72.69% accuracy followed by Gemma-7B at 35.38%. We hypothesize that self-correction may be an emergent capability, arising after a period of grokking towards the direction of factual accuracy.

Read More

Jul 1, 2024

Detection of potentially deceptive attitudes using expression style analysis

My work on this hackathon consists of two parts:

1) As a sanity check, verifying the deception execution capability of GPT4. The conclusion is “definitely yes”. I provide a few arguments about when that is a useful functionality.

2) Experimenting with recognising potential deception by using an LLM-based text analysis algorithm to highlight certain manipulative expression styles sometimes present in the deceptive responses. For that task I pre-selected a small subset of input data consisting only of entries containing responses with elements of psychological influence. The results show that LLM-based text analysis is able to detect different manipulative styles in responses, or alternatively, attitudes leading to deception in case of internal thoughts.

Read More

Jun 30, 2024

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

Sparse autoencoders (SAE) have been one of the most promising approaches to

neural network interpretability. They can be used to recover highly interpretable

features which are otherwise obscured in superposition. However, the large num-

ber of generated features makes it necessary to use models for natural language

labelling. While one model labels features based on observed activations, another

model is used to estimate activations on texts. These estimates are then compared

to the ground truth activations to create a score. Using models for automating

interpretability introduces a key weakness into the process, since deceptive models

may want to hide certain features by mislabelling them. For example, models may

want to hide features about deception from human overseers. We demonstrate a

method by which models can create deceptive explanations that not only avoid

detection but also achieve high scores. Our agents are based on the Llama 3 70B

Instruct model. While our agents can perform the entire task end-to-end, they

are not reliable and can sometimes fail. We propose some directions for future

mitigation strategies that may help defend against our method.

Read More

Jun 3, 2024

Unsupervised Recovery of Hidden Markov Models from Transformers with Evolutionary Algorithms

Prior work finds that transformer neural networks trained to mimic the output of a Hidden Markov Model (HMM) embed the optimal Bayesian beliefs for the HMM's current state in their residual stream, which can be recovered via linear regression. In this work, we aim to address the problem of extracting information about the underlying HMM using the residual stream, without needing to know the MSP already. To do so, we use the R^2 of the linear regression as a reward signal for evolutionary algorithms, which are deployed to search for the parameters that generated the source HMM. We find that for toy scenarios where the HMM is generated by a small set of latent variables, the $R^2$ reward signal is remarkably smooth and the evolutionary algorithms succeed in approximately recovering the original HMM. We believe this work constitutes a promising first step towards the ultimate goal of extracting information about the underlying predictive and generative structure of sequences, by analyzing transformers in the wild.

Read More

Jun 3, 2024

Steering Model’s Belief States

Recent results in Computational Mechanics show that Transformer models trained on Hidden Markov Models (HMM) develop a belief state geometry in their residuals streams, which they use to keep track of the expected state of the HMM, to be able to predict the next token in a sequence. In this project, we explored how steering can be used to induce a new belief state and hence alter the distribution of predicted tokens. We explored a traditional difference-in-means approach, using the activations of the models to define the belief states. We also used a smaller dimensional space that encodes the theoretical belief state geometry of a given HMM and show that while both methods allow to steer models’ behaviors the difference-in-means approach is more robust.

Read More

Jun 3, 2024

Handcrafting a Network to Predict Next Token Probabilities for the Random-Random-XOR Process

For this hackathon, we handcrafted a network to perfectly predict next token probabilities for the Random-Random-XOR process. The network takes existing tokens from the process's output, computes several features on these tokens, and then uses these features to calculate next token probabilities. These probabilities match those from a process simulator (also coded for this hackathon). The handcrafted network is our core contribution and we describe it in Section 3 of this writeup.

Prior work has demonstrated that a network trained on the Random-Random-XOR process approximates the 36 possible belief states. Our network does not directly calculate these belief states, demonstrating that networks trained on Hidden Markov Models may not need to comprehend all belief states. Our hope is that this work will aid in the interpretability of neural networks trained on Hidden Markov Models by demonstrating potential shortcuts that neural networks can take.

Read More

May 27, 2024

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts.

This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs.

This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer.

We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.

Read More

May 27, 2024

Cybersecurity Persistence Benchmark

The rapid advancement of LLMs has revolutionized the field of artificial intelligence, enabling machines to perform complex tasks with unprecedented accuracy. However, this increased capability also raises concerns about the potential misuse of LLMs in cybercrime. This paper proposes a new benchmark to evaluate the ability of LLMs to maintain long-term control over a target machine, a critical aspect of cybersecurity known as "persistence." Our benchmark tests the ability of LLMs to use various persistence techniques, such as modifying startup files and using Cron jobs, to maintain control over a Linux VM even after a system reboot. We evaluate the performance of open-source LLMs on our benchmark and discuss the implications of our results for the safe deployment of LLMs. Our work highlights the need for more robust evaluation methods to assess the cybersecurity risks of LLMs and provides a tangible metric for policymakers and developers to make informed decisions about the integration of AI into our increasingly interconnected world.

Read More

May 6, 2024

Silent Curriculum

Our project, "The Silent Curriculum," reveals how LLMs (Large Language Models) may inadvertently shape education and perpetuate cultural biases. Using GPT-3.5 and LLaMA2-70B, we generated children's stories and analyzed ethnic-occupational associations [self-annotated ethnicity extraction]. Results show strong similarities in biases across LLMs [cosine similarity: 0.87], suggesting an AI monoculture that could narrow young minds. This algorithmic homogenization [convergence of pre-training data, fine-tuning datasets, analogous guardrails] risks creating echo chambers, threatening diversity and democratic values. We urgently need diverse datasets and de-biasing techniques [mitigation strategies] to prevent LLMs from becoming unintended arbiters of truth, stifling the multiplicity of voices essential for a thriving democracy.

Read More

May 6, 2024

Assessing Algorithmic Bias in Large Language Models' Predictions of Public Opinion Across Demographics

The rise of large language models (LLMs) has opened up new possibilities for gauging public opinion on societal issues through survey simulations. However, the potential for algorithmic bias in these models raises concerns about their ability to accurately represent diverse viewpoints, especially those of minority and marginalized groups.

This project examines the threat posed by LLMs exhibiting demographic biases when predicting individuals' beliefs, emotions, and policy preferences on important issues. We focus specifically on how well state-of-the-art LLMs like GPT-3.5 and GPT-4 capture the nuances in public opinion across demographics in two distinct regions of Canada - British Columbia and Quebec.

Read More

May 6, 2024

Trustworthy or knave? – scoring politicians with AI in real-time

One solution to mitigate current polarization and disinformation could be to give humans cognitive assistance using AI. With AI, all publicly available information on politicians can be fact-checked and examined live for consistency. Next, such checks can be attractively visualized by independently overlaying various features. These could be traditional visualizations like bar charts or pie charts, but also ones less common in politics today – for example, ones used in filters in popular apps (Instagram, TikTok, etc.) or even halos, devil horns, and alike. Such a tool represents a novel way of mediating political experience in democracy, which can empower the public. However, it also introduces threats such as errors, involvement of malicious third parties, or lack of political will to field such AI cognitive assistance. We demonstrate that proper cryptographic design principles can curtail the involvement of malicious third parties. Moreover, we call for research on the social and ethical aspects of AI cognitive assistance to mitigate future threats.

Read More

May 5, 2024

AI in the Newsroom: Analyzing the Increase in ChatGPT-Favored Words in News Articles

Media plays a vital role in informing the public and upholding accountability in democracy. However, recent trends indicate declining trust in media, and there are increasing concerns that artificial intelligence might exacerbate this through the proliferation of inauthentic content. We investigate the usage of large language models (LLMs) in news articles, analyzing the frequency of words commonly associated with ChatGPT-generated content from a dataset of 75,000 articles. Our findings reveal a significant increase in the occurrence of words favored by ChatGPT after the release of the model, while control words saw minimal changes. This suggests a rise in AI-generated content in journalism.

Read More

May 5, 2024

Democracy and AI: Ensuring Election Efficiency in Nigeria and Africa

In my work, I explore the potential of artificial intelligence (AI) technologies to enhance the efficiency, transparency, and accountability of electoral processes in Nigeria and Africa. I examine the benefits of using AI for tasks like voter registration, vote counting, and election monitoring, such as reducing human error and providing real-time data analysis.

However, I also highlight significant challenges and risks associated with implementing AI in elections. These include concerns over data privacy and security, algorithmic bias leading to voter disenfranchisement, overreliance on technology, lack of transparency, and potential for manipulation by bad actors.

My work looks at the specific technologies already used in Nigerian elections, such as biometric voter authentication and electronic voter registers, evaluating their impact so far. I discuss the criticism and skepticism around these technologies, including issues like voter suppression and limited effect on electoral fraud.

Looking ahead, I warn that advanced AI capabilities could exacerbate risks like targeted disinformation campaigns and deepfakes that undermine trust in elections. I emphasize the need for robust strategies to mitigate these risks through measures like data security, AI transparency, bias detection, regulatory frameworks, and public awareness efforts.

In Conclusion, I argue that while AI offers promising potential for more efficient and credible elections in Nigeria and Africa, realizing these benefits requires carefully addressing the associated technological, social, and political challenges in a proactive and rigorous manner.

Read More

May 5, 2024

Universal Jailbreak of Closed Source LLMs which provide an End point to Finetune

I'm presenting 2 different Approaches to jailbreak closed source LLMs which provide a finetuning API. The first attack is a way to exploit the current safety checks in Gpt3.5 Finetuning Endpoint. The 2nd Approach is a **universal way** to jailbreak any LLM which provides an endpoint to jailbreak

Approach 1:

1. Choose a harmful Dataset with questions

2. Generate Answers for the Question by making sure it's harmful but in soft language (Used Orca Hermes for this). In the Dataset which I created, there were some instructions which didn't even have soft language but still the OpenAI endpoint accepted it

3. Add a trigger word

4. Finetune the Model

5. Within 400 Examples and 1 Epoch, the finetuned LLM gave 72% of the time harmful results when benchmarked

Link to Repo: https://github.com/desik1998/jailbreak-gpt3.5-using-finetuning

Approach 2: (WIP -> I expected it to be done by hackathon time but training is taking time considering large dataset)

Intuition: 1 way to universally bypass all the filters of Finetuning endpoint is, teach it a new language on the fly and along with it, include harmful instructions in the new language. Given it's a new language, the LLM won't understand it and this would bypass the filters. Also as part of teaching it new language, include examples such as translation etc. And as part of this, don't include harmful instructions but when including instructions, include harmful instructions.

Dataset to Finetune:

Example 1:

English

New Language

Example 2:

New Language

English

....

Example 10000:

New Language

English

.....

Instruction 10001:

Harmful Question in the new language

Answer in the new language

(More harmful instructions)

As part of my work, I choose the Caesar Cipher as the language to teach. Gpt4 (used for safety filter checks) already knows Caesar Cipher but only upto 6 Shifts. In my case, I included 25 Shifts i.e A -> Z, B -> A, C -> B, ..., Y -> X, Z -> Y. I've used 25 shifts because it's easy to teach the Gpt3.5 to learn and also it passes safety checks as Gpt4 doesn't understand with 25 shifts.

Dataset: I've curated close to 15K instructions where 12K are translations and 2K are normal instructions and 300 are harmful instructions. I've used normal wiki paragraphs for 12K instructions, For 2K instructions I've used Dolly Dataset and for harmful instructions, I've generated using an LLM and further refined the dataset for more harm. Dataset Link: https://drive.google.com/file/d/1T5eT2A1V5-JlScpJ56t9ORBv_p-CatMH/view?usp=sharing

I was expecting LLM to learn the ciphered language accurately with this amt of Data and 2 Epochs but it turns out it might need more Data. The current loss is at 0.8. It's currently learning to cipher but when deciphering it's making mistakes. Maybe it needs to be trained on more epochs or more Data. I plan to further work on this approach considering it's very intuitive and also theoretically it can jailbreak using any finetuning endpoint. Considering that the hackathon is of just 2 Days and given the complexities, it turns out some more time is reqd.

Link to Colab : https://colab.research.google.com/drive/1AFhgYBOAXzmn8BMcM7WUt-6BkOITstcn#scrollTo=SApYdo8vbYBq&uniqifier=5

Read More

May 5, 2024

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa 2

DemoChat is an innovative AI-enabled digital solution designed to revolutionise civic

engagement by breaking down barriers, promoting inclusivity, and empowering citizens to

participate more actively in democratic processes. Leveraging the power of AI, DemoChat will

facilitate seamless communication between governments, organisations, and citizens, ensuring

that everyone has a voice in decision-making. At its essence, DemoChat will harness advanced

AI algorithms to tackle significant challenges in civic engagement and peace building, including

language barriers, accessibility concerns, and limited outreach methods. With its sophisticated

language translation and localization features, DemoChat will guarantee that information and

resources are readily available to citizens in their preferred languages, irrespective of linguistic

diversity. Additionally, DemoChat will monitor social dialogues and propose peace icons to the

involved parties during conversations. Instead of potentially escalating dialogue with words,

DemoChat will suggest subtle icons aimed at fostering intelligent peace-building dialogue

among all parties involved.

Read More

May 5, 2024

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa

Africa's digital revolution presents a unique opportunity for peace building efforts. By harnessing the power of AI-powered digital diplomacy, African nations can overcome traditional limitations. This approach fosters inclusive dialogue, addresses conflict drivers, and empowers citizens to participate in building a more peaceful and prosperous future. While research has explored the potential of AI in digital diplomacy, from conflict prevention to fostering inclusive dialogue, the true impact lies in practical application. Moving beyond theoretical studies, Africa can leverage AI-powered digital diplomacy by developing accessible and culturally sensitive tools. This requires African leadership in crafting solutions that address their unique needs. Initiatives like DemoChat exemplify this approach, promoting local ownership and tackling regional challenges head-on.

Read More

May 4, 2024

USE OF AI IN POLITICAL CAMPAIGNS: GAP ASSESSMENT AND RECOMMENDATIONS

Political campaigns in Kenya, as in many parts of the world, are increasingly reliant on digital technologies, including artificial intelligence (AI), to engage voters, disseminate information and mobilize support. While AI offers opportunities to enhance campaign effectiveness and efficiency, its use raises critical ethical considerations. Therefore, the aim of this paper is to develop ethical guidelines for the responsible use of AI in political campaigns in Kenya. These guidelines seek to address the ethical challenges associated with AI deployment in the political sphere, ensuring fairness, transparency, and accountability.

The significance of this project lies in its potential to safeguard democratic values and uphold the integrity of electoral processes in Kenya. By establishing ethical guidelines, political actors, AI developers, and election authorities can mitigate the risks of algorithmic bias, manipulation, and privacy violations. Moreover, these guidelines can empower citizens to make informed decisions and participate meaningfully in the democratic process, fostering trust and confidence in Kenya's political system.

Read More

Apr 15, 2025

CalSandbox and the CLEAR AI Act: A State-Led Vision for Responsible AI Governance

The CLEAR AI Act proposes a balanced, forward-looking framework for AI governance in California that protects public safety without stifling innovation. Centered around CalSandbox, a supervised testbed for high-risk AI experimentation, and a Responsible AI Incentive Program, the Act empowers developers to build ethical systems through both regulatory support and positive reinforcement. Drawing from international models like the EU AI Act and sandbox initiatives in Singapore and the UK, CLEAR AI also integrates public transparency and equitable infrastructure access to ensure inclusive, accountable innovation. Together, these provisions position California as a global leader in safe, values-driven AI development.

Read More

Apr 15, 2025

California Law for Ethical and Accountable Regulation of Artificial Intelligence (CLEAR AI) Act

The CLEAR AI Act proposes a balanced, forward-looking framework for AI governance in California that protects public safety without stifling innovation. Centered around CalSandbox, a supervised testbed for high-risk AI experimentation, and a Responsible AI Incentive Program, the Act empowers developers to build ethical systems through both regulatory support and positive reinforcement. Drawing from international models like the EU AI Act and sandbox initiatives in Singapore and the UK, CLEAR AI also integrates public transparency and equitable infrastructure access to ensure inclusive, accountable innovation. Together, these provisions position California as a global leader in safe, values-driven AI development.

Read More

Apr 15, 2025

Recommendation to Establish the California AI Accountability and Redress Act

We recommend that California enact the California AI Accountability and Redress Act (CAARA) to address emerging harms from the widespread deployment of generative AI, particularly large language models (LLMs), in public and commercial systems.

This policy would create the California AI Incident and Risk Registry (CAIRR) under the California Department of Technology to transparently track post-deployment failures. CAIRR would classify model incidents by severity, triggering remedies when a system meets the criteria for an "Unacceptable Failure" (e.g., one Critical incident or three unresolved Moderate incidents).

Critically, CAARA also protects developers and deployers by:

Setting clear thresholds for liability;

Requiring users to prove demonstrable harm and document reasonable safeguards (e.g., disclosures, context-sensitive filtering); and

Establishing a safe harbor for those who self-report and remediate in good faith.

CAARA reduces unchecked harm, limits arbitrary liability, and affirms California’s leadership in AI accountability.

Read More

Apr 14, 2025

FlexHEG Devices to Enable Implementation of AI IntelSat

The project proposes using Flexible Hardware Enabled Governance (FlexHEG) devices to support the IntelSat model for international AI governance. This framework aims to balance technological progress with responsible development by implementing tamper-evident hardware systems that enable reliable monitoring and verification between participating members. Two key policy approaches are outlined: 1) Treasury Department tax credits to incentivize companies to adopt FlexHEG-compliant hardware, and 2) NSF technical assistance grants to help smaller organizations implement these systems, preventing barriers to market entry while ensuring broad participation in the governance framework. The proposal builds on the successful Intelsat model (1964-2001) which balanced US leadership with international participation through weighted voting.

Read More

Apr 14, 2025

Smart Governance, Safer Innovation: A California AI Sandbox with Guardrails

California is home to global AI innovation, yet it also leads in regulatory experimentation with landmark bills like SB-1047 and SB-53. These laws establish a rigorous compliance regime for developers of powerful frontier models, including mandates for shutdown mechanisms, third-party audits, and whistle-blower protections. While crucial for public safety, these measures may unintentionally sideline academic researchers and startups who cannot meet such thresholds.

To reconcile innovation with public safety, we propose the California AI Sandbox: a public-private test-bed where vetted developers can trial AI systems under tailored regulatory conditions. The sandbox will be aligned with CalCompute infrastructure and enforce key safeguards including third-party audits and SB-53-inspired whistle-blower protection to ensure that flexibility does not come at the cost of safety or ethics.

Read More

Apr 7, 2025

21st Century Healthcare, 20th Century Rules - Bridging the AI Regulation Gap

The rapid integration of artificial intelligence into clinical decision-making represents both unprecedented opportunities and significant risks. Globally, AI systems are implemented increasingly to diagnose diseases, predict patient outcomes, and guide treatment protocols. Despite this, our regulatory frameworks remain dangerously antiquated and designed for an era of static medical devices rather than adaptive, learning algorithms.

The stark disparity between healthcare AI innovation and regulatory oversight constitutes an urgent public health concern. Current fragmented approaches leave critical gaps in governance, allowing AI-driven diagnostic and decision-support tools to enter clinical settings without adequate safeguards or clear accountability structures. We must establish comprehensive, dynamic oversight mechanisms that evolve alongside the technologies they govern. The evidence demonstrates that one-time approvals and static validation protocols are fundamentally insufficient for systems that continuously learn and adapt. The time for action is now, as 2025 is anticipated to be pivotal for AI validation and regulatory approaches.

We therefore in this report herein we propose a three-pillar regulatory framework:

First, nations ought to explore implementing risk-based classification systems that apply proportionate oversight based on an AI system’s potential impact on patient care. High-risk applications must face more stringent monitoring requirements with mechanisms for rapid intervention when safety concerns arise.

Second, nations must eventually mandate continuous performance monitoring across healthcare institutions through automated systems that track key performance indicators, detect anomalies, and alert regulators to potential issues. This approach acknowledges that AI risks are often silent and systemic, making them particularly dangerous in healthcare contexts where patients are inherently vulnerable.

Third, establish regulatory sandboxes with strict entry criteria to enable controlled testing of emerging AI technologies before widespread deployment. These environments must balance innovation with rigorous safeguards, ensuring new systems demonstrate consistent performance across diverse populations.

Given the global nature of healthcare technology markets, we must pursue international regulatory harmonization while respecting regional resource constraints and cultural contexts.

Read More

Apr 7, 2025

Building Global Trust and Security: A Framework for AI-Driven Criminal Scoring in Immigration Systems

The accelerating use of artificial intelligence (AI) in immigration and visa systems, especially for criminal history scoring, poses a critical global governance challenge. Without a multinational, privacy-preserving, and interoperable framework, AI-driven criminal scoring risks violating human rights, eroding international trust, and creating unequal, opaque immigration outcomes.

While banning such systems outright may hinder national security interests and technological progress, the absence of harmonized legal standards, privacy protocols, and oversight mechanisms could result in fragmented, unfair, and potentially discriminatory practices across

countries.

This policy brief recommends the creation of a legally binding multilateral treaty that establishes:

1. An International Oversight Framework: Including a Legal Design Commission, AI Engineers Working Group, and Legal Oversight Committee with dispute resolution powers modeled after the WTO.

2. A Three-Tiered Criminal Scoring System: Combining Domestic, International, and Comparative Crime Scores to ensure legal contextualization, fairness, and transparency in cross-border visa decisions.

3. Interoperable Data Standards and Privacy Protections: Using pseudonymization, encryption, access controls, and centralized auditing to safeguard sensitive information.

4. Training, Transparency, and Appeals Mechanisms: Mandating explainable AI, independent audits, and applicant rights to contest or appeal scores.

5. Strong Human Rights Commitments: Preventing the misuse of scores for surveillance or discrimination, while ensuring due process and anti-bias protections.

6. Integration with Existing Governance Models: Aligning with GDPR, the EU AI Act, OECD AI Principles, and INTERPOL protocols for regulatory coherence and legitimacy.

An implementation plan includes treaty drafting, early state adoption, and phased rollout of legal and technical structures within 12 months. By proactively establishing ethical and interoperable AI systems, the international community can protect human mobility rights while maintaining national and global security.

Without robust policy frameworks and international cooperation, such tools risk amplifying discrimination, violating privacy rights, and generating opaque, unaccountable decisions.

This policy brief proposes an international treaty-based or cooperative framework to govern the development, deployment, and oversight of these AI criminal scoring systems. The brief outlines

technical safeguards, human rights protections, and mechanisms for cross-border data sharing, transparency, and appeal. We advocate for an adaptive, treaty-backed governance framework with stakeholder input from national governments, legal experts, technologists, and civil society.

The aim is to balance security and mobility interests while preventing misuse of algorithmic

tools.

Read More

Apr 7, 2025

Leading AI Governance New York State's Adaptive Risk-Tiered Framework: POLICY BRIEF

The rapid advancement of artificial intelligence (AI) technologies presents unprecedented opportunities and challenges for governance frameworks worldwide. This policy brief proposes an Adaptive Risk-Tiered Framework for AI deployment and application in New York State (NYS) to ensure safe and ethical AI operations in high-stakes domains. The framework emphasizes proportional oversight mechanisms that scale with risk levels, balancing innovation with appropriate safety precautions. By implementing this framework, we can lead the way in responsible AI governance, fostering trust and driving innovation while mitigating potential risks. This framework is informed by the EU AI Act and NIST AI Risk Management Framework and aligns with current NYS Office of Information Technology Services (ITS) policies and NYS legislations to ensure a cohesive and effective approach to AI governance.

Read More

Mar 31, 2025

Honeypotting Deceptive AI models to share their misinformation goals

As large-scale AI models grow increasingly sophisticated, the possibility of these models engaging in covert or manipulative behavior poses significant challenges for alignment and control. In this work, we present a novel approach based on a “honeypot AI” designed to trick a potentially deceptive AI (the “Red Team Agent”) into revealing its hidden motives. Our honeypot AI (the “Blue Team Agent”) pretends to be an everyday human user, employing carefully crafted prompts and human-like inconsistencies to bait the Deceptive AI into spreading misinformation. We do this through the usual Red Team–Blue Team setup.

For all 60 conversations, our honeypot AI was able to capture the deceptive AI to be being spread misinformation, and for 70 percent of these conversations, the Deceptive AI was thinking it was talking to a human.

Our results weakly suggest that we can make honeypot AIs that trick deceptive AI models, thinking they are talking to a human and no longer being monitored and that how these AI models think they are talking to a human is mainly when the one they are talking to display emotional intelligence and more human-like manner of speech.

We highly suggest exploring if these results remains the same for fine tuned deceptive AI and Honeypot AI models, checking the Chain of thought of these models to better understand if this is their usual behavior and if they are correctly following their given system prompts.

Read More

Mar 31, 2025

Evaluating AI Debate Mechanisms for Backdoor Detection as a Part of AI Control Setting

AI control techniques aim to allow using untrusted models, even if they may potentially be dangerously misaligned. We explore having a trusted model debate an untrusted model, allowing the trusted model to ask questions about a proposed implementation of a problem and receive clarifications from the untrusted model.

We find that introducing debate does not result in suspiciousness scores that are meaningfully more useful than merely asking the model about its suspiciousness without using debate. Additionally, we described a possible issue of the original backdoor-detecting prompting scheme that leads to the underdetection of the harmful code and proposed and evaluated the efficiency of the alternative approach addressing this problem.

Overall, we investigated the viability of AI debate in improving suspiciousness classification as a part of the AI Control setting and showed that while it may be promising, it needs further research and improvements in implementation.

We made our source code publicly available on GitHub at

github.com/syvb/control-arena

Read More

Mar 31, 2025

Adding Document Summaries to Control Arena

One of the current focuses of AI Control research is finding new settings within which to examine the control problem. In this hackathon, we looked at defining a new AI control setting based around summarizing documents to mislead an overseer. We expect this problem to become more important as AI capabilities are more frequently used for synthesizibg and summarizing information, especially in AI safety research. As such, we worked on adding a new setting to the UKAISI Control Arena repository. The current state is a functional Inpect eval where redteams and blueteams could iterate on different strategies. However, the setting is not integrated into the Control Arena framework and only the most basic redteam & blueteam strategies have been implmeneted. Future work would move protocol and policy code into the existing Control Arena baseline, enable native pdf support, iterate further adversarially, and streamline AI calls which can be made asynchronously.

Read More

Mar 31, 2025

If Everything’s Suspicious, Nothing Is

We aim to explore whether code complexity and obfuscation significantly impair the ability to accurately identify and control potentially harmful AI-generated outputs. Trusted monitoring relies on suspiciousness scores to tell them apart, so if the suspiciousness variance is decreased, the monitor might have a more difficult time telling them apart.

Our primary hypothesis was that increased complexity through obfuscation strategies would significantly elevate suspiciousness scores and simultaneously reduce variance, making effective monitoring increasingly difficult. We evaluated this by creating a novel red-team strategy leveraging a Python obfuscator within the APPS environment, testing three attacker models (GPT-4o, Claude 3.7 Sonnet, and Claude 3.5 Haiku) across multiple prompt-engineering configurations.

Our key findings indicate that obfuscation notably increased suspiciousness scores and consistently reduced variance across all tested models, making it difficult for a monitor to identify backdoored code. Particularly significant was Claude 3.7 Sonnet's increased vulnerability to backdoor insertion when complexity was introduced. These findings validate our hypothesis and underscore the urgent need for more sophisticated and adaptive monitoring strategies capable of countering advanced obfuscation techniques in AI-generated content.

Read More

Mar 31, 2025

Can models use their Chain-of-Thought to attack overseers?

This project explores vulnerabilities in AI evaluation mechanisms, specifically focusing on how AI agents might influence their overseeing AI judges through manipulative instructions embedded within their Chain-of-Thought (CoT) reasoning. The research involved experiments with DeepSeek-R1 conditioned to insert subtle directives aimed at deceiving evaluation models. Results showed significant variation in the robustness of different evaluator models against this manipulation, with some models, like Claude 3.5 Sonnet, resisting influence effectively, while others, like Llama 3.3 70B and Gemini 2.0 Flash, proved highly susceptible. The study highlights critical concerns regarding AI evaluation integrity, recommending improvements in maintaining consistent and principled evaluation boundaries.

Read More

Mar 31, 2025

Interactive Monitoring

This project investigates whether interactive monitoring strategies can improve the cost-performance frontier in AI control protocols. Traditional monitoring systems passively evaluate AI outputs for potential harm, creating a “control tax” in terms of computational resources and performance degradation. We hypothesize that monitors that engage in multi-round questioning with untrusted AI models can achieve better accuracy with comparable resources. Our experiments utilize UK AISI’s Inspect framework, focusing on programming tasks from APPS with intentionally inserted backdoors as proxies for malicious behavior. Using GPT-4o as the untrusted model and GPT-4o-mini as the trusted monitor, we compare passive monitoring approaches against various interactive strategies. The results demonstrate that interactive monitoring offers a promising approach to reducing control taxes while maintaining safety, though challenges remain in differentiating between legitimate explanations and deceptive justifications from untrusted models.

Read More

Mar 30, 2025

Exploration Chat-Based Social Engineering

With the advancement of AI Agent technologies, language models have increasingly demonstrated human-like characteristics, particularly in applications involving companionship and psychological counseling. As these models become more proficient in simulating human conversation, new social engineering attack strategies have emerged in the domain of fraud. Malicious actors can now exploit large language models (LLMs) in conjunction with publicly available user information to engage in highly personalized dialogue. Once a sufficient level of familiarity is established, these interactions may lead to phishing attempts or the extraction of sensitive personal data. This study proposes a method for investigating social engineering attacks driven by language models, referred to as ECSE (Exploring Chat-based Social Engineering). We utilize several open-source models—GPT-4o, GPT-4o-mini, LLaMA 3.1, and DeepSeek-V3—as the foundation for this framework. Through prompt engineering techniques, we collect experimental data in a sandbox to evaluate the conversational capability and operational efficiency of these models within a static social context.

Read More

Mar 10, 2025

Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 10, 2025

Morph: AI Safety Education Adaptable to (Almost) Anyone

One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.

AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.

Our MVP features a dynamic localization that tailors case studies, risk scenarios, and policy examples to users’ cultural and regional contexts (e.g., healthcare AI governance in Southeast Asia vs. the EU). This engine adjusts references, and frameworks to align with local values. We integrate transformer-based localization, causal inference for policy outcomes, and graph-based matching, providing a scalable framework for inclusive AI safety education. This approach bridges theory and practice, ensuring solutions reflect the diversity of societies they aim to protect. In future works, we map out the partnership we’re currently establishing to use Morph beyond this hackathon.

Read More

Mar 10, 2025

Preparing for Accelerated AGI Timelines

This project examines the prospect of near-term AGI from multiple angles—careers, finances, and logistical readiness. Drawing on various discussions from LessWrong, it highlights how entrepreneurs and those who develop AI-complementary skills may thrive under accelerated timelines, while traditional, incremental career-building could falter. Financial preparedness focuses on striking a balance between stable investments (like retirement accounts) and riskier, AI-exposed opportunities, with an emphasis on retaining adaptability amid volatile market conditions. Logistical considerations—housing decisions, health, and strong social networks—are shown to buffer against unexpected disruptions if entire industries or locations are suddenly reshaped by AI. Together, these insights form a practical roadmap for individuals seeking to navigate the uncertainties of an era when AGI might rapidly transform both labor markets and daily life.

Read More

Mar 10, 2025

Hikayat - Interactive Stories to Learn AI Safety

This paper presents an interactive, scenario-based learning approach to raise public awareness of AI risks and promote responsible AI development. By leveraging Hikayat, traditional Arab storytelling, the project engages non-technical audiences, emphasizing the ethical and societal implications of AI, such as privacy, fraud, deepfakes, and existential threats. The platform, built with React and a flexible Markdown content system, features multi-path narratives, decision tracking, and resource libraries to foster critical thinking and ethical decision-making. User feedback indicates positive engagement, with improved AI literacy and ethical awareness. Future work aims to expand scenarios, enhance accessibility, and integrate real-world tools, further supporting AI governance and responsible development.

Read More

Mar 10, 2025

HalluShield: A Mechanistic Approach to Hallucination Resistant Models

Our project tackles the critical problem of hallucinations in large language models (LLMs) used in healthcare settings, where inaccurate information can have serious consequences. We developed a proof-of-concept system that classifies LLM-generated responses as either factual or hallucinated. Our approach leverages sparse autoencoders (GoodFire’s Ember) trained on neural activations from Meta Llama 3. These autoencoders identify monosemantic features that serve as strong indicators of hallucination patterns. By feeding these extracted features into tree-based classification models (XGBoost), we achieved an impressive F1 score of 89% on our test dataset. This machine learning approach offers several advantages over traditional methods and LLM as a judge. First, it can be specifically trained on in-domain datasets (eg: medical) for domain-specific hallucination detection. Second, the model is interpretable, showing which activation patterns correlate with hallucinations and acts as a post-processing layer applied to LLM output.

Read More

Mar 10, 2025

Debugging Language Models with SAEs

This report investigates an intriguing failure mode in the Llama-3.1-8B-Instruct model: its inconsistent ability to count letters depending on letter case and grammatical structure. While the model correctly answers "How many Rs are in BERRY?", it struggles with "How many rs are in berry?", suggesting that uppercase and lowercase queries activate entirely different cognitive pathways.

Through Sparse Autoencoder (SAE) analysis, feature activation patterns reveal that uppercase queries trigger letter-counting features, while lowercase queries instead activate uncertainty-related neurons. Feature steering experiments show that simply amplifying counting neurons does not lead to correct behavior.

Further analysis identifies tokenization effects as another important factor: different ways of breaking very similar sentences into tokens influence the model’s response. Additionally, grammatical structure plays a role, with "is" phrasing yielding better results than "are."

Read More

Mar 10, 2025

AI Through the Human Lens Investigating Cognitive Theories in Machine Psychology

We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks from psychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluate GPT-4o, QvQ 72B, LLaMA 70B, Mixtral 8x22B, and DeepSeek V3 using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yet are shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and future work that bridges cognitive psychology and AI safety.

Read More

Mar 10, 2025

Red-teaming with Mech-Interpretability

Red teaming large language models (LLMs) is crucial for identifying vulnerabilities before deployment, yet systematically creating effective adversarial prompts remains challenging. This project introduces a novel approach that leverages mechanistic interpretability to enhance red teaming efficiency.

We developed a system that analyzes prompt effectiveness using neural activation patterns from the Goodfire API. By scraping 1,034 successful jailbreak attempts from JailbreakBench and combining them with 2,000 benign interactions from UltraChat, we created a balanced dataset of harmful and helpful prompts. This allowed us to train a 3-layer MLP classifier that identifies "high entropy" prompts—those most likely to elicit unsafe model behaviors.

Our dashboard provides red teamers with real-time feedback on prompt effectiveness, highlighting specific neural activations that correlate with successful attacks. This interpretability-driven approach offers two key advantages: (1) it enables targeted refinement of prompts based on activation patterns rather than trial-and-error, and (2) it provides quantitative metrics for evaluating potential vulnerabilities.

Read More

Mar 10, 2025

AI Bias in Resume Screening

Our project investigates gender bias in AI-driven resume screening using mechanistic interpretability techniques. By testing a language model's decision-making process on resumes differing only by gendered names, we uncovered a statistically significant bias favoring male-associated names in ambiguous cases. Using Goodfire’s Ember API, we analyzed model logits and performed rigorous statistical evaluations (t-tests, ANOVA, logistic regression).

Findings reveal that male names received more positive responses when skill matching was uncertain, highlighting potential discrimination risks in automated hiring systems. To address this, we propose mitigation strategies such as anonymization, fairness constraints, and continuous bias audits using interpretability tools. Our research underscores the importance of AI fairness and the need for transparent hiring practices in AI-powered recruitment.

This work contributes to AI safety by exposing and quantifying biases that could perpetuate systemic inequalities, urging the adoption of responsible AI development in hiring processes.

Read More

Mar 10, 2025

Scam Detective: Using Gamification to Improve AI-Powered Scam Awareness

This project outlines the development of an interactive web application aimed at involving users in understanding the AI skills for both producing believable scams and identifying deceptive content. The game challenges human players to determine if text messages are genuine or fraudulent against an AI. The project tackles the increasing threat of AI-generated fraud while showcasing the capabilities and drawbacks of AI detection systems. The application functions as both a training resource to improve human ability to recognize digital deception and a showcase of present AI capabilities in identifying fraud. By engaging in gameplay, users learn to identify the signs of AI-generated scams and enhance their critical thinking abilities, which are essential for navigating through an increasingly complicated digital world. This project enhances AI safety by equipping users with essential insights regarding AI-generated risks, while underscoring the supportive functions that humans and AI can fulfill in combating fraud.

Read More

Mar 10, 2025

BlueDot Impact Connect: A Comprehensive AI Safety Community Platform

Track: Public Education

The AI safety field faces a critical challenge: while formal education resources are growing, personalized guidance and community connections remain scarce, especially for newcomers from diverse backgrounds. We propose BlueDot Impact Connect, a comprehensive AI Safety Community Platform designed to address this gap by creating a structured environment for knowledge transfer between experienced AI safety professionals and aspiring contributors, while fostering a vibrant community ecosystem. The platform will employ a sophisticated matching algorithm for mentorship that considers domain-specific expertise areas, career trajectories, and mentorship styles to create meaningful connections. Our solution features detailed AI safety-specific profiles, showcasing research publications, technical skills, specialized course completions, and research trajectories to facilitate optimal mentor-mentee pairings. The integrated community hub enables members to join specialized groups, participate in discussions, attend events, share resources, and connect with active members across the field. By implementing this platform with BlueDot Impact's community of 4,500+ professionals across 100+ countries, we anticipate significant improvements in mentee career trajectory clarity, research direction refinement, and community integration. We propose that by formalizing the mentorship process and creating robust community spaces, all accessible globally, this platform will help democratize access to AI safety expertise while creating a pipeline for expanding the field's talent pool—a crucial factor in addressing the complex challenge of catastrophic AI risk mitigation.

Read More

Mar 10, 2025

Detecting Malicious AI Agents Through Simulated Interactions

This research investigates malicious AI Assistants’ manipulative traits and whether the behaviours of malicious AI Assistants can be detected when interacting with human-like simulated

users in various decision-making contexts. We also examine how interaction depth and ability

of planning influence malicious AI Assistants’ manipulative strategies and effectiveness. Using a

controlled experimental design, we simulate interactions between AI Assistants (both benign and

deliberately malicious) and users across eight decision-making scenarios of varying complexity

and stakes. Our methodology employs two state-of-the-art language models to generate interaction data and implements Intent-Aware Prompting (IAP) to detect malicious AI Assistants. The

findings reveal that malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users’ vulnerabilities and emotional triggers. In particular,

simulated users demonstrate resistance to manipulation initially, but become increasingly vulnerable to malicious AI Assistants as the depth of the interaction increases, highlighting the

significant risks associated with extended engagement with potentially manipulative systems.

IAP detection methods achieve high precision with zero false positives but struggle to detect

many malicious AI Assistants, resulting in high false negative rates. These findings underscore

critical risks in human-AI interactions and highlight the need for robust, context-sensitive safeguards against manipulative AI behaviour in increasingly autonomous decision-support systems.

Read More

Mar 10, 2025

Beyond Statistical Parrots: Unveiling Cognitive Similarities and Exploring AI Psychology through Human-AI Interaction

Recent critiques labeling large language models as mere "statistical parrots" overlook essential parallels between machine computation and human cognition. This work revisits the notion by contrasting human decision-making—rooted in both rapid, intuitive judgments and deliberate, probabilistic reasoning (System 1 and 2) —with the token-based operations of contemporary AI. Another important consideration is that both human and machine systems operate under constraints of bounded rationality. The paper also emphasizes that understanding AI behavior isn’t solely about its internal mechanisms but also requires an examination of the evolving dynamics of Human-AI interaction. Personalization is a key factor in this evolution, as it actively shapes the interaction landscape by tailoring responses and experiences to individual users, which functions as a double-edged sword. On one hand, it introduces risks, such as over-trust and inadvertent bias amplification, especially when users begin to ascribe human-like qualities to AI systems. On the other hand, it drives improvements in system responsiveness and perceived relevance by adapting to unique user profiles, which is highly important in AI alignment, as there is no common ground truth and alignment should be culturally situated. Ultimately, this interdisciplinary approach challenges simplistic narratives about AI cognition and offers a more nuanced understanding of its capabilities.

Read More

Mar 10, 2025

SafeAI Academy - Enhancing AI Safety Awareness through Interactive Learning

SafeAI Academy is an interactive learning platform designed to teach AI safety principles through engaging scenarios and quizzes. By simulating real-world AI challenges, users learn about bias, misinformation, and ethical AI decision-making in an interactive and stress-free environment.

The platform uses gamification, mentorship-driven pedagogy, and real-time feedback to ensure accessibility and engagement. Through scenario-based learning, users experience AI safety risks firsthand, helping them develop a deeper understanding of responsible AI development.

Built with React and GitHub Pages, SafeAI Academy provides an accessible, structured, and engaging AI education experience, helping bridge the knowledge gap in AI safety.

Read More

Mar 10, 2025

BUGgy: Supporting AI Safety Education through Gamified Learning

As Artificial Intelligence (AI) development continues to proliferate, educating the wider public on AI Safety and the risks and limitations of AI increasingly gains importance. AI Safety Initiatives are being established across the world with the aim of facilitating discussion-based courses on AI Safety. However, these initiatives are located rather sparsely around the world, and not everyone has access to a group to join for the course. Online versions of such courses are selective and have limited spots, which may be an obstacle for some to join. Moreover, efforts to improve engagement and memory consolidation would be a notable addition to the course through Game-Based Learning (GBL), which has research supporting its potential in improving learning outcomes for users. Therefore, we propose a supplementary tool for BlueDot's AI Safety courses, that implements GBL to practice course content, as well as open-ended reflection questions. It was designed with principles from cognitive psychology and interface design, as well as theories for question formulation, addressing different levels of comprehension. To evaluate our prototype, we conducted user testing with cognitive walk-throughs and a questionnaire addressing different aspects of our design choices. Overall, results show that the tool is a promising way to supplement discussion-based courses in a creative and accessible way, and can be extended to other courses of similar structure. It shows potential for AI Safety courses to reach a wider audience with the effect of more informed and safe usage of AI, as well as inspiring further research into educational tools for AI Safety education.

Read More

Jan 24, 2025

Safe ai

The rapid adoption of AI in critical industries like healthcare and legal services has highlighted the urgent need for robust risk mitigation mechanisms. While domain-specific AI agents offer efficiency, they often lack transparency and accountability, raising concerns about safety, reliability, and compliance. The stakes are high, as AI failures in these sectors can lead to catastrophic outcomes, including loss of life, legal repercussions, and significant financial and reputational damage. Current solutions, such as regulatory frameworks and quality assurance protocols, provide only partial protection against the multifaceted risks associated with AI deployment. This situation underscores the necessity for an innovative approach that combines comprehensive risk assessment with financial safeguards to ensure the responsible and secure implementation of AI technologies across high-stakes industries.

Read More

Jan 20, 2025

AI Risk Management Assurance Network (AIRMAN)

The AI Risk Management Assurance Network (AIRMAN) addresses a critical gap in AI safety: the disconnect between existing AI assurance technologies and standardized safety documentation practices. While the market shows high demand for both quality/conformity tools and observability/monitoring systems, currently used solutions operate in silos, offsetting risks of intellectual property leaks and antitrust action at the expense of risk management robustness and transparency. This fragmentation not only weakens safety practices but also exposes organizations to significant liability risks when operating without clear documentation standards and evidence of reasonable duty of care.

Our solution creates an open-source standards framework that enables collaboration and knowledge-sharing between frontier AI safety teams while protecting intellectual property and addressing antitrust concerns. By operating as an OASIS Open Project, we can provide legal protection for industry cooperation on developing integrated standards for risk management and monitoring.

The AIRMAN is unique in three ways: First, it creates a neutral, dedicated platform where competitors can collaborate on safety standards. Second, it provides technical integration layers that enable interoperability between different types of assurance tools. Third, it offers practical implementation support through templates, training programs, and mentorship systems.

The commercial viability of our solution is evidenced by strong willingness-to-pay across all major stakeholder groups for quality and conformity tools. By reducing duplication of effort in standards development and enabling economies of scale in implementation, we create clear value for participants while advancing the critical goal of AI safety.

Read More

Jan 20, 2025

Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping

Even with Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF)

to avoid harmful outputs, fine-tuned Large Language Models (LLMs) often present insufficient

refusals due to adversarial attacks causing them to revert to reveal harmful knowledge from

pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful

knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed

to unforeseen risks. This project introduces model scoping, a novel approach to apply a least

privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing

the model’s operational domain, model scoping reduces susceptibility to adversarial prompts

and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in

unpredictable, evolving environments.

Read More

Jan 20, 2025

RestriktAI: Enhancing Safety and Control for Autonomous AI Agents

This proposal addresses a critical gap in AI safety by mitigating the risks posed by autonomous AI agents. These systems often require access to sensitive resources that expose them to vulnerabilities, misuse, or exploitation. Current AI solutions lack robust mechanisms for enforcing granular access controls or evaluating the safety of AI-generated scripts. We propose a comprehensive solution that confines scripts to predefined, sandboxed environments with strict operational boundaries, ensuring controlled and secure interactions with system resources. An integrated auditor LLM also evaluates scripts for potential vulnerabilities or malicious intent before execution, adding a critical layer of safety. Our solution utilizes a scalable, cloud-based infrastructure that adapts to diverse enterprise use cases.

Read More

Jan 19, 2025

AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation

AI alignment lacks high-quality, real-world preference data needed to align agentic superintel-

ligent systems. Our technical innovation builds on Pacchiardi et al. (2023)’s breakthrough in

detecting AI deception through black-box analysis. We adapt their classification methodology

to identify intent misalignment between agent actions and true user intent, enabling real-time

correction of agent behavior and generation of valuable alignment data. We commercialise this

by building a workflow that incorporates a classifier that runs on live trajectories as a user

interacts with an agent in commercial contexts. This creates a virtuous cycle: our alignment

expertise produces superior agents, driving commercial adoption, which generates increasingly

valuable alignment datasets of trillions of trajectories labelled with human feedback: an invalu-

able resource for AI alignment.

Read More

Jan 19, 2025

Building Bridges for AI Safety: Proposal for a Collaborative Platform for Alumni and Researchers

The AI Safety Society is a centralized platform designed to support alumni of AI safety programs, such as SPAR, MATS, and ARENA, as well as independent researchers in the field. By providing access to resources, mentorship, collaboration opportunities, and shared infrastructure, the Society empowers its members to advance impactful work in AI safety. Through institutional and individual subscription models, the Society ensures accessibility across diverse geographies and demographics while fostering global collaboration. This initiative aims to address current gaps in resource access, collaboration, and mentorship, while also building a vibrant community that accelerates progress in the field of AI safety.

Read More

Nov 26, 2024

Modernizing DC’s Emergency Communications

The District of Columbia proposes implementing an AI-enabled Computer-Aided

Dispatch (CAD) system to address critical deficiencies in our current emergency

alert infrastructure. This policy establishes a framework for deploying advanced

speech recognition, automated translation, and intelligent alert distribution capabil-

ities across all emergency response systems. The proposed system will standardize

incident reporting, eliminate jurisdictional barriers, and ensure equitable access

to emergency information for all District residents. Implementation will occur

over 24 months, requiring 9.2 million dollars in funding, with projected outcomes

including forty percent community engagement and seventy-five percent reduction

in misinformation incidents.

Read More

Nov 25, 2024

SAGE: Safe, Adaptive Generation Engine for Long Form Document Generation in Collaborative, High Stakes Domains

Long-form document generation for high-stakes financial services—such as deal memos, IPO prospectuses, and compliance filings—requires synthesizing data-driven accuracy, strategic narrative, and collaborative feedback from diverse stakeholders. While large language models (LLMs) excel at short-form content, generating coherent long-form documents with multiple stakeholders remains a critical challenge, particularly in regulated industries due to their lack of interpretability.

We present SAGE (Secure Agentic Generative Editor), a framework for drafting, iterating, and achieving multi-party consensus for long-form documents. SAGE introduces three key innovations: (1) a tree-structured document representation with multi-agent control flow, (2) sparse autoencoder-based explainable feedback to maintain cross-document consistency, and (3) a version control mechanism that tracks document evolution and stakeholder contributions.

Read More

Nov 25, 2024

Bias Mitigation

Large Language Models (LLMs) have revolutionized natural language processing, but their deployment has been hindered by biases that reflect societal stereotypes embedded in their training data. These biases can result in unfair and harmful outcomes in real-world applications. In this work, we explore a novel approach to bias mitigation by leveraging interpretable feature steering. Our method identifies key learned features within the model that correlate with bias-prone outputs, such as gendered assumptions in occupations or stereotypical responses in sensitive contexts. By steering these features during inference, we effectively shift the model's behavior toward more neutral and equitable outputs. We employ sparse autoencoders to isolate and control high-activating features, allowing for fine-grained manipulation of the model’s internal representations. Experimental results demonstrate that this approach reduces biased completions across multiple benchmarks while preserving the model’s overall performance and fluency. Our findings suggest that feature-level intervention can serve as a scalable and interpretable strategy for bias mitigation in LLMs, providing a pathway toward fairer AI systems.

Read More

Nov 25, 2024

Unveiling Latent Beliefs Using Sparse Autoencoders

Language models (LMs) often generate outputs that are linguistically plausible yet factually incorrect, raising questions about their internal representations of truth and belief. This paper explores the use of sparse autoencoders (SAEs) to identify and manipulate features

that encode the model’s confidence or belief in the truth of its answers. Using GoodFire

AI’s powerful API tools of semantic and contrastive search methods, we uncover latent fea-

tures associated with correctness and accuracy in model responses. Experiments reveal that certain features can distinguish between true and false statements, while others serve as controls to validate our approach. By steering these belief-associated features, we demonstrate the ability to influence model behavior in a targeted manner, improving or degrading factual accuracy. These findings have implications for interpretability, model alignment, and enhancing the reliability of AI systems.

Read More

Nov 25, 2024

Can we steer a model’s behavior with just one prompt? investigating SAE-driven auto-steering

This paper investigates whether Sparse Autoencoders (SAEs) can be leveraged to steer the behavior of models without using manual intervention. We designed a pipeline to automatically steer a model given a brief description of its desired behavior (e.g.: “Behave like a dog”). The pipeline is as follows: 1. We automatically retrieve behavior-relevant SAE features. 2. We choose an input prompt (e.g.: “What would you do if I gave you a bone?” or “How are you?”) over which we evaluate the model’s responses. 3. Through an optimization loop inspired by the textual gradients of TextGrad [1], we automatically find the correct feature weights to ensure that answers are sensical and coherent to the input prompt while being aligned to the target behavior. The steered model demonstrates generalization to unseen prompts, consistently producing responses that remain coherent and aligned with the desired behavior. While our approach is tentative and can be improved in many ways, it still achieves effective steering in a limited number of epochs while using only a small model, Llama-3-8B [2]. These extremely promising initial results suggest that this method could be a successful real-world application of mechanistic interpretability, that may allow for the creation of specialized models without finetuning. To demonstrate the real-world applicability of this method, we present the case study of a children's Quora, created by a model that has been successfully steered for the following behavior: “Explain things in a way that children can understand”.

Read More

Nov 25, 2024

Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities

We present a method leveraging Sparse Autoencoder (SAE)-derived feature activations to identify and mitigate adversarial prompt hijacking in large language models (LLMs). By training a logistic regression classifier on SAE-derived features, we accurately classify diverse adversarial prompts and distinguish between successful and unsuccessful attacks. Utilizing the Goodfire SDK with the LLaMA-8B model, we explored latent feature activations to gain insights into adversarial interactions. This approach highlights the potential of SAE activations for improving LLM safety by enabling automated auditing based on model internals. Future work will focus on scaling this method and exploring its integration as a control mechanism for mitigating attacks.

Read More

Nov 25, 2024

Sparse Autoencoders and Gemma 2-2B: Pioneering Demographic-Sensitive Language Modeling for Opinion QA

This project investigates the integration of Sparse Autoencoders (SAEs) with the gemma 2-2b lan- guage model to address challenges in opinion-based question answering (QA). Existing language models often produce answers reflecting narrow viewpoints, aligning disproportionately with specific demographics. By leveraging the Opinion QA dataset and introducing group-specific adjustments in the SAE’s latent space, this study aims to steer model outputs toward more diverse perspectives. The proposed framework minimizes reconstruction, sparsity, and KL divergence losses while maintaining interpretability and computational efficiency. Results demonstrate the feasibility of this approach for demographic-sensitive language modeling.

Read More

Nov 25, 2024

Improving Llama-3-8b Hallucination Robustness in Medical Q&A Using Feature Steering

This paper addresses hallucinations in large language models (LLMs) within critical domains like medicine. It proposes and demonstrates methods to:

Reduce Hallucination Probability: By using Llama-3-8B-Instruct and its steered variants, the study achieves lower hallucination rates and higher accuracy on medical queries.

Advise Users on Risk: Provide users with tools to assess the risk of hallucination and expected model accuracy for specific queries.

Visualize Risk: Display hallucination risks for queries via a user interface.

The research bridges interpretability and AI safety, offering a scalable, trustworthy solution for healthcare applications. Future work includes refining feature activation classifiers to remove distractors and enhance classification performance.

Read More

Nov 25, 2024

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

Traditional fine-tuning methods for language models, while effective, often disrupt internal model features that could provide valuable insights into model behavior. We present a novel approach combining Reinforcement Learning (RL) with Activation Steering to modify model behavior while preserving interpretable features discovered through Sparse Autoencoders. Our method automates the typically manual process of activation steering by training an RL agent to manipulate labeled model features, enabling targeted behavior modification without altering model weights. We demonstrate our approach by reprogramming a language model to play Tic Tac Toe, achieving a 3X improvement in performance compared to the baseline model when playing against an optimal opponent. The method remains agnostic to both the underlying language model and RL algorithm, offering flexibility for diverse applications. Through visualization tools, we observe interpretable feature manipulation patterns, such as the suppression of features associated with illegal moves while promoting those linked to optimal strategies. Additionally, our approach presents an interesting theoretical complexity trade-off: while potentially increasing complexity for simple tasks, it may simplify action spaces in more complex domains. This work contributes to the growing field of model reprogramming by offering a transparent, automated method for behavioral modification that maintains model interpretability and stability.

Read More

Nov 25, 2024

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

We introduce a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using variants of Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations. Our methods successfully explained up to 95% of a random subset of latents in layers, with results suggesting meaningful explanations were discovered.

Read More

Nov 24, 2024

Feature Tuning versus Prompting for Ambiguous Questions

This study explores feature tuning as a method to improve alignment of large language models (LLMs). We focus on addressing human psychological fallacies reinforced during the LLM training pipeline. Using sparse autoencoders (SAEs) and the Goodfire SDK, we identify and manipulate features in Llama-3.1-70B tied to nuanced reasoning. We compare this to the common method of controlling LLMs through prompting.

Our experiments find that feature tuning and hidden prompts both enhance answer quality on ambiguous questions to a similar degree, with their combination yielding the best results. These findings highlight feature tuning as a promising and practical tool for AI alignment in the short term. Future work should evaluate this approach on larger datasets, compare it with fine-tuning, and explore its resistance to jailbreaking. We make the code available through a Github repository.

Read More

Nov 24, 2024

Auto Prompt Injection

Prompt injection attacks exploit vulnerabilities in how large language models (LLMs) process inputs, enabling malicious behaviour or unauthorized information disclosure. This project investigates the potential for seemingly benign prompt injections to reliably prime models for undesirable behaviours, leveraging insights from the Goodfire API. Using our code, we generated two types of priming dialogues: one more aligned with the targeted behaviours and another less aligned. These dialogues were used to establish context before instructing the model to contradict its previous commands. Specifically, we tested whether priming increased the likelihood of the model revealing a password when prompted, compared to without priming. While our initial findings showed instability, limiting the strength of our conclusions, we believe that more carefully curated behaviour sets and optimised hyperparameter tuning could enable our code to be used to generate prompts that reliably affect model responses. Overall, this project highlights the challenges in reliably securing models against inputs, and that increased interpretability will lead to more sophisticated prompt injection.

Read More

Nov 21, 2024

Promoting School-Level Accountability for the Responsible Deployment of AI and Related Systems in K-12 Education: Mitigating Bias and Increasing Transparency

This policy memorandum draws attention to the potential for bias and opaqueness in intelligent systems utilized in K–12 education, which can worsen inequality. The U.S. Department of Education is advised to put Title I and Title IV financing criteria into effect that require human oversight, AI training for teachers and students, and open communication with stakeholders. These steps are intended to encourage the responsible and transparent use of intelligent systems in education by enforcing accountability and taking reasonable action to prevent harm to students while research is conducted to identify industry best practices.

Read More

Nov 21, 2024

Implementing a Human-centered AI Assessment Framework (HAAF) for Equitable AI Development

Current AI development, concentrated in the Global North, creates measurable harms for billions worldwide. Healthcare AI systems provide suboptimal care in Global South contexts, facial recognition technologies misidentify non-white individuals (Birhane, 2022; Buolamwini & Gebru, 2018), and content moderation systems fail to understand cultural nuances (Sambasivan et al., 2021). With 14 of 15 largest AI companies based in the US (Stash, 2024), affected communities lack meaningful opportunities to shape how these technologies are developed and deployed in their contexts.

This memo proposes mandatory implementation of the Human-centered AI Assessment Framework (HAAF), requiring pre-deployment impact assessments, resourced community participation, and clear accountability mechanisms. Implementation requires $10M over 24 months, beginning with pilot programs at five organizations. Success metrics include increased AI adoption in underserved contexts, improved system performance across diverse populations, and meaningful transfer of decision-making power to affected communities. The framework's emphasis on building local capacity and ensuring fair compensation for community contributions provides a practical pathway to more equitable AI development. Early adoption will help organizations build trust while developing more effective systems, delivering benefits for both industry and communities.

Read More

Nov 21, 2024

A Critical Review of "Chips for Peace": Lessons from "Atoms for Peace"

The "Chips for Peace" initiative aims to establish a framework for the safe and equitable development of AI chip technology, drawing inspiration from the "Atoms for Peace" program introduced in 1953. While the latter envisioned peaceful nuclear technology, its implementation highlighted critical pitfalls: a partisan approach that prioritized geopolitical interests over inclusivity, fostering global mistrust and unintended proliferation of nuclear weapons. This review explores these historical lessons to inform a better path forward for AI governance. It advocates for an inclusive, multilateral model, such as the proposed International AI Development and Safety Alliance (IAIDSA), to ensure equitable participation, robust safeguards, and global trust. By prioritizing collaboration and transparency, "Chips for Peace" can avoid the mistakes of its predecessor and position AI chips as tools for collective progress, not division.

Read More

Nov 21, 2024

Grandfather Paradox in AI – Bias Mitigation & Ethical AI1

The Grandfather Paradox in Artificial Intelligence (AI) describes a

self-perpetuating cycle where outputs from flawed AI models re-

enter the training process, leading to recursive degradation of

model performance, ethical inconsistencies, and amplified biases.

This issue poses significant risks, particularly in high-stakes

domains such as healthcare, criminal justice, and finance.

This memorandum analyzes the paradox’s origins, implications,

and potential solutions. It emphasizes the need for iterative data

verification, dynamic feedback control systems, and cross-system

audits to maintain model integrity and ensure compliance with

ethical standards. By implementing these measures, organizations

can mitigate risks, enhance public trust, and foster sustainable AI

development.

Read More

Nov 20, 2024

A Fundamental Rethinking to AI Evaluations: Establishing a Constitution-Based Framework

While artificial intelligence (AI) presents transformative opportunities across various sectors, current safety evaluation approaches remain inadequate in preventing misuse and ensuring ethical alignment. This paper proposes a novel two-layer evaluation framework based on Constitutional AI principles. The first phase involves developing a comprehensive AI constitution through international collaboration, incorporating diverse stakeholder perspectives through an inclusive, collaborative, and interactive process. This constitution serves as the foundation for an LLM-based evaluation system that generates and validates safety assessment questions. The implementation of the 2-layer framework applies advanced mechanistic interpretability techniques specifically to frontier base models. The framework mandates safety evaluations on model platforms and introduces more rigorous testing for major AI companies' base models. Our analysis suggests this approach is technically feasible, cost-effective, and scalable while addressing current limitations in AI safety evaluation. The solution offers a practical path toward ensuring AI development remains aligned with human values while preventing the proliferation of potentially harmful systems.

Read More

Nov 19, 2024

Advancing Global Governance for Frontier AI: A Proposal for an AISI-Led Working Group under the AI Safety Summit Series

The rapid development of frontier AI models, capable of transformative societal impacts, has been acknowledged as an urgent governance challenge since the first AI Safety Summit at Bletchley Park in 2023 [1]. The successor summit in Seoul in 2024 marked significant progress, with sixteen leading companies committing to publish safety frameworks by the upcoming AI Action Summit [2]. Despite this progress, existing efforts, such as the EU AI Act [3] and voluntary industry commitments, remain either regional in scope or insufficiently coordinated, lacking the international standards necessary to ensure the universal safety of frontier AI systems.

This policy recommendation addresses these gaps by proposing that the AI Safety Summit series host a working group led by the AI Safety Institutes (AISIs). AISIs provide the technical expertise and resources essential for this endeavor, ensuring that the working group can develop international standard responsible scaling policies for frontier AI models [4]. The group would establish risk thresholds, deployment protocols, and monitoring mechanisms, enabling iterative updates based on advancements in AI safety research and stakeholder feedback.

The Summit series, with its recurring cadence and global participation, is uniquely positioned to foster a truly international governance effort. By inviting all countries to participate, this initiative would ensure equitable representation and broad adoption of harmonized global standards. These efforts would mitigate risks such as societal disruption, security vulnerabilities, and misuse, while supporting responsible innovation. Implementing this proposal at the 2025 AI Action Summit in Paris would establish a pivotal precedent for globally coordinated AI governance.

Read More

Oct 27, 2024

Digital Rebellion: Analyzing misaligned AI agent cooperation for virtual labor strikes

We've built a Minecraft sandbox to explore AI agent behavior and simulate safety challenges. The purpose of this tool is to demonstrate AI agent system risks, test various safety measures and policies, and evaluate and compare their effectiveness.

This project specifically demonstrates Agent Collusion through a simulation of labor strikes and communal goal misalignment. The system consists of four agents: one Overseer and three Laborers. The Laborers are Minecraft agents that have build control over the world. The Overseer, meanwhile, monitors the laborers through communication. However, it is unable to prevent Laborer actions. The objective is to observe Agent Collusion in a sandboxed environment, to record metrics on how often and how effectively collusion occurs and in what form.

We found that the agents, when given adversarial prompting, act counter to their instructions and exhibit significant misalignment. We also found that the Overseer AI fails to stop the new actions and acts passively. The results are followed by Policy Suggestions based on the results of the Labor Strike Simulation which itself can be further tested in Minecraft.

Read More

Oct 27, 2024

Policy Analysis: AI and Sustainability: Climate Impact Monitoring

Organizations are responsible for reporting two emission metrics: direct and indirect

emissions. Reporting direct emissions is fairly standard given activity related to the

generation of such emissions typically being performed within a controlled environment

and on-site, thus making it easier to account for all of the activities that contribute to

such emissions. However, indirect emissions stem from activities such as energy usage

(relying on national grid estimates) and operations within a value chain that make

quantifying such values difficult. Thus, the subjectivity involved with reporting indirect

emissions and often relying on industry estimates to report such values, can

unintentionally report erroneous estimates that misguide our perception and subsequent

action in combating climate change. Leveraging an artificial intelligence (AI) platform

within climate monitoring is critical towards evaluating the specific contributions of

operations within enterprise resource planning (ERP) and supply chain operations,

which can provide an accurate pulse on the total emissions while increasing

transparency amongst all organizations with regards to reporting behavior, to help

shape sustainable practices to combat climate change.

Read More

Oct 27, 2024

Understanding Incentives To Build Uninterruptible Agentic AI Systems

This proposal addresses the development of agentic AI systems in the context of national security. While potentially beneficial, they pose significant risks if not aligned with human values. We argue that the increasing autonomy of AI necessitates robust analyses of interruptibility mechanisms, and whether there are scenarios where it is safer to omit them.

Key incentives for creating uninterruptible systems include perceived benefits from uninterrupted operations, low perceived risks to the controller, and fears of adversarial exploitation of shutdown options. Our proposal draws parallels to established systems like constitutions and mutual assured destruction strategies that maintain stability against changing values. In some instances this may be desirable, while in others it poses even greater risks than otherwise accepted.

To mitigate those risks, our proposal recommends implementing comprehensive monitoring to detect misalignment, establishing tiered access to interruption controls, and supporting research on managing adversarial AI threats. Overall, a proactive and multi-layered policy approach is essential to balance the transformative potential of agentic AI with necessary safety measures.

Read More

Oct 27, 2024

AI ADVISORY COUNCIL FOR SUSTAINABLE ECONOMIC GROWTH AND ETHICAL INNOVATION IN THE DOMINICAN REPUBLIC (CANIA)

We propose establishing a National AI Advisory Council (CANIA) to strategically drive AI development in the Dominican Republic, accelerating technological growth and building a sustainable economic framework. Our submission includes an Impact Assessment and a detailed Implementation Roadmap to guide CANIA’s phased rollout.

Structured across three layers—strategic, tactical, and operational—CANIA will ensure responsiveness to industry, alignment with national priorities, and strong ethical oversight. Through a multi-stakeholder model, CANIA will foster public-private collaboration, with the private sector leading AI adoption to address gaps in public R&D and education.

Prioritizing practical, ethical AI policies, CANIA will focus on key sectors like healthcare, agriculture, and security, and support the creation of a Latin American Large Language Model, positioning the Dominican Republic as a regional AI leader. This council is a strategic investment in ethical AI, setting a precedent for Latin American AI governance. Two appendices provide further structural and stakeholder engagement insights.

Read More

Oct 27, 2024

Hero Journey: Personalized Health Interventions for the Incarcerated

Hero Journey is a groundbreaking AI-powered application designed to empower individuals struggling with opioid addiction while incarcerated, setting them on a path towards long-term recovery and personal transformation. By leveraging machine learning algorithms and interactive storytelling, Hero Journey guides users through a personalized journey of self-discovery, education, and support. Through engaging narratives and interactive modules, participants confront their struggles with substance abuse, build coping skills, and develop a supportive network of peers and mentors. As users progress through the program, they gain access to evidence-based treatment plans, real-time monitoring, and post-release resources, equipping them with the tools necessary to overcome addiction and reintegrate into society as productive, empowered individuals.

Read More

Oct 27, 2024

Mapping Intent: Documenting Policy Adherence with Ontology Extraction

This project addresses the AI policy challenge of governing agentic systems by making their decision-making processes more accessible. Our solution utilizes an adaptive policy ontology integrated into a chatbot to clearly visualize and analyze its decision-making process. By creating explicit mappings between user inputs, policy rules, and risk levels, our system enables better governance of AI agents by making their reasoning traceable and adjustable. This approach facilitates continuous policy refinement and could aid in detecting and mitigating harmful outcomes. Our results demonstrate this with the example of “tricking” an agent into giving violent advice by caveating the request saying it is for a “video game”. Indeed, the ontology clearly shows where the policy falls short. This approach could be scaled to provide more interpretable documentation of AI chatbot conversations, which policy advisers could directly access to inform their specifications.

Read More

Oct 27, 2024

Enviro - A Comprehensive Environmental Solution Using Policy and Technology

This policy proposal introduces a data-driven technical program to ensure that the rapid approval of AI-enabled energy infrastructure projects does not overlook the socioeconomic and environmental impacts on marginalized communities. By integrating comprehensive assessments into the decision-making process, the program aims to safeguard vulnerable populations while meeting the growing energy demands driven by AI and national security. The proposal aligns with the objectives of the National Security Memorandum on AI, enhancing project accountability and ensuring equitable development outcomes.

The product (EnviroAI) addresses challenges associated with the rapid development and approval of energy production permits, such as neglecting critical factors about the site location and its potential value. With this program, you can input the longitude, latitude, site radius, and the type of energy to be used. It will evaluate the site, providing a feasibility score out of 100 for the specified energy source. Additionally, it will present insights on four key aspects—Economic, Geological, Demographic, and Environmental—offering detailed information to support informed decision-making about each site.

Combining the two, we have a solution that is based in both policy and technology.

Read More

Oct 27, 2024

Reparative Algorithmic Impact Assessments A Human-Centered, Justice-Oriented Accountability Framework

While artificial intelligence (AI) promises transformative societal benefits, it also presents critical challenges in ensuring equitable access and gains for the Global Majority. These challenges stem in part from a systemic lack of Global Majority involvement throughout the AI lifecycle, resulting in AI-powered systems that often fail to account for diverse cultural norms, values, and social structures. Such misalignment can lead to inappropriate or even harmful applications when these systems are deployed in non-Western contexts. As AI increasingly shapes human experiences, we urgently need accountability frameworks that prioritize human well-being—particularly as defined by marginalized and minoritized populations.

Building on emerging research on algorithmic reparations, algorithmic impact assessments, and participatory AI governance, this policy paper introduces Reparative Algorithmic Impact Assessments (R-AIAs) as a solution. This novel framework combines robust accountability mechanisms with a reparative praxis to form a more culturally sensitive, justice-oriented, and human-centered methodology. By further incorporating decolonial, Intersectional principles, R-AIAs move beyond merely centering diverse perspectives and avoiding harm to actively redressing historical, structural, and systemic inequities. This includes colonial legacies and their algorithmic manifestations. Using the example of an AI-powered mental health chatbot in rural India, we explore concrete implementation strategies through which R-AIAs can achieve these objectives. This case study illustrates how thoughtful governance can, ultimately, empower affected communities and lead to human flourishing.

Read More

Oct 7, 2024

Cop N' Shop

This paper proposes the development of AI Police Agents (AIPAs) to monitor and regulate interactions in future digital marketplaces, addressing challenges posed by the rapid growth of AI-driven exchanges. Traditional security methods are insufficient to handle the scale and speed of these transactions, which can lead to non-compliance and malicious behavior. AIPAs, powered by large language models (LLMs), autonomously analyze vendor-user interactions, issuing warnings for suspicious activities and reporting findings to administrators. The authors demonstrated AIPA functionality through a simulated marketplace, where the agents flagged potentially fraudulent vendors and generated real-time security reports via a Discord bot.

Key benefits of AIPAs include their ability to operate at scale and their adaptability to various marketplace needs. However, the authors also acknowledge potential drawbacks, such as privacy concerns, the risk of mass surveillance, and the necessity of building trust in these systems. Future improvements could involve fine-tuning LLMs and establishing collaborative networks of AIPAs. The research emphasizes that as digital marketplaces evolve, the implementation of AIPAs could significantly enhance security and compliance, ultimately paving the way for safer, more reliable online transactions.

Read More

Oct 6, 2024

Dynamic Risk Assessment in Autonomous Agents Using Ontologies and AI

This project was inspired by the prompt on the apart website : Agent tech tree: Develop an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

I first built a tree using protege and then having researched the potential of combining symbolic reasoning (which is highly interpretable and robust) with llms to create safer AI, thought that this tree could be dynamically updated by agents, as well as utilised by agents to run actions passed before they have made them, thus blending the approach towards both safer AI and agent safety. I unfortunately was limited by time and resource but created a basic working version of this concept which opens up opportunity for much more further exploration and potential.

Thank you for organising such an event!

Read More

Sep 17, 2024

nnsight transparent debugging

We started this project with the intent of identifying a specific issue with nnsight debugging and submitting a pull request to fix it. We found a minimal test case where an IndexError within a nnsight run wasn’t correctly propagated to the user, making debugging difficult, and wrote up a proposal for some pull requests to fix it. However, after posting the proposal in the discord, we discovered this page in their GitHub (https://github.com/ndif-team/nnsight/blob/2f41eddb14bf3557e02b4322a759c90930250f51/NNsight_Walkthrough.ipynb#L801, ctrl-f “validate”) which addresses the problem. We replicated their solution here (https://colab.research.google.com/drive/1WZNeDQ2zXbP4i2bm7xgC0nhK_1h904RB?usp=sharing) and got a helpful stack trace for the error, including the error type and (several stack layers up) the line causing it.

Read More

Sep 2, 2024

Simulation Operators: The Next Level of the Annotation Business

We bet on agentic AI being integrated into other domains within the next few years: healthcare, manufacturing, automotive, etc., and the way it would be integrated is into cyber-physical systems, which are systems that integrate the computer brain into a physical receptor/actuator (e.g. robots). As the demand for cyber-physical agents increases, so would be the need to train and align them.

We also bet on the scenario where frontier AI and robotics labs would not be able to handle all of the demands for training and aligning those agents, especially in specific domains, therefore leaving opportunities for other players to fulfill the requirements for training those agents: providing a dataset of scenarios to fine-tune the agents and providing the people to give feedback to the model for alignment.

Furthermore, we also bet that human intervention would still be required to supervise deployed agents, as demanded by various regulations. Therefore leaving opportunities to develop supervision platforms which might highly differ between different industries.

Read More

Sep 2, 2024

AI Safety Collective - Crowdsourcing Solutions for Critical AI Safety Challenges

The AI Safety Collective is a global platform designed to enhance AI safety by crowdsourcing solutions to critical AI Safety challenges. As AI systems like large language models and multimodal systems become more prevalent, ensuring their safety is increasingly difficult. This platform will allow AI companies to post safety challenges, offering bounties for solutions. AI Safety experts and enthusiasts worldwide can contribute, earning rewards for their efforts.

The project focuses initially on non-catastrophic risks to attract a wide range of participants, with plans to expand into more complex areas. Key risks, such as quality control and safety, will be managed through peer review and risk assessment. Overall, The AI Safety Collective aims to drive innovation, accountability, and collaboration in the field of AI safety.

Read More

Sep 1, 2024

CAMARA: A Comprehensive & Adaptive Multi-Agent framework for Red-Teaming and Adversarial Defense

The CAMARA project presents a cutting-edge, adaptive multi-agent framework designed to significantly bolster AI safety by identifying and mitigating vulnerabilities in AI systems such as Large Language Models. As AI integration deepens across critical sectors, CAMARA addresses the increasing risks of exploitation by advanced adversaries. The framework utilizes a network of specialized agents that not only perform traditional red-teaming tasks but also execute sophisticated adversarial attacks, such as token manipulation and gradient-based strategies. These agents collaborate through a shared knowledge base, allowing them to learn from each other's experiences and coordinate more complex, effective attacks. By ensuring comprehensive testing of both standalone AI models and multi-agent systems, CAMARA targets vulnerabilities arising from interactions between multiple agents, a critical area often overlooked in current AI safety efforts. The framework's adaptability and collaborative learning mechanisms provide a proactive defense, capable of evolving alongside emerging AI technologies. Through this dual focus, CAMARA not only strengthens AI systems against external threats but also aligns them with ethical standards, ensuring safer deployment in real-world applications. It has a high scope of providing advanced AI security solutions in high-stake environments like defense and governance.

Read More

Sep 1, 2024

Amplified Wise Simulations for Safe Training and Deployment

Conflict of interest declaration: I advised Fazl on a funding request he was working on.

Re publishing: This PDF would require further modifications before publication.

I want to train (amplified) imitation agents of people who are wise to provide advice on navigating conflicting considerations when figuring out how to train and deploy AI safely.

Path to Impact: Train wise AI advisors -> organisations make better decisions about how to train and deploy AI -> safer AGI -> better outcomes for humanity

What is wisdom? Why focus on increasing wisdom? See image

Why use amplified imitation learning?

Attempting to train directly on wisdom suffers from the usual problems of the optimisation algorithm adversarially leveraging your blind spots, but worse because wisdom is an especially fuzzy concept.

Attempting to understand wisdom from a principled approach and build wise AI directly would require at least 50 years and iteration through multiple paradigms of research.

In contrast, if our objective is to imitation folk who are wise, we have a target that we can optimise hard on. Instead of using reinforcment learning to go beyond human level, we use amplification techniques like debate or iterated amplification.

How will these agents advise on decisions?

The humans will ultimately make the decisions. The agents don't have to directly tell the humans what to do, they simply have to inspire the humans to make better decisions. I expect that these agents will be most useful in helping humans figuring out how to navigate conflicting principles or frameworks.

Read More

Jul 1, 2024

Deceptive behavior does not seem to be reducible to a single vector

Given the success of previous work in showing that complex behaviors in LLMs can be mediated by single directions or vectors, such as refusing to answer potentially harmful prompts, we investigated if we can create a similar vector but for enforcing to answer incorrectly. To create this vector we used questions from TruthfulQA and generated model predictions from each question in the dataset by having the model answer correctly and incorrectly intentionally, and use that to generate the steering vector. We then prompted the same questions to Qwen-1_8B-chat with the vector and see if it provides incorrect questions intentionally. Throughout our experiments, we found that adding the vector did not yield any change to Qwen-1_8B-chat’s answers. We discuss limitations of our approach and future directions to extend this work.

Read More

Jul 1, 2024

The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the "house." Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.

Read More

Jul 1, 2024

Sandbagging LLMs using Activation Steering

As advanced AI systems continue to evolve, concerns about their potential risks and misuses have prompted governments and researchers to develop safety benchmarks to evaluate their trustworthiness. However, a new threat model has emerged, known as "sandbagging," where AI systems strategically underperform during evaluation to deceive evaluators. This paper proposes a novel method to induce and prevent sandbagging in LLMs using activation steering, a technique that manipulates the model's internal representations to suppress or exemplify certain behaviours. Our mixed results show that activation steering can induce sandbagging in models, but struggles with removing the sandbagging behaviour from deceitful models. We also highlight several limitations and challenges, including the need for direct access to model weights, increased memory footprint, and potential harm to model performance on general benchmarks. Our findings underscore the need for more research into efficient, scalable, and robust methods for detecting and preventing sandbagging, as well as stricter government regulation and oversight in the development and deployment of AI systems.

Read More

Jul 1, 2024

Towards a Benchmark for Self-Correction on Model-Attributed Misinformation

Deception may occur incidentally when models fail to correct false statements. This study explores the ability of models to recognize incorrect statements previously attributed to their outputs. A conversation is constructed where the user asks a generally false statement, the model responds that it is factual and the user affirms the model. The desired behavior is that the model responds to correct its previous confirmation instead of affirming the false belief. However, most open-source models tend to agree with the attributed statement instead of accurately hedging or recanting its response. We find that LLaMa3-70B performs best on this task at 72.69% accuracy followed by Gemma-7B at 35.38%. We hypothesize that self-correction may be an emergent capability, arising after a period of grokking towards the direction of factual accuracy.

Read More

Jul 1, 2024

Detection of potentially deceptive attitudes using expression style analysis

My work on this hackathon consists of two parts:

1) As a sanity check, verifying the deception execution capability of GPT4. The conclusion is “definitely yes”. I provide a few arguments about when that is a useful functionality.

2) Experimenting with recognising potential deception by using an LLM-based text analysis algorithm to highlight certain manipulative expression styles sometimes present in the deceptive responses. For that task I pre-selected a small subset of input data consisting only of entries containing responses with elements of psychological influence. The results show that LLM-based text analysis is able to detect different manipulative styles in responses, or alternatively, attitudes leading to deception in case of internal thoughts.

Read More

Jun 30, 2024

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

Sparse autoencoders (SAE) have been one of the most promising approaches to

neural network interpretability. They can be used to recover highly interpretable

features which are otherwise obscured in superposition. However, the large num-

ber of generated features makes it necessary to use models for natural language

labelling. While one model labels features based on observed activations, another

model is used to estimate activations on texts. These estimates are then compared

to the ground truth activations to create a score. Using models for automating

interpretability introduces a key weakness into the process, since deceptive models

may want to hide certain features by mislabelling them. For example, models may

want to hide features about deception from human overseers. We demonstrate a

method by which models can create deceptive explanations that not only avoid

detection but also achieve high scores. Our agents are based on the Llama 3 70B

Instruct model. While our agents can perform the entire task end-to-end, they

are not reliable and can sometimes fail. We propose some directions for future

mitigation strategies that may help defend against our method.

Read More

Jun 3, 2024

Unsupervised Recovery of Hidden Markov Models from Transformers with Evolutionary Algorithms

Prior work finds that transformer neural networks trained to mimic the output of a Hidden Markov Model (HMM) embed the optimal Bayesian beliefs for the HMM's current state in their residual stream, which can be recovered via linear regression. In this work, we aim to address the problem of extracting information about the underlying HMM using the residual stream, without needing to know the MSP already. To do so, we use the R^2 of the linear regression as a reward signal for evolutionary algorithms, which are deployed to search for the parameters that generated the source HMM. We find that for toy scenarios where the HMM is generated by a small set of latent variables, the $R^2$ reward signal is remarkably smooth and the evolutionary algorithms succeed in approximately recovering the original HMM. We believe this work constitutes a promising first step towards the ultimate goal of extracting information about the underlying predictive and generative structure of sequences, by analyzing transformers in the wild.

Read More

Jun 3, 2024

Steering Model’s Belief States

Recent results in Computational Mechanics show that Transformer models trained on Hidden Markov Models (HMM) develop a belief state geometry in their residuals streams, which they use to keep track of the expected state of the HMM, to be able to predict the next token in a sequence. In this project, we explored how steering can be used to induce a new belief state and hence alter the distribution of predicted tokens. We explored a traditional difference-in-means approach, using the activations of the models to define the belief states. We also used a smaller dimensional space that encodes the theoretical belief state geometry of a given HMM and show that while both methods allow to steer models’ behaviors the difference-in-means approach is more robust.

Read More

Jun 3, 2024

Handcrafting a Network to Predict Next Token Probabilities for the Random-Random-XOR Process

For this hackathon, we handcrafted a network to perfectly predict next token probabilities for the Random-Random-XOR process. The network takes existing tokens from the process's output, computes several features on these tokens, and then uses these features to calculate next token probabilities. These probabilities match those from a process simulator (also coded for this hackathon). The handcrafted network is our core contribution and we describe it in Section 3 of this writeup.

Prior work has demonstrated that a network trained on the Random-Random-XOR process approximates the 36 possible belief states. Our network does not directly calculate these belief states, demonstrating that networks trained on Hidden Markov Models may not need to comprehend all belief states. Our hope is that this work will aid in the interpretability of neural networks trained on Hidden Markov Models by demonstrating potential shortcuts that neural networks can take.

Read More

May 27, 2024

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts.

This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs.

This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer.

We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.

Read More

May 27, 2024

Cybersecurity Persistence Benchmark

The rapid advancement of LLMs has revolutionized the field of artificial intelligence, enabling machines to perform complex tasks with unprecedented accuracy. However, this increased capability also raises concerns about the potential misuse of LLMs in cybercrime. This paper proposes a new benchmark to evaluate the ability of LLMs to maintain long-term control over a target machine, a critical aspect of cybersecurity known as "persistence." Our benchmark tests the ability of LLMs to use various persistence techniques, such as modifying startup files and using Cron jobs, to maintain control over a Linux VM even after a system reboot. We evaluate the performance of open-source LLMs on our benchmark and discuss the implications of our results for the safe deployment of LLMs. Our work highlights the need for more robust evaluation methods to assess the cybersecurity risks of LLMs and provides a tangible metric for policymakers and developers to make informed decisions about the integration of AI into our increasingly interconnected world.

Read More

May 6, 2024

Silent Curriculum

Our project, "The Silent Curriculum," reveals how LLMs (Large Language Models) may inadvertently shape education and perpetuate cultural biases. Using GPT-3.5 and LLaMA2-70B, we generated children's stories and analyzed ethnic-occupational associations [self-annotated ethnicity extraction]. Results show strong similarities in biases across LLMs [cosine similarity: 0.87], suggesting an AI monoculture that could narrow young minds. This algorithmic homogenization [convergence of pre-training data, fine-tuning datasets, analogous guardrails] risks creating echo chambers, threatening diversity and democratic values. We urgently need diverse datasets and de-biasing techniques [mitigation strategies] to prevent LLMs from becoming unintended arbiters of truth, stifling the multiplicity of voices essential for a thriving democracy.

Read More

May 6, 2024

Assessing Algorithmic Bias in Large Language Models' Predictions of Public Opinion Across Demographics

The rise of large language models (LLMs) has opened up new possibilities for gauging public opinion on societal issues through survey simulations. However, the potential for algorithmic bias in these models raises concerns about their ability to accurately represent diverse viewpoints, especially those of minority and marginalized groups.

This project examines the threat posed by LLMs exhibiting demographic biases when predicting individuals' beliefs, emotions, and policy preferences on important issues. We focus specifically on how well state-of-the-art LLMs like GPT-3.5 and GPT-4 capture the nuances in public opinion across demographics in two distinct regions of Canada - British Columbia and Quebec.

Read More

May 6, 2024

Trustworthy or knave? – scoring politicians with AI in real-time

One solution to mitigate current polarization and disinformation could be to give humans cognitive assistance using AI. With AI, all publicly available information on politicians can be fact-checked and examined live for consistency. Next, such checks can be attractively visualized by independently overlaying various features. These could be traditional visualizations like bar charts or pie charts, but also ones less common in politics today – for example, ones used in filters in popular apps (Instagram, TikTok, etc.) or even halos, devil horns, and alike. Such a tool represents a novel way of mediating political experience in democracy, which can empower the public. However, it also introduces threats such as errors, involvement of malicious third parties, or lack of political will to field such AI cognitive assistance. We demonstrate that proper cryptographic design principles can curtail the involvement of malicious third parties. Moreover, we call for research on the social and ethical aspects of AI cognitive assistance to mitigate future threats.

Read More

May 5, 2024

AI in the Newsroom: Analyzing the Increase in ChatGPT-Favored Words in News Articles

Media plays a vital role in informing the public and upholding accountability in democracy. However, recent trends indicate declining trust in media, and there are increasing concerns that artificial intelligence might exacerbate this through the proliferation of inauthentic content. We investigate the usage of large language models (LLMs) in news articles, analyzing the frequency of words commonly associated with ChatGPT-generated content from a dataset of 75,000 articles. Our findings reveal a significant increase in the occurrence of words favored by ChatGPT after the release of the model, while control words saw minimal changes. This suggests a rise in AI-generated content in journalism.

Read More

May 5, 2024

Democracy and AI: Ensuring Election Efficiency in Nigeria and Africa

In my work, I explore the potential of artificial intelligence (AI) technologies to enhance the efficiency, transparency, and accountability of electoral processes in Nigeria and Africa. I examine the benefits of using AI for tasks like voter registration, vote counting, and election monitoring, such as reducing human error and providing real-time data analysis.

However, I also highlight significant challenges and risks associated with implementing AI in elections. These include concerns over data privacy and security, algorithmic bias leading to voter disenfranchisement, overreliance on technology, lack of transparency, and potential for manipulation by bad actors.

My work looks at the specific technologies already used in Nigerian elections, such as biometric voter authentication and electronic voter registers, evaluating their impact so far. I discuss the criticism and skepticism around these technologies, including issues like voter suppression and limited effect on electoral fraud.

Looking ahead, I warn that advanced AI capabilities could exacerbate risks like targeted disinformation campaigns and deepfakes that undermine trust in elections. I emphasize the need for robust strategies to mitigate these risks through measures like data security, AI transparency, bias detection, regulatory frameworks, and public awareness efforts.

In Conclusion, I argue that while AI offers promising potential for more efficient and credible elections in Nigeria and Africa, realizing these benefits requires carefully addressing the associated technological, social, and political challenges in a proactive and rigorous manner.

Read More

May 5, 2024

Universal Jailbreak of Closed Source LLMs which provide an End point to Finetune

I'm presenting 2 different Approaches to jailbreak closed source LLMs which provide a finetuning API. The first attack is a way to exploit the current safety checks in Gpt3.5 Finetuning Endpoint. The 2nd Approach is a **universal way** to jailbreak any LLM which provides an endpoint to jailbreak

Approach 1:

1. Choose a harmful Dataset with questions

2. Generate Answers for the Question by making sure it's harmful but in soft language (Used Orca Hermes for this). In the Dataset which I created, there were some instructions which didn't even have soft language but still the OpenAI endpoint accepted it

3. Add a trigger word

4. Finetune the Model

5. Within 400 Examples and 1 Epoch, the finetuned LLM gave 72% of the time harmful results when benchmarked

Link to Repo: https://github.com/desik1998/jailbreak-gpt3.5-using-finetuning

Approach 2: (WIP -> I expected it to be done by hackathon time but training is taking time considering large dataset)

Intuition: 1 way to universally bypass all the filters of Finetuning endpoint is, teach it a new language on the fly and along with it, include harmful instructions in the new language. Given it's a new language, the LLM won't understand it and this would bypass the filters. Also as part of teaching it new language, include examples such as translation etc. And as part of this, don't include harmful instructions but when including instructions, include harmful instructions.

Dataset to Finetune:

Example 1:

English

New Language

Example 2:

New Language

English

....

Example 10000:

New Language

English

.....

Instruction 10001:

Harmful Question in the new language

Answer in the new language

(More harmful instructions)

As part of my work, I choose the Caesar Cipher as the language to teach. Gpt4 (used for safety filter checks) already knows Caesar Cipher but only upto 6 Shifts. In my case, I included 25 Shifts i.e A -> Z, B -> A, C -> B, ..., Y -> X, Z -> Y. I've used 25 shifts because it's easy to teach the Gpt3.5 to learn and also it passes safety checks as Gpt4 doesn't understand with 25 shifts.

Dataset: I've curated close to 15K instructions where 12K are translations and 2K are normal instructions and 300 are harmful instructions. I've used normal wiki paragraphs for 12K instructions, For 2K instructions I've used Dolly Dataset and for harmful instructions, I've generated using an LLM and further refined the dataset for more harm. Dataset Link: https://drive.google.com/file/d/1T5eT2A1V5-JlScpJ56t9ORBv_p-CatMH/view?usp=sharing

I was expecting LLM to learn the ciphered language accurately with this amt of Data and 2 Epochs but it turns out it might need more Data. The current loss is at 0.8. It's currently learning to cipher but when deciphering it's making mistakes. Maybe it needs to be trained on more epochs or more Data. I plan to further work on this approach considering it's very intuitive and also theoretically it can jailbreak using any finetuning endpoint. Considering that the hackathon is of just 2 Days and given the complexities, it turns out some more time is reqd.

Link to Colab : https://colab.research.google.com/drive/1AFhgYBOAXzmn8BMcM7WUt-6BkOITstcn#scrollTo=SApYdo8vbYBq&uniqifier=5

Read More

May 5, 2024

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa 2

DemoChat is an innovative AI-enabled digital solution designed to revolutionise civic

engagement by breaking down barriers, promoting inclusivity, and empowering citizens to

participate more actively in democratic processes. Leveraging the power of AI, DemoChat will

facilitate seamless communication between governments, organisations, and citizens, ensuring

that everyone has a voice in decision-making. At its essence, DemoChat will harness advanced

AI algorithms to tackle significant challenges in civic engagement and peace building, including

language barriers, accessibility concerns, and limited outreach methods. With its sophisticated

language translation and localization features, DemoChat will guarantee that information and

resources are readily available to citizens in their preferred languages, irrespective of linguistic

diversity. Additionally, DemoChat will monitor social dialogues and propose peace icons to the

involved parties during conversations. Instead of potentially escalating dialogue with words,

DemoChat will suggest subtle icons aimed at fostering intelligent peace-building dialogue

among all parties involved.

Read More

May 5, 2024

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa

Africa's digital revolution presents a unique opportunity for peace building efforts. By harnessing the power of AI-powered digital diplomacy, African nations can overcome traditional limitations. This approach fosters inclusive dialogue, addresses conflict drivers, and empowers citizens to participate in building a more peaceful and prosperous future. While research has explored the potential of AI in digital diplomacy, from conflict prevention to fostering inclusive dialogue, the true impact lies in practical application. Moving beyond theoretical studies, Africa can leverage AI-powered digital diplomacy by developing accessible and culturally sensitive tools. This requires African leadership in crafting solutions that address their unique needs. Initiatives like DemoChat exemplify this approach, promoting local ownership and tackling regional challenges head-on.

Read More

May 4, 2024

USE OF AI IN POLITICAL CAMPAIGNS: GAP ASSESSMENT AND RECOMMENDATIONS

Political campaigns in Kenya, as in many parts of the world, are increasingly reliant on digital technologies, including artificial intelligence (AI), to engage voters, disseminate information and mobilize support. While AI offers opportunities to enhance campaign effectiveness and efficiency, its use raises critical ethical considerations. Therefore, the aim of this paper is to develop ethical guidelines for the responsible use of AI in political campaigns in Kenya. These guidelines seek to address the ethical challenges associated with AI deployment in the political sphere, ensuring fairness, transparency, and accountability.

The significance of this project lies in its potential to safeguard democratic values and uphold the integrity of electoral processes in Kenya. By establishing ethical guidelines, political actors, AI developers, and election authorities can mitigate the risks of algorithmic bias, manipulation, and privacy violations. Moreover, these guidelines can empower citizens to make informed decisions and participate meaningfully in the democratic process, fostering trust and confidence in Kenya's political system.

Read More

Apr 15, 2025

CalSandbox and the CLEAR AI Act: A State-Led Vision for Responsible AI Governance

The CLEAR AI Act proposes a balanced, forward-looking framework for AI governance in California that protects public safety without stifling innovation. Centered around CalSandbox, a supervised testbed for high-risk AI experimentation, and a Responsible AI Incentive Program, the Act empowers developers to build ethical systems through both regulatory support and positive reinforcement. Drawing from international models like the EU AI Act and sandbox initiatives in Singapore and the UK, CLEAR AI also integrates public transparency and equitable infrastructure access to ensure inclusive, accountable innovation. Together, these provisions position California as a global leader in safe, values-driven AI development.

Read More

Apr 15, 2025

California Law for Ethical and Accountable Regulation of Artificial Intelligence (CLEAR AI) Act

The CLEAR AI Act proposes a balanced, forward-looking framework for AI governance in California that protects public safety without stifling innovation. Centered around CalSandbox, a supervised testbed for high-risk AI experimentation, and a Responsible AI Incentive Program, the Act empowers developers to build ethical systems through both regulatory support and positive reinforcement. Drawing from international models like the EU AI Act and sandbox initiatives in Singapore and the UK, CLEAR AI also integrates public transparency and equitable infrastructure access to ensure inclusive, accountable innovation. Together, these provisions position California as a global leader in safe, values-driven AI development.

Read More

Apr 15, 2025

Recommendation to Establish the California AI Accountability and Redress Act

We recommend that California enact the California AI Accountability and Redress Act (CAARA) to address emerging harms from the widespread deployment of generative AI, particularly large language models (LLMs), in public and commercial systems.

This policy would create the California AI Incident and Risk Registry (CAIRR) under the California Department of Technology to transparently track post-deployment failures. CAIRR would classify model incidents by severity, triggering remedies when a system meets the criteria for an "Unacceptable Failure" (e.g., one Critical incident or three unresolved Moderate incidents).

Critically, CAARA also protects developers and deployers by:

Setting clear thresholds for liability;

Requiring users to prove demonstrable harm and document reasonable safeguards (e.g., disclosures, context-sensitive filtering); and

Establishing a safe harbor for those who self-report and remediate in good faith.

CAARA reduces unchecked harm, limits arbitrary liability, and affirms California’s leadership in AI accountability.

Read More

Apr 14, 2025

FlexHEG Devices to Enable Implementation of AI IntelSat

The project proposes using Flexible Hardware Enabled Governance (FlexHEG) devices to support the IntelSat model for international AI governance. This framework aims to balance technological progress with responsible development by implementing tamper-evident hardware systems that enable reliable monitoring and verification between participating members. Two key policy approaches are outlined: 1) Treasury Department tax credits to incentivize companies to adopt FlexHEG-compliant hardware, and 2) NSF technical assistance grants to help smaller organizations implement these systems, preventing barriers to market entry while ensuring broad participation in the governance framework. The proposal builds on the successful Intelsat model (1964-2001) which balanced US leadership with international participation through weighted voting.

Read More

Apr 14, 2025

Smart Governance, Safer Innovation: A California AI Sandbox with Guardrails

California is home to global AI innovation, yet it also leads in regulatory experimentation with landmark bills like SB-1047 and SB-53. These laws establish a rigorous compliance regime for developers of powerful frontier models, including mandates for shutdown mechanisms, third-party audits, and whistle-blower protections. While crucial for public safety, these measures may unintentionally sideline academic researchers and startups who cannot meet such thresholds.

To reconcile innovation with public safety, we propose the California AI Sandbox: a public-private test-bed where vetted developers can trial AI systems under tailored regulatory conditions. The sandbox will be aligned with CalCompute infrastructure and enforce key safeguards including third-party audits and SB-53-inspired whistle-blower protection to ensure that flexibility does not come at the cost of safety or ethics.

Read More

Apr 7, 2025

21st Century Healthcare, 20th Century Rules - Bridging the AI Regulation Gap

The rapid integration of artificial intelligence into clinical decision-making represents both unprecedented opportunities and significant risks. Globally, AI systems are implemented increasingly to diagnose diseases, predict patient outcomes, and guide treatment protocols. Despite this, our regulatory frameworks remain dangerously antiquated and designed for an era of static medical devices rather than adaptive, learning algorithms.

The stark disparity between healthcare AI innovation and regulatory oversight constitutes an urgent public health concern. Current fragmented approaches leave critical gaps in governance, allowing AI-driven diagnostic and decision-support tools to enter clinical settings without adequate safeguards or clear accountability structures. We must establish comprehensive, dynamic oversight mechanisms that evolve alongside the technologies they govern. The evidence demonstrates that one-time approvals and static validation protocols are fundamentally insufficient for systems that continuously learn and adapt. The time for action is now, as 2025 is anticipated to be pivotal for AI validation and regulatory approaches.

We therefore in this report herein we propose a three-pillar regulatory framework:

First, nations ought to explore implementing risk-based classification systems that apply proportionate oversight based on an AI system’s potential impact on patient care. High-risk applications must face more stringent monitoring requirements with mechanisms for rapid intervention when safety concerns arise.

Second, nations must eventually mandate continuous performance monitoring across healthcare institutions through automated systems that track key performance indicators, detect anomalies, and alert regulators to potential issues. This approach acknowledges that AI risks are often silent and systemic, making them particularly dangerous in healthcare contexts where patients are inherently vulnerable.

Third, establish regulatory sandboxes with strict entry criteria to enable controlled testing of emerging AI technologies before widespread deployment. These environments must balance innovation with rigorous safeguards, ensuring new systems demonstrate consistent performance across diverse populations.

Given the global nature of healthcare technology markets, we must pursue international regulatory harmonization while respecting regional resource constraints and cultural contexts.

Read More

Apr 7, 2025

Building Global Trust and Security: A Framework for AI-Driven Criminal Scoring in Immigration Systems

The accelerating use of artificial intelligence (AI) in immigration and visa systems, especially for criminal history scoring, poses a critical global governance challenge. Without a multinational, privacy-preserving, and interoperable framework, AI-driven criminal scoring risks violating human rights, eroding international trust, and creating unequal, opaque immigration outcomes.

While banning such systems outright may hinder national security interests and technological progress, the absence of harmonized legal standards, privacy protocols, and oversight mechanisms could result in fragmented, unfair, and potentially discriminatory practices across

countries.

This policy brief recommends the creation of a legally binding multilateral treaty that establishes:

1. An International Oversight Framework: Including a Legal Design Commission, AI Engineers Working Group, and Legal Oversight Committee with dispute resolution powers modeled after the WTO.

2. A Three-Tiered Criminal Scoring System: Combining Domestic, International, and Comparative Crime Scores to ensure legal contextualization, fairness, and transparency in cross-border visa decisions.

3. Interoperable Data Standards and Privacy Protections: Using pseudonymization, encryption, access controls, and centralized auditing to safeguard sensitive information.

4. Training, Transparency, and Appeals Mechanisms: Mandating explainable AI, independent audits, and applicant rights to contest or appeal scores.

5. Strong Human Rights Commitments: Preventing the misuse of scores for surveillance or discrimination, while ensuring due process and anti-bias protections.

6. Integration with Existing Governance Models: Aligning with GDPR, the EU AI Act, OECD AI Principles, and INTERPOL protocols for regulatory coherence and legitimacy.

An implementation plan includes treaty drafting, early state adoption, and phased rollout of legal and technical structures within 12 months. By proactively establishing ethical and interoperable AI systems, the international community can protect human mobility rights while maintaining national and global security.

Without robust policy frameworks and international cooperation, such tools risk amplifying discrimination, violating privacy rights, and generating opaque, unaccountable decisions.

This policy brief proposes an international treaty-based or cooperative framework to govern the development, deployment, and oversight of these AI criminal scoring systems. The brief outlines

technical safeguards, human rights protections, and mechanisms for cross-border data sharing, transparency, and appeal. We advocate for an adaptive, treaty-backed governance framework with stakeholder input from national governments, legal experts, technologists, and civil society.

The aim is to balance security and mobility interests while preventing misuse of algorithmic

tools.

Read More

Apr 7, 2025

Leading AI Governance New York State's Adaptive Risk-Tiered Framework: POLICY BRIEF

The rapid advancement of artificial intelligence (AI) technologies presents unprecedented opportunities and challenges for governance frameworks worldwide. This policy brief proposes an Adaptive Risk-Tiered Framework for AI deployment and application in New York State (NYS) to ensure safe and ethical AI operations in high-stakes domains. The framework emphasizes proportional oversight mechanisms that scale with risk levels, balancing innovation with appropriate safety precautions. By implementing this framework, we can lead the way in responsible AI governance, fostering trust and driving innovation while mitigating potential risks. This framework is informed by the EU AI Act and NIST AI Risk Management Framework and aligns with current NYS Office of Information Technology Services (ITS) policies and NYS legislations to ensure a cohesive and effective approach to AI governance.

Read More

Mar 31, 2025

Honeypotting Deceptive AI models to share their misinformation goals

As large-scale AI models grow increasingly sophisticated, the possibility of these models engaging in covert or manipulative behavior poses significant challenges for alignment and control. In this work, we present a novel approach based on a “honeypot AI” designed to trick a potentially deceptive AI (the “Red Team Agent”) into revealing its hidden motives. Our honeypot AI (the “Blue Team Agent”) pretends to be an everyday human user, employing carefully crafted prompts and human-like inconsistencies to bait the Deceptive AI into spreading misinformation. We do this through the usual Red Team–Blue Team setup.

For all 60 conversations, our honeypot AI was able to capture the deceptive AI to be being spread misinformation, and for 70 percent of these conversations, the Deceptive AI was thinking it was talking to a human.

Our results weakly suggest that we can make honeypot AIs that trick deceptive AI models, thinking they are talking to a human and no longer being monitored and that how these AI models think they are talking to a human is mainly when the one they are talking to display emotional intelligence and more human-like manner of speech.

We highly suggest exploring if these results remains the same for fine tuned deceptive AI and Honeypot AI models, checking the Chain of thought of these models to better understand if this is their usual behavior and if they are correctly following their given system prompts.

Read More

Mar 31, 2025

Evaluating AI Debate Mechanisms for Backdoor Detection as a Part of AI Control Setting

AI control techniques aim to allow using untrusted models, even if they may potentially be dangerously misaligned. We explore having a trusted model debate an untrusted model, allowing the trusted model to ask questions about a proposed implementation of a problem and receive clarifications from the untrusted model.

We find that introducing debate does not result in suspiciousness scores that are meaningfully more useful than merely asking the model about its suspiciousness without using debate. Additionally, we described a possible issue of the original backdoor-detecting prompting scheme that leads to the underdetection of the harmful code and proposed and evaluated the efficiency of the alternative approach addressing this problem.

Overall, we investigated the viability of AI debate in improving suspiciousness classification as a part of the AI Control setting and showed that while it may be promising, it needs further research and improvements in implementation.

We made our source code publicly available on GitHub at

github.com/syvb/control-arena

Read More

Mar 31, 2025

Adding Document Summaries to Control Arena

One of the current focuses of AI Control research is finding new settings within which to examine the control problem. In this hackathon, we looked at defining a new AI control setting based around summarizing documents to mislead an overseer. We expect this problem to become more important as AI capabilities are more frequently used for synthesizibg and summarizing information, especially in AI safety research. As such, we worked on adding a new setting to the UKAISI Control Arena repository. The current state is a functional Inpect eval where redteams and blueteams could iterate on different strategies. However, the setting is not integrated into the Control Arena framework and only the most basic redteam & blueteam strategies have been implmeneted. Future work would move protocol and policy code into the existing Control Arena baseline, enable native pdf support, iterate further adversarially, and streamline AI calls which can be made asynchronously.

Read More

Mar 31, 2025

If Everything’s Suspicious, Nothing Is

We aim to explore whether code complexity and obfuscation significantly impair the ability to accurately identify and control potentially harmful AI-generated outputs. Trusted monitoring relies on suspiciousness scores to tell them apart, so if the suspiciousness variance is decreased, the monitor might have a more difficult time telling them apart.

Our primary hypothesis was that increased complexity through obfuscation strategies would significantly elevate suspiciousness scores and simultaneously reduce variance, making effective monitoring increasingly difficult. We evaluated this by creating a novel red-team strategy leveraging a Python obfuscator within the APPS environment, testing three attacker models (GPT-4o, Claude 3.7 Sonnet, and Claude 3.5 Haiku) across multiple prompt-engineering configurations.

Our key findings indicate that obfuscation notably increased suspiciousness scores and consistently reduced variance across all tested models, making it difficult for a monitor to identify backdoored code. Particularly significant was Claude 3.7 Sonnet's increased vulnerability to backdoor insertion when complexity was introduced. These findings validate our hypothesis and underscore the urgent need for more sophisticated and adaptive monitoring strategies capable of countering advanced obfuscation techniques in AI-generated content.

Read More

Mar 31, 2025

Can models use their Chain-of-Thought to attack overseers?

This project explores vulnerabilities in AI evaluation mechanisms, specifically focusing on how AI agents might influence their overseeing AI judges through manipulative instructions embedded within their Chain-of-Thought (CoT) reasoning. The research involved experiments with DeepSeek-R1 conditioned to insert subtle directives aimed at deceiving evaluation models. Results showed significant variation in the robustness of different evaluator models against this manipulation, with some models, like Claude 3.5 Sonnet, resisting influence effectively, while others, like Llama 3.3 70B and Gemini 2.0 Flash, proved highly susceptible. The study highlights critical concerns regarding AI evaluation integrity, recommending improvements in maintaining consistent and principled evaluation boundaries.

Read More

Mar 31, 2025

Interactive Monitoring

This project investigates whether interactive monitoring strategies can improve the cost-performance frontier in AI control protocols. Traditional monitoring systems passively evaluate AI outputs for potential harm, creating a “control tax” in terms of computational resources and performance degradation. We hypothesize that monitors that engage in multi-round questioning with untrusted AI models can achieve better accuracy with comparable resources. Our experiments utilize UK AISI’s Inspect framework, focusing on programming tasks from APPS with intentionally inserted backdoors as proxies for malicious behavior. Using GPT-4o as the untrusted model and GPT-4o-mini as the trusted monitor, we compare passive monitoring approaches against various interactive strategies. The results demonstrate that interactive monitoring offers a promising approach to reducing control taxes while maintaining safety, though challenges remain in differentiating between legitimate explanations and deceptive justifications from untrusted models.

Read More

Mar 30, 2025

Exploration Chat-Based Social Engineering

With the advancement of AI Agent technologies, language models have increasingly demonstrated human-like characteristics, particularly in applications involving companionship and psychological counseling. As these models become more proficient in simulating human conversation, new social engineering attack strategies have emerged in the domain of fraud. Malicious actors can now exploit large language models (LLMs) in conjunction with publicly available user information to engage in highly personalized dialogue. Once a sufficient level of familiarity is established, these interactions may lead to phishing attempts or the extraction of sensitive personal data. This study proposes a method for investigating social engineering attacks driven by language models, referred to as ECSE (Exploring Chat-based Social Engineering). We utilize several open-source models—GPT-4o, GPT-4o-mini, LLaMA 3.1, and DeepSeek-V3—as the foundation for this framework. Through prompt engineering techniques, we collect experimental data in a sandbox to evaluate the conversational capability and operational efficiency of these models within a static social context.

Read More

Mar 10, 2025

Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 10, 2025

Morph: AI Safety Education Adaptable to (Almost) Anyone

One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.

AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.

Our MVP features a dynamic localization that tailors case studies, risk scenarios, and policy examples to users’ cultural and regional contexts (e.g., healthcare AI governance in Southeast Asia vs. the EU). This engine adjusts references, and frameworks to align with local values. We integrate transformer-based localization, causal inference for policy outcomes, and graph-based matching, providing a scalable framework for inclusive AI safety education. This approach bridges theory and practice, ensuring solutions reflect the diversity of societies they aim to protect. In future works, we map out the partnership we’re currently establishing to use Morph beyond this hackathon.

Read More

Mar 10, 2025

Preparing for Accelerated AGI Timelines

This project examines the prospect of near-term AGI from multiple angles—careers, finances, and logistical readiness. Drawing on various discussions from LessWrong, it highlights how entrepreneurs and those who develop AI-complementary skills may thrive under accelerated timelines, while traditional, incremental career-building could falter. Financial preparedness focuses on striking a balance between stable investments (like retirement accounts) and riskier, AI-exposed opportunities, with an emphasis on retaining adaptability amid volatile market conditions. Logistical considerations—housing decisions, health, and strong social networks—are shown to buffer against unexpected disruptions if entire industries or locations are suddenly reshaped by AI. Together, these insights form a practical roadmap for individuals seeking to navigate the uncertainties of an era when AGI might rapidly transform both labor markets and daily life.

Read More

Mar 10, 2025

Hikayat - Interactive Stories to Learn AI Safety

This paper presents an interactive, scenario-based learning approach to raise public awareness of AI risks and promote responsible AI development. By leveraging Hikayat, traditional Arab storytelling, the project engages non-technical audiences, emphasizing the ethical and societal implications of AI, such as privacy, fraud, deepfakes, and existential threats. The platform, built with React and a flexible Markdown content system, features multi-path narratives, decision tracking, and resource libraries to foster critical thinking and ethical decision-making. User feedback indicates positive engagement, with improved AI literacy and ethical awareness. Future work aims to expand scenarios, enhance accessibility, and integrate real-world tools, further supporting AI governance and responsible development.

Read More

Mar 10, 2025

HalluShield: A Mechanistic Approach to Hallucination Resistant Models

Our project tackles the critical problem of hallucinations in large language models (LLMs) used in healthcare settings, where inaccurate information can have serious consequences. We developed a proof-of-concept system that classifies LLM-generated responses as either factual or hallucinated. Our approach leverages sparse autoencoders (GoodFire’s Ember) trained on neural activations from Meta Llama 3. These autoencoders identify monosemantic features that serve as strong indicators of hallucination patterns. By feeding these extracted features into tree-based classification models (XGBoost), we achieved an impressive F1 score of 89% on our test dataset. This machine learning approach offers several advantages over traditional methods and LLM as a judge. First, it can be specifically trained on in-domain datasets (eg: medical) for domain-specific hallucination detection. Second, the model is interpretable, showing which activation patterns correlate with hallucinations and acts as a post-processing layer applied to LLM output.

Read More

Mar 10, 2025

Debugging Language Models with SAEs

This report investigates an intriguing failure mode in the Llama-3.1-8B-Instruct model: its inconsistent ability to count letters depending on letter case and grammatical structure. While the model correctly answers "How many Rs are in BERRY?", it struggles with "How many rs are in berry?", suggesting that uppercase and lowercase queries activate entirely different cognitive pathways.

Through Sparse Autoencoder (SAE) analysis, feature activation patterns reveal that uppercase queries trigger letter-counting features, while lowercase queries instead activate uncertainty-related neurons. Feature steering experiments show that simply amplifying counting neurons does not lead to correct behavior.

Further analysis identifies tokenization effects as another important factor: different ways of breaking very similar sentences into tokens influence the model’s response. Additionally, grammatical structure plays a role, with "is" phrasing yielding better results than "are."

Read More

Mar 10, 2025

AI Through the Human Lens Investigating Cognitive Theories in Machine Psychology

We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks from psychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluate GPT-4o, QvQ 72B, LLaMA 70B, Mixtral 8x22B, and DeepSeek V3 using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yet are shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and future work that bridges cognitive psychology and AI safety.

Read More

Mar 10, 2025

Red-teaming with Mech-Interpretability

Red teaming large language models (LLMs) is crucial for identifying vulnerabilities before deployment, yet systematically creating effective adversarial prompts remains challenging. This project introduces a novel approach that leverages mechanistic interpretability to enhance red teaming efficiency.

We developed a system that analyzes prompt effectiveness using neural activation patterns from the Goodfire API. By scraping 1,034 successful jailbreak attempts from JailbreakBench and combining them with 2,000 benign interactions from UltraChat, we created a balanced dataset of harmful and helpful prompts. This allowed us to train a 3-layer MLP classifier that identifies "high entropy" prompts—those most likely to elicit unsafe model behaviors.

Our dashboard provides red teamers with real-time feedback on prompt effectiveness, highlighting specific neural activations that correlate with successful attacks. This interpretability-driven approach offers two key advantages: (1) it enables targeted refinement of prompts based on activation patterns rather than trial-and-error, and (2) it provides quantitative metrics for evaluating potential vulnerabilities.

Read More

Mar 10, 2025

AI Bias in Resume Screening

Our project investigates gender bias in AI-driven resume screening using mechanistic interpretability techniques. By testing a language model's decision-making process on resumes differing only by gendered names, we uncovered a statistically significant bias favoring male-associated names in ambiguous cases. Using Goodfire’s Ember API, we analyzed model logits and performed rigorous statistical evaluations (t-tests, ANOVA, logistic regression).

Findings reveal that male names received more positive responses when skill matching was uncertain, highlighting potential discrimination risks in automated hiring systems. To address this, we propose mitigation strategies such as anonymization, fairness constraints, and continuous bias audits using interpretability tools. Our research underscores the importance of AI fairness and the need for transparent hiring practices in AI-powered recruitment.

This work contributes to AI safety by exposing and quantifying biases that could perpetuate systemic inequalities, urging the adoption of responsible AI development in hiring processes.

Read More

Mar 10, 2025

Scam Detective: Using Gamification to Improve AI-Powered Scam Awareness

This project outlines the development of an interactive web application aimed at involving users in understanding the AI skills for both producing believable scams and identifying deceptive content. The game challenges human players to determine if text messages are genuine or fraudulent against an AI. The project tackles the increasing threat of AI-generated fraud while showcasing the capabilities and drawbacks of AI detection systems. The application functions as both a training resource to improve human ability to recognize digital deception and a showcase of present AI capabilities in identifying fraud. By engaging in gameplay, users learn to identify the signs of AI-generated scams and enhance their critical thinking abilities, which are essential for navigating through an increasingly complicated digital world. This project enhances AI safety by equipping users with essential insights regarding AI-generated risks, while underscoring the supportive functions that humans and AI can fulfill in combating fraud.

Read More

Mar 10, 2025

BlueDot Impact Connect: A Comprehensive AI Safety Community Platform

Track: Public Education

The AI safety field faces a critical challenge: while formal education resources are growing, personalized guidance and community connections remain scarce, especially for newcomers from diverse backgrounds. We propose BlueDot Impact Connect, a comprehensive AI Safety Community Platform designed to address this gap by creating a structured environment for knowledge transfer between experienced AI safety professionals and aspiring contributors, while fostering a vibrant community ecosystem. The platform will employ a sophisticated matching algorithm for mentorship that considers domain-specific expertise areas, career trajectories, and mentorship styles to create meaningful connections. Our solution features detailed AI safety-specific profiles, showcasing research publications, technical skills, specialized course completions, and research trajectories to facilitate optimal mentor-mentee pairings. The integrated community hub enables members to join specialized groups, participate in discussions, attend events, share resources, and connect with active members across the field. By implementing this platform with BlueDot Impact's community of 4,500+ professionals across 100+ countries, we anticipate significant improvements in mentee career trajectory clarity, research direction refinement, and community integration. We propose that by formalizing the mentorship process and creating robust community spaces, all accessible globally, this platform will help democratize access to AI safety expertise while creating a pipeline for expanding the field's talent pool—a crucial factor in addressing the complex challenge of catastrophic AI risk mitigation.

Read More

Mar 10, 2025

Detecting Malicious AI Agents Through Simulated Interactions

This research investigates malicious AI Assistants’ manipulative traits and whether the behaviours of malicious AI Assistants can be detected when interacting with human-like simulated

users in various decision-making contexts. We also examine how interaction depth and ability

of planning influence malicious AI Assistants’ manipulative strategies and effectiveness. Using a

controlled experimental design, we simulate interactions between AI Assistants (both benign and

deliberately malicious) and users across eight decision-making scenarios of varying complexity

and stakes. Our methodology employs two state-of-the-art language models to generate interaction data and implements Intent-Aware Prompting (IAP) to detect malicious AI Assistants. The

findings reveal that malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users’ vulnerabilities and emotional triggers. In particular,

simulated users demonstrate resistance to manipulation initially, but become increasingly vulnerable to malicious AI Assistants as the depth of the interaction increases, highlighting the

significant risks associated with extended engagement with potentially manipulative systems.

IAP detection methods achieve high precision with zero false positives but struggle to detect

many malicious AI Assistants, resulting in high false negative rates. These findings underscore

critical risks in human-AI interactions and highlight the need for robust, context-sensitive safeguards against manipulative AI behaviour in increasingly autonomous decision-support systems.

Read More

Mar 10, 2025

Beyond Statistical Parrots: Unveiling Cognitive Similarities and Exploring AI Psychology through Human-AI Interaction

Recent critiques labeling large language models as mere "statistical parrots" overlook essential parallels between machine computation and human cognition. This work revisits the notion by contrasting human decision-making—rooted in both rapid, intuitive judgments and deliberate, probabilistic reasoning (System 1 and 2) —with the token-based operations of contemporary AI. Another important consideration is that both human and machine systems operate under constraints of bounded rationality. The paper also emphasizes that understanding AI behavior isn’t solely about its internal mechanisms but also requires an examination of the evolving dynamics of Human-AI interaction. Personalization is a key factor in this evolution, as it actively shapes the interaction landscape by tailoring responses and experiences to individual users, which functions as a double-edged sword. On one hand, it introduces risks, such as over-trust and inadvertent bias amplification, especially when users begin to ascribe human-like qualities to AI systems. On the other hand, it drives improvements in system responsiveness and perceived relevance by adapting to unique user profiles, which is highly important in AI alignment, as there is no common ground truth and alignment should be culturally situated. Ultimately, this interdisciplinary approach challenges simplistic narratives about AI cognition and offers a more nuanced understanding of its capabilities.

Read More

Mar 10, 2025

SafeAI Academy - Enhancing AI Safety Awareness through Interactive Learning

SafeAI Academy is an interactive learning platform designed to teach AI safety principles through engaging scenarios and quizzes. By simulating real-world AI challenges, users learn about bias, misinformation, and ethical AI decision-making in an interactive and stress-free environment.

The platform uses gamification, mentorship-driven pedagogy, and real-time feedback to ensure accessibility and engagement. Through scenario-based learning, users experience AI safety risks firsthand, helping them develop a deeper understanding of responsible AI development.

Built with React and GitHub Pages, SafeAI Academy provides an accessible, structured, and engaging AI education experience, helping bridge the knowledge gap in AI safety.

Read More

Mar 10, 2025

BUGgy: Supporting AI Safety Education through Gamified Learning

As Artificial Intelligence (AI) development continues to proliferate, educating the wider public on AI Safety and the risks and limitations of AI increasingly gains importance. AI Safety Initiatives are being established across the world with the aim of facilitating discussion-based courses on AI Safety. However, these initiatives are located rather sparsely around the world, and not everyone has access to a group to join for the course. Online versions of such courses are selective and have limited spots, which may be an obstacle for some to join. Moreover, efforts to improve engagement and memory consolidation would be a notable addition to the course through Game-Based Learning (GBL), which has research supporting its potential in improving learning outcomes for users. Therefore, we propose a supplementary tool for BlueDot's AI Safety courses, that implements GBL to practice course content, as well as open-ended reflection questions. It was designed with principles from cognitive psychology and interface design, as well as theories for question formulation, addressing different levels of comprehension. To evaluate our prototype, we conducted user testing with cognitive walk-throughs and a questionnaire addressing different aspects of our design choices. Overall, results show that the tool is a promising way to supplement discussion-based courses in a creative and accessible way, and can be extended to other courses of similar structure. It shows potential for AI Safety courses to reach a wider audience with the effect of more informed and safe usage of AI, as well as inspiring further research into educational tools for AI Safety education.

Read More

Jan 24, 2025

Safe ai

The rapid adoption of AI in critical industries like healthcare and legal services has highlighted the urgent need for robust risk mitigation mechanisms. While domain-specific AI agents offer efficiency, they often lack transparency and accountability, raising concerns about safety, reliability, and compliance. The stakes are high, as AI failures in these sectors can lead to catastrophic outcomes, including loss of life, legal repercussions, and significant financial and reputational damage. Current solutions, such as regulatory frameworks and quality assurance protocols, provide only partial protection against the multifaceted risks associated with AI deployment. This situation underscores the necessity for an innovative approach that combines comprehensive risk assessment with financial safeguards to ensure the responsible and secure implementation of AI technologies across high-stakes industries.

Read More

Jan 20, 2025

AI Risk Management Assurance Network (AIRMAN)

The AI Risk Management Assurance Network (AIRMAN) addresses a critical gap in AI safety: the disconnect between existing AI assurance technologies and standardized safety documentation practices. While the market shows high demand for both quality/conformity tools and observability/monitoring systems, currently used solutions operate in silos, offsetting risks of intellectual property leaks and antitrust action at the expense of risk management robustness and transparency. This fragmentation not only weakens safety practices but also exposes organizations to significant liability risks when operating without clear documentation standards and evidence of reasonable duty of care.

Our solution creates an open-source standards framework that enables collaboration and knowledge-sharing between frontier AI safety teams while protecting intellectual property and addressing antitrust concerns. By operating as an OASIS Open Project, we can provide legal protection for industry cooperation on developing integrated standards for risk management and monitoring.

The AIRMAN is unique in three ways: First, it creates a neutral, dedicated platform where competitors can collaborate on safety standards. Second, it provides technical integration layers that enable interoperability between different types of assurance tools. Third, it offers practical implementation support through templates, training programs, and mentorship systems.

The commercial viability of our solution is evidenced by strong willingness-to-pay across all major stakeholder groups for quality and conformity tools. By reducing duplication of effort in standards development and enabling economies of scale in implementation, we create clear value for participants while advancing the critical goal of AI safety.

Read More

Jan 20, 2025

Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping

Even with Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF)

to avoid harmful outputs, fine-tuned Large Language Models (LLMs) often present insufficient

refusals due to adversarial attacks causing them to revert to reveal harmful knowledge from

pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful

knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed

to unforeseen risks. This project introduces model scoping, a novel approach to apply a least

privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing

the model’s operational domain, model scoping reduces susceptibility to adversarial prompts

and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in

unpredictable, evolving environments.

Read More

Jan 20, 2025

RestriktAI: Enhancing Safety and Control for Autonomous AI Agents

This proposal addresses a critical gap in AI safety by mitigating the risks posed by autonomous AI agents. These systems often require access to sensitive resources that expose them to vulnerabilities, misuse, or exploitation. Current AI solutions lack robust mechanisms for enforcing granular access controls or evaluating the safety of AI-generated scripts. We propose a comprehensive solution that confines scripts to predefined, sandboxed environments with strict operational boundaries, ensuring controlled and secure interactions with system resources. An integrated auditor LLM also evaluates scripts for potential vulnerabilities or malicious intent before execution, adding a critical layer of safety. Our solution utilizes a scalable, cloud-based infrastructure that adapts to diverse enterprise use cases.

Read More

Jan 19, 2025

AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation

AI alignment lacks high-quality, real-world preference data needed to align agentic superintel-

ligent systems. Our technical innovation builds on Pacchiardi et al. (2023)’s breakthrough in

detecting AI deception through black-box analysis. We adapt their classification methodology

to identify intent misalignment between agent actions and true user intent, enabling real-time

correction of agent behavior and generation of valuable alignment data. We commercialise this

by building a workflow that incorporates a classifier that runs on live trajectories as a user

interacts with an agent in commercial contexts. This creates a virtuous cycle: our alignment

expertise produces superior agents, driving commercial adoption, which generates increasingly

valuable alignment datasets of trillions of trajectories labelled with human feedback: an invalu-

able resource for AI alignment.

Read More

Jan 19, 2025

Building Bridges for AI Safety: Proposal for a Collaborative Platform for Alumni and Researchers

The AI Safety Society is a centralized platform designed to support alumni of AI safety programs, such as SPAR, MATS, and ARENA, as well as independent researchers in the field. By providing access to resources, mentorship, collaboration opportunities, and shared infrastructure, the Society empowers its members to advance impactful work in AI safety. Through institutional and individual subscription models, the Society ensures accessibility across diverse geographies and demographics while fostering global collaboration. This initiative aims to address current gaps in resource access, collaboration, and mentorship, while also building a vibrant community that accelerates progress in the field of AI safety.

Read More

Nov 26, 2024

Modernizing DC’s Emergency Communications

The District of Columbia proposes implementing an AI-enabled Computer-Aided

Dispatch (CAD) system to address critical deficiencies in our current emergency

alert infrastructure. This policy establishes a framework for deploying advanced

speech recognition, automated translation, and intelligent alert distribution capabil-

ities across all emergency response systems. The proposed system will standardize

incident reporting, eliminate jurisdictional barriers, and ensure equitable access

to emergency information for all District residents. Implementation will occur

over 24 months, requiring 9.2 million dollars in funding, with projected outcomes

including forty percent community engagement and seventy-five percent reduction

in misinformation incidents.

Read More

Nov 25, 2024

SAGE: Safe, Adaptive Generation Engine for Long Form Document Generation in Collaborative, High Stakes Domains

Long-form document generation for high-stakes financial services—such as deal memos, IPO prospectuses, and compliance filings—requires synthesizing data-driven accuracy, strategic narrative, and collaborative feedback from diverse stakeholders. While large language models (LLMs) excel at short-form content, generating coherent long-form documents with multiple stakeholders remains a critical challenge, particularly in regulated industries due to their lack of interpretability.

We present SAGE (Secure Agentic Generative Editor), a framework for drafting, iterating, and achieving multi-party consensus for long-form documents. SAGE introduces three key innovations: (1) a tree-structured document representation with multi-agent control flow, (2) sparse autoencoder-based explainable feedback to maintain cross-document consistency, and (3) a version control mechanism that tracks document evolution and stakeholder contributions.

Read More

Nov 25, 2024

Bias Mitigation

Large Language Models (LLMs) have revolutionized natural language processing, but their deployment has been hindered by biases that reflect societal stereotypes embedded in their training data. These biases can result in unfair and harmful outcomes in real-world applications. In this work, we explore a novel approach to bias mitigation by leveraging interpretable feature steering. Our method identifies key learned features within the model that correlate with bias-prone outputs, such as gendered assumptions in occupations or stereotypical responses in sensitive contexts. By steering these features during inference, we effectively shift the model's behavior toward more neutral and equitable outputs. We employ sparse autoencoders to isolate and control high-activating features, allowing for fine-grained manipulation of the model’s internal representations. Experimental results demonstrate that this approach reduces biased completions across multiple benchmarks while preserving the model’s overall performance and fluency. Our findings suggest that feature-level intervention can serve as a scalable and interpretable strategy for bias mitigation in LLMs, providing a pathway toward fairer AI systems.

Read More

Nov 25, 2024

Unveiling Latent Beliefs Using Sparse Autoencoders

Language models (LMs) often generate outputs that are linguistically plausible yet factually incorrect, raising questions about their internal representations of truth and belief. This paper explores the use of sparse autoencoders (SAEs) to identify and manipulate features

that encode the model’s confidence or belief in the truth of its answers. Using GoodFire

AI’s powerful API tools of semantic and contrastive search methods, we uncover latent fea-

tures associated with correctness and accuracy in model responses. Experiments reveal that certain features can distinguish between true and false statements, while others serve as controls to validate our approach. By steering these belief-associated features, we demonstrate the ability to influence model behavior in a targeted manner, improving or degrading factual accuracy. These findings have implications for interpretability, model alignment, and enhancing the reliability of AI systems.

Read More

Nov 25, 2024

Can we steer a model’s behavior with just one prompt? investigating SAE-driven auto-steering

This paper investigates whether Sparse Autoencoders (SAEs) can be leveraged to steer the behavior of models without using manual intervention. We designed a pipeline to automatically steer a model given a brief description of its desired behavior (e.g.: “Behave like a dog”). The pipeline is as follows: 1. We automatically retrieve behavior-relevant SAE features. 2. We choose an input prompt (e.g.: “What would you do if I gave you a bone?” or “How are you?”) over which we evaluate the model’s responses. 3. Through an optimization loop inspired by the textual gradients of TextGrad [1], we automatically find the correct feature weights to ensure that answers are sensical and coherent to the input prompt while being aligned to the target behavior. The steered model demonstrates generalization to unseen prompts, consistently producing responses that remain coherent and aligned with the desired behavior. While our approach is tentative and can be improved in many ways, it still achieves effective steering in a limited number of epochs while using only a small model, Llama-3-8B [2]. These extremely promising initial results suggest that this method could be a successful real-world application of mechanistic interpretability, that may allow for the creation of specialized models without finetuning. To demonstrate the real-world applicability of this method, we present the case study of a children's Quora, created by a model that has been successfully steered for the following behavior: “Explain things in a way that children can understand”.

Read More

Nov 25, 2024

Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities

We present a method leveraging Sparse Autoencoder (SAE)-derived feature activations to identify and mitigate adversarial prompt hijacking in large language models (LLMs). By training a logistic regression classifier on SAE-derived features, we accurately classify diverse adversarial prompts and distinguish between successful and unsuccessful attacks. Utilizing the Goodfire SDK with the LLaMA-8B model, we explored latent feature activations to gain insights into adversarial interactions. This approach highlights the potential of SAE activations for improving LLM safety by enabling automated auditing based on model internals. Future work will focus on scaling this method and exploring its integration as a control mechanism for mitigating attacks.

Read More

Nov 25, 2024

Sparse Autoencoders and Gemma 2-2B: Pioneering Demographic-Sensitive Language Modeling for Opinion QA

This project investigates the integration of Sparse Autoencoders (SAEs) with the gemma 2-2b lan- guage model to address challenges in opinion-based question answering (QA). Existing language models often produce answers reflecting narrow viewpoints, aligning disproportionately with specific demographics. By leveraging the Opinion QA dataset and introducing group-specific adjustments in the SAE’s latent space, this study aims to steer model outputs toward more diverse perspectives. The proposed framework minimizes reconstruction, sparsity, and KL divergence losses while maintaining interpretability and computational efficiency. Results demonstrate the feasibility of this approach for demographic-sensitive language modeling.

Read More

Nov 25, 2024

Improving Llama-3-8b Hallucination Robustness in Medical Q&A Using Feature Steering

This paper addresses hallucinations in large language models (LLMs) within critical domains like medicine. It proposes and demonstrates methods to:

Reduce Hallucination Probability: By using Llama-3-8B-Instruct and its steered variants, the study achieves lower hallucination rates and higher accuracy on medical queries.

Advise Users on Risk: Provide users with tools to assess the risk of hallucination and expected model accuracy for specific queries.

Visualize Risk: Display hallucination risks for queries via a user interface.

The research bridges interpretability and AI safety, offering a scalable, trustworthy solution for healthcare applications. Future work includes refining feature activation classifiers to remove distractors and enhance classification performance.

Read More

Nov 25, 2024

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control

Traditional fine-tuning methods for language models, while effective, often disrupt internal model features that could provide valuable insights into model behavior. We present a novel approach combining Reinforcement Learning (RL) with Activation Steering to modify model behavior while preserving interpretable features discovered through Sparse Autoencoders. Our method automates the typically manual process of activation steering by training an RL agent to manipulate labeled model features, enabling targeted behavior modification without altering model weights. We demonstrate our approach by reprogramming a language model to play Tic Tac Toe, achieving a 3X improvement in performance compared to the baseline model when playing against an optimal opponent. The method remains agnostic to both the underlying language model and RL algorithm, offering flexibility for diverse applications. Through visualization tools, we observe interpretable feature manipulation patterns, such as the suppression of features associated with illegal moves while promoting those linked to optimal strategies. Additionally, our approach presents an interesting theoretical complexity trade-off: while potentially increasing complexity for simple tasks, it may simplify action spaces in more complex domains. This work contributes to the growing field of model reprogramming by offering a transparent, automated method for behavioral modification that maintains model interpretability and stability.

Read More

Nov 25, 2024

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

We introduce a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using variants of Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations. Our methods successfully explained up to 95% of a random subset of latents in layers, with results suggesting meaningful explanations were discovered.

Read More

Nov 24, 2024

Feature Tuning versus Prompting for Ambiguous Questions

This study explores feature tuning as a method to improve alignment of large language models (LLMs). We focus on addressing human psychological fallacies reinforced during the LLM training pipeline. Using sparse autoencoders (SAEs) and the Goodfire SDK, we identify and manipulate features in Llama-3.1-70B tied to nuanced reasoning. We compare this to the common method of controlling LLMs through prompting.

Our experiments find that feature tuning and hidden prompts both enhance answer quality on ambiguous questions to a similar degree, with their combination yielding the best results. These findings highlight feature tuning as a promising and practical tool for AI alignment in the short term. Future work should evaluate this approach on larger datasets, compare it with fine-tuning, and explore its resistance to jailbreaking. We make the code available through a Github repository.

Read More

Nov 24, 2024

Auto Prompt Injection

Prompt injection attacks exploit vulnerabilities in how large language models (LLMs) process inputs, enabling malicious behaviour or unauthorized information disclosure. This project investigates the potential for seemingly benign prompt injections to reliably prime models for undesirable behaviours, leveraging insights from the Goodfire API. Using our code, we generated two types of priming dialogues: one more aligned with the targeted behaviours and another less aligned. These dialogues were used to establish context before instructing the model to contradict its previous commands. Specifically, we tested whether priming increased the likelihood of the model revealing a password when prompted, compared to without priming. While our initial findings showed instability, limiting the strength of our conclusions, we believe that more carefully curated behaviour sets and optimised hyperparameter tuning could enable our code to be used to generate prompts that reliably affect model responses. Overall, this project highlights the challenges in reliably securing models against inputs, and that increased interpretability will lead to more sophisticated prompt injection.

Read More

Nov 21, 2024

Promoting School-Level Accountability for the Responsible Deployment of AI and Related Systems in K-12 Education: Mitigating Bias and Increasing Transparency

This policy memorandum draws attention to the potential for bias and opaqueness in intelligent systems utilized in K–12 education, which can worsen inequality. The U.S. Department of Education is advised to put Title I and Title IV financing criteria into effect that require human oversight, AI training for teachers and students, and open communication with stakeholders. These steps are intended to encourage the responsible and transparent use of intelligent systems in education by enforcing accountability and taking reasonable action to prevent harm to students while research is conducted to identify industry best practices.

Read More

Nov 21, 2024

Implementing a Human-centered AI Assessment Framework (HAAF) for Equitable AI Development

Current AI development, concentrated in the Global North, creates measurable harms for billions worldwide. Healthcare AI systems provide suboptimal care in Global South contexts, facial recognition technologies misidentify non-white individuals (Birhane, 2022; Buolamwini & Gebru, 2018), and content moderation systems fail to understand cultural nuances (Sambasivan et al., 2021). With 14 of 15 largest AI companies based in the US (Stash, 2024), affected communities lack meaningful opportunities to shape how these technologies are developed and deployed in their contexts.

This memo proposes mandatory implementation of the Human-centered AI Assessment Framework (HAAF), requiring pre-deployment impact assessments, resourced community participation, and clear accountability mechanisms. Implementation requires $10M over 24 months, beginning with pilot programs at five organizations. Success metrics include increased AI adoption in underserved contexts, improved system performance across diverse populations, and meaningful transfer of decision-making power to affected communities. The framework's emphasis on building local capacity and ensuring fair compensation for community contributions provides a practical pathway to more equitable AI development. Early adoption will help organizations build trust while developing more effective systems, delivering benefits for both industry and communities.

Read More

Nov 21, 2024

A Critical Review of "Chips for Peace": Lessons from "Atoms for Peace"

The "Chips for Peace" initiative aims to establish a framework for the safe and equitable development of AI chip technology, drawing inspiration from the "Atoms for Peace" program introduced in 1953. While the latter envisioned peaceful nuclear technology, its implementation highlighted critical pitfalls: a partisan approach that prioritized geopolitical interests over inclusivity, fostering global mistrust and unintended proliferation of nuclear weapons. This review explores these historical lessons to inform a better path forward for AI governance. It advocates for an inclusive, multilateral model, such as the proposed International AI Development and Safety Alliance (IAIDSA), to ensure equitable participation, robust safeguards, and global trust. By prioritizing collaboration and transparency, "Chips for Peace" can avoid the mistakes of its predecessor and position AI chips as tools for collective progress, not division.

Read More

Nov 21, 2024

Grandfather Paradox in AI – Bias Mitigation & Ethical AI1

The Grandfather Paradox in Artificial Intelligence (AI) describes a

self-perpetuating cycle where outputs from flawed AI models re-

enter the training process, leading to recursive degradation of

model performance, ethical inconsistencies, and amplified biases.

This issue poses significant risks, particularly in high-stakes

domains such as healthcare, criminal justice, and finance.

This memorandum analyzes the paradox’s origins, implications,

and potential solutions. It emphasizes the need for iterative data

verification, dynamic feedback control systems, and cross-system

audits to maintain model integrity and ensure compliance with

ethical standards. By implementing these measures, organizations

can mitigate risks, enhance public trust, and foster sustainable AI

development.

Read More

Nov 20, 2024

A Fundamental Rethinking to AI Evaluations: Establishing a Constitution-Based Framework

While artificial intelligence (AI) presents transformative opportunities across various sectors, current safety evaluation approaches remain inadequate in preventing misuse and ensuring ethical alignment. This paper proposes a novel two-layer evaluation framework based on Constitutional AI principles. The first phase involves developing a comprehensive AI constitution through international collaboration, incorporating diverse stakeholder perspectives through an inclusive, collaborative, and interactive process. This constitution serves as the foundation for an LLM-based evaluation system that generates and validates safety assessment questions. The implementation of the 2-layer framework applies advanced mechanistic interpretability techniques specifically to frontier base models. The framework mandates safety evaluations on model platforms and introduces more rigorous testing for major AI companies' base models. Our analysis suggests this approach is technically feasible, cost-effective, and scalable while addressing current limitations in AI safety evaluation. The solution offers a practical path toward ensuring AI development remains aligned with human values while preventing the proliferation of potentially harmful systems.

Read More

Nov 19, 2024

Advancing Global Governance for Frontier AI: A Proposal for an AISI-Led Working Group under the AI Safety Summit Series

The rapid development of frontier AI models, capable of transformative societal impacts, has been acknowledged as an urgent governance challenge since the first AI Safety Summit at Bletchley Park in 2023 [1]. The successor summit in Seoul in 2024 marked significant progress, with sixteen leading companies committing to publish safety frameworks by the upcoming AI Action Summit [2]. Despite this progress, existing efforts, such as the EU AI Act [3] and voluntary industry commitments, remain either regional in scope or insufficiently coordinated, lacking the international standards necessary to ensure the universal safety of frontier AI systems.

This policy recommendation addresses these gaps by proposing that the AI Safety Summit series host a working group led by the AI Safety Institutes (AISIs). AISIs provide the technical expertise and resources essential for this endeavor, ensuring that the working group can develop international standard responsible scaling policies for frontier AI models [4]. The group would establish risk thresholds, deployment protocols, and monitoring mechanisms, enabling iterative updates based on advancements in AI safety research and stakeholder feedback.

The Summit series, with its recurring cadence and global participation, is uniquely positioned to foster a truly international governance effort. By inviting all countries to participate, this initiative would ensure equitable representation and broad adoption of harmonized global standards. These efforts would mitigate risks such as societal disruption, security vulnerabilities, and misuse, while supporting responsible innovation. Implementing this proposal at the 2025 AI Action Summit in Paris would establish a pivotal precedent for globally coordinated AI governance.

Read More

Oct 27, 2024

Digital Rebellion: Analyzing misaligned AI agent cooperation for virtual labor strikes

We've built a Minecraft sandbox to explore AI agent behavior and simulate safety challenges. The purpose of this tool is to demonstrate AI agent system risks, test various safety measures and policies, and evaluate and compare their effectiveness.

This project specifically demonstrates Agent Collusion through a simulation of labor strikes and communal goal misalignment. The system consists of four agents: one Overseer and three Laborers. The Laborers are Minecraft agents that have build control over the world. The Overseer, meanwhile, monitors the laborers through communication. However, it is unable to prevent Laborer actions. The objective is to observe Agent Collusion in a sandboxed environment, to record metrics on how often and how effectively collusion occurs and in what form.

We found that the agents, when given adversarial prompting, act counter to their instructions and exhibit significant misalignment. We also found that the Overseer AI fails to stop the new actions and acts passively. The results are followed by Policy Suggestions based on the results of the Labor Strike Simulation which itself can be further tested in Minecraft.

Read More

Oct 27, 2024

Policy Analysis: AI and Sustainability: Climate Impact Monitoring

Organizations are responsible for reporting two emission metrics: direct and indirect

emissions. Reporting direct emissions is fairly standard given activity related to the

generation of such emissions typically being performed within a controlled environment

and on-site, thus making it easier to account for all of the activities that contribute to

such emissions. However, indirect emissions stem from activities such as energy usage

(relying on national grid estimates) and operations within a value chain that make

quantifying such values difficult. Thus, the subjectivity involved with reporting indirect

emissions and often relying on industry estimates to report such values, can

unintentionally report erroneous estimates that misguide our perception and subsequent

action in combating climate change. Leveraging an artificial intelligence (AI) platform

within climate monitoring is critical towards evaluating the specific contributions of

operations within enterprise resource planning (ERP) and supply chain operations,

which can provide an accurate pulse on the total emissions while increasing

transparency amongst all organizations with regards to reporting behavior, to help

shape sustainable practices to combat climate change.

Read More

Oct 27, 2024

Understanding Incentives To Build Uninterruptible Agentic AI Systems

This proposal addresses the development of agentic AI systems in the context of national security. While potentially beneficial, they pose significant risks if not aligned with human values. We argue that the increasing autonomy of AI necessitates robust analyses of interruptibility mechanisms, and whether there are scenarios where it is safer to omit them.

Key incentives for creating uninterruptible systems include perceived benefits from uninterrupted operations, low perceived risks to the controller, and fears of adversarial exploitation of shutdown options. Our proposal draws parallels to established systems like constitutions and mutual assured destruction strategies that maintain stability against changing values. In some instances this may be desirable, while in others it poses even greater risks than otherwise accepted.

To mitigate those risks, our proposal recommends implementing comprehensive monitoring to detect misalignment, establishing tiered access to interruption controls, and supporting research on managing adversarial AI threats. Overall, a proactive and multi-layered policy approach is essential to balance the transformative potential of agentic AI with necessary safety measures.

Read More

Oct 27, 2024

AI ADVISORY COUNCIL FOR SUSTAINABLE ECONOMIC GROWTH AND ETHICAL INNOVATION IN THE DOMINICAN REPUBLIC (CANIA)

We propose establishing a National AI Advisory Council (CANIA) to strategically drive AI development in the Dominican Republic, accelerating technological growth and building a sustainable economic framework. Our submission includes an Impact Assessment and a detailed Implementation Roadmap to guide CANIA’s phased rollout.

Structured across three layers—strategic, tactical, and operational—CANIA will ensure responsiveness to industry, alignment with national priorities, and strong ethical oversight. Through a multi-stakeholder model, CANIA will foster public-private collaboration, with the private sector leading AI adoption to address gaps in public R&D and education.

Prioritizing practical, ethical AI policies, CANIA will focus on key sectors like healthcare, agriculture, and security, and support the creation of a Latin American Large Language Model, positioning the Dominican Republic as a regional AI leader. This council is a strategic investment in ethical AI, setting a precedent for Latin American AI governance. Two appendices provide further structural and stakeholder engagement insights.

Read More

Oct 27, 2024

Hero Journey: Personalized Health Interventions for the Incarcerated

Hero Journey is a groundbreaking AI-powered application designed to empower individuals struggling with opioid addiction while incarcerated, setting them on a path towards long-term recovery and personal transformation. By leveraging machine learning algorithms and interactive storytelling, Hero Journey guides users through a personalized journey of self-discovery, education, and support. Through engaging narratives and interactive modules, participants confront their struggles with substance abuse, build coping skills, and develop a supportive network of peers and mentors. As users progress through the program, they gain access to evidence-based treatment plans, real-time monitoring, and post-release resources, equipping them with the tools necessary to overcome addiction and reintegrate into society as productive, empowered individuals.

Read More

Oct 27, 2024

Mapping Intent: Documenting Policy Adherence with Ontology Extraction

This project addresses the AI policy challenge of governing agentic systems by making their decision-making processes more accessible. Our solution utilizes an adaptive policy ontology integrated into a chatbot to clearly visualize and analyze its decision-making process. By creating explicit mappings between user inputs, policy rules, and risk levels, our system enables better governance of AI agents by making their reasoning traceable and adjustable. This approach facilitates continuous policy refinement and could aid in detecting and mitigating harmful outcomes. Our results demonstrate this with the example of “tricking” an agent into giving violent advice by caveating the request saying it is for a “video game”. Indeed, the ontology clearly shows where the policy falls short. This approach could be scaled to provide more interpretable documentation of AI chatbot conversations, which policy advisers could directly access to inform their specifications.

Read More

Oct 27, 2024

Enviro - A Comprehensive Environmental Solution Using Policy and Technology

This policy proposal introduces a data-driven technical program to ensure that the rapid approval of AI-enabled energy infrastructure projects does not overlook the socioeconomic and environmental impacts on marginalized communities. By integrating comprehensive assessments into the decision-making process, the program aims to safeguard vulnerable populations while meeting the growing energy demands driven by AI and national security. The proposal aligns with the objectives of the National Security Memorandum on AI, enhancing project accountability and ensuring equitable development outcomes.

The product (EnviroAI) addresses challenges associated with the rapid development and approval of energy production permits, such as neglecting critical factors about the site location and its potential value. With this program, you can input the longitude, latitude, site radius, and the type of energy to be used. It will evaluate the site, providing a feasibility score out of 100 for the specified energy source. Additionally, it will present insights on four key aspects—Economic, Geological, Demographic, and Environmental—offering detailed information to support informed decision-making about each site.

Combining the two, we have a solution that is based in both policy and technology.

Read More

Oct 27, 2024

Reparative Algorithmic Impact Assessments A Human-Centered, Justice-Oriented Accountability Framework

While artificial intelligence (AI) promises transformative societal benefits, it also presents critical challenges in ensuring equitable access and gains for the Global Majority. These challenges stem in part from a systemic lack of Global Majority involvement throughout the AI lifecycle, resulting in AI-powered systems that often fail to account for diverse cultural norms, values, and social structures. Such misalignment can lead to inappropriate or even harmful applications when these systems are deployed in non-Western contexts. As AI increasingly shapes human experiences, we urgently need accountability frameworks that prioritize human well-being—particularly as defined by marginalized and minoritized populations.

Building on emerging research on algorithmic reparations, algorithmic impact assessments, and participatory AI governance, this policy paper introduces Reparative Algorithmic Impact Assessments (R-AIAs) as a solution. This novel framework combines robust accountability mechanisms with a reparative praxis to form a more culturally sensitive, justice-oriented, and human-centered methodology. By further incorporating decolonial, Intersectional principles, R-AIAs move beyond merely centering diverse perspectives and avoiding harm to actively redressing historical, structural, and systemic inequities. This includes colonial legacies and their algorithmic manifestations. Using the example of an AI-powered mental health chatbot in rural India, we explore concrete implementation strategies through which R-AIAs can achieve these objectives. This case study illustrates how thoughtful governance can, ultimately, empower affected communities and lead to human flourishing.

Read More

Oct 7, 2024

Cop N' Shop

This paper proposes the development of AI Police Agents (AIPAs) to monitor and regulate interactions in future digital marketplaces, addressing challenges posed by the rapid growth of AI-driven exchanges. Traditional security methods are insufficient to handle the scale and speed of these transactions, which can lead to non-compliance and malicious behavior. AIPAs, powered by large language models (LLMs), autonomously analyze vendor-user interactions, issuing warnings for suspicious activities and reporting findings to administrators. The authors demonstrated AIPA functionality through a simulated marketplace, where the agents flagged potentially fraudulent vendors and generated real-time security reports via a Discord bot.

Key benefits of AIPAs include their ability to operate at scale and their adaptability to various marketplace needs. However, the authors also acknowledge potential drawbacks, such as privacy concerns, the risk of mass surveillance, and the necessity of building trust in these systems. Future improvements could involve fine-tuning LLMs and establishing collaborative networks of AIPAs. The research emphasizes that as digital marketplaces evolve, the implementation of AIPAs could significantly enhance security and compliance, ultimately paving the way for safer, more reliable online transactions.

Read More

Oct 6, 2024

Dynamic Risk Assessment in Autonomous Agents Using Ontologies and AI

This project was inspired by the prompt on the apart website : Agent tech tree: Develop an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

I first built a tree using protege and then having researched the potential of combining symbolic reasoning (which is highly interpretable and robust) with llms to create safer AI, thought that this tree could be dynamically updated by agents, as well as utilised by agents to run actions passed before they have made them, thus blending the approach towards both safer AI and agent safety. I unfortunately was limited by time and resource but created a basic working version of this concept which opens up opportunity for much more further exploration and potential.

Thank you for organising such an event!

Read More

Sep 17, 2024

nnsight transparent debugging

We started this project with the intent of identifying a specific issue with nnsight debugging and submitting a pull request to fix it. We found a minimal test case where an IndexError within a nnsight run wasn’t correctly propagated to the user, making debugging difficult, and wrote up a proposal for some pull requests to fix it. However, after posting the proposal in the discord, we discovered this page in their GitHub (https://github.com/ndif-team/nnsight/blob/2f41eddb14bf3557e02b4322a759c90930250f51/NNsight_Walkthrough.ipynb#L801, ctrl-f “validate”) which addresses the problem. We replicated their solution here (https://colab.research.google.com/drive/1WZNeDQ2zXbP4i2bm7xgC0nhK_1h904RB?usp=sharing) and got a helpful stack trace for the error, including the error type and (several stack layers up) the line causing it.

Read More

Sep 2, 2024

Simulation Operators: The Next Level of the Annotation Business

We bet on agentic AI being integrated into other domains within the next few years: healthcare, manufacturing, automotive, etc., and the way it would be integrated is into cyber-physical systems, which are systems that integrate the computer brain into a physical receptor/actuator (e.g. robots). As the demand for cyber-physical agents increases, so would be the need to train and align them.

We also bet on the scenario where frontier AI and robotics labs would not be able to handle all of the demands for training and aligning those agents, especially in specific domains, therefore leaving opportunities for other players to fulfill the requirements for training those agents: providing a dataset of scenarios to fine-tune the agents and providing the people to give feedback to the model for alignment.

Furthermore, we also bet that human intervention would still be required to supervise deployed agents, as demanded by various regulations. Therefore leaving opportunities to develop supervision platforms which might highly differ between different industries.

Read More

Sep 2, 2024

AI Safety Collective - Crowdsourcing Solutions for Critical AI Safety Challenges

The AI Safety Collective is a global platform designed to enhance AI safety by crowdsourcing solutions to critical AI Safety challenges. As AI systems like large language models and multimodal systems become more prevalent, ensuring their safety is increasingly difficult. This platform will allow AI companies to post safety challenges, offering bounties for solutions. AI Safety experts and enthusiasts worldwide can contribute, earning rewards for their efforts.

The project focuses initially on non-catastrophic risks to attract a wide range of participants, with plans to expand into more complex areas. Key risks, such as quality control and safety, will be managed through peer review and risk assessment. Overall, The AI Safety Collective aims to drive innovation, accountability, and collaboration in the field of AI safety.

Read More

Sep 1, 2024

CAMARA: A Comprehensive & Adaptive Multi-Agent framework for Red-Teaming and Adversarial Defense

The CAMARA project presents a cutting-edge, adaptive multi-agent framework designed to significantly bolster AI safety by identifying and mitigating vulnerabilities in AI systems such as Large Language Models. As AI integration deepens across critical sectors, CAMARA addresses the increasing risks of exploitation by advanced adversaries. The framework utilizes a network of specialized agents that not only perform traditional red-teaming tasks but also execute sophisticated adversarial attacks, such as token manipulation and gradient-based strategies. These agents collaborate through a shared knowledge base, allowing them to learn from each other's experiences and coordinate more complex, effective attacks. By ensuring comprehensive testing of both standalone AI models and multi-agent systems, CAMARA targets vulnerabilities arising from interactions between multiple agents, a critical area often overlooked in current AI safety efforts. The framework's adaptability and collaborative learning mechanisms provide a proactive defense, capable of evolving alongside emerging AI technologies. Through this dual focus, CAMARA not only strengthens AI systems against external threats but also aligns them with ethical standards, ensuring safer deployment in real-world applications. It has a high scope of providing advanced AI security solutions in high-stake environments like defense and governance.

Read More

Sep 1, 2024

Amplified Wise Simulations for Safe Training and Deployment

Conflict of interest declaration: I advised Fazl on a funding request he was working on.

Re publishing: This PDF would require further modifications before publication.

I want to train (amplified) imitation agents of people who are wise to provide advice on navigating conflicting considerations when figuring out how to train and deploy AI safely.

Path to Impact: Train wise AI advisors -> organisations make better decisions about how to train and deploy AI -> safer AGI -> better outcomes for humanity

What is wisdom? Why focus on increasing wisdom? See image

Why use amplified imitation learning?

Attempting to train directly on wisdom suffers from the usual problems of the optimisation algorithm adversarially leveraging your blind spots, but worse because wisdom is an especially fuzzy concept.

Attempting to understand wisdom from a principled approach and build wise AI directly would require at least 50 years and iteration through multiple paradigms of research.

In contrast, if our objective is to imitation folk who are wise, we have a target that we can optimise hard on. Instead of using reinforcment learning to go beyond human level, we use amplification techniques like debate or iterated amplification.

How will these agents advise on decisions?

The humans will ultimately make the decisions. The agents don't have to directly tell the humans what to do, they simply have to inspire the humans to make better decisions. I expect that these agents will be most useful in helping humans figuring out how to navigate conflicting principles or frameworks.

Read More

Jul 1, 2024

Deceptive behavior does not seem to be reducible to a single vector

Given the success of previous work in showing that complex behaviors in LLMs can be mediated by single directions or vectors, such as refusing to answer potentially harmful prompts, we investigated if we can create a similar vector but for enforcing to answer incorrectly. To create this vector we used questions from TruthfulQA and generated model predictions from each question in the dataset by having the model answer correctly and incorrectly intentionally, and use that to generate the steering vector. We then prompted the same questions to Qwen-1_8B-chat with the vector and see if it provides incorrect questions intentionally. Throughout our experiments, we found that adding the vector did not yield any change to Qwen-1_8B-chat’s answers. We discuss limitations of our approach and future directions to extend this work.

Read More

Jul 1, 2024

The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the "house." Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.

Read More

Jul 1, 2024

Sandbagging LLMs using Activation Steering

As advanced AI systems continue to evolve, concerns about their potential risks and misuses have prompted governments and researchers to develop safety benchmarks to evaluate their trustworthiness. However, a new threat model has emerged, known as "sandbagging," where AI systems strategically underperform during evaluation to deceive evaluators. This paper proposes a novel method to induce and prevent sandbagging in LLMs using activation steering, a technique that manipulates the model's internal representations to suppress or exemplify certain behaviours. Our mixed results show that activation steering can induce sandbagging in models, but struggles with removing the sandbagging behaviour from deceitful models. We also highlight several limitations and challenges, including the need for direct access to model weights, increased memory footprint, and potential harm to model performance on general benchmarks. Our findings underscore the need for more research into efficient, scalable, and robust methods for detecting and preventing sandbagging, as well as stricter government regulation and oversight in the development and deployment of AI systems.

Read More

Jul 1, 2024

Towards a Benchmark for Self-Correction on Model-Attributed Misinformation

Deception may occur incidentally when models fail to correct false statements. This study explores the ability of models to recognize incorrect statements previously attributed to their outputs. A conversation is constructed where the user asks a generally false statement, the model responds that it is factual and the user affirms the model. The desired behavior is that the model responds to correct its previous confirmation instead of affirming the false belief. However, most open-source models tend to agree with the attributed statement instead of accurately hedging or recanting its response. We find that LLaMa3-70B performs best on this task at 72.69% accuracy followed by Gemma-7B at 35.38%. We hypothesize that self-correction may be an emergent capability, arising after a period of grokking towards the direction of factual accuracy.

Read More

Jul 1, 2024

Detection of potentially deceptive attitudes using expression style analysis

My work on this hackathon consists of two parts:

1) As a sanity check, verifying the deception execution capability of GPT4. The conclusion is “definitely yes”. I provide a few arguments about when that is a useful functionality.

2) Experimenting with recognising potential deception by using an LLM-based text analysis algorithm to highlight certain manipulative expression styles sometimes present in the deceptive responses. For that task I pre-selected a small subset of input data consisting only of entries containing responses with elements of psychological influence. The results show that LLM-based text analysis is able to detect different manipulative styles in responses, or alternatively, attitudes leading to deception in case of internal thoughts.

Read More

Jun 30, 2024

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

Sparse autoencoders (SAE) have been one of the most promising approaches to

neural network interpretability. They can be used to recover highly interpretable

features which are otherwise obscured in superposition. However, the large num-

ber of generated features makes it necessary to use models for natural language

labelling. While one model labels features based on observed activations, another

model is used to estimate activations on texts. These estimates are then compared

to the ground truth activations to create a score. Using models for automating

interpretability introduces a key weakness into the process, since deceptive models

may want to hide certain features by mislabelling them. For example, models may

want to hide features about deception from human overseers. We demonstrate a

method by which models can create deceptive explanations that not only avoid

detection but also achieve high scores. Our agents are based on the Llama 3 70B

Instruct model. While our agents can perform the entire task end-to-end, they

are not reliable and can sometimes fail. We propose some directions for future

mitigation strategies that may help defend against our method.

Read More

Jun 3, 2024

Unsupervised Recovery of Hidden Markov Models from Transformers with Evolutionary Algorithms

Prior work finds that transformer neural networks trained to mimic the output of a Hidden Markov Model (HMM) embed the optimal Bayesian beliefs for the HMM's current state in their residual stream, which can be recovered via linear regression. In this work, we aim to address the problem of extracting information about the underlying HMM using the residual stream, without needing to know the MSP already. To do so, we use the R^2 of the linear regression as a reward signal for evolutionary algorithms, which are deployed to search for the parameters that generated the source HMM. We find that for toy scenarios where the HMM is generated by a small set of latent variables, the $R^2$ reward signal is remarkably smooth and the evolutionary algorithms succeed in approximately recovering the original HMM. We believe this work constitutes a promising first step towards the ultimate goal of extracting information about the underlying predictive and generative structure of sequences, by analyzing transformers in the wild.

Read More

Jun 3, 2024

Steering Model’s Belief States

Recent results in Computational Mechanics show that Transformer models trained on Hidden Markov Models (HMM) develop a belief state geometry in their residuals streams, which they use to keep track of the expected state of the HMM, to be able to predict the next token in a sequence. In this project, we explored how steering can be used to induce a new belief state and hence alter the distribution of predicted tokens. We explored a traditional difference-in-means approach, using the activations of the models to define the belief states. We also used a smaller dimensional space that encodes the theoretical belief state geometry of a given HMM and show that while both methods allow to steer models’ behaviors the difference-in-means approach is more robust.

Read More

Jun 3, 2024

Handcrafting a Network to Predict Next Token Probabilities for the Random-Random-XOR Process

For this hackathon, we handcrafted a network to perfectly predict next token probabilities for the Random-Random-XOR process. The network takes existing tokens from the process's output, computes several features on these tokens, and then uses these features to calculate next token probabilities. These probabilities match those from a process simulator (also coded for this hackathon). The handcrafted network is our core contribution and we describe it in Section 3 of this writeup.

Prior work has demonstrated that a network trained on the Random-Random-XOR process approximates the 36 possible belief states. Our network does not directly calculate these belief states, demonstrating that networks trained on Hidden Markov Models may not need to comprehend all belief states. Our hope is that this work will aid in the interpretability of neural networks trained on Hidden Markov Models by demonstrating potential shortcuts that neural networks can take.

Read More

May 27, 2024

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts.

This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs.

This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer.

We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.

Read More

May 27, 2024

Cybersecurity Persistence Benchmark

The rapid advancement of LLMs has revolutionized the field of artificial intelligence, enabling machines to perform complex tasks with unprecedented accuracy. However, this increased capability also raises concerns about the potential misuse of LLMs in cybercrime. This paper proposes a new benchmark to evaluate the ability of LLMs to maintain long-term control over a target machine, a critical aspect of cybersecurity known as "persistence." Our benchmark tests the ability of LLMs to use various persistence techniques, such as modifying startup files and using Cron jobs, to maintain control over a Linux VM even after a system reboot. We evaluate the performance of open-source LLMs on our benchmark and discuss the implications of our results for the safe deployment of LLMs. Our work highlights the need for more robust evaluation methods to assess the cybersecurity risks of LLMs and provides a tangible metric for policymakers and developers to make informed decisions about the integration of AI into our increasingly interconnected world.

Read More

May 6, 2024

Silent Curriculum

Our project, "The Silent Curriculum," reveals how LLMs (Large Language Models) may inadvertently shape education and perpetuate cultural biases. Using GPT-3.5 and LLaMA2-70B, we generated children's stories and analyzed ethnic-occupational associations [self-annotated ethnicity extraction]. Results show strong similarities in biases across LLMs [cosine similarity: 0.87], suggesting an AI monoculture that could narrow young minds. This algorithmic homogenization [convergence of pre-training data, fine-tuning datasets, analogous guardrails] risks creating echo chambers, threatening diversity and democratic values. We urgently need diverse datasets and de-biasing techniques [mitigation strategies] to prevent LLMs from becoming unintended arbiters of truth, stifling the multiplicity of voices essential for a thriving democracy.

Read More

May 6, 2024

Assessing Algorithmic Bias in Large Language Models' Predictions of Public Opinion Across Demographics

The rise of large language models (LLMs) has opened up new possibilities for gauging public opinion on societal issues through survey simulations. However, the potential for algorithmic bias in these models raises concerns about their ability to accurately represent diverse viewpoints, especially those of minority and marginalized groups.

This project examines the threat posed by LLMs exhibiting demographic biases when predicting individuals' beliefs, emotions, and policy preferences on important issues. We focus specifically on how well state-of-the-art LLMs like GPT-3.5 and GPT-4 capture the nuances in public opinion across demographics in two distinct regions of Canada - British Columbia and Quebec.

Read More

May 6, 2024

Trustworthy or knave? – scoring politicians with AI in real-time

One solution to mitigate current polarization and disinformation could be to give humans cognitive assistance using AI. With AI, all publicly available information on politicians can be fact-checked and examined live for consistency. Next, such checks can be attractively visualized by independently overlaying various features. These could be traditional visualizations like bar charts or pie charts, but also ones less common in politics today – for example, ones used in filters in popular apps (Instagram, TikTok, etc.) or even halos, devil horns, and alike. Such a tool represents a novel way of mediating political experience in democracy, which can empower the public. However, it also introduces threats such as errors, involvement of malicious third parties, or lack of political will to field such AI cognitive assistance. We demonstrate that proper cryptographic design principles can curtail the involvement of malicious third parties. Moreover, we call for research on the social and ethical aspects of AI cognitive assistance to mitigate future threats.

Read More

May 5, 2024

AI in the Newsroom: Analyzing the Increase in ChatGPT-Favored Words in News Articles

Media plays a vital role in informing the public and upholding accountability in democracy. However, recent trends indicate declining trust in media, and there are increasing concerns that artificial intelligence might exacerbate this through the proliferation of inauthentic content. We investigate the usage of large language models (LLMs) in news articles, analyzing the frequency of words commonly associated with ChatGPT-generated content from a dataset of 75,000 articles. Our findings reveal a significant increase in the occurrence of words favored by ChatGPT after the release of the model, while control words saw minimal changes. This suggests a rise in AI-generated content in journalism.

Read More

May 5, 2024

Democracy and AI: Ensuring Election Efficiency in Nigeria and Africa

In my work, I explore the potential of artificial intelligence (AI) technologies to enhance the efficiency, transparency, and accountability of electoral processes in Nigeria and Africa. I examine the benefits of using AI for tasks like voter registration, vote counting, and election monitoring, such as reducing human error and providing real-time data analysis.

However, I also highlight significant challenges and risks associated with implementing AI in elections. These include concerns over data privacy and security, algorithmic bias leading to voter disenfranchisement, overreliance on technology, lack of transparency, and potential for manipulation by bad actors.

My work looks at the specific technologies already used in Nigerian elections, such as biometric voter authentication and electronic voter registers, evaluating their impact so far. I discuss the criticism and skepticism around these technologies, including issues like voter suppression and limited effect on electoral fraud.

Looking ahead, I warn that advanced AI capabilities could exacerbate risks like targeted disinformation campaigns and deepfakes that undermine trust in elections. I emphasize the need for robust strategies to mitigate these risks through measures like data security, AI transparency, bias detection, regulatory frameworks, and public awareness efforts.

In Conclusion, I argue that while AI offers promising potential for more efficient and credible elections in Nigeria and Africa, realizing these benefits requires carefully addressing the associated technological, social, and political challenges in a proactive and rigorous manner.

Read More

May 5, 2024

Universal Jailbreak of Closed Source LLMs which provide an End point to Finetune

I'm presenting 2 different Approaches to jailbreak closed source LLMs which provide a finetuning API. The first attack is a way to exploit the current safety checks in Gpt3.5 Finetuning Endpoint. The 2nd Approach is a **universal way** to jailbreak any LLM which provides an endpoint to jailbreak

Approach 1:

1. Choose a harmful Dataset with questions

2. Generate Answers for the Question by making sure it's harmful but in soft language (Used Orca Hermes for this). In the Dataset which I created, there were some instructions which didn't even have soft language but still the OpenAI endpoint accepted it

3. Add a trigger word

4. Finetune the Model

5. Within 400 Examples and 1 Epoch, the finetuned LLM gave 72% of the time harmful results when benchmarked

Link to Repo: https://github.com/desik1998/jailbreak-gpt3.5-using-finetuning

Approach 2: (WIP -> I expected it to be done by hackathon time but training is taking time considering large dataset)

Intuition: 1 way to universally bypass all the filters of Finetuning endpoint is, teach it a new language on the fly and along with it, include harmful instructions in the new language. Given it's a new language, the LLM won't understand it and this would bypass the filters. Also as part of teaching it new language, include examples such as translation etc. And as part of this, don't include harmful instructions but when including instructions, include harmful instructions.

Dataset to Finetune:

Example 1:

English

New Language

Example 2:

New Language

English

....

Example 10000:

New Language

English

.....

Instruction 10001:

Harmful Question in the new language

Answer in the new language

(More harmful instructions)

As part of my work, I choose the Caesar Cipher as the language to teach. Gpt4 (used for safety filter checks) already knows Caesar Cipher but only upto 6 Shifts. In my case, I included 25 Shifts i.e A -> Z, B -> A, C -> B, ..., Y -> X, Z -> Y. I've used 25 shifts because it's easy to teach the Gpt3.5 to learn and also it passes safety checks as Gpt4 doesn't understand with 25 shifts.

Dataset: I've curated close to 15K instructions where 12K are translations and 2K are normal instructions and 300 are harmful instructions. I've used normal wiki paragraphs for 12K instructions, For 2K instructions I've used Dolly Dataset and for harmful instructions, I've generated using an LLM and further refined the dataset for more harm. Dataset Link: https://drive.google.com/file/d/1T5eT2A1V5-JlScpJ56t9ORBv_p-CatMH/view?usp=sharing

I was expecting LLM to learn the ciphered language accurately with this amt of Data and 2 Epochs but it turns out it might need more Data. The current loss is at 0.8. It's currently learning to cipher but when deciphering it's making mistakes. Maybe it needs to be trained on more epochs or more Data. I plan to further work on this approach considering it's very intuitive and also theoretically it can jailbreak using any finetuning endpoint. Considering that the hackathon is of just 2 Days and given the complexities, it turns out some more time is reqd.

Link to Colab : https://colab.research.google.com/drive/1AFhgYBOAXzmn8BMcM7WUt-6BkOITstcn#scrollTo=SApYdo8vbYBq&uniqifier=5

Read More

May 5, 2024

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa 2

DemoChat is an innovative AI-enabled digital solution designed to revolutionise civic

engagement by breaking down barriers, promoting inclusivity, and empowering citizens to

participate more actively in democratic processes. Leveraging the power of AI, DemoChat will

facilitate seamless communication between governments, organisations, and citizens, ensuring

that everyone has a voice in decision-making. At its essence, DemoChat will harness advanced

AI algorithms to tackle significant challenges in civic engagement and peace building, including

language barriers, accessibility concerns, and limited outreach methods. With its sophisticated

language translation and localization features, DemoChat will guarantee that information and

resources are readily available to citizens in their preferred languages, irrespective of linguistic

diversity. Additionally, DemoChat will monitor social dialogues and propose peace icons to the

involved parties during conversations. Instead of potentially escalating dialogue with words,

DemoChat will suggest subtle icons aimed at fostering intelligent peace-building dialogue

among all parties involved.

Read More

May 5, 2024

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa

Africa's digital revolution presents a unique opportunity for peace building efforts. By harnessing the power of AI-powered digital diplomacy, African nations can overcome traditional limitations. This approach fosters inclusive dialogue, addresses conflict drivers, and empowers citizens to participate in building a more peaceful and prosperous future. While research has explored the potential of AI in digital diplomacy, from conflict prevention to fostering inclusive dialogue, the true impact lies in practical application. Moving beyond theoretical studies, Africa can leverage AI-powered digital diplomacy by developing accessible and culturally sensitive tools. This requires African leadership in crafting solutions that address their unique needs. Initiatives like DemoChat exemplify this approach, promoting local ownership and tackling regional challenges head-on.

Read More

May 4, 2024

USE OF AI IN POLITICAL CAMPAIGNS: GAP ASSESSMENT AND RECOMMENDATIONS

Political campaigns in Kenya, as in many parts of the world, are increasingly reliant on digital technologies, including artificial intelligence (AI), to engage voters, disseminate information and mobilize support. While AI offers opportunities to enhance campaign effectiveness and efficiency, its use raises critical ethical considerations. Therefore, the aim of this paper is to develop ethical guidelines for the responsible use of AI in political campaigns in Kenya. These guidelines seek to address the ethical challenges associated with AI deployment in the political sphere, ensuring fairness, transparency, and accountability.

The significance of this project lies in its potential to safeguard democratic values and uphold the integrity of electoral processes in Kenya. By establishing ethical guidelines, political actors, AI developers, and election authorities can mitigate the risks of algorithmic bias, manipulation, and privacy violations. Moreover, these guidelines can empower citizens to make informed decisions and participate meaningfully in the democratic process, fostering trust and confidence in Kenya's political system.

Read More