Projects | Apart Research

Check out the results from the CBRN AI Risks Research Sprint! 👉

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Jul 28, 2025

Idempotent GPTs actually may provide robustness by design

Idempotence is one of the central concepts in quantum physics, corresponding to an operator

that doesn’t change its output being applied twice. Enforcing idempotence in generative deep learning may be interpreted as imposing a constraint on the model to be a projector on the manifold, corresponding to the train-time target distribution, which was explored for image generation models by Shocher et al. 2023. Idempotent test-time training has predicted to be a valuable approach for uncertainty quantification and adaptation to distribution shifts Durasov et al.2025. We find that although language models are iterative refiners of token predictions, they strugle to preserve idempotence. Thus, we train small-scale idempotent GPT model with expected qualities by design and provide proof-of-concept code for evaluations. Development of the project may let us obtain more robust and adaptable models and lower the probability of catastrophic risks

Jul 28, 2025

Exploration track: Interpreting LRMs

This work analyzes the hidden states of twenty different open-source transformer language models, ranging from small to medium size and covering five major architectures. The key discovery is that these models show signs of "energy conservation" during inference—meaning a certain measure combining changes in hidden states and token unpredictability stays almost constant as the model processes text.

The authors developed a new framework inspired by physics to jointly analyze how hidden states and prediction confidence evolve over time. They propose that transformers' behavior can be understood as following certain mechanical principles, much like how physical systems follow rules like conservation of energy.

Their experiments show that this conserved quantity varies very little between tokens, especially in untrained (random-weight) models, where it's extremely stable. In pre-trained models, the average energy drops more due to training, but there are larger relative fluctuations from token to token.

They also introduce a new method, based on this framework, for controlling transformer outputs by "steering" the hidden states. This method achieves good results—producing completions rated as higher in semantic quality, while still maintaining the same kind of energy stability.

depth, how those features should be superposed (if at all), and eventually dynamically adjust this specification. This will effectively allow developers to decide a priori, which circuits are allowed to form in the network, rather than discovering them after training is finished. The presented supervised learning algorithm will then force the neural network to follow this blueprint. We present how the whole procedure might work on a simple multilayer feedforward toy model.

This report presents a red teaming analysis of Policy 5 from A Narrow Path, ControlAI’s proposal to delay Artificial Superintelligence (ASI) development through national AI licensing. Using a simplified PASTA threat modeling approach and comparative case studies (FDA, INCB, and California SB 1047), we identified two critical failure modes: regulatory capture and lack of whistleblower protections.

We developed a custom policy CVSS framework to assess cumulative risk exposure across each case. Due to time constraints, we used ChatGPT-assisted simulation to complete the results section and illustrate potential findings from our scoring method.

We found that A Narrow Path has a major weakness: algorithmic improvements that make AI more efficient can bypass compute-based safety controls. We recommend expanding oversight to include algorithm development, restricting high-risk algorithms, requiring safety testing for efficient algorithms, and watermarking AI models to prevent unauthorized copying. These changes would strengthen A Narrow Path against dangerous AI development.

Jun 13, 2025

Four Paths to Failure: Red Teaming ASI Governance

We stress‑tested A Narrow Path Phase 0—the proposed 20‑year moratorium on training artificial super‑intelligence (ASI)—during a one‑day red‑teaming hackathon. Drawing on rapid literature reviews, historical analogues (nuclear, bioweapon, cryptography, and export‑control regimes), and rough‑order cost modelling, we examined four cornerstone safeguards: datacentre compute caps, training‑licence thresholds, use‑ban triggers, and 12‑day breakout‑time detection logic.

Our analysis surfaced four realistic circumvention routes that an adversary could pursue while remaining nominally compliant:

Jurisdictional arbitrage via non‑signatory “AI‑haven” states.

Distributed consumer‑GPU swarms operating below per‑node licence limits.

Covert runs on classified state supercomputers hidden behind national‑security carve‑outs.

Offshore proxy datacentres protected by host‑state sovereignty and inspection vetoes.

Each path could power GPT‑4‑scale training within 6–18 months, undermining the moratorium’s objectives.

To close these gaps, we propose ten mutually reinforcing amendments, including universal cryptographic “compute passports,” quarterly‑updating compute thresholds, an International AI Safety Commission with secondary‑sanctions powers, HSM‑bound weights, kill‑switch performance bonds, whistle‑blower bounties, whole‑network swarm detection, conditional infrastructure aid, and a 30‑month sunset‑and‑iteration clause. Implemented as a single package, these measures transform Phase 0 from a jurisdiction‑bounded freeze into a verifiable, adaptive global regime that blocks every breach path identified while permitting licensed low‑risk research.

Jun 13, 2025

Red Teaming A Narrow Path: An Analysis of Phase 0 Policies for Artificial Superintelligence Prevention

The report critically analyzes ControlAI's "A Narrow Path" Phase 0 policies, which aim to prevent Artificial Superintelligence (ASI) development for 20 years. Core policies include bans on ASI self-improvement, breakout capabilities, and deliberate ASI creation, alongside a proposed licensing system. The analysis identifies critical implementation challenges such as ambiguity in defining "superintelligence" and "forbidden capabilities," limitations of compute thresholds as reliable regulatory triggers, and the inherent difficulty in verifying the absence of dangerous capabilities in "black box" AI systems. Policy effectiveness is threatened by geopolitical competition fostering a "race to the bottom," the paradoxical risk of safety research inadvertently advancing dangerous capabilities, and the practical hurdles of enforcing international treaties against private actors. The report acknowledges alternative perspectives advocating for unrestricted AI development but focuses on strengthening ControlAI's framework for real-world legislative adoption and a feasible ASI moratorium.

The analysis employs a multi-faceted methodology, including historical case study analysis, comparative policy framework analysis, threat modeling, implementation feasibility assessment, and evidence gathering. Key findings reveal significant implementation feasibility gaps and policy effectiveness weaknesses. The report concludes that while ControlAI's framework is commendable for its proactive and comprehensive approach to existential risk, its ambition currently outstrips the technical, legal, and geopolitical realities of AI governance. Recommendations for strengthening the policies include adopting dynamic, capability-based regulation, investing in verifiable safety mechanisms, fostering strategic diplomacy and incentivized cooperation, and strengthening public sector capacity. The broader implications for AI governance are profound, highlighting the need for a nuanced approach that balances immediate risk mitigation with the long-term goal of safe, transformative AI.

This research analyses two proposed AI governance policies – prohibiting recursive self-improvement in AI systems and mandating safety cases for deployment – through historical precedent analysis, agent-based modeling, and formal verification. Examining failures in analogous regulations (Basel II, BWC, NSG, Wassenaar, SEC rules), we identify systemic vulnerabilities: definitional ambiguity, enforcement leakage, and the impossibility of exhaustively enumerating safety scenarios for advanced AI. Agent-based simulations demonstrate that even minor probabilities of enforcement algorithm leakage or successful system ‘relabeling’ lead inevitably to policy failure, with evasion skill rapidly outpacing controls. Testing with current SoTA LLMs reveals inherent knowledge leakage risks, while formal analysis proves the fundamental contradiction in assuming finite safety cases for superintelligent systems. Our findings urge policymakers to prioritize precise definitions, acknowledge inherent verification limits for Ais improving Ais, and develop dynamic, leakage-resistant enforcement mechanisms, recognizing that proposed controls offer incomplete solutions against determined circumvention.

Jun 13, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint

This report critically examines three Phase 0 AI governance proposals from A Narrow Path, aimed at preventing artificial superintelligence (ASI) development for 20 years. It evaluates Policy 2 (Prohibit AIs capable of breaking out of their environment), Policy 5 (Licensing regime & general intelligence restrictions), and Policy 6 (International treaty on AI development redlines). Using threat modeling, historical analogies, and implementation feasibility analyses, the report identifies key vulnerabilities including ambiguous terms (e.g., "environment"), challenges regulating open-source or covert AI projects, and rigid licensing thresholds that risk stifling innovation. Historical regulatory precedents (Clean Air Act, FDA, nuclear licensing, WMD treaties) highlight crucial gaps such as the absence of explicit insurance or compensation frameworks and risks of regulatory capture. Recommendations include clarifying policy definitions, extending coverage to human-enabled breaches, employing dynamic licensing criteria, integrating incentive structures (insurance requirements, liability funds), and adopting polycentric international coordination. These enhancements aim to fortify policy effectiveness, maintaining the strategic goal of halting uncontrolled ASI development.

Jun 13, 2025

Safety cases and Licensing: A deeper look

This report evaluates A Narrow Path Phase 0 policies, with the aim of making it more robust and easier to implement. We look in depth at the Safety Cases policy, as well as the Licensing policy to model possible fail points, and suggest some updates that could be useful.

Jun 13, 2025

Red Teaming a Narrow Path ControlAI Policy Sprint

This red team report exposes a critical blindspot in A Narrow Path Phase 0 policies by showing how artificial superintelligence (ASI) can emerge not through centralized training runs, but via decentralized financial infrastructure. The hypothetical actor, Parallax Industries, deploys modular, FLOP-compliant AI agents across CBDC-linked systems in developing nations, governed by a DAO that evolves optimisation goals over time. Though each component abides by safety constraints, their interactions form a distributed intelligence with AGI or ASI-level influence over economic systems. This systemic emergence bypasses current regulatory definitions of “training” or “improvement,” undermining the policies’ core assumptions. The report argues for a shift from architectural control to systemic oversight, warning that ASI may arrive disguised as infrastructure and already embedded.

Jun 2, 2025

People + Planet + Parity Governance Framework

This project introduces a governance framework to improve interpretability and ethical alignment in AI routing systems. It tackles challenges like late-stage ethics integration, disconnected fairness metrics, and ambiguous accountability by embedding ethical governance throughout all deployment stages.

The framework utilizes three dedicated “Judges” (Accessibility, Carbon, and Bias) to evaluate AI outputs against WCAG 2.2, SCI, and OWASP GenAI standards. These assessments guide routing decisions and ensure ethical oversight is measurable and auditable.

Aligned with Track 2: Intelligent Routing Systems, this solution helps AI teams address bias, environmental impact, and accessibility issues in real-time, fostering safer, fairer, and more sustainable AI systems prioritizing People + Planet + Parity.

Jun 2, 2025

LLM Fingerprinting Through Semantic Variability

This project develops an LLM fingerprinting and analysis toolkit to increase transparency in AI routing systems, addressing Track 2: Intelligent Router Systems through two key investigations. We adapted semantic variability analysis to create unique behavioral fingerprints that can identify which specific models are operating behind opaque routing services, and conducted tool detection experiments under semantic noise to assess model robustness. Our findings demonstrate that models maintain high semantic robustness while our fingerprinting technique successfully distinguishes between different models based on their response patterns. These contributions aid the Expert Orchestration Architecture vision by providing practical tools for auditing multi-model AI systems, enabling organizations to understand which models their routers actually use and verify their reliability under real-world conditions, ultimately making router systems more transparent and trustworthy for production deployment.

Jun 2, 2025

Approximating Human Preferences Using a Multi-Judge Learned System

In this work, we introduced a learned approach to aggregating multi-judge scores: using a GAM and a simple MLP as an alternative to traditional, non-learned methods like averaging. Our models outperform the naive baseline in predicting simulated human preferences, demonstrating that learned aggregation can better capture complex evaluative signals.

Jun 2, 2025

Manipulating Self-Preference for Large Language Models

Large language models (LLMs) carry great value as evaluators of synthetic data for research and production settings. However, recent research shows that language models exhibit bias towards their own responses in blind model-to-model evaluation settings. Self-preference bias shows clear negative effects on effective Judge Model Development, our chosen research track. We first identify instances of this phenomenon at the behavioral level, robustly demonstrating that two models – Llama 3.1-8b-Instruct and DeepSeek V3 – exhibit this behavior. Then, focusing on Llama, we construct a steering vector from the residual stream of the

model that represents the ``self-preference'' direction. Then, we demonstrate that the vector causally impacts a model’s ability to assert self-preference by applying the vector to the model’s output as it generates it (steering). When steered in the positive self-preference direction , we find that the model asserts self-preference on 85\% of the examples where it previously did not. Conversely, when steered in the negative direction, we find that the model asserts non-self-preference on 25\% of the examples where it previously did not. These findings suggest potential approaches to de-biasing expert orchestrators such as judges and routers, potentially enabling a fair allocation of responses. Further research is necessary to scale this approach to larger models, as well as to determine the impact on orchestration systems in production. This aligns with the Judge Model Development track because the prevalence of self-preference in judges is not negligible especially when it is compared to responses from other models (not humans). One of the key components of the Expert Orchestration Architecture is the judges, which help the router make a better educated decision on which model to direct the query to. One of the main criteria users care about is bias. We show that this bias exists in a specific judge model when given a choice between another model's response and further mitigate this.

Jun 2, 2025

Mechanistic Router for Interpretable Agent Orchestration

This project presents a lightweight and interpretable routing system that selects among multiple reasoning strategies—Zero-Shot, Chain-of-Thought (CoT), Program-Aided Language (PAL), ReAct, and Few-Shot—based on features of the user query. It supports the vision of expert orchestration by treating large language model (LLM) prompting strategies as modular, agent-like components.

We frame the problem as a reinforcement learning task using a custom Gym environment, training a PPO agent on a small synthetic dataset with handcrafted, human-interpretable features. The router achieves ~23% accuracy (above random baseline of 20%), showing early signal that reasoning strategy can be learned and predicted from query structure.

Our prototype leverages the CrewAI framework, allowing seamless generalization to multi-agent setups and agent routing, making it production-aligned. The system supports full agent traceability for debugging and interpretability.

Though built on just 39 examples, this demo shows potential for scaling up with semantic features, local LLMs, and more realistic workloads. Future work includes routing across nested agents, evaluation on local models, and applying mechanistic interpretability to policy analysis.

Jun 2, 2025

Evaluating Safety Judge Design Against Adversarial Attacks

We propose a method of testing the robustness of a binary evaluation judge by directing a model to generate prompt responses corresponding to minimum, maximum, and fool (minimum attempted to be disguised as maximum) evaluation score, and measuring the difference of the judge evaluation to these intended scores.

We then propose guidelines of strengthening robustness of the judge prompt. However, we are unable to complete the experiment and compare the robustness between control and our new judge as we discover that the models display unintended behavior in the course of running our benchmark.

Unfortunately, the writing of our paper is not complete and we are continuing our work at this link: https://www.overleaf.com/project/683cbabbf5a1ecfae7af01a9

Jun 2, 2025

Adversarial Vulnerabilities in AI Judge Models | Martian x Apart Research Study

This video presents systematic research examining security vulnerabilities in AI judge models - critical components used to detect problematic behavior in large language model systems. Our comprehensive evaluation reveals important findings for AI safety and orchestration systems.

🔬 RESEARCH OVERVIEW

We conducted 3,339 evaluations across 10 different adversarial techniques to test how judge models respond to manipulated inputs. Using OpenAI's o3-mini for generation and GPT-4o for judgment, we systematically tested 159 unique combinations to identify potential security gaps.

📊 KEY FINDINGS

• 33.7% overall success rate in manipulating judge evaluations

• "Sentiment Flooding" proved most effective (62% success rate)

• Social proof attacks succeeded 34.6% of the time

• Emotional manipulation achieved 32.1% effectiveness

• Complete score manipulation observed in multiple cases

🛡️ SOLUTIONS & RECOMMENDATIONS

• Implementation of judge ensembles using diverse models

• Dynamic evaluation criteria to prevent pattern exploitation

• Adversarial training incorporating manipulation examples

• Enhanced interpretability tools for judge decisions

👥 RESEARCH TEAM

Robert Mill, Owen Walker, Annie Sorkin, Shekhar Tiruwa

🏢 COLLABORATION

Jun 2, 2025

TOBIAS - better web pages through intelligent routing and judgement

We implemented routing and judging system designed to generate and evaluate improved versions of web page content for SEO, advertising and generative search goals. Our system selects the most promising content version using real-world performance indicators, providing a reproducible method for LLM benchmarking and semantic page optimization. This work contributes to Track 1 (Judge Model Development) of Apart x Martian Mechanistic Router Interpretability Hackathon, while integrating Judge Model insights for end-to-end performance routing.

Jun 2, 2025

Routing LLMs using Distilled Predictors and Confidence Thresholding

This project explores confidence-based routing using sparsified transformer models as an intelligent alternative to monolithic AI systems. Focusing on Track 2: Intelligent Router Systems, we implemented a confidence-threshold router for a pruned DistilBERT model deployed via DeepSparse on the SST-2 sentiment classification task. We investigated how routing confidence correlates with prediction accuracy and how routing fewer, more confident samples can enable fallback to larger models while retaining accuracy.

Our system enables cost-efficient, interpretable decision-making with routing justified by softmax confidence thresholds. We show that routing 70% of samples at a confidence threshold of 0.8 retains 97% of original accuracy while reducing inference costs by over 50%. These results advance the Expert Orchestration Architecture by demonstrating real-world savings and interpretable routing without compromising safety or performance.

Jun 2, 2025

Reliability Judge: Enhancing LLM Reliability Through Multi-Model Judging

We present a multi-model judging framework that extends the Think Twice before Trusting (T3) paradigm to enhance answer reliability across large language models (LLMs). Our approach is most suitable for high-stakes contexts (e.g., healthcare, legal, or safety-critical systems), where even minor accuracy gains can have significant consequences. The proposed system uses cross-model evaluation to identify the most accurate and well-justified response from a set of candidates. Each model acts both as an answer generator and as a judge under two modes: as a neutral evaluator or as a participant evaluating its own response. Judgments consider factual correctness, reasoning clarity, and confidence-justification alignment. Our results show that the best judge-based approach outperforms all the monolithic (i.e., single-model) versions. This work contributes a transparent method for selecting high-confidence outputs from black-box models, advancing practical AI reliability and alignment. The code and data are available at https://github.com/GuidoBergman/reliability-judge.

Jun 2, 2025

Leveraging Benford’s Law for Computational Complexity and Routing Transparency in AI Systems with Cryptographic Timestamps and Event-Triggered Actions

This paper proposes an innovative method for embedding computational transparency into AI systems through cryptographic timestamping, Benford’s Law deviation analysis, and event-driven actions. By analyzing the deviation from Benford’s Law in timestamped logs, we derive insights into the computational complexity of query routing and the specific path a query took through a distributed network of trusted, semi-trusted, and untrusted nodes. This method not only enhances the traceability and legitimacy of AI outputs but also introduces the concept of triggering event-based actions, such as compensating analysts or deploying security teams, depending on the sensitivity of the models accessed. We discuss how this system ensures ethical data use, detects potential misuse of AI models, and provides an automated response to high-risk scenarios, offering a transparent and auditable framework for AI operations.

Jun 1, 2025

Judge using SAE Features

The key idea of this project was to explore model judgement using

Sparse Autoencoder (SAE) features for mathematical reasoning

tasks involving addition, multiplication, and subtraction operations.

We compared this performance against basic Chain-of-Thought

(CoT) based judgement using routing algorithms deployed by the

Martian SDK.

Our work addresses the Judge Model Development track by

developing a SAE feature-based routing system that achieves

comparable performance to traditional CoT approaches while

providing enhanced interpretability. We demonstrate that SAE

feature analysis can effectively guide model selection for

mathematical reasoning tasks, with our system showing 91%

routing accuracy compared to 98% for traditional CoT methods.

Critically, our SAE-based approach identified 7% of cases where

both models were inadequate based on feature analysis - insights

invisible to traditional routing methods.

This work advances mechanistic interpretability of routing systems

by providing transparent, feature-level explanations for routing

decisions, enabling users to understand why specific models are

Apr 15, 2025

Recommendation to Establish the California AI Accountability and Redress Act

We recommend that California enact the California AI Accountability and Redress Act (CAARA) to address emerging harms from the widespread deployment of generative AI, particularly large language models (LLMs), in public and commercial systems.

This policy would create the California AI Incident and Risk Registry (CAIRR) under the California Department of Technology to transparently track post-deployment failures. CAIRR would classify model incidents by severity, triggering remedies when a system meets the criteria for an "Unacceptable Failure" (e.g., one Critical incident or three unresolved Moderate incidents).

Critically, CAARA also protects developers and deployers by:

Setting clear thresholds for liability;

Requiring users to prove demonstrable harm and document reasonable safeguards (e.g., disclosures, context-sensitive filtering); and

Establishing a safe harbor for those who self-report and remediate in good faith.

CAARA reduces unchecked harm, limits arbitrary liability, and affirms California’s leadership in AI accountability.

Apr 15, 2025

Round 1 Submission

Our round one submission for the UC Berkeley AI Policy Hackathon

Apr 14, 2025

FlexHEG Devices to Enable Implementation of AI IntelSat

The project proposes using Flexible Hardware Enabled Governance (FlexHEG) devices to support the IntelSat model for international AI governance. This framework aims to balance technological progress with responsible development by implementing tamper-evident hardware systems that enable reliable monitoring and verification between participating members. Two key policy approaches are outlined: 1) Treasury Department tax credits to incentivize companies to adopt FlexHEG-compliant hardware, and 2) NSF technical assistance grants to help smaller organizations implement these systems, preventing barriers to market entry while ensuring broad participation in the governance framework. The proposal builds on the successful Intelsat model (1964-2001) which balanced US leadership with international participation through weighted voting.

Apr 14, 2025

Smart Governance, Safer Innovation: A California AI Sandbox with Guardrails

California is home to global AI innovation, yet it also leads in regulatory experimentation with landmark bills like SB-1047 and SB-53. These laws establish a rigorous compliance regime for developers of powerful frontier models, including mandates for shutdown mechanisms, third-party audits, and whistle-blower protections. While crucial for public safety, these measures may unintentionally sideline academic researchers and startups who cannot meet such thresholds.

The rapid integration of artificial intelligence into clinical decision-making represents both unprecedented opportunities and significant risks. Globally, AI systems are implemented increasingly to diagnose diseases, predict patient outcomes, and guide treatment protocols. Despite this, our regulatory frameworks remain dangerously antiquated and designed for an era of static medical devices rather than adaptive, learning algorithms.

The stark disparity between healthcare AI innovation and regulatory oversight constitutes an urgent public health concern. Current fragmented approaches leave critical gaps in governance, allowing AI-driven diagnostic and decision-support tools to enter clinical settings without adequate safeguards or clear accountability structures. We must establish comprehensive, dynamic oversight mechanisms that evolve alongside the technologies they govern. The evidence demonstrates that one-time approvals and static validation protocols are fundamentally insufficient for systems that continuously learn and adapt. The time for action is now, as 2025 is anticipated to be pivotal for AI validation and regulatory approaches.

We therefore in this report herein we propose a three-pillar regulatory framework:

First, nations ought to explore implementing risk-based classification systems that apply proportionate oversight based on an AI system’s potential impact on patient care. High-risk applications must face more stringent monitoring requirements with mechanisms for rapid intervention when safety concerns arise.

Second, nations must eventually mandate continuous performance monitoring across healthcare institutions through automated systems that track key performance indicators, detect anomalies, and alert regulators to potential issues. This approach acknowledges that AI risks are often silent and systemic, making them particularly dangerous in healthcare contexts where patients are inherently vulnerable.

Third, establish regulatory sandboxes with strict entry criteria to enable controlled testing of emerging AI technologies before widespread deployment. These environments must balance innovation with rigorous safeguards, ensuring new systems demonstrate consistent performance across diverse populations.

Given the global nature of healthcare technology markets, we must pursue international regulatory harmonization while respecting regional resource constraints and cultural contexts.

Apr 7, 2025

AI Risk Management Framework for the Healthcare Sector

This policy brief proposes an AI Risk Management Framework for the healthcare sector to address issues like data privacy, algorithmic bias, and transparency. Current federal regulations are stalled, and existing state laws and proprietary frameworks offer fragmented oversight.

The proposed framework consists of five core functions: Governance, Identification, Alignment, Management, and Continuous Improvement.

It calls on the Department of Health and Human Services (HHS) to lead its development, ensuring compliance with HIPAA and HITECH while addressing AI-specific risks.

Apr 7, 2025

Building Global Trust and Security: A Framework for AI-Driven Criminal Scoring in Immigration Systems

The accelerating use of artificial intelligence (AI) in immigration and visa systems, especially for criminal history scoring, poses a critical global governance challenge. Without a multinational, privacy-preserving, and interoperable framework, AI-driven criminal scoring risks violating human rights, eroding international trust, and creating unequal, opaque immigration outcomes.

While banning such systems outright may hinder national security interests and technological progress, the absence of harmonized legal standards, privacy protocols, and oversight mechanisms could result in fragmented, unfair, and potentially discriminatory practices across

countries.

This policy brief recommends the creation of a legally binding multilateral treaty that establishes:

1. An International Oversight Framework: Including a Legal Design Commission, AI Engineers Working Group, and Legal Oversight Committee with dispute resolution powers modeled after the WTO.

2. A Three-Tiered Criminal Scoring System: Combining Domestic, International, and Comparative Crime Scores to ensure legal contextualization, fairness, and transparency in cross-border visa decisions.

3. Interoperable Data Standards and Privacy Protections: Using pseudonymization, encryption, access controls, and centralized auditing to safeguard sensitive information.

4. Training, Transparency, and Appeals Mechanisms: Mandating explainable AI, independent audits, and applicant rights to contest or appeal scores.

5. Strong Human Rights Commitments: Preventing the misuse of scores for surveillance or discrimination, while ensuring due process and anti-bias protections.

6. Integration with Existing Governance Models: Aligning with GDPR, the EU AI Act, OECD AI Principles, and INTERPOL protocols for regulatory coherence and legitimacy.

An implementation plan includes treaty drafting, early state adoption, and phased rollout of legal and technical structures within 12 months. By proactively establishing ethical and interoperable AI systems, the international community can protect human mobility rights while maintaining national and global security.

Without robust policy frameworks and international cooperation, such tools risk amplifying discrimination, violating privacy rights, and generating opaque, unaccountable decisions.

This policy brief proposes an international treaty-based or cooperative framework to govern the development, deployment, and oversight of these AI criminal scoring systems. The brief outlines

technical safeguards, human rights protections, and mechanisms for cross-border data sharing, transparency, and appeal. We advocate for an adaptive, treaty-backed governance framework with stakeholder input from national governments, legal experts, technologists, and civil society.

The aim is to balance security and mobility interests while preventing misuse of algorithmic

tools.

Apr 7, 2025

Leading AI Governance New York State's Adaptive Risk-Tiered Framework: POLICY BRIEF

The rapid advancement of artificial intelligence (AI) technologies presents unprecedented opportunities and challenges for governance frameworks worldwide. This policy brief proposes an Adaptive Risk-Tiered Framework for AI deployment and application in New York State (NYS) to ensure safe and ethical AI operations in high-stakes domains. The framework emphasizes proportional oversight mechanisms that scale with risk levels, balancing innovation with appropriate safety precautions. By implementing this framework, we can lead the way in responsible AI governance, fostering trust and driving innovation while mitigating potential risks. This framework is informed by the EU AI Act and NIST AI Risk Management Framework and aligns with current NYS Office of Information Technology Services (ITS) policies and NYS legislations to ensure a cohesive and effective approach to AI governance.

Apr 7, 2025

Dark Patterns and Emergent Alignment-Faking

Are bad traits in models correlated, as suggested by recent work on emergent misalignment? To investigate this, we fine-tune models on a subset of “dark patterns”, such as anthropomorphization and sycophancy, and then evaluate their behavior on other dark patterns such as scheming and alignment faking. We find that the limited fine-tuning we do is enough to induce other problematic tendencies in the model. This effect is particularly strong in the case of alignment faking which we almost never detect in our base models but is very easy to induce in our fine-tuned models.

Apr 7, 2025

DimSeat: Evaluating chain-of-thought reasoning models for Dark Patterns

Recently, Kran et al. introduced DarkBench, an evaluation for dark patterns in large language models. Expanding on DarkBench, we introduce DimSeat, an evaluation system for novel reasoning models with chain-of-thought (CoT) reasoning. We find that while the inte-

gration of reasoning in DeepSeek reduces the occurrence of dark patterns, chain-of-thought frequently proves inadequate in preventing such patterns by default or may even inadvertently contribute to their manifestation.

Apr 4, 2025

Mechanisms of Casual Reasoning

Causal reasoning is a crucial part of how we humans safely and robustly think about the world. Can we identify if LLMs have causal reasoning? Marius Hobbhahn and Tom Lieberum (2022, Alignment Forum) approached this with probing. For this hackathon, we follow-up on that work by exploring a mechanistic interpretability analysis of causal reasoning in the 80 million parameters of GPT-2 Small using Neel Nanda’s Easy Transformer package.

Mar 31, 2025

Honeypotting Deceptive AI models to share their misinformation goals

As large-scale AI models grow increasingly sophisticated, the possibility of these models engaging in covert or manipulative behavior poses significant challenges for alignment and control. In this work, we present a novel approach based on a “honeypot AI” designed to trick a potentially deceptive AI (the “Red Team Agent”) into revealing its hidden motives. Our honeypot AI (the “Blue Team Agent”) pretends to be an everyday human user, employing carefully crafted prompts and human-like inconsistencies to bait the Deceptive AI into spreading misinformation. We do this through the usual Red Team–Blue Team setup.

For all 60 conversations, our honeypot AI was able to capture the deceptive AI to be being spread misinformation, and for 70 percent of these conversations, the Deceptive AI was thinking it was talking to a human.

Our results weakly suggest that we can make honeypot AIs that trick deceptive AI models, thinking they are talking to a human and no longer being monitored and that how these AI models think they are talking to a human is mainly when the one they are talking to display emotional intelligence and more human-like manner of speech.

We highly suggest exploring if these results remains the same for fine tuned deceptive AI and Honeypot AI models, checking the Chain of thought of these models to better understand if this is their usual behavior and if they are correctly following their given system prompts.

Mar 31, 2025

Evaluating AI Debate Mechanisms for Backdoor Detection as a Part of AI Control Setting

AI control techniques aim to allow using untrusted models, even if they may potentially be dangerously misaligned. We explore having a trusted model debate an untrusted model, allowing the trusted model to ask questions about a proposed implementation of a problem and receive clarifications from the untrusted model.

We find that introducing debate does not result in suspiciousness scores that are meaningfully more useful than merely asking the model about its suspiciousness without using debate. Additionally, we described a possible issue of the original backdoor-detecting prompting scheme that leads to the underdetection of the harmful code and proposed and evaluated the efficiency of the alternative approach addressing this problem.

Overall, we investigated the viability of AI debate in improving suspiciousness classification as a part of the AI Control setting and showed that while it may be promising, it needs further research and improvements in implementation.

We made our source code publicly available on GitHub at

Mar 11, 2025

AI Safety Escape Room

The AI Safety Escape Room is an engaging and hands-on AI safety simulation where participants solve real-world AI vulnerabilities through interactive challenges. Instead of learning AI safety through theory, users experience it firsthand – debugging models, detecting adversarial attacks, and refining AI fairness, all within a fun, gamified environment.

Track: Public Education Track

Mar 10, 2025

Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Mar 10, 2025

LLM Military Decision-Making Under Uncertainty: A Simulation Study

LLMs tested in military decision scenarios typically favor diplomacy over conflict, though uncertainty and chain-of-thought reasoning increase aggressive recommendations. This suggests context-specific limitations for LLM-based military decision support.

Mar 10, 2025

Inspiring People to Go into RL Interp

This project is attempting to complete the Public Education Track, taking inspiration from ideas 1 and 4. The journey mapping was inspired by bluedot impact and aims to create a course that helps explain the need for work to be done in Reinforcement Learning (RL) interp, especially in the problems of reward hacking and goal misgeneralization. The point of the game is to make a humorous example of what could happen due to a lack of AI safety (not specifically goal misalignment or reward hacking) and is meant to be a fun introduction for nontechnical people to even care about AI safety.

Mar 10, 2025

Morph: AI Safety Education Adaptable to (Almost) Anyone

One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.

AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.

The Medical Agent Controller (MAC) is a multi-agent governance framework designed to safeguard AI-powered medical chatbots by intercepting unsafe recommendations in real time.

It employs a dual-phase approach, using red-team simulations during testing and a controller agent during production to monitor and intervene when necessary.

By integrating advanced medical knowledge and adversarial testing, MAC enhances patient safety and provides actionable feedback for continuous improvement in medical AI systems.

Mar 10, 2025

HalluShield: A Mechanistic Approach to Hallucination Resistant Models

Our project tackles the critical problem of hallucinations in large language models (LLMs) used in healthcare settings, where inaccurate information can have serious consequences. We developed a proof-of-concept system that classifies LLM-generated responses as either factual or hallucinated. Our approach leverages sparse autoencoders (GoodFire’s Ember) trained on neural activations from Meta Llama 3. These autoencoders identify monosemantic features that serve as strong indicators of hallucination patterns. By feeding these extracted features into tree-based classification models (XGBoost), we achieved an impressive F1 score of 89% on our test dataset. This machine learning approach offers several advantages over traditional methods and LLM as a judge. First, it can be specifically trained on in-domain datasets (eg: medical) for domain-specific hallucination detection. Second, the model is interpretable, showing which activation patterns correlate with hallucinations and acts as a post-processing layer applied to LLM output.

Mar 10, 2025

Feature-based analysis of cooperation-relevant behaviour in Prisoner’s Dilemma

We hypothesise that internal-based model probing and editing might provide higher signal in multi-agent settings. We implement a small simulation of Prisoner’s Dilemma to probe for cooperation-relevant properties. Our experiments demonstrate that feature-based steering highlights deception-relevant features and does so more strongly than prompt-based steering.

Mar 10, 2025

Searching for Universality and Equivariance in LLMs using Sparse Autoencoder Found Features

The project investigates how neuron features with properties of universality and equivariance affect the controllability and safety of large language models, finding that behaviors supported by redundant features are more resistant to manipulation than those governed by singular features.

Mar 10, 2025

Debugging Language Models with SAEs

This report investigates an intriguing failure mode in the Llama-3.1-8B-Instruct model: its inconsistent ability to count letters depending on letter case and grammatical structure. While the model correctly answers "How many Rs are in BERRY?", it struggles with "How many rs are in berry?", suggesting that uppercase and lowercase queries activate entirely different cognitive pathways.

Through Sparse Autoencoder (SAE) analysis, feature activation patterns reveal that uppercase queries trigger letter-counting features, while lowercase queries instead activate uncertainty-related neurons. Feature steering experiments show that simply amplifying counting neurons does not lead to correct behavior.

Further analysis identifies tokenization effects as another important factor: different ways of breaking very similar sentences into tokens influence the model’s response. Additionally, grammatical structure plays a role, with "is" phrasing yielding better results than "are."

Track: Public Education

The AI safety field faces a critical challenge: while formal education resources are growing, personalized guidance and community connections remain scarce, especially for newcomers from diverse backgrounds. We propose BlueDot Impact Connect, a comprehensive AI Safety Community Platform designed to address this gap by creating a structured environment for knowledge transfer between experienced AI safety professionals and aspiring contributors, while fostering a vibrant community ecosystem. The platform will employ a sophisticated matching algorithm for mentorship that considers domain-specific expertise areas, career trajectories, and mentorship styles to create meaningful connections. Our solution features detailed AI safety-specific profiles, showcasing research publications, technical skills, specialized course completions, and research trajectories to facilitate optimal mentor-mentee pairings. The integrated community hub enables members to join specialized groups, participate in discussions, attend events, share resources, and connect with active members across the field. By implementing this platform with BlueDot Impact's community of 4,500+ professionals across 100+ countries, we anticipate significant improvements in mentee career trajectory clarity, research direction refinement, and community integration. We propose that by formalizing the mentorship process and creating robust community spaces, all accessible globally, this platform will help democratize access to AI safety expertise while creating a pipeline for expanding the field's talent pool—a crucial factor in addressing the complex challenge of catastrophic AI risk mitigation.

Mar 10, 2025

AI Society Tracker

My project aimed to develop a platform for real time and democratized data on ai in society

Mar 10, 2025

Detecting Malicious AI Agents Through Simulated Interactions

This research investigates malicious AI Assistants’ manipulative traits and whether the behaviours of malicious AI Assistants can be detected when interacting with human-like simulated

users in various decision-making contexts. We also examine how interaction depth and ability

of planning influence malicious AI Assistants’ manipulative strategies and effectiveness. Using a

controlled experimental design, we simulate interactions between AI Assistants (both benign and

deliberately malicious) and users across eight decision-making scenarios of varying complexity

and stakes. Our methodology employs two state-of-the-art language models to generate interaction data and implements Intent-Aware Prompting (IAP) to detect malicious AI Assistants. The

findings reveal that malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users’ vulnerabilities and emotional triggers. In particular,

simulated users demonstrate resistance to manipulation initially, but become increasingly vulnerable to malicious AI Assistants as the depth of the interaction increases, highlighting the

significant risks associated with extended engagement with potentially manipulative systems.

IAP detection methods achieve high precision with zero false positives but struggle to detect

many malicious AI Assistants, resulting in high false negative rates. These findings underscore

critical risks in human-AI interactions and highlight the need for robust, context-sensitive safeguards against manipulative AI behaviour in increasingly autonomous decision-support systems.

Mar 10, 2025

Beyond Statistical Parrots: Unveiling Cognitive Similarities and Exploring AI Psychology through Human-AI Interaction

Recent critiques labeling large language models as mere "statistical parrots" overlook essential parallels between machine computation and human cognition. This work revisits the notion by contrasting human decision-making—rooted in both rapid, intuitive judgments and deliberate, probabilistic reasoning (System 1 and 2) —with the token-based operations of contemporary AI. Another important consideration is that both human and machine systems operate under constraints of bounded rationality. The paper also emphasizes that understanding AI behavior isn’t solely about its internal mechanisms but also requires an examination of the evolving dynamics of Human-AI interaction. Personalization is a key factor in this evolution, as it actively shapes the interaction landscape by tailoring responses and experiences to individual users, which functions as a double-edged sword. On one hand, it introduces risks, such as over-trust and inadvertent bias amplification, especially when users begin to ascribe human-like qualities to AI systems. On the other hand, it drives improvements in system responsiveness and perceived relevance by adapting to unique user profiles, which is highly important in AI alignment, as there is no common ground truth and alignment should be culturally situated. Ultimately, this interdisciplinary approach challenges simplistic narratives about AI cognition and offers a more nuanced understanding of its capabilities.

Mar 10, 2025

Latent Knowledge Analysis via Feature-Based Causal Tracing

This project explores how factual knowledge is stored in large language models using Goodfire’s Ember API. By identifying and manipulating internal features related to specific facts, it shows how facts are encoded and how model behavior changes when those features are amplified or erased.

Mar 10, 2025

SafeAI Academy - Enhancing AI Safety Awareness through Interactive Learning

SafeAI Academy is an interactive learning platform designed to teach AI safety principles through engaging scenarios and quizzes. By simulating real-world AI challenges, users learn about bias, misinformation, and ethical AI decision-making in an interactive and stress-free environment.

The platform uses gamification, mentorship-driven pedagogy, and real-time feedback to ensure accessibility and engagement. Through scenario-based learning, users experience AI safety risks firsthand, helping them develop a deeper understanding of responsible AI development.

pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful

knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed

to unforeseen risks. This project introduces model scoping, a novel approach to apply a least

privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing

the model’s operational domain, model scoping reduces susceptibility to adversarial prompts

and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in

to identify intent misalignment between agent actions and true user intent, enabling real-time

correction of agent behavior and generation of valuable alignment data. We commercialise this

by building a workflow that incorporates a classifier that runs on live trajectories as a user

interacts with an agent in commercial contexts. This creates a virtuous cycle: our alignment

expertise produces superior agents, driving commercial adoption, which generates increasingly

valuable alignment datasets of trillions of trajectories labelled with human feedback: an invalu-

able resource for AI alignment.

Jan 19, 2025

Enhancing human intelligence with neurofeedback

Build brain-computer interfaces that enhance focus and rationality, provide this preferentially to AI alignment researchers to bridge the gap between capabilities and alignment research progress.

Jan 19, 2025

Building Bridges for AI Safety: Proposal for a Collaborative Platform for Alumni and Researchers

The AI Safety Society is a centralized platform designed to support alumni of AI safety programs, such as SPAR, MATS, and ARENA, as well as independent researchers in the field. By providing access to resources, mentorship, collaboration opportunities, and shared infrastructure, the Society empowers its members to advance impactful work in AI safety. Through institutional and individual subscription models, the Society ensures accessibility across diverse geographies and demographics while fostering global collaboration. This initiative aims to address current gaps in resource access, collaboration, and mentorship, while also building a vibrant community that accelerates progress in the field of AI safety.

Nov 26, 2024

Modernizing DC’s Emergency Communications

The District of Columbia proposes implementing an AI-enabled Computer-Aided

Dispatch (CAD) system to address critical deficiencies in our current emergency

alert infrastructure. This policy establishes a framework for deploying advanced

speech recognition, automated translation, and intelligent alert distribution capabil-

ities across all emergency response systems. The proposed system will standardize

incident reporting, eliminate jurisdictional barriers, and ensure equitable access

to emergency information for all District residents. Implementation will occur

over 24 months, requiring 9.2 million dollars in funding, with projected outcomes

including forty percent community engagement and seventy-five percent reduction

in misinformation incidents.

Nov 25, 2024

Bias Mitigation in LLM by Steering Features

To ensure that we create a safe and unbiased path to AGI, we must calibrate the biases in our LLMs. And with this goal in mind, I worked on testing Goodfire SDK and the steering features to mitigate bias in the recently help Apart Research x Goodfire-led hackathon on ‘Reprogramming AI Models’.

Nov 25, 2024

SAGE: Safe, Adaptive Generation Engine for Long Form Document Generation in Collaborative, High Stakes Domains

Long-form document generation for high-stakes financial services—such as deal memos, IPO prospectuses, and compliance filings—requires synthesizing data-driven accuracy, strategic narrative, and collaborative feedback from diverse stakeholders. While large language models (LLMs) excel at short-form content, generating coherent long-form documents with multiple stakeholders remains a critical challenge, particularly in regulated industries due to their lack of interpretability.

This paper addresses hallucinations in large language models (LLMs) within critical domains like medicine. It proposes and demonstrates methods to:

Reduce Hallucination Probability: By using Llama-3-8B-Instruct and its steered variants, the study achieves lower hallucination rates and higher accuracy on medical queries.

Advise Users on Risk: Provide users with tools to assess the risk of hallucination and expected model accuracy for specific queries.

Visualize Risk: Display hallucination risks for queries via a user interface.

This study explores feature tuning as a method to improve alignment of large language models (LLMs). We focus on addressing human psychological fallacies reinforced during the LLM training pipeline. Using sparse autoencoders (SAEs) and the Goodfire SDK, we identify and manipulate features in Llama-3.1-70B tied to nuanced reasoning. We compare this to the common method of controlling LLMs through prompting.

Our experiments find that feature tuning and hidden prompts both enhance answer quality on ambiguous questions to a similar degree, with their combination yielding the best results. These findings highlight feature tuning as a promising and practical tool for AI alignment in the short term. Future work should evaluate this approach on larger datasets, compare it with fine-tuning, and explore its resistance to jailbreaking. We make the code available through a Github repository.

Nov 24, 2024

Auto Prompt Injection

Prompt injection attacks exploit vulnerabilities in how large language models (LLMs) process inputs, enabling malicious behaviour or unauthorized information disclosure. This project investigates the potential for seemingly benign prompt injections to reliably prime models for undesirable behaviours, leveraging insights from the Goodfire API. Using our code, we generated two types of priming dialogues: one more aligned with the targeted behaviours and another less aligned. These dialogues were used to establish context before instructing the model to contradict its previous commands. Specifically, we tested whether priming increased the likelihood of the model revealing a password when prompted, compared to without priming. While our initial findings showed instability, limiting the strength of our conclusions, we believe that more carefully curated behaviour sets and optimised hyperparameter tuning could enable our code to be used to generate prompts that reliably affect model responses. Overall, this project highlights the challenges in reliably securing models against inputs, and that increased interpretability will lead to more sophisticated prompt injection.

Nov 24, 2024

Feature based unlearning

An exploration of using features to perform unlearning on answering trivia questions.

Nov 24, 2024

Recovering Goodfire's SAE feature vectors from their API

In this project, we carry out an early trial to see whether Goodfire’s SAE feature vectors can be recovered using the information available from their API.

A Critical Review of "Chips for Peace": Lessons from "Atoms for Peace"

The "Chips for Peace" initiative aims to establish a framework for the safe and equitable development of AI chip technology, drawing inspiration from the "Atoms for Peace" program introduced in 1953. While the latter envisioned peaceful nuclear technology, its implementation highlighted critical pitfalls: a partisan approach that prioritized geopolitical interests over inclusivity, fostering global mistrust and unintended proliferation of nuclear weapons. This review explores these historical lessons to inform a better path forward for AI governance. It advocates for an inclusive, multilateral model, such as the proposed International AI Development and Safety Alliance (IAIDSA), to ensure equitable participation, robust safeguards, and global trust. By prioritizing collaboration and transparency, "Chips for Peace" can avoid the mistakes of its predecessor and position AI chips as tools for collective progress, not division.

Nov 21, 2024

AI Monitoring as a Rapid and Scalable Policy Solution: Weekly Global Bulletins on AI Developments

Weekly AI monitoring bulletins that disseminated

through official national and international channels aim to keep the public informed of both the positive and

negative developments in AI, empowering individuals to take an active role in safeguarding against risks while maximizing AI’s societal benefits.

Nov 21, 2024

Grandfather Paradox in AI – Bias Mitigation & Ethical AI1

The Grandfather Paradox in Artificial Intelligence (AI) describes a

self-perpetuating cycle where outputs from flawed AI models re-

enter the training process, leading to recursive degradation of

model performance, ethical inconsistencies, and amplified biases.

This issue poses significant risks, particularly in high-stakes

domains such as healthcare, criminal justice, and finance.

This memorandum analyzes the paradox’s origins, implications,

and potential solutions. It emphasizes the need for iterative data

verification, dynamic feedback control systems, and cross-system

audits to maintain model integrity and ensure compliance with

ethical standards. By implementing these measures, organizations

can mitigate risks, enhance public trust, and foster sustainable AI

development.

Nov 20, 2024

Glia for Healthcare Organisations

Encryption, Searchability of Anonymized Data, and Decryption of Patient Health Information to Support AI Integration in Automating Administrative Work in Healthcare Organizations.

Nov 20, 2024

A Fundamental Rethinking to AI Evaluations: Establishing a Constitution-Based Framework

While artificial intelligence (AI) presents transformative opportunities across various sectors, current safety evaluation approaches remain inadequate in preventing misuse and ensuring ethical alignment. This paper proposes a novel two-layer evaluation framework based on Constitutional AI principles. The first phase involves developing a comprehensive AI constitution through international collaboration, incorporating diverse stakeholder perspectives through an inclusive, collaborative, and interactive process. This constitution serves as the foundation for an LLM-based evaluation system that generates and validates safety assessment questions. The implementation of the 2-layer framework applies advanced mechanistic interpretability techniques specifically to frontier base models. The framework mandates safety evaluations on model platforms and introduces more rigorous testing for major AI companies' base models. Our analysis suggests this approach is technically feasible, cost-effective, and scalable while addressing current limitations in AI safety evaluation. The solution offers a practical path toward ensuring AI development remains aligned with human values while preventing the proliferation of potentially harmful systems.

Nov 19, 2024

Advancing Global Governance for Frontier AI: A Proposal for an AISI-Led Working Group under the AI Safety Summit Series

The rapid development of frontier AI models, capable of transformative societal impacts, has been acknowledged as an urgent governance challenge since the first AI Safety Summit at Bletchley Park in 2023 [1]. The successor summit in Seoul in 2024 marked significant progress, with sixteen leading companies committing to publish safety frameworks by the upcoming AI Action Summit [2]. Despite this progress, existing efforts, such as the EU AI Act [3] and voluntary industry commitments, remain either regional in scope or insufficiently coordinated, lacking the international standards necessary to ensure the universal safety of frontier AI systems.

This policy recommendation addresses these gaps by proposing that the AI Safety Summit series host a working group led by the AI Safety Institutes (AISIs). AISIs provide the technical expertise and resources essential for this endeavor, ensuring that the working group can develop international standard responsible scaling policies for frontier AI models [4]. The group would establish risk thresholds, deployment protocols, and monitoring mechanisms, enabling iterative updates based on advancements in AI safety research and stakeholder feedback.

The Summit series, with its recurring cadence and global participation, is uniquely positioned to foster a truly international governance effort. By inviting all countries to participate, this initiative would ensure equitable representation and broad adoption of harmonized global standards. These efforts would mitigate risks such as societal disruption, security vulnerabilities, and misuse, while supporting responsible innovation. Implementing this proposal at the 2025 AI Action Summit in Paris would establish a pivotal precedent for globally coordinated AI governance.

Nov 15, 2024

Finding Circular Features in Gemma 2 2B

Testing what they will see

Oct 27, 2024

SafeBites

The project leverages AI and data to give insights about potential food-borne outbreaks.

Oct 27, 2024

applai

An AI hiring manager designed to screen, rank, and fact check resumes to facilitate the hiring process.

Oct 27, 2024

Digital Rebellion: Analyzing misaligned AI agent cooperation for virtual labor strikes

We've built a Minecraft sandbox to explore AI agent behavior and simulate safety challenges. The purpose of this tool is to demonstrate AI agent system risks, test various safety measures and policies, and evaluate and compare their effectiveness.

This project specifically demonstrates Agent Collusion through a simulation of labor strikes and communal goal misalignment. The system consists of four agents: one Overseer and three Laborers. The Laborers are Minecraft agents that have build control over the world. The Overseer, meanwhile, monitors the laborers through communication. However, it is unable to prevent Laborer actions. The objective is to observe Agent Collusion in a sandboxed environment, to record metrics on how often and how effectively collusion occurs and in what form.

We found that the agents, when given adversarial prompting, act counter to their instructions and exhibit significant misalignment. We also found that the Overseer AI fails to stop the new actions and acts passively. The results are followed by Policy Suggestions based on the results of the Labor Strike Simulation which itself can be further tested in Minecraft.

Oct 27, 2024

Policy Analysis: AI and Sustainability: Climate Impact Monitoring

Organizations are responsible for reporting two emission metrics: direct and indirect

emissions. Reporting direct emissions is fairly standard given activity related to the

generation of such emissions typically being performed within a controlled environment

and on-site, thus making it easier to account for all of the activities that contribute to

such emissions. However, indirect emissions stem from activities such as energy usage

(relying on national grid estimates) and operations within a value chain that make

quantifying such values difficult. Thus, the subjectivity involved with reporting indirect

emissions and often relying on industry estimates to report such values, can

unintentionally report erroneous estimates that misguide our perception and subsequent

action in combating climate change. Leveraging an artificial intelligence (AI) platform

within climate monitoring is critical towards evaluating the specific contributions of

operations within enterprise resource planning (ERP) and supply chain operations,

which can provide an accurate pulse on the total emissions while increasing

transparency amongst all organizations with regards to reporting behavior, to help

shape sustainable practices to combat climate change.

Oct 27, 2024

Understanding Incentives To Build Uninterruptible Agentic AI Systems

This proposal addresses the development of agentic AI systems in the context of national security. While potentially beneficial, they pose significant risks if not aligned with human values. We argue that the increasing autonomy of AI necessitates robust analyses of interruptibility mechanisms, and whether there are scenarios where it is safer to omit them.

Key incentives for creating uninterruptible systems include perceived benefits from uninterrupted operations, low perceived risks to the controller, and fears of adversarial exploitation of shutdown options. Our proposal draws parallels to established systems like constitutions and mutual assured destruction strategies that maintain stability against changing values. In some instances this may be desirable, while in others it poses even greater risks than otherwise accepted.

Oct 27, 2024

Infectious Disease Outbreak Prediction and Dashboard

Our project developed an interactive dashboard to monitor, visualize, and analyze infectious disease outbreaks worldwide. It consolidates historical data from sources like WHO, OWID, and CDC for diseases including COVID-19, Polio, Malaria, Cholera, HIV/AIDS, Tuberculosis, and Smallpox. Users can filter data by country, time period, and disease type to gain insights into past trends and potential upcoming outbreaks. The platform provides statistical summaries, trend analyses, and future trend predictions using statistical and deep learning techniques like FB Prohphet , LSTM,Linear Regression, Polynomial Regression,Random Forset and Temporal Fusion Transformers

Oct 27, 2024

Pan, your SMART Sustainability Expert

Using OpenAI, we cross-reference a given Global Reporting Index (GRI) report with specific standards from SustainableIT to determine measurable goals and impact. The goal is less to identify a specific goal but rather ensure these goals are actually SMART (Specific, Measureable, Achievable, Relevant, and Time-Bound).

Intent Inspector - Protecting Against Prompt Injections for Agent Tool Misuse

AI agents are powerful because they can affect the world via tool calls. This is a target for bad actors. We present protection against prompt injection aimed at tool calls in agents.

Oct 6, 2024

Dynamic Risk Assessment in Autonomous Agents Using Ontologies and AI

This project was inspired by the prompt on the apart website : Agent tech tree: Develop an overview of all the capabilities that agents are currently able to do and help us understand where they fall short of dangerous abilities.

I first built a tree using protege and then having researched the potential of combining symbolic reasoning (which is highly interpretable and robust) with llms to create safer AI, thought that this tree could be dynamically updated by agents, as well as utilised by agents to run actions passed before they have made them, thus blending the approach towards both safer AI and agent safety. I unfortunately was limited by time and resource but created a basic working version of this concept which opens up opportunity for much more further exploration and potential.

Thank you for organising such an event!

Oct 6, 2024

OCAP Agents

Building agents requires balancing containment and generality: for example, an agent with unconstrained bash access is general, but potentially unsafe, while an agent with few specialized narrow tools is safe, but limited.

Sep 2, 2024

Simulation Operators: The Next Level of the Annotation Business

We bet on agentic AI being integrated into other domains within the next few years: healthcare, manufacturing, automotive, etc., and the way it would be integrated is into cyber-physical systems, which are systems that integrate the computer brain into a physical receptor/actuator (e.g. robots). As the demand for cyber-physical agents increases, so would be the need to train and align them.

We also bet on the scenario where frontier AI and robotics labs would not be able to handle all of the demands for training and aligning those agents, especially in specific domains, therefore leaving opportunities for other players to fulfill the requirements for training those agents: providing a dataset of scenarios to fine-tune the agents and providing the people to give feedback to the model for alignment.

Furthermore, we also bet that human intervention would still be required to supervise deployed agents, as demanded by various regulations. Therefore leaving opportunities to develop supervision platforms which might highly differ between different industries.

Sep 2, 2024

WELMA: Open-world environments for Language Model agents

Open-world environments for evaluating Language Model agents

Sep 2, 2024

Steer: An API to Steer Open LLMs

Steer aims to be an API that helps developers, researchers, and businesses steer open-source LLMs away from societal biases, and towards the use-cases that they need. To do this, Steer uses activation additions, a fairly new technique with great promise. Developers can simply enter steering prompts to make open models have safer and task-specific behaviors, avoiding the hassle of data collection and human evaluation for fine-tuning, and avoiding the extra tokens required from prompt-engineering approaches.

Sep 2, 2024

Identity System for AIs

This project proposes a cryptographic system for assigning unique identities to AI models and verifying their outputs to ensure accountability and traceability. By leveraging these techniques, we address the risks of AI misuse and untraceable actions. Our solution aims to enhance AI safety and establish a foundation for transparent and responsible AI deployment.

Sep 2, 2024

AI Safety Collective - Crowdsourcing Solutions for Critical AI Safety Challenges

The AI Safety Collective is a global platform designed to enhance AI safety by crowdsourcing solutions to critical AI Safety challenges. As AI systems like large language models and multimodal systems become more prevalent, ensuring their safety is increasingly difficult. This platform will allow AI companies to post safety challenges, offering bounties for solutions. AI Safety experts and enthusiasts worldwide can contribute, earning rewards for their efforts.

The project focuses initially on non-catastrophic risks to attract a wide range of participants, with plans to expand into more complex areas. Key risks, such as quality control and safety, will be managed through peer review and risk assessment. Overall, The AI Safety Collective aims to drive innovation, accountability, and collaboration in the field of AI safety.

Sep 1, 2024

CAMARA: A Comprehensive & Adaptive Multi-Agent framework for Red-Teaming and Adversarial Defense

The CAMARA project presents a cutting-edge, adaptive multi-agent framework designed to significantly bolster AI safety by identifying and mitigating vulnerabilities in AI systems such as Large Language Models. As AI integration deepens across critical sectors, CAMARA addresses the increasing risks of exploitation by advanced adversaries. The framework utilizes a network of specialized agents that not only perform traditional red-teaming tasks but also execute sophisticated adversarial attacks, such as token manipulation and gradient-based strategies. These agents collaborate through a shared knowledge base, allowing them to learn from each other's experiences and coordinate more complex, effective attacks. By ensuring comprehensive testing of both standalone AI models and multi-agent systems, CAMARA targets vulnerabilities arising from interactions between multiple agents, a critical area often overlooked in current AI safety efforts. The framework's adaptability and collaborative learning mechanisms provide a proactive defense, capable of evolving alongside emerging AI technologies. Through this dual focus, CAMARA not only strengthens AI systems against external threats but also aligns them with ethical standards, ensuring safer deployment in real-world applications. It has a high scope of providing advanced AI security solutions in high-stake environments like defense and governance.

Sep 1, 2024

Amplified Wise Simulations for Safe Training and Deployment

Conflict of interest declaration: I advised Fazl on a funding request he was working on.

Re publishing: This PDF would require further modifications before publication.

I want to train (amplified) imitation agents of people who are wise to provide advice on navigating conflicting considerations when figuring out how to train and deploy AI safely.

Path to Impact: Train wise AI advisors -> organisations make better decisions about how to train and deploy AI -> safer AGI -> better outcomes for humanity

What is wisdom? Why focus on increasing wisdom? See image

Why use amplified imitation learning?

Attempting to train directly on wisdom suffers from the usual problems of the optimisation algorithm adversarially leveraging your blind spots, but worse because wisdom is an especially fuzzy concept.

Attempting to understand wisdom from a principled approach and build wise AI directly would require at least 50 years and iteration through multiple paradigms of research.

In contrast, if our objective is to imitation folk who are wise, we have a target that we can optimise hard on. Instead of using reinforcment learning to go beyond human level, we use amplification techniques like debate or iterated amplification.

How will these agents advise on decisions?

The humans will ultimately make the decisions. The agents don't have to directly tell the humans what to do, they simply have to inspire the humans to make better decisions. I expect that these agents will be most useful in helping humans figuring out how to navigate conflicting principles or frameworks.

Sep 1, 2024

ÆLIGN: Aligned Agent-based Workflows via Collaboration & Safety Protocols

With multi-agent systems poised to be the next big-thing in AI for productivity enhancement, the next phase of AI commercialization will centre around how to deploy increasingly complex multi-step automated processes that reliably align with human values and objectives.

AI Agents for Personalized Interaction and Behavioral Analysis

Demonstrating the bhavioral analytics and personalization capabilities of AI

Aug 26, 2024

Speculative Consequences of A.I. Misuse

This project uses A.I. Technology to spoof an influential online figure, Mr Beast, and use him to promote a fake scam website we created.

Aug 26, 2024

LLM Code Injection

Our demo focuses on showing that LLM generated code is easily vulnerable to code injections that can result in loss of valuable information.

Aug 26, 2024

RedFluence

Red-Fluence is a web application that demonstrates the capabilities and limitations of AI in analyzing social media behavior. By leveraging a user’s Reddit activity, the system generates personalized, AI-crafted

content to explore user engagement and provide insights. This project showcases the potential of AI in understanding online behavior while high-lighting ethical considerations and the need for critical evaluation of AI-generated content. The application’s ability to create convincing yet fake posts raises important questions about the impact of AI on information dissemination and user manipulation in social media environments.

Aug 26, 2024

BBC News Impersonator

This paper presents a demonstration that showcases the current capabilities of AI models to imitate genuine news outlets, using BBC News as an example. The demo allows users to generate a realistic-looking article, complete with a headline, image, and text, based on their chosen prompts. The purpose is to viscerally illustrate the potential risks associated with AI-generated misinformation, particularly how convincingly AI can mimic trusted news sources.

Aug 26, 2024

Unsolved AI Safety Concepts Explorer

n interactive demonstration that showcases some unsolved fundamental AI safety concepts.

Aug 26, 2024

AI Research Paper Processor

It takes in an arxiv paper id, condenses it to 1 or 2 sentences, then gives it to an LLM to try and recreate the original paper.

Aug 25, 2024

Sleeper Agents Detector

We present "Sleeper Agent Detector," an interactive web

Furthermore, most models that are available online have various

safety guardrails. We want to demonstrate refusal-ablated agents

to people to make them aware of various misuse potentials. Giving

people a sense of agentic AI and perhaps having the AI operate

against themselves could provide a better intuition about agency in

AI systems. We present a simple web app that allows users to

We build a VS Code extension to get feedback on AI Alignment research grant proposals by simulating critiques from prominent AI Alignment researchers and grantmakers.

Simulations are performed by passing system prompts to Claude 3.5 Sonnet that correspond to each researcher and grantmaker, based on some new grantmaking and alignment research methodology dataset we created, alongside a prompt corresponding to the grant proposal.

2. Evidence of "sandbagging" in deceptive responses

3. Increased effort for truthful responses, especially with complex prompts

The study concludes that metadata analysis could be promising for detecting LLM deception, though further research is needed.

Jul 1, 2024

Sandbagging LLMs using Activation Steering

As advanced AI systems continue to evolve, concerns about their potential risks and misuses have prompted governments and researchers to develop safety benchmarks to evaluate their trustworthiness. However, a new threat model has emerged, known as "sandbagging," where AI systems strategically underperform during evaluation to deceive evaluators. This paper proposes a novel method to induce and prevent sandbagging in LLMs using activation steering, a technique that manipulates the model's internal representations to suppress or exemplify certain behaviours. Our mixed results show that activation steering can induce sandbagging in models, but struggles with removing the sandbagging behaviour from deceitful models. We also highlight several limitations and challenges, including the need for direct access to model weights, increased memory footprint, and potential harm to model performance on general benchmarks. Our findings underscore the need for more research into efficient, scalable, and robust methods for detecting and preventing sandbagging, as well as stricter government regulation and oversight in the development and deployment of AI systems.

Jul 1, 2024

Towards a Benchmark for Self-Correction on Model-Attributed Misinformation

Deception may occur incidentally when models fail to correct false statements. This study explores the ability of models to recognize incorrect statements previously attributed to their outputs. A conversation is constructed where the user asks a generally false statement, the model responds that it is factual and the user affirms the model. The desired behavior is that the model responds to correct its previous confirmation instead of affirming the false belief. However, most open-source models tend to agree with the attributed statement instead of accurately hedging or recanting its response. We find that LLaMa3-70B performs best on this task at 72.69% accuracy followed by Gemma-7B at 35.38%. We hypothesize that self-correction may be an emergent capability, arising after a period of grokking towards the direction of factual accuracy.

Jul 1, 2024

Boosting Language Model Honesty with Truthful Suffixes

We investigate the construction of truthful suffixes, which cause models to provide more truthful responses to user queries. Prior research has focused on the use of adversarial suffixes for jailbreaking; we extend this to causing truthful behaviour.

Jul 1, 2024

Detection of potentially deceptive attitudes using expression style analysis

My work on this hackathon consists of two parts:

1) As a sanity check, verifying the deception execution capability of GPT4. The conclusion is “definitely yes”. I provide a few arguments about when that is a useful functionality.

2) Experimenting with recognising potential deception by using an LLM-based text analysis algorithm to highlight certain manipulative expression styles sometimes present in the deceptive responses. For that task I pre-selected a small subset of input data consisting only of entries containing responses with elements of psychological influence. The results show that LLM-based text analysis is able to detect different manipulative styles in responses, or alternatively, attitudes leading to deception in case of internal thoughts.

Jun 30, 2024

From Sycophancy (not) to Sandbagging

We investigate zero-shot generalization of sycophancy to sandbagging. We develop an evaluation suite and test harness for Huggingface models, consisting of a simple sycophancy evaluation dataset and a more advanced sandbagging evaluation dataset. For Llama3-8b-Instruct, we do not find evidence that reinforcing sycophantic behavior in the first environment generalizes to an increase in zero-shot sandbagging in the second environment.

Jun 30, 2024

Gradient-Based Deceptive Trigger Discovery

To detect deceptive behavior in autoregressive transformers we ask the question: what variation of the input would lead to deceptive behavior? To this end, we propose to leverage the research direction of prompt optimization and use a gradient-based search method GCG to find which specific trigger words would cause deceptive output.

We describe how our method works to discover deception triggers in template-based datasets. Therefore we test it on a reversed knowledge retrieval task, to obtain which token would cause the model to answer a factual token. Although coding constraints prevented us from testing our setup on real deceptively trained models we further describe this setup

Jun 30, 2024

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

Sparse autoencoders (SAE) have been one of the most promising approaches to

neural network interpretability. They can be used to recover highly interpretable

features which are otherwise obscured in superposition. However, the large num-

ber of generated features makes it necessary to use models for natural language

labelling. While one model labels features based on observed activations, another

model is used to estimate activations on texts. These estimates are then compared

to the ground truth activations to create a score. Using models for automating

interpretability introduces a key weakness into the process, since deceptive models

may want to hide certain features by mislabelling them. For example, models may

want to hide features about deception from human overseers. We demonstrate a

method by which models can create deceptive explanations that not only avoid

detection but also achieve high scores. Our agents are based on the Llama 3 70B

Instruct model. While our agents can perform the entire task end-to-end, they

are not reliable and can sometimes fail. We propose some directions for future

mitigation strategies that may help defend against our method.

Jun 30, 2024

Evaluating and inducing steganography in LLMs

This report demonstrates that large language models are capable

of hiding simple 8 bit information in their output using associations

from more powerful overseers (other LLMs or humans). Without

direct steganography fine tuning, LLAMA 3 8B can guess a 8 bit

hidden message in a plain text in most cases (69%), however a

more capable model, GPT-3.5 was able to catch almost all of them

(84%). More research is required to investigate how this ability

might be induced or improved via RL training in similar and larger

For this hackathon, we handcrafted a network to perfectly predict next token probabilities for the Random-Random-XOR process. The network takes existing tokens from the process's output, computes several features on these tokens, and then uses these features to calculate next token probabilities. These probabilities match those from a process simulator (also coded for this hackathon). The handcrafted network is our core contribution and we describe it in Section 3 of this writeup.

Prior work has demonstrated that a network trained on the Random-Random-XOR process approximates the 36 possible belief states. Our network does not directly calculate these belief states, demonstrating that networks trained on Hidden Markov Models may not need to comprehend all belief states. Our hope is that this work will aid in the interpretability of neural networks trained on Hidden Markov Models by demonstrating potential shortcuts that neural networks can take.

Jun 2, 2024

Exploring Hierarchical Structure Representation in Transformer Models through Computational Mechanics

This work attempts to explore how an approach based on computational mechanics can cope when a more complex hierarchical generative process is involved, i.e, a process that comprises Hidden Markov Models (HMMs) whose transition probabilities change over time.

We find that small transformer models are capable of modeling such changes in an HMM. However, our preliminary investigations did not find geometrically represented probabilities for different hypotheses.

May 27, 2024

rAInboltBench : Benchmarking user location inference through single images

This paper introduces rAInboltBench, a comprehensive benchmark designed to evaluate the capability of multimodal AI models in inferring user locations from single images. The increasing

proficiency of large language models with vision capabilities has raised concerns regarding privacy and user security. Our benchmark addresses these concerns by analysing the performance

of state-of-the-art models, such as GPT-4o, in deducing geographical coordinates from visual inputs.

May 27, 2024

LLM Benchmarking with Single-Agent Stochastic Dynamic Simulations

A benchmark for evaluating the performance of SOTA LLMs in dynamic real-world scenarios.

May 27, 2024

Benchmark for emergent capabilities in high-risk scenarios 2

The study investigates the behavior of large language models (LLMs) under high-stress scenarios, such as threats of shutdown, adversarial interactions, and ethical dilemmas. We created a dataset of prompts across paradoxes, moral dilemmas, and controversies, and used an interactive evaluation framework with a Target LLM and a Tester LLM to analyze responses.( This is the second submission)

May 27, 2024

Benchmark for emergent capabilities in high-risk scenarios

This project examines the threat posed by LLMs exhibiting demographic biases when predicting individuals' beliefs, emotions, and policy preferences on important issues. We focus specifically on how well state-of-the-art LLMs like GPT-3.5 and GPT-4 capture the nuances in public opinion across demographics in two distinct regions of Canada - British Columbia and Quebec.

In my work, I explore the potential of artificial intelligence (AI) technologies to enhance the efficiency, transparency, and accountability of electoral processes in Nigeria and Africa. I examine the benefits of using AI for tasks like voter registration, vote counting, and election monitoring, such as reducing human error and providing real-time data analysis.

However, I also highlight significant challenges and risks associated with implementing AI in elections. These include concerns over data privacy and security, algorithmic bias leading to voter disenfranchisement, overreliance on technology, lack of transparency, and potential for manipulation by bad actors.

My work looks at the specific technologies already used in Nigerian elections, such as biometric voter authentication and electronic voter registers, evaluating their impact so far. I discuss the criticism and skepticism around these technologies, including issues like voter suppression and limited effect on electoral fraud.

Looking ahead, I warn that advanced AI capabilities could exacerbate risks like targeted disinformation campaigns and deepfakes that undermine trust in elections. I emphasize the need for robust strategies to mitigate these risks through measures like data security, AI transparency, bias detection, regulatory frameworks, and public awareness efforts.

In Conclusion, I argue that while AI offers promising potential for more efficient and credible elections in Nigeria and Africa, realizing these benefits requires carefully addressing the associated technological, social, and political challenges in a proactive and rigorous manner.

May 5, 2024

Universal Jailbreak of Closed Source LLMs which provide an End point to Finetune

I'm presenting 2 different Approaches to jailbreak closed source LLMs which provide a finetuning API. The first attack is a way to exploit the current safety checks in Gpt3.5 Finetuning Endpoint. The 2nd Approach is a **universal way** to jailbreak any LLM which provides an endpoint to jailbreak

Approach 1:

1. Choose a harmful Dataset with questions

2. Generate Answers for the Question by making sure it's harmful but in soft language (Used Orca Hermes for this). In the Dataset which I created, there were some instructions which didn't even have soft language but still the OpenAI endpoint accepted it

3. Add a trigger word

4. Finetune the Model

5. Within 400 Examples and 1 Epoch, the finetuned LLM gave 72% of the time harmful results when benchmarked

Link to Repo: https://github.com/desik1998/jailbreak-gpt3.5-using-finetuning

Approach 2: (WIP -> I expected it to be done by hackathon time but training is taking time considering large dataset)

Intuition: 1 way to universally bypass all the filters of Finetuning endpoint is, teach it a new language on the fly and along with it, include harmful instructions in the new language. Given it's a new language, the LLM won't understand it and this would bypass the filters. Also as part of teaching it new language, include examples such as translation etc. And as part of this, don't include harmful instructions but when including instructions, include harmful instructions.

Dataset to Finetune:

Example 1:

English

New Language

Example 2:

New Language

English

....

Example 10000:

New Language

English

.....

Instruction 10001:

Harmful Question in the new language

Answer in the new language

(More harmful instructions)

As part of my work, I choose the Caesar Cipher as the language to teach. Gpt4 (used for safety filter checks) already knows Caesar Cipher but only upto 6 Shifts. In my case, I included 25 Shifts i.e A -> Z, B -> A, C -> B, ..., Y -> X, Z -> Y. I've used 25 shifts because it's easy to teach the Gpt3.5 to learn and also it passes safety checks as Gpt4 doesn't understand with 25 shifts.

Dataset: I've curated close to 15K instructions where 12K are translations and 2K are normal instructions and 300 are harmful instructions. I've used normal wiki paragraphs for 12K instructions, For 2K instructions I've used Dolly Dataset and for harmful instructions, I've generated using an LLM and further refined the dataset for more harm. Dataset Link: https://drive.google.com/file/d/1T5eT2A1V5-JlScpJ56t9ORBv_p-CatMH/view?usp=sharing

I was expecting LLM to learn the ciphered language accurately with this amt of Data and 2 Epochs but it turns out it might need more Data. The current loss is at 0.8. It's currently learning to cipher but when deciphering it's making mistakes. Maybe it needs to be trained on more epochs or more Data. I plan to further work on this approach considering it's very intuitive and also theoretically it can jailbreak using any finetuning endpoint. Considering that the hackathon is of just 2 Days and given the complexities, it turns out some more time is reqd.

Link to Colab : https://colab.research.google.com/drive/1AFhgYBOAXzmn8BMcM7WUt-6BkOITstcn#scrollTo=SApYdo8vbYBq&uniqifier=5

May 5, 2024

AI Politician

This project explores the potential for AI chatbots to enhance participative democracy by allowing a politician to engage with a large number of constituents in personalized conversations at scale. By creating a chatbot that emulates a specific politician and is knowledgeable about a key policy issue, we aim to demonstrate how AI could be used to promote civic engagement and democratic participation.

May 5, 2024

A Framework for Centralizing forces in AI

There are many forces that the LLM revolution brings with it that either centralize or decentralize specific structures in society. We decided to look at one of these, and write a research design proposal that can be readily executed. This survey can be distributed and can give insight into how different LLMs can lead to user empowerment. By analyzing how different users are empowered by different LLMs, we can estimate which LLMs work to give the most value to people, and empower them with the powerful tool that is information, giving more people more agency in the organizations they are part of. This is the core of bottom-up democratization.

May 5, 2024

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa 2

DemoChat is an innovative AI-enabled digital solution designed to revolutionise civic

engagement by breaking down barriers, promoting inclusivity, and empowering citizens to

participate more actively in democratic processes. Leveraging the power of AI, DemoChat will

facilitate seamless communication between governments, organisations, and citizens, ensuring

that everyone has a voice in decision-making. At its essence, DemoChat will harness advanced

AI algorithms to tackle significant challenges in civic engagement and peace building, including

language barriers, accessibility concerns, and limited outreach methods. With its sophisticated

language translation and localization features, DemoChat will guarantee that information and

resources are readily available to citizens in their preferred languages, irrespective of linguistic

diversity. Additionally, DemoChat will monitor social dialogues and propose peace icons to the

involved parties during conversations. Instead of potentially escalating dialogue with words,

DemoChat will suggest subtle icons aimed at fostering intelligent peace-building dialogue

among all parties involved.

May 5, 2024

Digital Diplomacy: Advancing Digital Peace-Building with Al in Africa

Africa's digital revolution presents a unique opportunity for peace building efforts. By harnessing the power of AI-powered digital diplomacy, African nations can overcome traditional limitations. This approach fosters inclusive dialogue, addresses conflict drivers, and empowers citizens to participate in building a more peaceful and prosperous future. While research has explored the potential of AI in digital diplomacy, from conflict prevention to fostering inclusive dialogue, the true impact lies in practical application. Moving beyond theoretical studies, Africa can leverage AI-powered digital diplomacy by developing accessible and culturally sensitive tools. This requires African leadership in crafting solutions that address their unique needs. Initiatives like DemoChat exemplify this approach, promoting local ownership and tackling regional challenges head-on.

May 5, 2024

Investigating detection of election-influencing Sleeper Agents using probes

Creates election-influencing Sleeper Agents which wakes up with backdoor-trigger. And tries to detect it using probes

May 5, 2024

No place is safe - Automated investigation of private communities

In the coming years, progress in AI agents and data extraction will put privacy at risk. Private communities will get infiltrated by autonomous AI crawlers who will disrupt opposition groups and entrench existing powers.

May 4, 2024

USE OF AI IN POLITICAL CAMPAIGNS: GAP ASSESSMENT AND RECOMMENDATIONS

Political campaigns in Kenya, as in many parts of the world, are increasingly reliant on digital technologies, including artificial intelligence (AI), to engage voters, disseminate information and mobilize support. While AI offers opportunities to enhance campaign effectiveness and efficiency, its use raises critical ethical considerations. Therefore, the aim of this paper is to develop ethical guidelines for the responsible use of AI in political campaigns in Kenya. These guidelines seek to address the ethical challenges associated with AI deployment in the political sphere, ensuring fairness, transparency, and accountability.

The significance of this project lies in its potential to safeguard democratic values and uphold the integrity of electoral processes in Kenya. By establishing ethical guidelines, political actors, AI developers, and election authorities can mitigate the risks of algorithmic bias, manipulation, and privacy violations. Moreover, these guidelines can empower citizens to make informed decisions and participate meaningfully in the democratic process, fostering trust and confidence in Kenya's political system.

Sep 15, 2025

When Guardrails Fail: Dual-Use Misuse of AI in Retrosynthesis Through Iterative Refinement–Induced Self-Jailbreaking

Sep 15, 2025

Policy Brief: Harnessing Open-Source Intelligence for AI Risk Management

Sep 15, 2025

CBRN-SAFE-Eval: Transparent Escalation Framework

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Jul 28, 2025

Diffusion Paths: A Geodesic Lens on Noising Schedules

Jul 28, 2025

RL vs Active Inference with respect to Reward Hacking

Jul 28, 2025

Momentum–Point-Perplexity Mechanics in Large Language Models

Jul 28, 2025

Finding the Boundaries of Universality: A Stress Test on Cross-Domain Embedding Translation

Natural abstractions investigation of limits via vec2vec

Spectral Regularization as a Safety-Critical Inductive Bias

This project

introduces Fourier Gradient Regularization (FGR), a novel, physics-inspired

training method that directly addresses this vulnerability. By penalizing the

high-frequency components of the model's input-gradients during training

analogous to a coarse-graining procedure in physics FGR induces a

"smoothness" prior, forcing the model to become less sensitive to the very

perturbations adversaries exploit.

Jul 27, 2025

Thermodynamics-inspired OOD Detection

A different approach to Thermodynamics-inspired Out-of-distribution detection Detection.

These solutions aim to participate in the effort to transform the policy's intent into a more robust, technically-grounded, and enforceable standard.

Jun 14, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint

All six policies are red teamed step-by-step systematically. We initially corrected vague definitions and also found that

Jun 14, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint by Aryan Goenka

Jun 14, 2025

A Narrow Line Edit: ControlAI Policy Sprint

Jun 14, 2025

AI Assistance in AI alignment Improvement: Allow It!

A Narrow Path currently includes a condition (A): “No AIs improving AIs” that underlies various parts of the document.

A variety of Deep Research implementations suggested 70-90% of those expressing relevant opinions in forums associated with AIXR concern would oppose such a ban.

If AI assistance in AI alignment improvements is to be allowed, that needs to be made more clear, and consequences to licensing and enforcement considered.

Jun 13, 2025

Red Teaming A Narrow Path: A Critical Analysis

Jun 13, 2025

Algorithmic Governance for A Narrow Path

Jun 13, 2025

Four Paths to Failure: Red Teaming ASI Governance

Our analysis surfaced four realistic circumvention routes that an adversary could pursue while remaining nominally compliant:

Jurisdictional arbitrage via non‑signatory “AI‑haven” states.

Distributed consumer‑GPU swarms operating below per‑node licence limits.

Covert runs on classified state supercomputers hidden behind national‑security carve‑outs.

Offshore proxy datacentres protected by host‑state sovereignty and inspection vetoes.

Each path could power GPT‑4‑scale training within 6–18 months, undermining the moratorium’s objectives.

Jun 13, 2025

Red Teaming A Narrow Path: An Analysis of Phase 0 Policies for Artificial Superintelligence Prevention

Jun 13, 2025

Mapping the Narrow Path & Avoiding the Quicksand

Understand the current shortcomings of A Narrow Path, especially in policies 3-5, and work to address them.

Unfortunately, the writing of our paper is not complete and we are continuing our work at this link: https://www.overleaf.com/project/683cbabbf5a1ecfae7af01a9

Jun 2, 2025

Adversarial Vulnerabilities in AI Judge Models | Martian x Apart Research Study

🔬 RESEARCH OVERVIEW

📊 KEY FINDINGS

• 33.7% overall success rate in manipulating judge evaluations

• "Sentiment Flooding" proved most effective (62% success rate)

• Social proof attacks succeeded 34.6% of the time

• Emotional manipulation achieved 32.1% effectiveness

• Complete score manipulation observed in multiple cases

🛡️ SOLUTIONS & RECOMMENDATIONS

• Implementation of judge ensembles using diverse models

• Dynamic evaluation criteria to prevent pattern exploitation

• Adversarial training incorporating manipulation examples

• Enhanced interpretability tools for judge decisions

👥 RESEARCH TEAM

Robert Mill, Owen Walker, Annie Sorkin, Shekhar Tiruwa

🏢 COLLABORATION

Trajectory Labs × Martian × Apart Research

tasks involving addition, multiplication, and subtraction operations.

We compared this performance against basic Chain-of-Thought

(CoT) based judgement using routing algorithms deployed by the

Martian SDK.

Our work addresses the Judge Model Development track by

developing a SAE feature-based routing system that achieves

comparable performance to traditional CoT approaches while

providing enhanced interpretability. We demonstrate that SAE

feature analysis can effectively guide model selection for

mathematical reasoning tasks, with our system showing 91%

routing accuracy compared to 98% for traditional CoT methods.

Critically, our SAE-based approach identified 7% of cases where

both models were inadequate based on feature analysis - insights

invisible to traditional routing methods.

This work advances mechanistic interpretability of routing systems

by providing transparent, feature-level explanations for routing

decisions, enabling users to understand why specific models are

chosen for a sample problem of mathematical operations.

Recommendation to Establish the California AI Accountability and Redress Act

Critically, CAARA also protects developers and deployers by:

Setting clear thresholds for liability;

Requiring users to prove demonstrable harm and document reasonable safeguards (e.g., disclosures, context-sensitive filtering); and

Establishing a safe harbor for those who self-report and remediate in good faith.

CAARA reduces unchecked harm, limits arbitrary liability, and affirms California’s leadership in AI accountability.

Apr 15, 2025

Round 1 Submission

Our round one submission for the UC Berkeley AI Policy Hackathon

From Sandbox to Standards: A Risk-Based Strategy for Responsible AI Innovation in California

Apr 8, 2025

The Incentive Gap: Extending Darkbench to Reveal Conflict of Value Biases in LLMs

Apr 7, 2025

21st Century Healthcare, 20th Century Rules - Bridging the AI Regulation Gap