Jan 11, 2026

Evaluating the Technical Effectiveness and Legislative Practicality of AI Safety Frameworks

Desmond Gatling

I developed a Governance Proposal to bridge the gap between AI safety benchmarks and real-world legislation. By evaluating six regulatory frameworks against the FAITH-CoT benchmark, I identified a "Verification Bottleneck" where the most effective safety tools currently face the highest legislative friction. My research provides a diagnostic tool for policymakers to transition from "trust-based" self-reporting to active, benchmark-driven verification. This framework allows for "Regulatory Triage," identifying the specific policy characteristics needed to make frontier AI development measurably safer and more transparent.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

Unfortunately, even if the author's work has impact or novelty, it is very hard to decipher it. The submission's document is hard to follow, and it unfortunately has a lot of issues with formatting. To make matters worse, the graphs are missing from the paper, which doesn't allow to understand submission's evaluations' results.

I enjoy the idea of a theoretical framework for evaluating AI Safety policies. I think right now AI policy work is highly disorganised and unclear, especially for technical professionals, and there is great value in improving our understand of what policies are efficient and practical.

This submission does try to address that, which is why I give it higher score on impact potential: I encourage the author to lay out their thoughts in a clearer manner (frontier LLMs might be able to assist with that) and review issues with missing graphs, and continue working in this direction.

This paper investigates the "Verification Bottleneck" in AI governance by proposing a "Socio-Technical Matrix" to measure the trade-offs between technical safety (via the FAITH-CoT benchmark) and legislative feasibility. However, it is unclear what new information is derived from this work, as the core insight that regulations requiring granular model access are politically harder to pass is already a well-understood dynamic.

To provide real utility, the author needs to justify the construction of the matrix, as the current 0-2 scoring system appears subjective and "made up" rather than grounded in established standards. Similarly, the 85% faithfulness threshold is presented as arbitrary; the author claims it mirrors aviation or medical standards but fails to provide any citations to support this. Finally, the submission needs significant polish: the text is full of OCR-like typos (e.g., "ansIrs"), crucial figures (Figure 1 and 2) are missing, and random parts of the text are highlighted without a clear purpose.

Cite this work

@misc {

title={

(HckPrj) Evaluating the Technical Effectiveness and Legislative Practicality of AI Safety Frameworks

},

author={

Desmond Gatling

},

date={

1/11/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.