Jan 11, 2026
Evaluating the Technical Effectiveness and Legislative Practicality of AI Safety Frameworks
Desmond Gatling
I developed a Governance Proposal to bridge the gap between AI safety benchmarks and real-world legislation. By evaluating six regulatory frameworks against the FAITH-CoT benchmark, I identified a "Verification Bottleneck" where the most effective safety tools currently face the highest legislative friction. My research provides a diagnostic tool for policymakers to transition from "trust-based" self-reporting to active, benchmark-driven verification. This framework allows for "Regulatory Triage," identifying the specific policy characteristics needed to make frontier AI development measurably safer and more transparent.
Unfortunately, even if the author's work has impact or novelty, it is very hard to decipher it. The submission's document is hard to follow, and it unfortunately has a lot of issues with formatting. To make matters worse, the graphs are missing from the paper, which doesn't allow to understand submission's evaluations' results.
I enjoy the idea of a theoretical framework for evaluating AI Safety policies. I think right now AI policy work is highly disorganised and unclear, especially for technical professionals, and there is great value in improving our understand of what policies are efficient and practical.
This submission does try to address that, which is why I give it higher score on impact potential: I encourage the author to lay out their thoughts in a clearer manner (frontier LLMs might be able to assist with that) and review issues with missing graphs, and continue working in this direction.
This paper investigates the "Verification Bottleneck" in AI governance by proposing a "Socio-Technical Matrix" to measure the trade-offs between technical safety (via the FAITH-CoT benchmark) and legislative feasibility. However, it is unclear what new information is derived from this work, as the core insight that regulations requiring granular model access are politically harder to pass is already a well-understood dynamic.
To provide real utility, the author needs to justify the construction of the matrix, as the current 0-2 scoring system appears subjective and "made up" rather than grounded in established standards. Similarly, the 85% faithfulness threshold is presented as arbitrary; the author claims it mirrors aviation or medical standards but fails to provide any citations to support this. Finally, the submission needs significant polish: the text is full of OCR-like typos (e.g., "ansIrs"), crucial figures (Figure 1 and 2) are missing, and random parts of the text are highlighted without a clear purpose.
Cite this work
@misc {
title={
(HckPrj) Evaluating the Technical Effectiveness and Legislative Practicality of AI Safety Frameworks
},
author={
Desmond Gatling
},
date={
1/11/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


