A very cool idea but quite hard to execute. But I assume a good team with time and recourses can do that.
The paper has a strong problem identification - the present lack of trusted, neutral infrastructure for cross-organisational eval comparison is a bottleneck for international AI governance, and the paper articulates the requirements clearly (collaborative, transparent, neutral, verifiable, privacy-preserving), while giving clear historical examples of the gains of this approach. The two-TEE architecture is a sensible design for this problem - separating evaluation code from model weights in distinct trust boundaries allows model providers to submit to third-party evals without risking IP exfiltration.
However, the main reproducibility gains highlighted are probably already achievable for the most part through standard containerisation. The hard reproducibility problems in evals, such as GPU floating-point non-determinism, prompt sensitivity, dataset contamination and LLM-as-judge variance are unaddressed by the architecture. The gain from the controller pattern over standard dependency pinning is real, but it is probably not the key bottleneck to reproducability in evals at present.
Cite this work
@misc {
title={
(HckPrj) AI Safety Template
},
author={
A. Davíd Giagnocavo, Kazuki Kimura, Zane Estere Gruntmane, Siddharth Putta, Gaurav Joshi, Satoshi Nakamura
},
date={
2/2/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


