Keep Apart Research Going: Donate Today
Jun 2, 2025
Guardian-Loop: Mechanistically Interpretable Micro-Judges with Adversarial Self-Improvement
Efstathios Siatras, Man Kit Chan
Summary
Guardian-Loop is a mechanistically interpretable judge system designed to enhance the Expert Orchestration Architecture through transparent and efficient safety evaluation. Targeting Track 1 (Judge Model Development), we train lightweight classifiers that pre-filter prompts for safety using a Llama 3.1 8B model, fine-tuning only the upper layers to directly output True or False responses. This avoids probe-head architectures, enabling native token-level interpretability and calibrated scoring. Achieving 85.0% accuracy and 94.6% AUC-ROC on a hold-out test set with low latency using the safety judge, the system is deployable on consumer hardware. Guardian-Loop integrates deep interpretability techniques, including token attribution, attention analysis, and circuit tracing, to expose the model’s internal decision-making; We also demonstrate the extensibility of our framework by applying it to adjacent judgment tasks, such as feasibility prediction. An open-ended adversarial framework based on MAP-Elites quality diversity optimization was proposed, designed to populate a 10×10 grid spanning risk types and evasion strategies. While not yet deployed, this framework could support continuous self-improvement and vulnerability discovery. Guardian-Loop illustrates how small-sized LLMs can be repurposed as efficient, transparent filters, supporting scalable and trustworthy AI deployments.
Cite this work:
@misc {
title={
},
author={
Efstathios Siatras, Man Kit Chan
},
date={
6/2/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}