Jun 2, 2025
Routing LLMs using Distilled Predictors and Confidence Thresholding
Gideon Daniel Giftson T
This project explores confidence-based routing using sparsified transformer models as an intelligent alternative to monolithic AI systems. Focusing on Track 2: Intelligent Router Systems, we implemented a confidence-threshold router for a pruned DistilBERT model deployed via DeepSparse on the SST-2 sentiment classification task. We investigated how routing confidence correlates with prediction accuracy and how routing fewer, more confident samples can enable fallback to larger models while retaining accuracy.
Our system enables cost-efficient, interpretable decision-making with routing justified by softmax confidence thresholds. We show that routing 70% of samples at a confidence threshold of 0.8 retains 97% of original accuracy while reducing inference costs by over 50%. These results advance the Expert Orchestration Architecture by demonstrating real-world savings and interpretable routing without compromising safety or performance.
Philip Quirke
Thank you for your submission. Your submission is very readable and clear. Your paper confirms that a distilled model can perform at say 90% accuracy of its parent model, and is cheaper to run. This is a known result, reducing the novelty of this submission.
In the EO framework, the EO implementer prefers to avoid training and distilling models - leaving that to the model creators / innovators.
Amir Abdullah
Distilling models for confidence based routing is a good line of work, and the ideas here are good. It would be nice to extend this, by or example - calibrating confidence of models, figuring out how to distill a LLM in different ways and so forth.
Anosha Rahim
Great job on cost-savings and overall crisp project. I'd have looked for more robust safety considerations, like addressing confidence mis-callibration or fallback safety.
Cite this work
@misc {
title={
@misc {
},
author={
Gideon Daniel Giftson T
},
date={
6/2/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}