Mar 23, 2026
Resample With Trusted Comparison
Chris Harig
We validate a technique that compares resamples of suspicious actions to actions from trusted models for safety and incrimination, and an automated red team attack-transcript pipeline.
The idea sounds interesting but it's not clear to me how this is much better than just normal defer to resampling since I would assume most of the time max(trusted) < min(untrusted) if untrusted is flagging the threshold
This paper would have been more interesting if there was a safety curve and safety / usefulness curves to make it easier to visualize. Also, a larger sampler size would have been better. Also using sonnet to edit the qwen transcripts makes the results more confusing - using a more clear methodology would have helped.
This is solid exploration of an interesting Blue team technique. The red teaming pipeline is also interesting. There wasn't as much validation of the protocol and exploration of it as I'd like though this may not have been doable due to time/resource contraints. I think that rewriting transcripts with sonnet was probably worse than just using an attack policy with sonnet directly.
It seems like the main way it seems better than DTR is reducing false positive rate. I'd be interested in ROC curves for both DTR and RTC so we can have more clear comparisions of them and their efficacy as well.
something claude flagged that i agree with:
- the different denominators in Table 1 (11/16 for DTR vs 12/15 for RTC) are unexplained and a bit suspicious
Cite this work
@misc {
title={
(HckPrj) Resample With Trusted Comparison
},
author={
Chris Harig
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


