Apr 27, 2026
HGT Leaves a Linear Fingerprint in Codon Space
Yatharth Maheshwari, Arka Dash
Nucleotide virulence factor benchmarks are inflated by ~0.30 AUROC from organism confounds and gene-family leakage. Under same-strain negatives and gene-family-disjoint evaluation, 64-dimensional codon frequency with logistic regression generalises perfectly to novel genera (gap = 0.006, p = 0.097 NS). The signal is HGT-derived codon usage deviation - linear, genus-invariant, and pretrain-free.
Thanks for your submission! Your thorough quantitative approach here is commendable, and the clearly spelled out limitations and future direction are great. The write-up itself is quite jargon-heavy and a bit on the long side, and I would like to see more discussion of the big picture relevance - what are the consequences of the overestimation, and what should we do about it?
This paper shows that published performance numbers for DNA-level virulence factor classifiers — the kind that could screen raw synthesis orders — are significantly inflated due to two testing mistakes that compound on each other. Once you fix the test design, a simple model that counts codon frequencies is the only approach that actually generalizes to organisms it hasn't seen. The proposed explanation is that dangerous genes acquired through horizontal transfer still carry a subtle "accent" from their donor organism's codon preferences.
Why it matters: If you're evaluating nucleotide-level screening tools and relying on published benchmarks, those benchmarks are probably overstating performance by a wide margin. This paper quantifies exactly how much and why. The proposed fast pre-filter for synthesis orders (no GPU, no pretrained model, runs in linear time) is a practical contribution to screening infrastructure.
What's strong: Best methodology in the batch. Same-strain controls, family-disjoint evaluation, 20 random seeds, pre-registered analysis, careful statistical reporting. The finding that more complex models consistently overfit while the simplest one holds up is clean and actionable.
What's missing: The HGT mechanism is a hypothesis, not a validated result — the title overstates this. They haven't computed performance at the false-positive rates that synthesis screening actually operates at (below 1%). No testing on engineered or codon-optimized sequences, which is what screening actually needs to catch.
This paper is rigorous and the inflation decomposition finding is important for anyone building or evaluating nucleotide-level biosecurity classifiers. The same-strain design and gene-family-disjoint evaluation is an actionable practice for design. The main thing working against it is presentation density. The statistical analyses that makes the science trustworthy also makes the paper hard to absorb quickly. A shorter, punchier framing of the core result up front, with the full statistical components in supporting sections, would let the strength of the conclusion come upfront. The HGT mechanism is compelling but acknowledged as unvalidated and the amelioration-score correlation they describe as future work would substantially strengthen this. The team has demonstrated deep knowledge & thinking.
Cite this work
@misc {
title={
(HckPrj) HGT Leaves a Linear Fingerprint in Codon Space
},
author={
Yatharth Maheshwari, Arka Dash
},
date={
4/27/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


