Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

Nathan Khosla, Aleksandra Wosztyl

🏆 5th Place Winner + 🏆 Track 1 Winner: DNA Screening & Synthesis Controls

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Review Project

View Related Sprint

Reviewer's Comments

Functional/structural screening is very important and it seems you made real progress toward that, kudos!

I think using ricin as a central example makes sense given that it's a favourite among would-be-terrorists. However, I wonder whether there is some cherry-picking where certain proteins perform unusually well in your pipeline, maybe because they are well-represented in the dataset or similar - whereas your pipeline would struggle to detect functional/structural analogues in other proteins (?). This is just a guess, I am not an expert in this specific area.

It would also be interesting how this pipeline performs on viral proteins such as coronavirus family spike proteins or influenza H/N proteins. More infohazardous of course!

"a risk acknowledged by the October 2026 revision to the OSTP Nucleic Acid Synthesis Screening Framework, among other policy memos" --> either a typo for the year or LLM hallucination.

You mention the pipeline is bottlenecked by compute resources for use at scale. I'd be curious to see cost and speed comparisons with sequence homology screening.

This is asking too much for a hackathon submission but if this was a full article I would like an explanation of why exactly these values were chosen and what they actually represent: "DALI similarity scoring was

determined if a protein matched a toxin with a z score greater than 10, an identity score

greater than 30%, and/or an RMSD less than 2. MMseqs2 similarity was determined if the

identity score was greater than 90% than a known toxin sequence. FoldSeek similarity was

determined by an identity score greater than 30%, RMSD below 2, e-value below 1e-4, and/or

TM score greater than 0.9."

The Limitations and Future Work section makes lots of good points.

I would have appreciated a short discussion on next steps of how false positives and false negatives in your pipeline could be reduced.

IMHO your submission is a bit too biased toward arxiv-style academic writing. That's great for a very particular researcher audience but as a judge who is more in policy-world, it's a bit hard to follow. I think you could take inspiration from the writing of research outputs at places like METR or GovAI who do a great job at writing with rigorous clarity that is still accessible to non-experts. Hence slightly lower points for presentation/clarity but your technical contribution is fantastic.

The most interesting finding here (AI-designed Munc13 binders accidentally resembling neurotoxins with no toxin input) is buried at the end. Should probably lead with it.

24/24 catch rate via DALI is great but presentation is a weak spot: figure wrapping in related work breaks the read, and the 20-page appendix has no summary. This is honest about false positives.

An impressive amount of work covering a known gap in biosecurity. While this is similar work going on in this area, this is a nice piece of work that strings together several public tools into a robust demonstration and proof of principle. The paper is clearly written with a strong ToC and the table is nice and very information dense. A small legend on the second table in the appendix indicating what the colors meant would be nice to have, but minor.

For a weekend hackathon, the breadth of the testing dataset is very impressive. Testing against a panel that includes natural toxins, benign structural mimics, amyloid proteins, de novo ricin variants, and AI-designed Munc13 binders demonstrates a deep understanding of the problem space and a very rigorous examination of the screening pipeline.

The truncated de novo ricin sequences to simulate a partial-synthesis evasion attack was a strong inclusion. Proving that sequence alignment caught 0/24 of these while your DALI pipeline caught 24/24 is a great way to highlight the importance of moving beyond homology methods.

You do well in discussing the false positives and current limitations, flagging that host proteins co-crystalized with toxins get called, and you lay out future directions clearly and with a very strong understanding of the problem space and ongoing work.

The tiered design is clever. Using MMseqs2 for fast, low compute initial gating before moving to computationally expensive tools like AlphaFold2, FoldSeek, and DALI is practical and logical system design. But, and you touch on this, running AlphaFold2 and DALI on every single sequence that bypasses MMseqs2 is currently too computationally expensive for commercial scale adoption. But that may just be a problem that resolves as compute gets cheaper.

"Finally, trypsin (14), concanavalin A (19), and thrombin (22) were confidently flagged as toxic across all methods, raising a broader question that this and any biosecurity screening tool must explicitly address: what threshold of danger justifies synthesis restriction? The distinction between "toxic in some biological context" and "dangerous enough to warrant screening" is not currently defined in any of the databases used

here." - Excellent point. These aren't really biosecurity threats, and a smart synthesis screening system should not flag them, but I agree that is broader work and a bigger question than within this project and for a hackathon.

Future work could focus on a few areas. To solve the compute bottleneck, potentially inserting an intermediate machine learning step between MMseqs2 and AlphaFold2. As you mentioned in the report, using a lightweight classifier built on embeddings from a protein language model (like evo2 or ESM-2) could quickly filter out non-homologous benign proteins, reserving AlphaFold2 strictly for highly suspicious sequences. A good path toe xplore further.

Moving away from binary "toxic/non-toxic" outputs. Work developing a preliminary heuristic or risk score that considers the exact structural hit and flagging an active site match for botulinum neurotoxin as a 'critical flag/block' while flagging an overexpressed protease active site as 'requires manual/human review' mimicking some current screening approaches. As LLMs improve one could also think of sending it for a review step to one to then send their summary along to a human reviewer.

In future presentations, include a 3D structural overlay (e.g., in PyMOL) that shows how your de novo ricin aligns with the active site of natural ricin despite zero sequence homology. Visuals make structural bioinformatics much more accessible to policymakers, the public, and generalist judges etc.

Overall, really great work!

Cite this work

@misc {

title={

(HckPrj) PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

author={

Nathan Khosla, Aleksandra Wosztyl

date={

4/27/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

OmnyraCloud

OmnyraCloud is protocol level biosecurity screening for cloud lab workflows. Today's tools screen DNA sequences at order time — but cloud labs run workflows, and a chain of individually benign steps (serial passage, split orders, surface obfuscation) can pursue a dangerous objective without a single flagged sequence. That's the gap we close.

OmnyraCloud ingests any lab protocol (Autoprotocol, Opentrons, JSON, or free text) and runs a 5-stage pipeline: decompose the workflow → score 5 risk dimensions → ground every flag in retrieved biosecurity literature → audit with LLM-as-judge → cross-check sequences via IBBIS commec. Output: an auditable risk report with citations, not a black box.

IBBIS flagged 1/3 dangerous protocols. Two sequences were screened but returned no HMM matches, but protocol level reasoning caught all three.

Protocol level screening isn't just complementary to sequence screening. It's essential.

Live at https://omnyra-cloud.vercel.app/

Apr 27, 2026