SPEC-CHECK

Vijaya L Peketi

"SPEC-CHECK addresses Track 1 — Specification Elicitation — by automatically extracting formal constraints from natural language requirements documents and evaluating whether LLM outputs adhere to them or silently override them.

Built on empirical findings from ARCORE-ML (NeurIPS 2026 #2102, EMNLP 2026 #391), which documented that Gemini 1.5 Pro silently overestimated a UPI payment system by +212% by substituting enterprise messaging rates for explicitly specified transactional rates – without signalling the departure. This is the specification elicitation problem in production: informal constraints exist in natural language requirements but are never formally verified against AI outputs.

SPEC-CHECK extracts formal constraints from requirements documents, evaluates LLM outputs for three failure modes (template-execution bias, silent constraint override, and single-vendor output tendency), and issues APPROVED or BLOCKED governance verdicts with evidence and recommendations. Grounded in the 2026 International AI Safety Report finding that pre-deployment safety testing is broken and relevant to Rajashree Agrawal's work on trustworthy-by-default AI, SPEC-CHECK is the enterprise implementation of specification elicitation as a safety gate.

Dataset: ARCORE-ML — 595 annotated requirements from 10 production systems across 8 domains, including government, healthcare, finance and payments in India. India presents a unique and urgent context for this research. 68% of Indian SMEs select cloud vendors based on familiarity alone rather than systematic evaluation, according to the 2024 State of Cloud Computing Survey. With India's digital economy growing at an unprecedented scale — UPI processing 15 billion transactions monthly, government systems serving 1.4 billion citizens, and healthcare platforms reaching previously unserved populations — the consequences of silent constraint override in AI-assisted infrastructure decisions are not hypothetical. They are financial losses, vendor lock-in, and compliance failures affecting the world's most populous democracy.

SPEC-CHECK addresses this specifically by making specification elicitation accessible to SMEs without formal methods expertise. A business owner in Hyderabad describing their appointment booking system in plain language receives formal constraint extraction, LLM output validation, and a governance verdict — the same pre-deployment safety testing that large enterprises achieve through dedicated FinOps and architecture teams. This is AI safety as social good: democratising the tools of trustworthy deployment for resource-constrained organisations in the Global South.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

SPEC-CHECK addresses a real and important problem: LLMs often appear to follow requirements while silently substituting their own assumptions, formulas, vendors, or constraints. This is especially relevant for secure program synthesis because a system that ignores or rewrites the user’s specification can produce outputs that look plausible but violate the actual intent.

The strongest part of the project is the problem framing. “Silent constraint override” is a useful failure mode, and the UPI payment example makes the issue concrete. The focus on making specification checks accessible to non-experts is also valuable, especially for smaller organizations that do not have formal methods or architecture review teams.

The current implementation, however, seems more like an early governance demo than a strong technical validation. Constraint extraction is described as pattern matching over natural language, and output evaluation appears to rely on an LLM-as-evaluator. That can be useful for prototyping, but it raises reliability questions: the evaluator may share the same weaknesses as the model being checked, and pattern matching may miss constraints phrased in less obvious ways. The demonstration uses a small number of examples, so it is hard to know whether the tool generalizes beyond the motivating UPI case.

To strengthen the project, I would focus on building a more systematic benchmark of requirement-output pairs with labeled constraint violations. The tool should be evaluated against baselines such as direct LLM judging, rule-only extraction, and human review. It would also help to separate hard constraints that can be checked deterministically, such as numeric bounds or vendor restrictions, from softer qualitative constraints that require judgment. For secure program synthesis, the next step would be to translate extracted constraints into executable tests or formal assertions rather than only producing a governance verdict.

Overall, this is a useful and socially relevant hackathon prototype with a clear motivation. The impact could be meaningful if the system becomes a reliable bridge from informal requirements to enforceable constraints, but the current evidence is still preliminary and the technical core needs stronger validation.

SPEC-CHECK aims to extract formal specifications from natural language documentation. However, its reliance on regex pattern matching is too rigid to accurately capture complex requirements. It is likely that leveraging LLMs directly would yield better results. Furthermore, the work is undermined by unclear presentation and a weak evaluation.

SPEC-CHECK addresses a practical and important problem: verifying that LLM outputs adhere to explicit requirements, constraints, formulas, and deployment assumptions before they are used in enterprise decisions. The project is well aligned with specification elicitation because it turns natural-language requirements into constraint checks and produces an APPROVED or BLOCKED governance verdict.

The strongest part of the work is the framing of “silent constraint override” as an AI governance failure mode. The artifact is also a real prototype, not just a concept: it includes a Streamlit app, built-in AWS and GCP scenarios, heuristic constraint extraction, deterministic checks, optional Gemini-based evaluation, recommendations, and session history.

To strengthen the work, the evaluation should go beyond a small set of curated examples and cover more domains, vendors, requirement types, and LLM outputs. The project should also report false positives and false negatives, since both pattern-based extraction and LLM-as-evaluator workflows can miss or over-enforce constraints. The repository would benefit from a fuller technical README, documented evaluator prompts, sample inputs/outputs, tests for extraction and verdict logic, and a one-command way to reproduce the headline examples.

Overall, this is solid hackathon work with a meaningful governance use case and a working prototype. With broader validation, stronger reproducibility, and more calibrated claims, SPEC-CHECK could become a more convincing pre-deployment constraint-adherence gate for enterprise AI workflows.

Cite this work

@misc {

title={

(HckPrj) SPEC-CHECK

},

author={

Vijaya L Peketi

},

date={

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.