SPEC-CHECK
Vijaya L Peketi
"SPEC-CHECK addresses Track 1 — Specification Elicitation — by automatically extracting formal constraints from natural language requirements documents and evaluating whether LLM outputs adhere to them or silently override them.
Built on empirical findings from ARCORE-ML (NeurIPS 2026 #2102, EMNLP 2026 #391), which documented that Gemini 1.5 Pro silently overestimated a UPI payment system by +212% by substituting enterprise messaging rates for explicitly specified transactional rates – without signalling the departure. This is the specification elicitation problem in production: informal constraints exist in natural language requirements but are never formally verified against AI outputs.
SPEC-CHECK extracts formal constraints from requirements documents, evaluates LLM outputs for three failure modes (template-execution bias, silent constraint override, and single-vendor output tendency), and issues APPROVED or BLOCKED governance verdicts with evidence and recommendations. Grounded in the 2026 International AI Safety Report finding that pre-deployment safety testing is broken and relevant to Rajashree Agrawal's work on trustworthy-by-default AI, SPEC-CHECK is the enterprise implementation of specification elicitation as a safety gate.
Dataset: ARCORE-ML — 595 annotated requirements from 10 production systems across 8 domains, including government, healthcare, finance and payments in India. India presents a unique and urgent context for this research. 68% of Indian SMEs select cloud vendors based on familiarity alone rather than systematic evaluation, according to the 2024 State of Cloud Computing Survey. With India's digital economy growing at an unprecedented scale — UPI processing 15 billion transactions monthly, government systems serving 1.4 billion citizens, and healthcare platforms reaching previously unserved populations — the consequences of silent constraint override in AI-assisted infrastructure decisions are not hypothetical. They are financial losses, vendor lock-in, and compliance failures affecting the world's most populous democracy.
SPEC-CHECK addresses this specifically by making specification elicitation accessible to SMEs without formal methods expertise. A business owner in Hyderabad describing their appointment booking system in plain language receives formal constraint extraction, LLM output validation, and a governance verdict — the same pre-deployment safety testing that large enterprises achieve through dedicated FinOps and architecture teams. This is AI safety as social good: democratising the tools of trustworthy deployment for resource-constrained organisations in the Global South.
SPEC-CHECK addresses a real and important problem: LLMs often appear to follow requirements while silently substituting their own assumptions, formulas, vendors, or constraints. This is especially relevant for secure program synthesis because a system that ignores or rewrites the user’s specification can produce outputs that look plausible but violate the actual intent.
The strongest part of the project is the problem framing. “Silent constraint override” is a useful failure mode, and the UPI payment example makes the issue concrete. The focus on making specification checks accessible to non-experts is also valuable, especially for smaller organizations that do not have formal methods or architecture review teams.
The current implementation, however, seems more like an early governance demo than a strong technical validation. Constraint extraction is described as pattern matching over natural language, and output evaluation appears to rely on an LLM-as-evaluator. That can be useful for prototyping, but it raises reliability questions: the evaluator may share the same weaknesses as the model being checked, and pattern matching may miss constraints phrased in less obvious ways. The demonstration uses a small number of examples, so it is hard to know whether the tool generalizes beyond the motivating UPI case.
To strengthen the project, I would focus on building a more systematic benchmark of requirement-output pairs with labeled constraint violations. The tool should be evaluated against baselines such as direct LLM judging, rule-only extraction, and human review. It would also help to separate hard constraints that can be checked deterministically, such as numeric bounds or vendor restrictions, from softer qualitative constraints that require judgment. For secure program synthesis, the next step would be to translate extracted constraints into executable tests or formal assertions rather than only producing a governance verdict.
Overall, this is a useful and socially relevant hackathon prototype with a clear motivation. The impact could be meaningful if the system becomes a reliable bridge from informal requirements to enforceable constraints, but the current evidence is still preliminary and the technical core needs stronger validation.
SPEC-CHECK aims to extract formal specifications from natural language documentation. However, its reliance on regex pattern matching is too rigid to accurately capture complex requirements. It is likely that leveraging LLMs directly would yield better results. Furthermore, the work is undermined by unclear presentation and a weak evaluation.
SPEC-CHECK addresses a practical and important problem: verifying that LLM outputs adhere to explicit requirements, constraints, formulas, and deployment assumptions before they are used in enterprise decisions. The project is well aligned with specification elicitation because it turns natural-language requirements into constraint checks and produces an APPROVED or BLOCKED governance verdict.
The strongest part of the work is the framing of “silent constraint override” as an AI governance failure mode. The artifact is also a real prototype, not just a concept: it includes a Streamlit app, built-in AWS and GCP scenarios, heuristic constraint extraction, deterministic checks, optional Gemini-based evaluation, recommendations, and session history.
To strengthen the work, the evaluation should go beyond a small set of curated examples and cover more domains, vendors, requirement types, and LLM outputs. The project should also report false positives and false negatives, since both pattern-based extraction and LLM-as-evaluator workflows can miss or over-enforce constraints. The repository would benefit from a fuller technical README, documented evaluator prompts, sample inputs/outputs, tests for extraction and verdict logic, and a one-command way to reproduce the headline examples.
Overall, this is solid hackathon work with a meaningful governance use case and a working prototype. With broader validation, stronger reproducibility, and more calibrated claims, SPEC-CHECK could become a more convincing pre-deployment constraint-adherence gate for enterprise AI workflows.
Cite this work
@misc {
title={
(HckPrj) SPEC-CHECK
},
author={
Vijaya L Peketi
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


