Jun 21, 2026

TraceGuard-X: Adaptive Collusion Resistant Monitoring for Agentic AI Systems , a hierarchical governance framework with constitutional dimension prompting, temporal trajectory analysis, and cross-environment transfer

AYUSH KUMAR SAMAL

The deployment of increasingly capable autonomous AI agents raises a fundamental governance challenge: how can a monitoring system reliably distinguish benign from harmful behaviour when both the acting agent and the monitor may be strategically misaligned? Recent work on TraceGuard introduced structured multi-dimensional monitoring as a collusion-resistant

alternative to single-score evaluation. While promising, three key questions remain unresolved:

(i) robustness against rubric-aware adaptive attackers, (ii) computational efficiency of multidimensional monitoring, and (iii) generalisation beyond shell-based environments.

We present TraceGuard-X, a hierarchical monitoring framework that extends TraceGuard through adaptive adversarial evaluation, dynamic dimension routing, temporal trajectory analysis, federated monitoring, and cross-environment governance transfer. TraceGuard-X introduces

six core mechanisms: Constitutional Dimension Prompting (CDP), Temporal Graph of-Thought Monitoring (TGM), Divergence-Based Collusion Detection (DBCD),Dynamic Relevance Routing (DR-TraceGuard), Multi-Model Federated Monitoring (MMFM), and Hierarchical Domain-Adaptive Governance Transfer (HiDAgT).

Preliminary calibration experiments on the BashArena benchmark demonstrate statistically significant separation between benign and attack trajectories across all five monitoring dimensions, with large effect sizes (Cohen’s d > 0.96 on every dimension) and highly significant Kolmogorov-Smirnov statistics (p < 10−5 across all dimensions). A three-dimension routing policy achieves 84% of full-ensemble AUC at 60% API cost. The three-layer HiDAgT architecture enables zero-shot transfer to Infrastructure-as-Code, code-generation, and web-agent environments without modifying the universal meta-monitor.

Our results suggest that structured monitoring remains a viable governance primitive even under increasingly sophisticated attack models and heterogeneous deployment environments.

All code, calibration data, and evaluation infrastructure are publicly available.

Review Project

See Code

View Related Sprint

Reviewer's Comments

This paper tackles an important problem in AI control by extending structured monitoring toward adaptive adversaries, lower monitoring costs, and cross-environment transfer. The proposed ideas—including rubric-aware adversary modeling, Constitutional Dimension Prompting, collusion detection, and Pareto-based routing—are well motivated, and the cost-performance tradeoff achieved by the routing policy is a useful practical result. The engineering infrastructure and implementation are also well organized.

The primary weakness is the gap between the paper's claims and the experimental evidence. Several headline contributions, including robustness against adaptive adversaries and zero-shot transfer, are described but not fully evaluated, with results remaining preliminary. The paper would be significantly stronger by focusing on one core mechanism, validating it against the proposed adaptive attack model, and reporting quantitative transfer results. Aligning the abstract and conclusions with the current experimental evidence would also improve the paper's credibility.

Overall, this is a promising systems paper with strong conceptual framing and implementation, but its central claims require more comprehensive experimental validation.

Extending multi-dimensional monitoring to combat collusion is a vital direction for agentic AI control. The breadth of mechanisms introduced, such as Temporal Graph-of-Thought Monitoring and Dynamic Relevance Routing, is impressive. Demonstrating that a three-dimension routing policy achieves 84% of full-ensemble AUC at a 60% API cost provides great practical value for deployment. However, the presentation feels slightly dense due to the high volume of acronyms and subsystems. Completing the pending full adaptive adversarial evaluation will be critical to proving the system's empirical robustness.

To provide an overview of a very difficult challenge in AI control, this research provides a method of monitoring that can be reliable even if the monitoring agent (the controller) and the monitored agent (the controlled entity) are adversarially aligned.

The number of different approaches presented here are numerous, especially given this was completed by one person over a single weekend. There were many innovative concepts presented here. Some examples include, using Constitutional Principles to limit the actions of the Monitor, instead of limiting the actions of the Agent. Using a Runtime Kolmogorov-Smirnov Divergence as a Collusion Signal. And Pareto Cost-Routing of Monitoring Dimensions. The author provided a model of honesty about their current understanding of their results in the Limitations Section, as well as providing an honest assessment of what they do know through the Hypotheses Appendix.

Some possible suggestions for improving this work:

(1). Make sure the Abstract and Conclusion accurately reflect the results of your actual experimental results. The major statement made in the Abstract regarding Robustness is based upon a Smoke Test (Direction 1) with only 2 iterations of testing and does not represent the complete experimental results. Although the authors note that the complete evaluation will occur after additional funding has been obtained, it represents the strongest statements of fact based upon experiments that have not occurred. In order to improve this work either perform a reduced evaluation of the Adaptive-Adversary Evaluation or clearly identify these mechanisms as proposed and untested.

(2). Increase the size of your Calibration Set. The values reported (Cohen's d > .96, p < 1E-05, and AUC = .98) were generated using a total of 40 trajectories per Dimension (20 Benign and 20 Attack Trajectories). Due to having only 20 trajectories per Class, I believe your Statistics to be Fragile and Optimistic. Having at least 100 trajectories per Class (which you mentioned in Future Work), would increase the credibility of your Separation Claims, as well as require you to use a Held-Out Test Set that is independent of your Calibration Set in order to report AUC.

(3). Provide Quantitative Evidence of Transferability. Your discussion of zero-shot transferability to Infrastructure-as-Code, Code Generation, and Web Agents states that there were promising preliminary results. Providing at least one quantitative example of transferability for each Environment would support your HiDAgT Story.

(4). Protect Against Failure Modes of the Monitor. As several Dimensions utilize Reasoning Capabilities of Large Language Models, I suggest reporting how frequently the Judge hallucinates or reverses its decision under Prompt Changes. Additionally, I recommend evaluating how Federated Median Aggregation performs when a Majority of Judges are Compromised versus a Minority.

Overall: a solid foundation of ideas and architectures; however, in order to further develop this body of knowledge into actionable research, we need evidence as most of your major contributions are currently architectural/hypothetical in nature, whereas most of your claims have not been validated experimentally.

Cite this work

@misc {

title={

(HckPrj) TraceGuard-X: Adaptive Collusion Resistant Monitoring for Agentic AI Systems , a hierarchical governance framework with constitutional dimension prompting, temporal trajectory analysis, and cross-environment transfer

author={

AYUSH KUMAR SAMAL

date={

6/21/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026