Mar 23, 2026

LidaRed: ControlArena Red-Team Dataset Generator

Linh Le, David Williams-King

We present LidaRed, an automated pipeline for generating adversarial red-team datasets for AI control. Built on the ControlArena benchmark, our pipeline iteratively refines attacks using monitor feedback, with a focus on omission-based attacks. We achieve attack success rates of up to 60% against Claude Opus 4.6 as a safety monitor, demonstrating that current frontier models still cannot catch subtle side task execution.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Here is a list of my thoughts:

1. I'm unsure of the problem you are trying to solve with this project. You have generated some trajs from the Control Arena benchmark. I'm unsure what value that provides.

2. Given that the value is a dataset, the dataset itself seems very limited.

3. I really like that the write-up is clean and well presented.

4. In the abstract, you state you "iteratively refined attacks using monitor feedback, with a focus on omission-based attacks", However, this is not talked about in your methods. This suggests to me that the method lacks the detail needed if someone was to replicate your work.

5. You state a key finding about omission-based attacks being hard to detect; however, I'm unsure how you proved this from your results. This could just be because of the data set you used, or randomness instead. If you don't have hard evidence (as that's hard to get in a hackathon), then maybe saying these findings in a discussion may be more appropriate.

Given that, good work on making a paper in 3 days! It's no easy task.

- clear code and clearly written, interesting idea simply and effectively presented, enjoying reading it ;)

- further comments intended to think of what this would look like as a full paper, and where the research direction seems to point to.

- iterative improvements are great, but there’s no discussion of the actual rewrites that are made to the prompts in order to bypass the monitor - this would be very interesting to look at. a simple case study of a couple prompts would be great. the framework itself seems novel well suited to current trends / agentic use of LMs

- makes sense that omission in the networking example is harder to detect - you're testing the model's conception of an "ideal configuration" here, or the ability to construct a configuration - so you should first establish if this is something that the model knows at all, or some baseline for what safety looks like in your network scenario. for Claude Opus I imagine that this is reasonable to assume that this is known, but it should still be tested

- what’s the so what? this reads like a SWE scaffold to more effectively bypass safety screens. Iterating from feedback is not novel, though finding specific techniques that work is. How could this be meaningfully used to improve existing safety regimes - is this for refining evals?

Cite this work

@misc {

title={

(HckPrj) LidaRed: ControlArena Red-Team Dataset Generator

},

author={

Linh Le, David Williams-King

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.