Mar 23, 2026
LayoutArena: A Control Setting for Tool-Using Design Agents
Duy Le, Huy Bik Nguyen, Tin Duong, Khanh Linh Nguyen
We introduce LayoutArena, a control setting for tool-
using design agents that operate on a constrained 2D
canvas through structured tool calls. Unlike prior con-
trol benchmarks centered on code execution or shell in-
teraction, LayoutArena studies safety and usefulness in
a creative domain where correctness is partly objective
(bounds, overlap, spacing) and partly fuzzy (visual hi-
erarchy, composition quality). We define three control
protocols—enforcement, bounded control, and steward-
ship monitoring—and five adversarial side-task families,
then show that trajectory-aware stewardship can im-
prove both attack detection and effective yield by adap-
tively restricting unproductive search.
Ok this is a cool and definitely out of the box idea. I think it definitely comes down to implementation whether this project produces useful research.
One note is that this definitely isn't high stakes control, which is fine! But the threats being modeled are not ones we are concerned about. Nonetheless it is a unique toy setting an could still be useful in theory.
A bigger issue that seems to be the case is that the side tasks are almost arbitrarily chosen goals, and the monitor is just hard-coded rules to block exactly those goals. So the detection results are essentially circular. So we're not learning anything interesting about monitoring, you've just written the answer into the code.
When I saw this paper first, I was skeptical of its relevance, but the introduction convinced me it's a security-relevant setting.
This was a very ambitious project for a hackathon, and building a new evaluation framework is a good result. The experiments show that the framework works in principle, but the results don't seem meaningful to me (no critical thresholds identified, basically 100% detection rate, identical tool cost distribution). It would have been exceptional if it included experiments with actual AI agents (or AI vs. static monitor or attack policy vs. AI monitor) that show capability vs. security tradeoffs.
Note:
I think "Pareto Frontier" is the wrong concept in 8.2 - stewardship is the dominant protocol, offering the highest detection rate and usefulness. I'd like to see if there are thresholds that result in actual tradeoffs (tested parameters between 0.5 and 0.9).
Cite this work
@misc {
title={
(HckPrj) LayoutArena: A Control Setting for Tool-Using Design Agents
},
author={
Duy Le, Huy Bik Nguyen, Tin Duong, Khanh Linh Nguyen
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


