AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation

Jacob Arbeid, Jay Bailey, Sam Lowenstein, Jake Pencharz

🏆 1st place by peer review

AI alignment lacks high-quality, real-world preference data needed to align agentic superintel-

ligent systems. Our technical innovation builds on Pacchiardi et al. (2023)’s breakthrough in

detecting AI deception through black-box analysis. We adapt their classification methodology

to identify intent misalignment between agent actions and true user intent, enabling real-time

correction of agent behavior and generation of valuable alignment data. We commercialise this

by building a workflow that incorporates a classifier that runs on live trajectories as a user

interacts with an agent in commercial contexts. This creates a virtuous cycle: our alignment

expertise produces superior agents, driving commercial adoption, which generates increasingly

valuable alignment datasets of trillions of trajectories labelled with human feedback: an invalu-

able resource for AI alignment.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Pablo Sanzo

I valued the team's vision to generate an alignment dataset, starting with low-stakes cases. I would have appreciated more though on what are the incremental steps (i.e. specific examples of industries and verticals) towards more high-stakes cases.

Also when it comes to the MVP and product, I encourage the team to think further into how this can be commercialized, and what will make e-commerce companies adopt this technology (in the example of the first low-stakes case).

The demo video was very useful to understand the team's proposal.

Edward Yee

Really interesting idea. Biggest challenge is being able to get the system working well enough to collect enough data for it to materially improve an existing agent. Existing companies creating such agents might also push back strongly on this.

Ricardo Raphael Corona Moreno

The team has identified a critical gap in AI alignment - the lack of high-quality, real-world preference data needed for aligning advanced AI systems. Their approach of leveraging commercial agent deployments to systematically collect alignment-relevant data at scale while creating immediate business value is innovative. However, I have concerns about how they'll ensure the quality and representativeness of the collected data across different demographics and use cases. From my MLT Accelerator experience, I know how challenging it is to build robust datasets that avoid perpetuating existing biases. (1) Their technical approach building on Pacchiardi et al.'s work on detecting AI deception shows strong research foundations. The commercial viability angle also provides a clear path to scaling data collection.

(2) The focus on generating high-quality alignment data addresses a fundamental challenge in AI safety. Their threat model around insufficient data quality for training aligned AI systems is well-reasoned.

(3) While their technical approach is sound, more detail is needed on data quality assurance and bias mitigation methods. Their pilot experiment demonstrates feasibility but needs more rigorous validation

Jaime Raldua

I really like the ambitious scope of this project.. not just thinking about current AI systems but actually trying to build something that could potentially scale to ASI, which is pretty impressive. The focus on dataset collection for alignment research is a solid approach, and I particularly appreciate how they're trying to create a practical, commercially viable solution while contributing to the broader alignment research community.

On the research quality front (Criteria 1), while their approach is interesting, I have some concerns about scalability. Their core assumption that methods working for low-risk agents will generalize to high-stakes settings isn't necessarily solid - we've seen from Redwood's research that AI control strategies often need to be fundamentally different for high-stakes versus low-stakes scenarios. Also while I agree with their critique of RLHF not scaling to superintelligent agents, this is debatable.

Regarding AI safety and methods (Criteria 2 & 3), they've done a good job identifying the threat model - the misalignment between agent actions and true user intent. Their adaptation of Pacchiardi's work on deception detection is clever, and I like how they're approaching it through a commercial lens to gather real-world data.

Cite this work

@misc {

title={

@misc {

},

author={

Jacob Arbeid, Jay Bailey, Sam Lowenstein, Jake Pencharz

},

date={

1/19/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Read More

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Read More

Mar 31, 2025

Model Models: Simulating a Trusted Monitor

We offer initial investigations into whether the untrusted model can 'simulate' the trusted monitor: is U able to successfully guess what suspicion score T will assign in the APPS setting? We also offer a clean, modular codebase which we hope can be used to streamline future research into this question.

Read More

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Read More

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Read More

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Read More

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Read More

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Read More

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.