Jan 19, 2025

AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation

Jacob Arbeid, Jay Bailey, Sam Lowenstein, Jake Pencharz

🏆 1st place by peer review

AI alignment lacks high-quality, real-world preference data needed to align agentic superintel-

ligent systems. Our technical innovation builds on Pacchiardi et al. (2023)’s breakthrough in

detecting AI deception through black-box analysis. We adapt their classification methodology

to identify intent misalignment between agent actions and true user intent, enabling real-time

correction of agent behavior and generation of valuable alignment data. We commercialise this

by building a workflow that incorporates a classifier that runs on live trajectories as a user

interacts with an agent in commercial contexts. This creates a virtuous cycle: our alignment

expertise produces superior agents, driving commercial adoption, which generates increasingly

valuable alignment datasets of trillions of trajectories labelled with human feedback: an invalu-

able resource for AI alignment.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

I valued the team's vision to generate an alignment dataset, starting with low-stakes cases. I would have appreciated more though on what are the incremental steps (i.e. specific examples of industries and verticals) towards more high-stakes cases.

Also when it comes to the MVP and product, I encourage the team to think further into how this can be commercialized, and what will make e-commerce companies adopt this technology (in the example of the first low-stakes case).

The demo video was very useful to understand the team's proposal.

Really interesting idea. Biggest challenge is being able to get the system working well enough to collect enough data for it to materially improve an existing agent. Existing companies creating such agents might also push back strongly on this.

The team has identified a critical gap in AI alignment - the lack of high-quality, real-world preference data needed for aligning advanced AI systems. Their approach of leveraging commercial agent deployments to systematically collect alignment-relevant data at scale while creating immediate business value is innovative. However, I have concerns about how they'll ensure the quality and representativeness of the collected data across different demographics and use cases. From my MLT Accelerator experience, I know how challenging it is to build robust datasets that avoid perpetuating existing biases. (1) Their technical approach building on Pacchiardi et al.'s work on detecting AI deception shows strong research foundations. The commercial viability angle also provides a clear path to scaling data collection.

(2) The focus on generating high-quality alignment data addresses a fundamental challenge in AI safety. Their threat model around insufficient data quality for training aligned AI systems is well-reasoned.

(3) While their technical approach is sound, more detail is needed on data quality assurance and bias mitigation methods. Their pilot experiment demonstrates feasibility but needs more rigorous validation

I really like the ambitious scope of this project.. not just thinking about current AI systems but actually trying to build something that could potentially scale to ASI, which is pretty impressive. The focus on dataset collection for alignment research is a solid approach, and I particularly appreciate how they're trying to create a practical, commercially viable solution while contributing to the broader alignment research community.

On the research quality front (Criteria 1), while their approach is interesting, I have some concerns about scalability. Their core assumption that methods working for low-risk agents will generalize to high-stakes settings isn't necessarily solid - we've seen from Redwood's research that AI control strategies often need to be fundamentally different for high-stakes versus low-stakes scenarios. Also while I agree with their critique of RLHF not scaling to superintelligent agents, this is debatable.

Regarding AI safety and methods (Criteria 2 & 3), they've done a good job identifying the threat model - the misalignment between agent actions and true user intent. Their adaptation of Pacchiardi's work on deception detection is clever, and I like how they're approaching it through a commercial lens to gather real-world data.

Cite this work

@misc {

title={

AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation

},

author={

Jacob Arbeid, Jay Bailey, Sam Lowenstein, Jake Pencharz

},

date={

1/19/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.