Jan 11, 2026

Biased Attractiveness Bench (BAB): Image Reward Models Confuse Attractiveness with Realism

Harvey Mannering, Yaseen Mohammed Osman, Yeda WANG, Annika Catulli, Nehal Yasin

Image reward models are used to finetune and improve diffusion models. Therefore any biases contained within these models can be exploited, leaving the generative model vulnerable to reward hacking. We explore how the attractiveness of faces impact the rewards assigned to an image, designing a new benchmark to measure the effect. In particular, we find that human preference models confuse attractive AI-generated faces with high quality, realistic images, assigning higher rewards for more attractive people. To our knowledge, we are the first to recognize this tendency in image reward models.

Review Project

See Code

View Presentation

View Related Sprint

Reviewer's Comments

This project is a solid effort in identifying the "Halo Effect" in models like PickScore and HPSv2. I really enjoyed the focus on beauty bias, though it does feel somewhat incremental given the existing literature on the subject. One thing I struggled with was the assumption that these reward models are accurate proxies for human liking; without validation against actual human preferences, it is hard to be certain. I also found it interesting, and a bit puzzling, that ImageReward showed nearly zero correlation while others were so high, and I would have loved to see a bit more exploration into why that happened.

While I think framing this bias as a "manipulation" vulnerability feels a bit far-fetched in this specific context, I can see where the team was going with it. My main suggestion for the future would be to move away from the "synthetic loop" of using AI-generated images labelled by AI classifiers, as grounding these findings in real-world data is the only way to prove they aren't just artefacts of the generation process. However, I completely understand that this was a ton to cover within a hackathon timeline! Great job overall.

This paper targets a real and interesting, if somewhat marginal, issue about how training photo realism of faces using RLHF can get biased by human preferences for attractive faces. The paper is clear to read and, by and large, well executed. I also appreciate that the team focused on an issue of a size that could be handled well.

The main issue with the paper as I see it are methodological. To start, they only generated a sample size of 300 faces, which is arguably relatively small for evaluating these effects.

The more substantial issue is that they did not have an independent method for evaluating the attractiveness of faces, but rather deferred this to the prompts. This means that they cannot rule out, as far as I can see, the risk that the image generator itself is subject to this bias, where prompting it to generate more attractive faces will in fact make it generate more realistic faces. Without some control for this, I'm not sure that the paper achieves what it's set out to do sufficiently well and will possibly need to be complemented to be an authoritative resource for evaluating this bias.

Cite this work

@misc {

title={

(HckPrj) Biased Attractiveness Bench (BAB): Image Reward Models Confuse Attractiveness with Realism

author={

Harvey Mannering, Yaseen Mohammed Osman, Yeda WANG, Annika Catulli, Nehal Yasin

date={

1/11/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Feb 2, 2026