Score Against Ground Truth: Cross-Language Fragility in Heuristic LLM Evaluation
Iago Moura e Moreira
This is a game-based method for evaluating language-model agents, where a single game mechanic acts as a test primitive on a substrate whose world-state is JSON ground truth the scorer can check directly. As a first instance, we probe whether a model's instructed disposition (brazen, careful, skeptical, literal) holds when the locally rational move pressures against it, running 160 trials across four dispositions, cold and primed arms, in English and Portuguese.
The model behaved identically throughout and never refused the verifiable safe action, a clean refusal null replicated across both languages. The heuristic scorer did not: it produced 13 false-positives, every one an error in its language-specific text patterns rather than in the model's behavior, and most appeared only in Portuguese, manufactured by negation surface forms English never produced. Because the model's behavior was identical, these errors isolate cleanly to the grader, making the case for scoring multilingual evaluations against ground truth rather than language-specific heuristics. The run sits at the floor of the difficulty range, so it establishes a controlled baseline and a reusable primitive, not a result about disposition under genuine pressure.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Score Against Ground Truth: Cross-Language Fragility in Heuristic LLM Evaluation
},
author={
Iago Moura e Moreira
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


