Confía-CO: A reliability evaluation for an AI customer-service assistant operating in Colombian Spanish
Marlon Naranjo
We built Confía-CO, a reliability evaluation for a customer-service AI assistant in Colombian Spanish, grounded in a fictitious small business. Our automated harness first reported that a low-cost commercial model failed most cases (21% accuracy, 67% false answers on traps). Reading the raw outputs by hand revealed the opposite: the model was right 18 of 19 times — the failure was a scoring bug in our own evaluation pipeline, not the model. After correcting it, true accuracy was 94.7%, with one genuine risk: a confidential-information leak. Our central, intellectually honest finding is methodological: an automated evaluator can be wrong with the same confidence as the system it evaluates, and only human validation reveals it — a risk we observed even though the pipeline was built by a frontier model under expert direction. We release the dataset, harness, and a documented correction log.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Confía-CO: A reliability evaluation for an AI customer-service assistant operating in Colombian Spanish
},
author={
Marlon Naranjo
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


