FertiScope: Measuring the Multilingual Tokenizer Tax in Low-Resource Asian Languages
Miles Whiticker, Leo Deng, Antoine Pedretti
When using LLMs with lower-resource Asian languages, a hidden “tax” is applied where text is fragmented into a greater number of tokens compared to the English language. That greater token generation raises API bills, leaves room for fewer in-context examples, and fills the context window faster. To address this we made FertiScope: an open-source web tool that measures how many tokens are generated in 15 lower-resource Asian languages as well as other associated costs. Our tool measures three tokenizers: GPT-4o, GPT-4/3.5 and SEA-LION v3. We find that on parallel FLORES-200 sentences, the cost of the same content reaches up to 11.6× English, depending heavily on the tokenizer. Finally, we carried out a 540 call needle-in-haystack test across a subset of language, model and tokenizer combinations which found no degradation at token length inside the models’ publicly advertised context window sizes.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) FertiScope: Measuring the Multilingual Tokenizer Tax in Low-Resource Asian Languages
},
author={
Miles Whiticker, Leo Deng, Antoine Pedretti
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


