Antonio Valerio Miceli-Barone1*, Fazl Barez1*, Ioannis Konstas2, Shay B. Cohen1
1School of Informatics, University of Edinburgh 2School of Mathematical and Computer Sciences, Heriot-Watt University
* Equal contribution
Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to properly generate correct Python code when default function names are swapped, but some of them even become more confident in their incorrect predictions as the model size increases, an instance of the recently discovered phenomenon of Inverse Scaling, which runs contrary to the commonly observed trend of increasing prediction quality with increasing model size.
Our findings indicate that, despite their astonishing typical-case performance, LLMs still lack a deep, abstract understanding of the content they manipulate, making them unsuitable for tasks that statistically deviate from their training data, and that mere scaling is not enough to achieve such capability.
Classification loss over model size. Left: all models. Right: all models except Meta AI OPT and GoogleFLAN-T5 families
Miceli-Barone, Antonio Valerio, et al. "The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python." arXiv preprint arXiv:2305.15507 (2023).
@article{miceli2023larger,
title={The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python},
author={Miceli-Barone, Antonio Valerio and Barez, Fazl and Konstas, Ioannis and Cohen, Shay B},
journal={arXiv preprint arXiv:2305.15507},
year={2023}
}