We theoretically model how transformers learn addition and compare with the training loss over epochs