This is a really nice project on chain of throught. The experiments are logical and well conducted, and the presentation of the results is clear. The uplift in chain of thought performance is quite surprising - I'd be interested to know if the authors tuned the feature strengths or set them at the default intervention strength. Feature steering curves (feature strength vs performance) often peak at somewhat different points on different features (even semantically very similar ones) so tuning can be very worth doing. The findings on uncertainty at the first tokens of a direct response are intriguing and worth some more investigation.
A very interesting extension would be to test the generalisation of these features to another domain where CoT reasoning is important (ideally something non-mathematical, for example logic puzzles). Seeing a scatter plot of performance
on one domain vs performance on another domain would be very informative - my concern is that steering might improve one kind of performance at the expense of another.