Thank you for submitting your work on A Noise Audit of LLM Reasoning in Legal Decisions - it was timely and interesting to read. Please see below for feedback on your project:
1. Strengths of your proposal/ project:
- Methods and results are clearly outlined
- The project appears to be novel
- It is good that you detail the drawbacks/ limitations of the study
2. Areas for improvement
- A consistent use terminology should be kept. E.g., on p.1 you mention ‘level of variance in judgements’ - given this is the first mention of this, you should specifically refer to legal judgments (and specify what a judgement is/ in which legal context or area of law/ who carries out judgements/ in what jurisdiction or country). Unless you are referring to LLM judgements here? You also mention ‘the Supreme Court in 2005’ - of which country?
- Some key concepts/ methods should be defined - e.g., chain of thought reasoning (p.2)
- Sentencing guidelines are heavily relied upon courts, so I would expect a stronger argument under ‘Noise in Legal Decisions’ as to why variation isn’t good/ more up to-date evidence as to what level of variation there is amongst UK or US judgements at present.
- I’m not sure that I agree with the statement that ‘Because modern LLMs are trained on data scraped from the internet, it is likely that they will have prior knowledge of the court cases’ (p.3) without any evidence being provided. Can you provide a reference here (e.g., that legal cases are used to train LLMs)?
- You note that ‘Given the low sample size, the test does not have high power, introducing vulnerability to type 2 error’. Did you do a power calculation? Can you provide that in the paper?
- You note on p.6 that ‘The facts presented to the model were a short, broad summary of the case. The Supreme Court justices hearing the case would have much more detailed information, not to mention their expertise and extensive experience.’ It would be helpful if you could briefly go into more detail here - what implication does this have for your results? The drawbacks are noted descriptively, but more critical analysis would strengthen this section.
- A clear threat model should be provided.