LLMs are increasingly popular for reasoning tasks, such as multi-turn QA, task completion, code generation, or mathematics. However, they often make mistakes and need to be able to identify and correct them. In order to study this, we break down the self-correction process into two components: mistake finding and output correction. In our study, we test state-of-the-art LLMs on these two components separately.
We have created a new evaluation benchmark dataset called BIG-Bench Mistake to assess LLMs’ ability to find mistakes. This dataset consists of Chain-of-Thought traces generated using PaLM 2 on five tasks in BIG-Bench. Each trace is annotated with the location of the first logical mistake. We have labeled the dataset with the help of human labelers, ensuring high inter-rater reliability.
Our experiments show that LLMs struggle with mistake finding even for simple and unambiguous mistakes. Prompting methods such as direct (trace), direct (step), and CoT (step) yield low accuracy overall. We also find that mistake-finding cannot be used as a proxy for correctness of the answer.
However, we propose a backtracking method that allows LLMs to correct errors. By generating alternative outputs for the mistake step, we can improve the accuracy of the LLMs. Our method shows more gains than losses compared to a random baseline approach.
Furthermore, we find that mistake finding can generalize to tasks that LLMs have never seen before. Fine-tuned reward models perform better than zero-shot prompting with a larger model in most cases.