Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models
TL;DR Highlight
A method that automatically identifies and removes unnecessary CoT reasoning steps where perplexity doesn't change — reducing token count while maintaining accuracy.
Who Should Read
ML engineers wanting to reduce LLM inference costs, especially developers who create CoT fine-tuning data or optimize few-shot prompts.
Core Mechanics
- Simple principle: removing a step that causes large perplexity increase = 'important step'; little perplexity change = 'unnecessary step'
- Strong negative correlation confirmed between perplexity and reasoning accuracy (AL1: r=-0.690~-0.860, Diff-Calc: r=-0.997, Time-Diff: r=-0.850~-0.973)
- Added 'merge' mechanism to avoid breaking context when removing steps — content of removed steps folded into adjacent steps
- SPIRIT-FS for few-shot CoT: reduces demo steps to encourage shorter model generation (AL1 from 7 steps to 4 while maintaining accuracy)
- SPIRIT-FT for fine-tuning: refines training data reasoning steps for SFT/ORPO training — consistently outperforms random step removal
- Steps selected using one model's (LLaMA3.1-70B) perplexity work well for GPT-4o-mini and GPT-3.5-Turbo — transferable across models
Evidence
- Diff-Calc task perplexity-accuracy correlation r=-0.997 (p=3.37e-8), Time-Diff r=-0.973 (p=0.0002) — statistically highly significant
- AL1 few-shot 7 to 4 steps: LLaMA3.1-70B accuracy 99.80% to 99.20%, GPT-4o-mini 98.00% to 98.80% actually slightly improved; random removal drops to 94.40%
- NBC task 12 to 9 steps: Ours (merge) GPT-4o-mini 95.80% to 97.80%, random removal drops to 91.60%
- Strong model (LLaMA3-8B) perplexity used to refine weak model (LLaMA2-7B, Qwen1.5-7B) fine-tuning data performs even better than using the weak model's own perplexity
How to Apply
- Few-shot prompt optimization: in existing CoT demos, remove each step one by one and measure perplexity change on calibration samples — remove or merge least-changing steps to reduce tokens while maintaining accuracy.
- CoT fine-tuning data refinement: run SPIRIT-FT algorithm on math reasoning datasets to remove unnecessary steps, then fine-tune with LoRA SFT/ORPO.
- Restricted model access (closed model fine-tuning): compute perplexity using open-source models like LLaMA3.1-70B to select steps, then use results as few-shot demos — performance maintained via cross-model transferability.
Code Example
Terminology
Original Abstract (Expand)
Chain-of-Thought (CoT) reasoning, which breaks down complex tasks into intermediate reasoning steps, has significantly enhanced the performance of large language models (LLMs) on challenging tasks. However, the detailed reasoning process in CoT often incurs long generation times and high computational costs, partly due to the inclusion of unnecessary steps. To address this, we propose a method to identify critical reasoning steps using perplexity as a measure of their importance: a step is deemed critical if its removal causes a significant increase in perplexity. Our method enables models to focus solely on generating these critical steps. This can be achieved through two approaches: refining demonstration examples in few-shot CoT or fine-tuning the model using selected examples that include only critical steps. Comprehensive experiments validate the effectiveness of our method, which achieves a better balance between the reasoning accuracy and efficiency of CoT.