Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models

Feb 18, 2025•Yingqian Cui, Pengfei He, Jingying Zeng +11•View PDF

TL;DR Highlight

A method that automatically identifies and removes unnecessary CoT reasoning steps where perplexity doesn't change — reducing token count while maintaining accuracy.

Who Should Read

ML engineers wanting to reduce LLM inference costs, especially developers who create CoT fine-tuning data or optimize few-shot prompts.

Core Mechanics

Simple principle: removing a step that causes large perplexity increase = 'important step'; little perplexity change = 'unnecessary step'
Strong negative correlation confirmed between perplexity and reasoning accuracy (AL1: r=-0.690~-0.860, Diff-Calc: r=-0.997, Time-Diff: r=-0.850~-0.973)
Added 'merge' mechanism to avoid breaking context when removing steps — content of removed steps folded into adjacent steps
SPIRIT-FS for few-shot CoT: reduces demo steps to encourage shorter model generation (AL1 from 7 steps to 4 while maintaining accuracy)
SPIRIT-FT for fine-tuning: refines training data reasoning steps for SFT/ORPO training — consistently outperforms random step removal
Steps selected using one model's (LLaMA3.1-70B) perplexity work well for GPT-4o-mini and GPT-3.5-Turbo — transferable across models

Evidence

Diff-Calc task perplexity-accuracy correlation r=-0.997 (p=3.37e-8), Time-Diff r=-0.973 (p=0.0002) — statistically highly significant
AL1 few-shot 7 to 4 steps: LLaMA3.1-70B accuracy 99.80% to 99.20%, GPT-4o-mini 98.00% to 98.80% actually slightly improved; random removal drops to 94.40%
NBC task 12 to 9 steps: Ours (merge) GPT-4o-mini 95.80% to 97.80%, random removal drops to 91.60%
Strong model (LLaMA3-8B) perplexity used to refine weak model (LLaMA2-7B, Qwen1.5-7B) fine-tuning data performs even better than using the weak model's own perplexity

How to Apply

Few-shot prompt optimization: in existing CoT demos, remove each step one by one and measure perplexity change on calibration samples — remove or merge least-changing steps to reduce tokens while maintaining accuracy.
CoT fine-tuning data refinement: run SPIRIT-FT algorithm on math reasoning datasets to remove unnecessary steps, then fine-tune with LoRA SFT/ORPO.
Restricted model access (closed model fine-tuning): compute perplexity using open-source models like LLaMA3.1-70B to select steps, then use results as few-shot demos — performance maintained via cross-model transferability.

Code Example

snippet

Terminology

Perplexity (PPL)A metric showing how 'unexpected' a text is to the model. Lower means the model finds it natural.

Chain-of-Thought (CoT)A method where LLM writes out intermediate reasoning steps before the final answer.

Few-shot CoTPutting 2-5 solved examples in the prompt and having the model follow that pattern for new problems.

SFT (Supervised Fine-Tuning)Teaching the model by showing gold-standard examples and having it imitate.

LoRALightweight fine-tuning that adds small adapter layers instead of training all parameters.

ORPO (Odds Ratio Preference Optimization)Trains the model to prefer good over bad responses shown in pairs. Simultaneously does SFT and preference learning without a separate reward model.

Calibration SetA small set of validation samples for algorithm tuning before actual testing.

Original Abstract (Expand)

Chain-of-Thought (CoT) reasoning, which breaks down complex tasks into intermediate reasoning steps, has significantly enhanced the performance of large language models (LLMs) on challenging tasks. However, the detailed reasoning process in CoT often incurs long generation times and high computational costs, partly due to the inclusion of unnecessary steps. To address this, we propose a method to identify critical reasoning steps using perplexity as a measure of their importance: a step is deemed critical if its removal causes a significant increase in perplexity. Our method enables models to focus solely on generating these critical steps. This can be achieved through two approaches: refining demonstration examples in few-shot CoT or fine-tuning the model using selected examples that include only critical steps. Comprehensive experiments validate the effectiveness of our method, which achieves a better balance between the reasoning accuracy and efficiency of CoT.