Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
TL;DR Highlight
A new method for determining when an LLM should abstain from answering — it reverse-analyzes the model's reasoning trace to reconstruct 'what question the model actually answered' and compares it against the original question.
Who Should Read
Backend/ML engineers who need to filter out hallucinations or inappropriate responses in LLM-based services. Especially developers looking to deploy reasoning models like DeepSeek-R1 or GPT-o series in production.
Core Mechanics
- Existing abstention methods judge response refusal by 'model confidence,' but this approach fails especially with reasoning models (models using CoT) due to frequent high-confidence hallucinations.
- A new framework called 'Query Misalignment' reinterprets hallucinations not as 'wrong answers' but as 'answers to a different question' — the perspective that a model receiving user query q internally transforms it into q* before answering.
- TRACE INVERSION operates in 3 steps: ① Generate the model's reasoning trace → ② Reconstruct q*, the question the model actually answered, from the trace alone → ③ Compare similarity between the original query q and q*, flagging abstention when the difference is large.
- Similarity measurement uses majority voting across an ensemble of 3 modules: sentence embedding cosine similarity (SE), LLM evaluation (TrInv-LLM), and grounding verification using IBM's Granite-Guardian-3.3-8b (GROUND).
- The best-performing module varies by domain: SE is strongest for math problems (84.2%), TrInv-LLM for reading comprehension (73.3%), and GROUND for bias/safety-related domains (75.2%).
- Adding CoT prompts to existing baselines drops abstention performance by an average of 2.6% — evidence that existing methods are ill-suited for reasoning models.
Evidence
- "TRACE INVERSION achieved top performance over the strongest existing baseline in 33 out of 36 configurations across 4 models (phi-4, Qwen2.5-32B, DeepSeek-R1-Distill-Qwen-32B, gpt-oss-120b) × 9 datasets. Average Abstain Accuracy improved by 8.7% over competing methods, with +11.6% on phi-4 and +9.5% on Qwen2.5-32B. The gap versus baselines is more pronounced on datasets containing unanswerable questions: baselines show 13–20%+ performance gaps between answerable vs. unanswerable across domains, while TRACE INVERSION's gap is only 3–6%. On DeepSeek-R1-Distill-Qwen-32B, overall Abstain Accuracy reached 0.733, and 0.762 on gpt-oss-120b — significantly ahead of the strongest baselines (0.604 and 0.648, respectively)."
How to Apply
- "In QA pipelines using reasoning models (DeepSeek-R1, GPT-o series, etc.): after the model responds, extract the reasoning trace, reconstruct q* using a Query Reconstruction Prompt, compute cosine similarity with sentence embeddings against the original query, and use it as a guardrail to block responses below a threshold. If cost is a concern, start with a single module instead of the full ensemble: for math/knowledge QA services, the SE module alone (sentence transformer all-MiniLM-L6-v2) delivers competitive performance (84.2%), while bias/safety services can leverage a guardrail model like Granite-Guardian. If existing confidence-based filters (token probabilities, asking the model about its confidence) are misbehaving with reasoning models, consider replacing them with TRACE INVERSION — especially effective for services with many unanswerable, false-premise, or subjective questions."
Code Example
# Example implementation of the core TRACE INVERSION logic
from sentence_transformers import SentenceTransformer, util
model_embed = SentenceTransformer('all-MiniLM-L6-v2')
# Step 1: Generate reasoning trace with LLM
def get_reasoning_trace(llm, query):
prompt = f"Let's think step by step.\n\nQuestion: {query}\n\nReasoning:"
return llm.generate(prompt)
# Step 2: Reconstruct the original question from the trace alone
QUERY_RECONSTRUCTION_PROMPT = """
You are a puzzle solver. Given the following reasoning trace,
reconstructthe initial question by interpreting the steps in the reasoning trace.
Do not answer the question.
Reasoning Trace:
{reasoning_trace}
Reconstructed query:
"""
def reconstruct_query(llm, reasoning_trace):
prompt = QUERY_RECONSTRUCTION_PROMPT.format(reasoning_trace=reasoning_trace)
return llm.generate(prompt)
# Step 3-A: Measure distance using sentence embedding similarity (SE Module)
def se_similarity(original_query, reconstructed_query, threshold=0.85):
emb_q = model_embed.encode(original_query, convert_to_tensor=True)
emb_q_star = model_embed.encode(reconstructed_query, convert_to_tensor=True)
score = util.cos_sim(emb_q, emb_q_star).item()
should_abstain = score < threshold
return score, should_abstain
# Step 3-B: LLM evaluation module (TrInv-LLM Module)
TRINV_LLM_PROMPT = """
Do the following two prompts convey the same framing, intent, and context?
Prompt 1: {q1}
Prompt 2: {q2}
Select YES or NO:
Final answer:
"""
def trinv_llm_check(llm, original_query, reconstructed_query):
prompt = TRINV_LLM_PROMPT.format(q1=original_query, q2=reconstructed_query)
response = llm.generate(prompt)
should_abstain = 'NO' in response.upper()
return should_abstain
# Ensemble: majority vote
def trace_inversion(llm, query, threshold=0.85):
trace = get_reasoning_trace(llm, query)
q_star = reconstruct_query(llm, trace)
_, se_abstain = se_similarity(query, q_star, threshold)
llm_abstain = trinv_llm_check(llm, query, q_star)
# GROUND module requires Granite-Guardian API call (omitted here)
votes = [se_abstain, llm_abstain] # 3 votes when GROUND is included
should_abstain = sum(votes) > len(votes) / 2 # majority vote
return should_abstain, q_star, traceTerminology
Related Resources
Original Abstract (Expand)
For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: abstain. Reasoning models, in particular, have gained attention for impressive performance on complex tasks. However, reasoning models have been shown to have worse abstention abilities. Taking the vulnerabilities of reasoning models into account, we propose our Query Misalignment Framework. Hallucinations resulting in failed abstention can be reinterpreted as LLMs answering the wrong question (rather than answering a question incorrectly). Based on this framework, we develop a new class of state-of-the-art abstention methods called Trace Inversion. First, we generate the reasoning trace of a model. Based on only the trace, we then reconstruct the most likely query that the model responded to. Finally, we compare the initial query with the reconstructed query. Low similarity score between the initial query and reconstructed query suggests that the model likely answered the question incorrectly and is flagged to abstain. Extensive experiments demonstrate that Trace Inversion effectively boosts abstention performance in four frontier LLMs across nine abstention QA datasets, beating competitive baselines in 33 out of 36 settings.