Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs

Apr 2, 2026•Abinitha Gourabathina, Inkit Padhi, Manish Nagireddy +2•View PDF

TL;DR Highlight

A new method for determining when an LLM should abstain from answering — it reverse-analyzes the model's reasoning trace to reconstruct 'what question the model actually answered' and compares it against the original question.

Who Should Read

Backend/ML engineers who need to filter out hallucinations or inappropriate responses in LLM-based services. Especially developers looking to deploy reasoning models like DeepSeek-R1 or GPT-o series in production.

Core Mechanics

Existing abstention methods judge response refusal by 'model confidence,' but this approach fails especially with reasoning models (models using CoT) due to frequent high-confidence hallucinations.
A new framework called 'Query Misalignment' reinterprets hallucinations not as 'wrong answers' but as 'answers to a different question' — the perspective that a model receiving user query q internally transforms it into q* before answering.
TRACE INVERSION operates in 3 steps: ① Generate the model's reasoning trace → ② Reconstruct q*, the question the model actually answered, from the trace alone → ③ Compare similarity between the original query q and q*, flagging abstention when the difference is large.
Similarity measurement uses majority voting across an ensemble of 3 modules: sentence embedding cosine similarity (SE), LLM evaluation (TrInv-LLM), and grounding verification using IBM's Granite-Guardian-3.3-8b (GROUND).
The best-performing module varies by domain: SE is strongest for math problems (84.2%), TrInv-LLM for reading comprehension (73.3%), and GROUND for bias/safety-related domains (75.2%).
Adding CoT prompts to existing baselines drops abstention performance by an average of 2.6% — evidence that existing methods are ill-suited for reasoning models.

Evidence

"TRACE INVERSION achieved top performance over the strongest existing baseline in 33 out of 36 configurations across 4 models (phi-4, Qwen2.5-32B, DeepSeek-R1-Distill-Qwen-32B, gpt-oss-120b) × 9 datasets. Average Abstain Accuracy improved by 8.7% over competing methods, with +11.6% on phi-4 and +9.5% on Qwen2.5-32B. The gap versus baselines is more pronounced on datasets containing unanswerable questions: baselines show 13–20%+ performance gaps between answerable vs. unanswerable across domains, while TRACE INVERSION's gap is only 3–6%. On DeepSeek-R1-Distill-Qwen-32B, overall Abstain Accuracy reached 0.733, and 0.762 on gpt-oss-120b — significantly ahead of the strongest baselines (0.604 and 0.648, respectively)."

How to Apply

"In QA pipelines using reasoning models (DeepSeek-R1, GPT-o series, etc.): after the model responds, extract the reasoning trace, reconstruct q* using a Query Reconstruction Prompt, compute cosine similarity with sentence embeddings against the original query, and use it as a guardrail to block responses below a threshold. If cost is a concern, start with a single module instead of the full ensemble: for math/knowledge QA services, the SE module alone (sentence transformer all-MiniLM-L6-v2) delivers competitive performance (84.2%), while bias/safety services can leverage a guardrail model like Granite-Guardian. If existing confidence-based filters (token probabilities, asking the model about its confidence) are misbehaving with reasoning models, consider replacing them with TRACE INVERSION — especially effective for services with many unanswerable, false-premise, or subjective questions."

Code Example

snippet

# Example implementation of the core TRACE INVERSION logic
from sentence_transformers import SentenceTransformer, util

model_embed = SentenceTransformer('all-MiniLM-L6-v2')

# Step 1: Generate reasoning trace with LLM
def get_reasoning_trace(llm, query):
    prompt = f"Let's think step by step.\n\nQuestion: {query}\n\nReasoning:"
    return llm.generate(prompt)

# Step 2: Reconstruct the original question from the trace alone
QUERY_RECONSTRUCTION_PROMPT = """
You are a puzzle solver. Given the following reasoning trace, 
reconstructthe initial question by interpreting the steps in the reasoning trace. 
Do not answer the question.

Reasoning Trace:
{reasoning_trace}

Reconstructed query:
"""

def reconstruct_query(llm, reasoning_trace):
    prompt = QUERY_RECONSTRUCTION_PROMPT.format(reasoning_trace=reasoning_trace)
    return llm.generate(prompt)

# Step 3-A: Measure distance using sentence embedding similarity (SE Module)
def se_similarity(original_query, reconstructed_query, threshold=0.85):
    emb_q = model_embed.encode(original_query, convert_to_tensor=True)
    emb_q_star = model_embed.encode(reconstructed_query, convert_to_tensor=True)
    score = util.cos_sim(emb_q, emb_q_star).item()
    should_abstain = score < threshold
    return score, should_abstain

# Step 3-B: LLM evaluation module (TrInv-LLM Module)
TRINV_LLM_PROMPT = """
Do the following two prompts convey the same framing, intent, and context?
Prompt 1: {q1}
Prompt 2: {q2}
Select YES or NO:
Final answer:
"""

def trinv_llm_check(llm, original_query, reconstructed_query):
    prompt = TRINV_LLM_PROMPT.format(q1=original_query, q2=reconstructed_query)
    response = llm.generate(prompt)
    should_abstain = 'NO' in response.upper()
    return should_abstain

# Ensemble: majority vote
def trace_inversion(llm, query, threshold=0.85):
    trace = get_reasoning_trace(llm, query)
    q_star = reconstruct_query(llm, trace)
    
    _, se_abstain = se_similarity(query, q_star, threshold)
    llm_abstain = trinv_llm_check(llm, query, q_star)
    # GROUND module requires Granite-Guardian API call (omitted here)
    
    votes = [se_abstain, llm_abstain]  # 3 votes when GROUND is included
    should_abstain = sum(votes) > len(votes) / 2  # majority vote
    return should_abstain, q_star, trace

Terminology

AbstentionThe ability of an LLM to refuse to answer by saying 'I don't know' or 'I cannot answer' when it doesn't know something or shouldn't answer. A safety mechanism that prevents the model from producing incorrect information by trying to answer every question regardless.

Reasoning TraceThe process where an LLM writes out its thinking step by step — 'Step 1: ..., Step 2: ...' — before producing an answer. Also known as Chain-of-Thought (CoT), similar to how a person shows their work on an exam.

HallucinationThe phenomenon where an LLM confidently generates information that doesn't actually exist as if it were real. The model 'makes things up,' stating incorrect facts with strong conviction.

Abstain AccuracyA metric measuring how well an LLM makes abstention decisions — the proportion of times it answered when it should have, and refrained when it shouldn't have.

CalibrationAligning a model's expressed confidence level with its actual accuracy. For example, a well-calibrated model whose responses express 70% confidence should have a real accuracy of 70% on those responses.

Sentence EmbeddingA technique that compresses the meaning of a sentence into a numerical vector. Sentences with similar meanings end up close together in vector space, allowing cosine similarity to measure semantic differences.

Granite-GuardianA guardrail model developed by IBM (Granite-Guardian-3.3-8b). It acts as a safety filter, detecting whether text is grounded in other text, among other checks.

CoT (Chain-of-Thought)A prompting technique that causes a model to output intermediate reasoning steps in order before producing a final answer. Prompting the model to 'think step by step' improves performance on complex problems.

Related Resources

TRACE INVERSION Official Code Repository

Original Abstract (Expand)

For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: abstain. Reasoning models, in particular, have gained attention for impressive performance on complex tasks. However, reasoning models have been shown to have worse abstention abilities. Taking the vulnerabilities of reasoning models into account, we propose our Query Misalignment Framework. Hallucinations resulting in failed abstention can be reinterpreted as LLMs answering the wrong question (rather than answering a question incorrectly). Based on this framework, we develop a new class of state-of-the-art abstention methods called Trace Inversion. First, we generate the reasoning trace of a model. Based on only the trace, we then reconstruct the most likely query that the model responded to. Finally, we compare the initial query with the reconstructed query. Low similarity score between the initial query and reconstructed query suggests that the model likely answered the question incorrectly and is flagged to abstain. Extensive experiments demonstrate that Trace Inversion effectively boosts abstention performance in four frontier LLMs across nine abstention QA datasets, beating competitive baselines in 33 out of 36 settings.