DyCon: Evolving Difficulty Modeling을 통한 Dynamic Reasoning Control

TL;DR Highlight

LLM의 내부 hidden state에서 난이도를 실시간으로 추정해 쉬운 문제엔 추론을 빨리 끊고, 어려운 문제엔 깊이 생각하게 만드는 training-free 방법

Who Should Read

DeepSeek-R1, QwQ-32B 같은 추론 모델(LRM)을 프로덕션에 배포하면서 토큰 비용과 레이턴시를 줄이고 싶은 ML 엔지니어. 'overthinking' 문제로 불필요하게 긴 CoT가 생성되어 비용이 과다 발생하는 상황에 처한 개발자.

Core Mechanics

LRM(Large Reasoning Model)의 'overthinking' 문제 핵심: 쉬운 문제도 정답을 이미 찾은 이후에 계속 반성·탐색을 반복해서 토큰을 낭비함. QwQ-32B 기준 정답이 나온 시점은 전체 추론의 중간(median rearly=0.327)인데도 끝까지 계속 생성.
문제 난이도는 추론 중에 동적으로 변함 - 추론 경로가 맞으면 난이도가 점점 낮아지고, 틀린 방향이면 높아지거나 유지됨. 기존 방법들은 추론 시작 전 정적 난이도만 봤음.
각 추론 스텝의 step embedding(각 \n\n 경계에서의 hidden state)이 난이도 정보를 선형적으로 인코딩함 - R² ≈ 0.8로 linear regressor만으로도 현재 난이도 예측 가능.
DyCon은 600개 MATH 학습 샘플로 offline에서 ridge regressor를 fitting하고, inference 시 각 스텝마다 난이도를 예측해서 'wait', 'hmm', 'let me reconsider' 같은 reflection 키워드의 logit을 동적으로 억제.
난이도가 낮으면 reflection 토큰 logit을 많이 깎아 빨리 끝내고, 난이도가 높으면 거의 안 깎아서 충분히 생각하게 둠 - 단순한 on/off 종료가 아니라 연속적(soft) 제어.
학습 없이 작동(training-free)하며, MATH 데이터로 fitting한 regressor가 코딩, 상식 추론, 과학 QA 등 전혀 다른 도메인에도 그대로 전이됨.

Evidence

수학 벤치마크에서 최대 토큰 40.6% 감소(QwQ-32B MMLUalgebra: 2133→1266 토큰)하면서 정확도는 오히려 +2.0pp 향상(95.0→97.0).
비수학 벤치마크에서 최대 토큰 52.2% 감소(DeepSeek-R1-Distill-Qwen-7B TriviaQA: 1295→619 토큰), GPQA-Diamond에서는 정확도 +8.6pp 향상(38.4→47.0).
정적(static) 난이도 계수 사용 시 Math-500 정확도 -3.6pp 하락(92.0→88.4), AIME2024 -16.7pp 하락(53.3→33.3) - 동적 난이도 모델링이 필수임을 증명.
regressor의 R²가 약 0.8로, step embedding이 난이도 정보를 강하게 인코딩함을 확인. Random Forest(R²=0.64) 대비 ridge regressor(R²≈0.8)가 훨씬 정확하고 downstream 성능도 우수.

How to Apply

DeepSeek-R1이나 QwQ-32B 같은 추론 모델을 서빙 중이라면: MATH 데이터셋 600개로 해당 모델의 step embedding→난이도 ridge regressor를 offline fitting하고, inference 시 각 \n\n 경계마다 현재 hidden state로 난이도를 예측해 reflection 키워드('wait', 'alternatively', 'let me reconsider' 등)의 logit을 (1-난이도) 비율로 깎으면 됨.
비수학 도메인(코드 생성, QA 등)에 적용할 때도 MATH로 fitting한 regressor를 그대로 사용 가능 - 단, 도메인 편차가 크면 MATH+GPQA+CommonsenseQA 혼합 데이터로 refitting하면 더 정확한 난이도 예측 가능.
early-exit 방식의 더 공격적인 비용 절감이 필요한 경우: 난이도 threshold 이하가 되면 </think> 토큰 logit을 크게 높여 추론을 강제 종료하는 방식도 가능. 단, 이 경우 어려운 문제에서 정확도 손실이 생길 수 있으니 threshold를 보수적으로 설정.

Code Example

snippet

# DyCon 핵심 로직 스케치 (Hugging Face transformers 기반)
import numpy as np
from sklearn.linear_model import Ridge

# === OFFLINE: Regressor Fitting ===
# 1. MATH 데이터 600개로 CoT 생성
# 2. 각 \n\n 경계에서 hidden state와 remaining_length 수집

def fit_difficulty_regressor(step_embeddings, remaining_lengths):
    """step_embeddings: (N, d), remaining_lengths: (N,)"""
    # log-transform + min-max normalize
    log_r = np.log1p(remaining_lengths)
    r_min, r_max = log_r.min(), log_r.max()
    d_normalized = (log_r - r_min) / (r_max - r_min)  # [0, 1]
    
    # Ridge regression
    regressor = Ridge(alpha=1.0)
    regressor.fit(step_embeddings, d_normalized)
    return regressor, r_min, r_max

# === ONLINE: Difficulty-Aware Logit Intervention ===
REFLECTION_TOKENS = ["wait", "alternatively", "let me reconsider", "hmm", "actually"]
THRESHOLD_TAU = 0.5

def compute_logit_bias(estimated_difficulty, logits, reflection_token_ids, vocab_mean_logit):
    """
    estimated_difficulty: float in [0, 1] (0=easy, 1=hard)
    logits: (vocab_size,) tensor
    """
    biases = {}
    for token_id in reflection_token_ids:
        logit_i = logits[token_id].item()
        margin = max(logit_i - vocab_mean_logit, 0)  # m_{t,i}
        
        if estimated_difficulty >= THRESHOLD_TAU:
            # 어려운 문제: sqrt로 약하게 억제
            bias_magnitude = (1 - estimated_difficulty) * np.sqrt(margin)
        else:
            # 쉬운 문제: 강하게 억제
            bias_magnitude = (1 - estimated_difficulty) * margin
        
        biases[token_id] = bias_magnitude
    return biases

# inference loop에서 각 \n\n 경계마다:
# 1. hidden_state = model.get_hidden_state(layer=best_layer)
# 2. difficulty = regressor.predict(hidden_state.reshape(1, -1))[0]
# 3. biases = compute_logit_bias(difficulty, current_logits, reflection_ids, mean_logit)
# 4. current_logits[token_id] -= biases[token_id] for token_id in biases
# 5. next_token = sample(softmax(current_logits))

Terminology

LRMLarge Reasoning Model의 약자. DeepSeek-R1, QwQ-32B처럼 <think>...</think> 태그 안에서 긴 Chain-of-Thought를 생성하며 추론하도록 학습된 모델.

overthinkingLRM이 이미 정답을 찾았는데도 계속 반성하고 재검토하는 현상. 쉬운 덧셈 문제에서도 몇 천 토큰씩 생각하는 것처럼 비효율적으로 토큰을 낭비.

CoTChain-of-Thought의 약자. '먼저 A를 계산하고, 그다음 B를 하면...' 식으로 단계별로 추론 과정을 텍스트로 쓰는 기법.

step embedding추론 과정의 각 단계(\n\n으로 구분)에서 모델의 내부 hidden state 벡터. 모델이 그 시점까지 이해한 정보가 압축되어 있음.

logit모델이 다음 토큰을 결정하기 전의 raw 점수. softmax를 거쳐 확률로 변환됨. logit을 낮추면 해당 토큰이 선택될 확률이 낮아짐.

ridge regression선형 회귀에 L2 정규화를 추가해 overfitting을 막는 기법. 여기서는 hidden state 벡터 → 난이도 점수로 매핑하는 초경량 모델로 사용.

R²결정계수(coefficient of determination). 0~1 값으로, 1에 가까울수록 예측이 실제값을 잘 설명함. R²=0.8이면 분산의 80%를 설명한다는 뜻.

training-free추가 학습이나 파인튜닝 없이 기존 모델 그대로 사용하는 방법. 여기서는 작은 ridge regressor만 fitting하고 원래 LRM 파라미터는 전혀 안 건드림.

Related Resources

DyCon GitHub Repository

Original Abstract (Expand)

Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as "overthinking". Existing methods to mitigate this issue either rely on static difficulty estimates or require task-specific training, and thus fail to adapt to the dynamic complexity during reasoning. In this work, we empirically show that the problem difficulty evolves dynamically throughout the reasoning process and is linearly encoded in the LRM's step-level embeddings. Building on this insight, we propose DyCon, a training-free framework that leverages latent step-level representations to explicitly model the evolving task difficulty, enabling the dynamic control of reasoning depth to mitigate the overthinking issue. Extensive experiments conducted on four models ranging from 4B to 32B, and across twelve benchmarks in math reasoning, general question answering, and coding tasks demonstrate that DyCon significantly enhances reasoning efficiency by reducing redundant steps without sacrificing accuracy or generalization. Project page and code are available at https://github.com/yu-lin-li/DyCon.