LLM 에이전트의 Active Reasoning을 위한 Reinforcement Learning에서 Information Self-Locking 현상 연구

On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

Mar 12, 2026•Deyu Zou, Yongqiang Chen, Fan Feng +4•View PDF

TL;DR Highlight

RL로 학습한 LLM 에이전트가 질문을 멈추고 정보를 활용 못하는 '자기잠금' 현상의 원인을 분석하고, 간단한 방향성 신호 주입으로 최대 60% 성능 개선을 달성한 연구

Who Should Read

멀티턴 대화 에이전트나 질의응답 에이전트를 RL로 학습시키는 ML 엔지니어. 특히 에이전트가 정보 수집을 포기하거나 이미 얻은 정보를 무시하는 현상을 겪고 있는 개발자.

Core Mechanics

RL로 학습된 LLM 에이전트는 'Information Self-Locking(SeL)'에 빠짐 - 보상이 올라가도 에이전트가 점점 더 무의미한 질문을 하고 수집한 정보를 내부 신념 업데이트에 활용 못하게 됨
에이전트 행동을 Action Selection(AS, 무엇을 물을지)과 Belief Tracking(BT, 얻은 정보를 어떻게 내면화할지)으로 분리해서 분석하면, 둘 다 보상이 오르는데도 개선되지 않는 현상을 관찰 가능
약한 BT가 좋은 AS의 보상 기여를 마스킹하고, 낮은 AS가 BT의 학습 재료를 줄이는 악순환이 수학적으로 증명됨 (Theorem 3.4)
AREW(Advantage REWeighting) 프레임워크 제안 - '이 질문이 새 정보를 끌어냈는가?', '신념이 올바른 방향으로 업데이트됐는가?'를 +1/-1로 표시하는 간단한 이진 신호를 policy gradient의 advantage에 더해줌
Qwen-2.5-7B-Instruct, Llama-3.1-8B-Instruct 두 모델에서 PPO/GRPO/GSPO 세 가지 RL 알고리즘 모두에서 AREW가 효과적으로 Self-Locking을 해소함
critique 신호에 50% 노이즈를 줘도 vanilla baseline보다 경쟁력 있는 성능 유지 - 완벽한 신호 없이도 동작함

Evidence

7개 데이터셋 28개 설정 중 27개에서 vanilla PPO 대비 성능 향상, PE-GS=3에서 Qwen 기준 18.33 → 80.33으로 +62.0 포인트 개선
MediQ에서 RL 학습 후 환자 피드백을 모두 'Unknown'으로 교체했을 때 성능 하락 폭이 줄어듦 (41.25→30.50 vs 61.00→55.50) - RL이 오히려 상호작용 의존도를 낮춤을 실증
FloDial-Hard에서 Llama-3.1-8B 기준 vanilla 31.00 → AREW AS+BT 49.00으로 +18.0 포인트 향상
Proposition 4.1에 따라 critique의 weighted accuracy가 50% 이상이면 AREW가 AS 개선에 효과적임을 수학적으로 증명

How to Apply

멀티턴 에이전트에서 각 대화 턴 후 '이 질문이 새로운 정보를 끌어냈는가'(AS critique)와 '에이전트의 신념 점수가 정답 방향으로 움직였는가'(BT critique)를 +1/-1로 기록하고, 이 값을 RL advantage에 λ*u_t 형태로 더해주면 됨 (λ는 하이퍼파라미터, 너무 크면 불안정해짐)
의료 진단 챗봇이나 고객 지원 에이전트처럼 여러 번 질문해서 정보를 모아야 하는 시스템에서, 에이전트가 항상 같은 질문만 반복하거나 초기 판단을 고집하는 증상이 있다면 AREW 방식으로 중간 단계 critique을 추가해볼 것
Action Round(질문 생성)와 Update Round(신념 업데이트)를 명시적으로 분리한 프롬프트 구조를 사용하면 AS/BT critique을 각각 독립적으로 측정하기 쉬워짐 (논문의 Figure 6~8 프롬프트 템플릿 참고)

Code Example

snippet

# AREW의 핵심 아이디어: advantage reweighting
# 기존 PPO에서 advantage만 수정하면 됨

def compute_arew_advantage(advantages, as_critiques, bt_critiques, lambda_weight=0.2):
    """
    advantages: 기존 GAE로 계산된 per-step advantage 텐서
    as_critiques: 각 Action Round에서의 critique (+1: informative, -1: uninformative, 0: abstain)
    bt_critiques: 각 Update Round에서의 critique (+1: belief improved, -1: degraded, 0: abstain)
    """
    import torch
    
    # AS critique를 이진 라벨로 결합
    critiques = []  # 전체 턴에 대한 critique
    for t, (as_c, bt_c) in enumerate(zip(as_critiques, bt_critiques)):
        critiques.append(as_c if t % 2 == 0 else bt_c)  # Action/Update 교대
    
    critiques = torch.tensor(critiques, dtype=torch.float)
    
    # Positive/Negative 인덱스 분리
    pos_mask = (critiques == 1)
    neg_mask = (critiques == -1)
    
    n_pos = pos_mask.sum().clamp(min=1)
    n_neg = neg_mask.sum().clamp(min=1)
    
    # u_t 계산: 방향성 가중치
    u = torch.zeros_like(critiques)
    u[pos_mask] = 1.0 / n_pos
    u[neg_mask] = -1.0 / n_neg
    
    # Advantage reweighting: A_hat_t = A_t + lambda * u_t
    adjusted_advantages = advantages + lambda_weight * u
    
    return adjusted_advantages

# 실제 RL 학습 루프에서:
# adj_adv = compute_arew_advantage(advantages, as_critiques, bt_critiques)
# loss = -torch.mean(log_probs * adj_adv)  # policy gradient loss

Terminology

Active Reasoning에이전트가 수동적으로 답하는 게 아니라, 모르는 정보를 능동적으로 질문해서 채워나가는 추론 방식. 탐정이 증거를 찾아다니는 것과 비슷.

POMDPPartially Observable MDP. 환경 전체를 볼 수 없고 부분적인 정보만 보면서 의사결정해야 하는 상황을 수학적으로 모델링한 것. 안개 속 운전처럼 일부만 보이는 상태에서 최선의 선택을 하는 문제.

Belief Tracking (BT)에이전트가 대화 중 수집한 정보를 바탕으로 '현재 정답이 무엇일 가능성이 얼마나 되는가'를 계속 업데이트하는 능력. 의사가 문진할수록 진단 확신도를 조정하는 것과 유사.

Action Selection (AS)다음에 어떤 질문을 할지 선택하는 능력. 좋은 AS는 가장 많은 정보를 얻을 수 있는 질문을 고름.

Policy GradientRL에서 좋은 결과를 낸 행동의 확률은 높이고, 나쁜 결과를 낸 행동의 확률은 낮추는 방식으로 모델을 학습시키는 방법.

AdvantageRL에서 어떤 행동이 평균보다 얼마나 더 좋았는지를 나타내는 값. 양수면 평균보다 좋은 행동, 음수면 나쁜 행동.

PPOProximal Policy Optimization. LLM을 RL로 학습할 때 가장 많이 쓰이는 알고리즘. 학습이 너무 급격하게 바뀌지 않도록 클리핑으로 안정성을 확보하는 방식.

Outcome-based Reward중간 과정은 보상 없이, 최종 결과가 맞았는지 틀렸는지에만 점수를 주는 학습 방식. 답안지를 제출한 후에만 채점받는 것과 비슷.

Related Resources

Original Abstract (Expand)

Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent's belief based on collected evidence. We show that deficient AS and BT capabilities will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low-information regime. To resolve the issue, we propose a simple yet effective approach that reallocates the learning signal by injecting easy- to-obtain directional critiques to help the agent escape self-locking. Extensive experiments with 7 datasets show that our approach significantly mitigates the information self-locking, bringing up to 60% improvements.