Chain-of-Thought을 넘어서: LLM 내부의 잠재적 추론 Computational Mode

Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models

Jan 12, 2026•Zhenghao He, Guangzhi Xiong, Bohan Liu +2•View PDF

TL;DR Highlight

CoT 프롬프트 없이도 모델 내부의 단 하나의 latent feature만 조작하면 추론 성능을 CoT 수준으로 끌어올릴 수 있다.

Who Should Read

LLM 추론 비용과 성능 사이의 트레이드오프를 고민하는 ML 엔지니어. 특히 오픈소스 모델(LLaMA, Qwen, Gemma)을 직접 서빙하면서 CoT의 긴 토큰 비용을 줄이고 싶은 개발자.

Core Mechanics

CoT 프롬프트는 추론을 유발하는 유일한 방법이 아님 — 내부 latent feature 하나를 조작해도 동일한 추론 모드 진입 가능
SAE(Sparse Autoencoder, 희소 자동인코더)로 추론 관련 latent feature를 단 몇 개 수준으로 특정할 수 있음
생성 첫 번째 토큰 시점에만 개입해도 LLaMA-3.1-8B GSM8K 정확도가 24.5% → 73.3%로 상승
LLaMA-3.3-70B 같은 대형 모델에서는 latent steering이 CoT와 유사한 정확도를 내면서 토큰 수는 약 5배 절감 (53 vs 268 tokens)
Qwen의 /no_think 제어 토큰도 이 steering으로 우회됨 — 프롬프트 레벨 억제 명령을 내부 계산이 오버라이드
이 reasoning feature는 추론 '모드 진입' 지표일 뿐, 정답 품질과는 무관함 (정답 여부와 상관관계 없음)

Evidence

LLaMA-3.1-8B GSM8K: Direct 24.5% → Steered Direct 73.3% (단일 feature #8629 개입만으로)
LLaMA-3.3-70B GSM8K: Steered Direct 88.8% vs CoT 96.1%, 토큰은 53 vs 268 (약 80% 절감)
랜덤 feature steering 정확도 26.1±3.4% vs reasoning feature steering 73.3% — 효과가 특정 feature에 국한됨을 확인
CoT vs Direct 모드 구분 point-biserial correlation r=0.14, p=0.006 (통계적으로 유의미)

How to Apply

오픈소스 LLM 추론 파이프라인에서 Goodfire/GemmaScope의 사전학습된 SAE를 불러와 CoT vs Direct 프롬프트 차이로 differential score가 높은 feature를 찾고, 첫 번째 생성 스텝에서 α=15~25로 boost하면 CoT 없이도 추론 유도 가능
토큰 비용이 중요한 서비스에서 LLaMA-3.3-70B처럼 큰 모델을 쓸 때, CoT 대신 latent steering으로 전환하면 비슷한 정확도에 출력 토큰 대폭 절감 — 특히 스트리밍 응답 레이턴시 개선에 유리
Qwen3처럼 /no_think 토큰으로 reasoning을 억제하는 모델에서 특정 케이스에 강제로 reasoning 모드를 켜야 할 때, 프롬프트 대신 activation steering을 fallback으로 활용 가능

Code Example

snippet

# Goodfire SAE를 사용한 latent steering 개념 예시 (LLaMA-3.1-8B)
# pip install goodfire transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. 모델 로드 (hidden states 출력 필요)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    output_hidden_states=True,
    torch_dtype=torch.float16
)

# 2. SAE 로드 (Goodfire 제공)
# from goodfire import SAE
# sae = SAE.load("Goodfire/Llama-3.3-70B-Instruct-SAE-l50")

# 3. Residual injection hook 등록
REASONING_FEATURE_IDX = 8629  # LLaMA-3.1-8B reasoning feature
STEERING_LAYER = 19
STEERING_ALPHA = 15
first_step_done = False

def steering_hook(module, input, output):
    global first_step_done
    if first_step_done:
        return output
    
    hidden = output[0]  # (batch, seq, hidden)
    
    # SAE encode → 특정 feature 활성화 값 증폭 → decode → residual 추가
    # z = sae.encode(hidden[:, -1, :])
    # z_steered = z.clone()
    # scale = z[:, REASONING_FEATURE_IDX].abs().mean()
    # z_steered[:, REASONING_FEATURE_IDX] += STEERING_ALPHA * scale
    # delta = sae.decode(z_steered) - sae.decode(z)
    # hidden[:, -1, :] += delta
    
    first_step_done = True
    return (hidden,) + output[1:]

# hook 등록
hook = model.model.layers[STEERING_LAYER].register_forward_hook(steering_hook)

# 4. Direct 프롬프트로 추론 (CoT 없이도 reasoning 동작)
prompt = "Question: James runs 3 sprints of 60 meters, 3 times a week. How many meters does he run in a week?\n\nGive me the answer directly."
# model.generate(tokenizer(prompt, return_tensors='pt').input_ids, max_new_tokens=200)

hook.remove()

Terminology

SAE모델 내부의 복잡하게 얽힌 활성화 신호를 의미 있는 sparse한 조각들로 분해하는 도구. 전파 신호에서 특정 주파수만 골라내는 필터와 비슷하게, 모델 내부에서 '추론 관련 신호'만 골라낼 수 있음.

Latent Feature모델 내부 표현 공간에서 특정 동작이나 개념과 연관된 숨겨진 차원. 눈에 보이지 않지만 모델 행동을 결정하는 내부 스위치 같은 것.

Latent Steering프롬프트를 바꾸지 않고 모델 내부 활성화 값을 직접 조작해 원하는 동작을 유도하는 기법. 앱 UI가 아니라 메모리를 직접 건드리는 치트키 같은 개념.

Residual InjectionSAE로 수정한 활성화 값을 원본에 그대로 덮어쓰지 않고 '변화분(delta)'만 더하는 방식. 수술할 때 전체를 교체하지 않고 이식만 하는 것처럼 부작용을 최소화.

CoTChain-of-Thought의 줄임말. '단계별로 생각해봐'처럼 중간 추론 과정을 텍스트로 출력하게 유도하는 프롬프트 기법. LLM이 어려운 문제를 풀 때 수학 풀이 과정을 보여주는 것과 같음.

Mechanistic InterpretabilityLLM이 왜 특정 출력을 내는지 내부 메커니즘 수준에서 해부하려는 연구 분야. 블랙박스 모델을 MRI 찍듯이 내부를 들여다보는 접근법.

Point-biserial Correlation연속형 변수(feature 활성화 값)와 이진 변수(CoT인지 아닌지) 사이의 상관관계를 측정하는 통계 방법. 두 그룹 간 평균 차이를 표준화한 값.

Related Resources

Original Abstract (Expand)

Chain-of-Thought (CoT) prompting has improved the reasoning performance of large language models (LLMs), but it remains unclear why it works and whether it is the unique mechanism for triggering reasoning in large language models. In this work, we study this question by directly analyzing and intervening on the internal representations of LLMs with Sparse Autoencoders (SAEs), identifying a small set of latent features that are causally associated with LLM reasoning behavior. Across multiple model families and reasoning benchmarks, we find that steering a single reasoning-related latent feature can substantially improve accuracy without explicit CoT prompting. For large models, latent steering achieves performance comparable to standard CoT prompting while producing more efficient outputs. We further observe that this reasoning-oriented internal state is triggered early in generation and can override prompt-level instructions that discourage explicit reasoning. Overall, our results suggest that multi-step reasoning in LLMs is supported by latent internal activations that can be externally activated, while CoT prompting is one effective, but not unique, way of activating this mechanism rather than its necessary cause.