LLM의 Temperature가 성능에 미치는 영향: 뜨겁게 vs 차갑게?

Exploring the Impact of Temperature on Large Language Models:Hot or Cold?

Jun 8, 2025•Lujun Li, Lama Sleem, Niccolò Gentile +2•View PDF

TL;DR Highlight

LLM의 temperature를 어떻게 설정하면 좋은지, 6가지 능력별로 최적값이 다르다는 걸 실험으로 증명하고 자동 선택기까지 만든 연구.

Who Should Read

LLM API 호출 시 temperature 값을 기본값(1.0)으로 쓰고 있는 백엔드 개발자, 또는 RAG/에이전트 시스템에서 모델 출력 품질을 개선하려는 AI 엔지니어.

Core Mechanics

temperature는 능력별로 최적값이 완전히 다름 - 번역/요약은 0에 가까울수록 좋고, 창의성/인과추론은 1.3 근처가 최고
Instruction Following(지시 따르기)은 temperature 1.0 이하면 안전하지만, 1.0 넘으면 모델 크기별로 급격히 성능 하락 (소형은 1.0~1.3, 중형은 1.3~1.6, 대형은 1.6~1.9에서 뚝 떨어짐)
Machine Translation은 temperature 영향을 가장 많이 받음 - 소형 모델에서 최대 192% 성능 차이 발생
큰 모델일수록 temperature 변화에 강건(robust)함 - 소형 모델은 temperature 조절이 필수
BERT를 fine-tuning해서 입력 프롬프트의 '주요 능력'을 분류하고, 그에 맞는 최적 temperature를 자동으로 골라주는 'BERT-based Selector' 제안
4-bit 양자화(모델 용량 줄이는 기법) vs FP16 정밀도 비교 시 temperature 최적값은 거의 동일하고 성능 차이만 10~20% 존재

Evidence

소형 모델에서 temperature에 따른 최대 성능 변동폭: MT 192.32%, CT(창의성) 186.81%, IF(지시따르기) 154.65%
Llama-3.2-1B에서 BERT Selector 사용 시 SuperGLUE 평균 정확도 0.252 → 0.494로 약 96% 향상
Meta-Llama-3-8B에서 BERT Selector 사용 시 SuperGLUE 평균 정확도 0.600 → 0.612로 소폭 개선, WIC 단일 태스크는 0.547 → 0.556
temperature와 성능의 Spearman 상관계수: IF -0.37, MT -0.40, SUMM -0.45로 강한 음의 상관관계 확인 (p-value 0.00)

How to Apply

번역이나 요약 태스크 API 호출 시 temperature를 0.1~0.3으로 낮춰서 설정하면 품질이 올라감. 반대로 브레인스토밍/스토리 생성은 1.0~1.3이 최적.
에이전트나 RAG 시스템에서 지시 따르기(tool call, JSON 출력 등)가 중요한 노드는 temperature를 0.7 이하로 제한해서 안정성 확보.
소형 모델(1B~4B)을 쓰는 환경에서는 프롬프트 유형을 분류(번역/창의/추론 등)하고 유형별로 다른 temperature를 적용하는 라우팅 레이어를 추가하면 성능 개선 가능.

Code Example

snippet

# 태스크 유형별 temperature 가이드라인 (논문 결과 기반)
TEMPERATURE_MAP = {
    'machine_translation': 0.1,   # 낮을수록 좋음
    'summarization': 0.1,          # 낮을수록 좋음
    'instruction_following': 0.7,  # 1.0 이하 권장
    'creativity': 1.3,             # 중간~높은 값이 최적
    'causal_reasoning': 1.3,       # 약간 높은 값이 도움
    'in_context_learning': 0.4,    # 낮은 편이 안정적
}

def get_optimal_temperature(task_type: str, model_size: str = 'large') -> float:
    """
    model_size: 'small'(1B-4B), 'medium'(6B-13B), 'large'(40B-80B)
    소형 모델일수록 temperature 변화에 민감 → 더 보수적으로 설정
    """
    base_temp = TEMPERATURE_MAP.get(task_type, 0.7)
    if model_size == 'small':
        # 소형 모델은 mutation temperature가 낮으므로 더 낮게 설정
        return min(base_temp, 0.7)
    elif model_size == 'medium':
        return min(base_temp, 1.0)
    return base_temp  # large 모델은 그대로 사용

# 사용 예시
import openai

task = 'machine_translation'
temp = get_optimal_temperature(task, model_size='small')
response = openai.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{'role': 'user', 'content': 'Translate to Korean: Hello world'}],
    temperature=temp
)

Terminology

temperatureLLM이 다음 단어를 고를 때 얼마나 '다양하게' 고를지 조절하는 값. 0에 가까울수록 항상 가장 확률 높은 단어만 고르고, 높을수록 랜덤하게 다양한 단어를 선택함.

logitLLM이 각 단어에 부여하는 '점수'. temperature는 이 점수를 softmax 함수에 넣기 전에 나누어서 분포를 바꿈.

softmax여러 숫자를 0~1 사이 확률로 변환하는 함수. 모든 값의 합이 1이 됨. LLM이 다음 토큰을 고를 때 사용.

Top-P sampling확률 합이 P가 될 때까지 상위 단어들만 후보로 올리는 샘플링 방식. Nucleus Sampling이라고도 함.

AWQ (4-bit 양자화)모델 가중치를 32비트 대신 4비트로 압축해서 메모리와 속도를 개선하는 기법. 성능 손실이 최소화됨.

SuperGLUELLM의 자연어 이해 능력을 여러 태스크로 종합 평가하는 벤치마크. GLUE의 어려운 버전.

Mutation Temperature이 논문에서 정의한 개념으로, 특정 온도 이상에서 모델 성능이 갑자기 뚝 떨어지는 임계점.

BERT-based Selector입력 프롬프트를 보고 '이 질문은 번역/창의/추론 중 어느 태스크인지' 분류하는 fine-tuned BERT 모델. 분류 결과에 따라 최적 temperature를 자동 선택함.

Related Resources

GitHub: temperature_eval (실험 코드)

Original Abstract (Expand)

The sampling temperature, a critical hyperparameter in large language models (LLMs), modifies the logits before the softmax layer, thereby reshaping the distribution of output tokens. Recent studies have challenged the Stochastic Parrots analogy by demonstrating that LLMs are capable of understanding semantics rather than merely memorizing data and that randomness, modulated by sampling temperature, plays a crucial role in model inference. In this study, we systematically evaluated the impact of temperature in the range of 0 to 2 on data sets designed to assess six different capabilities, conducting statistical analyses on open source models of three different sizes: small (1B--4B), medium (6B--13B), and large (40B--80B). Our findings reveal distinct skill-specific effects of temperature on model performance, highlighting the complexity of optimal temperature selection in practical applications. To address this challenge, we propose a BERT-based temperature selector that takes advantage of these observed effects to identify the optimal temperature for a given prompt. We demonstrate that this approach can significantly improve the performance of small and medium models in the SuperGLUE datasets. Furthermore, our study extends to FP16 precision inference, revealing that temperature effects are consistent with those observed in 4-bit quantized models. By evaluating temperature effects up to 4.0 in three quantized models, we find that the Mutation Temperature -- the point at which significant performance changes occur -- increases with model size.