Exploring the Impact of Temperature on Large Language Models:Hot or Cold?

Jun 8, 2025•Lujun Li, Lama Sleem, Niccolò Gentile +2•View PDF

TL;DR Highlight

Experimentally proving that optimal LLM temperature varies by task type across 6 capabilities, and building an automatic temperature selector.

Who Should Read

Backend developers using default temperature (1.0) for LLM API calls, or AI engineers looking to improve output quality in RAG/agent systems.

Core Mechanics

Optimal temperature varies dramatically by capability — translation/summarization works best near 0, while creativity/causal reasoning peaks around 1.3
Instruction Following is safe below 1.0 but degrades sharply above — small models at 1.0-1.3, medium at 1.3-1.6, large at 1.6-1.9
A BERT-based automatic selector can choose the right temperature per task, significantly improving performance on smaller models
Performance swing from temperature alone: up to 192% for machine translation, 187% for creativity

Evidence

Small model max performance variation by temperature: MT 192.32%, CT (creativity) 186.81%, IF (instruction following) 154.65%
Llama-3.2-1B with BERT Selector: SuperGLUE average accuracy 0.252 → 0.494 (~96% improvement)
Meta-Llama-3-8B with BERT Selector: significant improvements across multiple benchmarks

How to Apply

For translation/summarization API calls, set temperature to 0.1-0.3 for better quality. For brainstorming/story generation, 1.0-1.3 is optimal.
For agent/RAG system nodes where instruction following matters (tool calls, JSON output), cap temperature at 0.7 for stability.
Small models (1B-4B) are extremely sensitive to temperature — use a task-specific temperature selector or at minimum avoid the default 1.0 for all tasks.

Code Example

snippet

# Temperature guidelines by task type (based on paper results)
TEMPERATURE_MAP = {
    'machine_translation': 0.1,   # lower is better
    'summarization': 0.1,          # lower is better
    'instruction_following': 0.7,  # recommended 1.0 or below
    'creativity': 1.3,             # medium~high values are optimal
    'causal_reasoning': 1.3,       # slightly higher values help
    'in_context_learning': 0.4,    # lower values are more stable
}

def get_optimal_temperature(task_type: str, model_size: str = 'large') -> float:
    """
    model_size: 'small'(1B-4B), 'medium'(6B-13B), 'large'(40B-80B)
    Smaller models are more sensitive to temperature changes → set more conservatively
    """
    base_temp = TEMPERATURE_MAP.get(task_type, 0.7)
    if model_size == 'small':
        # Smaller models have lower mutation temperature, so set it even lower
        return min(base_temp, 0.7)
    elif model_size == 'medium':
        return min(base_temp, 1.0)
    return base_temp  # use as-is for large models

# Usage example
import openai

task = 'machine_translation'
temp = get_optimal_temperature(task, model_size='small')
response = openai.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{'role': 'user', 'content': 'Translate to Korean: Hello world'}],
    temperature=temp
)

Terminology

temperatureControls how 'diverse' the LLM's next word selection is. Near 0 = always picks the most probable word. Higher = more random and creative selection.

logitThe raw 'score' the LLM assigns to each candidate word. Temperature divides these scores before applying softmax to reshape the probability distribution.

softmaxA function that converts raw scores into probabilities summing to 1. Temperature modifies the distribution shape — low temp makes it peaky, high temp makes it flat.

Related Resources

GitHub: temperature_eval (Experiment Code)

Original Abstract (Expand)

The sampling temperature, a critical hyperparameter in large language models (LLMs), modifies the logits before the softmax layer, thereby reshaping the distribution of output tokens. Recent studies have challenged the Stochastic Parrots analogy by demonstrating that LLMs are capable of understanding semantics rather than merely memorizing data and that randomness, modulated by sampling temperature, plays a crucial role in model inference. In this study, we systematically evaluated the impact of temperature in the range of 0 to 2 on data sets designed to assess six different capabilities, conducting statistical analyses on open source models of three different sizes: small (1B--4B), medium (6B--13B), and large (40B--80B). Our findings reveal distinct skill-specific effects of temperature on model performance, highlighting the complexity of optimal temperature selection in practical applications. To address this challenge, we propose a BERT-based temperature selector that takes advantage of these observed effects to identify the optimal temperature for a given prompt. We demonstrate that this approach can significantly improve the performance of small and medium models in the SuperGLUE datasets. Furthermore, our study extends to FP16 precision inference, revealing that temperature effects are consistent with those observed in 4-bit quantized models. By evaluating temperature effects up to 4.0 in three quantized models, we find that the Mutation Temperature -- the point at which significant performance changes occur -- increases with model size.