GPT-5: Key characteristics, pricing and system card

TL;DR Highlight

OpenAI launched the GPT-5 model family (regular, mini, nano) focused on stable, less error-prone practical improvements rather than revolutionary leaps, with aggressively competitive pricing.

Who Should Read

Backend/fullstack developers using or evaluating OpenAI API, or team leads focused on LLM model selection and cost optimization.

Core Mechanics

In ChatGPT, GPT-5 is a hybrid system with a router auto-switching between fast and deep reasoning models. In the API, it's 3 variants (regular/mini/nano) × 4 reasoning levels (minimal/low/medium/high).
272K input tokens, 128K output tokens (including reasoning tokens), supports text+image+audio+video multimodal input.
Input pricing halved vs. GPT-4o, with 90% token caching discount — very aggressive cost positioning.
Reduced hallucination highlighted by Simon Willison, though community debate about whether Claude Sonnet/Opus remains more reliable for daily coding.

Evidence

GPT-5 being 'incremental' rather than 'revolutionary' was seen as evidence of diminishing returns from pure scaling — the shift to router optimization and sub-model composition itself signals the old approach's limits.
Simon's reduced hallucination claims were debated — some argued Claude 4 Sonnet/Opus was more reliable for daily coding tasks.
Token caching with 90% discount makes repeated similar requests significantly cheaper — beneficial for chatbot and agent use cases.

How to Apply

If using GPT-4o or o3 via API, just swap to GPT-5 for halved input costs with equal or better quality. Start testing with reasoning level 'minimal' to control reasoning token costs.
For chat-based services, leverage the 90% token caching discount by structuring conversation history as a long system prompt to maximize cache hits.
Use the reasoning level parameter to optimize cost per use case: 'minimal' for simple Q&A, 'high' for complex math/code tasks.

Terminology

Reasoning EffortA parameter controlling how much the model 'thinks' before answering. 'minimal' gives near-instant responses; 'high' spends many internal tokens on deep reasoning.

Token CachingSending the same input tokens again within a few minutes gets cached results without recomputation — massive cost discounts.

System CardA document OpenAI publishes detailing a model's capabilities, limitations, and safety evaluations.