GPT-5: Key characteristics, pricing and system card
TL;DR Highlight
OpenAI launched the GPT-5 model family (regular, mini, nano) focused on stable, less error-prone practical improvements rather than revolutionary leaps, with aggressively competitive pricing.
Who Should Read
Backend/fullstack developers using or evaluating OpenAI API, or team leads focused on LLM model selection and cost optimization.
Core Mechanics
- In ChatGPT, GPT-5 is a hybrid system with a router auto-switching between fast and deep reasoning models. In the API, it's 3 variants (regular/mini/nano) × 4 reasoning levels (minimal/low/medium/high).
- 272K input tokens, 128K output tokens (including reasoning tokens), supports text+image+audio+video multimodal input.
- Input pricing halved vs. GPT-4o, with 90% token caching discount — very aggressive cost positioning.
- Reduced hallucination highlighted by Simon Willison, though community debate about whether Claude Sonnet/Opus remains more reliable for daily coding.
Evidence
- GPT-5 being 'incremental' rather than 'revolutionary' was seen as evidence of diminishing returns from pure scaling — the shift to router optimization and sub-model composition itself signals the old approach's limits.
- Simon's reduced hallucination claims were debated — some argued Claude 4 Sonnet/Opus was more reliable for daily coding tasks.
- Token caching with 90% discount makes repeated similar requests significantly cheaper — beneficial for chatbot and agent use cases.
How to Apply
- If using GPT-4o or o3 via API, just swap to GPT-5 for halved input costs with equal or better quality. Start testing with reasoning level 'minimal' to control reasoning token costs.
- For chat-based services, leverage the 90% token caching discount by structuring conversation history as a long system prompt to maximize cache hits.
- Use the reasoning level parameter to optimize cost per use case: 'minimal' for simple Q&A, 'high' for complex math/code tasks.
Terminology
Related Papers
Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s
Apple Silicon에서 Swift로 직접 행렬 곱셈 커널을 구현하며 CPU, SIMD, AMX, GPU(Metal)를 단계별로 최적화해 Gflop/s에서 Tflop/s 수준까지 성능을 높이는 과정을 상세히 설명한 글이다. 프레임워크 없이 LLM 학습의 핵심 연산을 밑바닥부터 구현하고 싶은 개발자에게 Apple Silicon의 성능 한계를 체감할 수 있는 드문 자료다.
Removing fsync from our local storage engine
FractalBits가 fsync 없이 SSD 전용 KV 스토리지 엔진을 구현해 동일 조건 대비 약 65% 높은 쓰기 성능을 달성한 설계 방법을 공유했다. fsync의 메타데이터 오버헤드를 피하기 위해 사전 할당, O_DIRECT, SSD 원자 쓰기 단위 정렬 저널을 조합한 구조가 핵심이다.
Google Chrome silently installs a 4 GB AI model on your device without consent
Google Chrome이 사용자 동의 없이 Gemini Nano 4GB 모델 파일을 자동 다운로드하고, 삭제해도 재다운로드되는 문제가 발견됐다. GDPR 위반 가능성과 수십억 대 기기에 적용될 때의 환경 비용 문제가 제기되고 있다.
How OpenAI delivers low-latency voice AI at scale
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Deterministic Leaf Enumeration (DLE) cuts self-consistency’s redundant sampling by deterministically exploring a tree of possible sequences, simultaneously improving math/code reasoning performance and speed.