This new technique saves 60% of my token expenses
TL;DR Highlight
You can reduce LLM response tokens by 60% by using a telegraphic style that only keeps nouns and verbs, excluding articles, conjunctions, and auxiliary verbs.
Who Should Read
Backend developers who are concerned about API costs and token optimization. Especially those using GPT-4 level models for simple tasks such as summarization, classification, and data extraction.
Core Mechanics
- When a typical response is hundreds of tokens, forcing a 'caveman' style compresses it to around 40 tokens. It's possible to convey the same meaning with significantly fewer tokens.
- Key prompt pattern: 'Drop articles, conjunctions, filler words, copulas. Keep nouns, verbs, key modifiers only.' — Explicitly instruct to remove articles (a, the), conjunctions (and, but), and unnecessary verbs (is, are).
- This approach is similar to the structure of American Sign Language (ASL) or telegrams. It's a strategy to increase meaning density and remove padding words.
- However, this technique is only valid for pipelines where 'readable responses' are not required. It's not suitable for responses exposed to end-users.
- It's also pointed out that 80% of prompts can be handled without expensive models (GPT-4, Claude Opus). Model downgrading (routing) may be a more fundamental cost reduction than compression style.
- Synergy can be achieved by combining a routing strategy to smaller models (GPT-4o mini, Haiku, etc.) with a compression style.
Evidence
- "Reported a 60% reduction in token count compared to normal responses. Presented a case where a hundreds-of-tokens response was compressed to around 40 tokens. Since costs are calculated based on the sum of input and output tokens, reducing output tokens by 60% proportionally reduces API costs. The effect is greater when the output proportion is large."
How to Apply
- If the response is not directly read by humans in internal pipelines (classification, extraction, summarization, etc.), add a telegraphic style instruction to the system prompt. Example: 'Respond in compressed telegraphic style. Drop articles, conjunctions, filler words, copulas. Keep nouns, verbs, key modifiers only.'
- Create a router that first determines the complexity of the task, sending simple classification/summarization to GPT-4o mini or Claude Haiku, and sending only complex reasoning to expensive models. Adding a compression style on top of this can provide double savings.
- If response parsing is required, use JSON mode or structured output along with the telegraphic style to structure the response, reducing tokens without parsing errors.
Code Example
system_prompt = """
Respond in compressed telegraphic style.
Drop articles, conjunctions, filler words, copulas.
Keep nouns, verbs, key modifiers only.
Meaning density over readability.
Write like a telegram costs per word.
"""
# Example input
user_message = "What are the main causes of climate change?"
# Normal response example (~80 tokens)
# "Climate change is primarily caused by the burning of fossil fuels, which releases greenhouse gases..."
# Telegraphic response example (~20 tokens)
# "Fossil fuel burning → CO2 rise → heat trap. Also: deforestation, agriculture, industry emissions."Terminology
Related Papers
Claude-real-video - any LLM can watch a video
YouTube URL이나 로컬 영상 파일에서 장면 변화 기반으로 핵심 프레임만 추출하고 음성 전사까지 해서 LLM에게 넘겨주는 오픈소스 도구. Claude는 영상 파일을 못 받고, ChatGPT는 자막만 읽고, Gemini는 고정 1fps 샘플링이라는 한계를 모두 우회한다.
ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning
128K 토큰 컨텍스트에서 모델 내부 attention 신호로 핵심 증거만 추출해 재주입하면 추론 정확도가 24.6% 오른다.
Single and Multi Truth Data Fusion using Large Language Models
여러 소스의 충돌하는 데이터를 GPT-4o-mini 프롬프트로 병합하면 기존 비지도 방법보다 일관되게 F1 점수가 높다.
Multilingual Reasoning Cascades Need More Context
번역 cascade 파이프라인에서 원본 질문을 마지막까지 유지하면 추가 학습 없이 다국어 성능이 크게 오른다.
Less Back-and-Forth: A Comparative Study of Structured Prompting
체크리스트 형식으로 프롬프트를 구조화하면 LLM 답변 품질도 높아지고 토큰도 적게 쓴다.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
재학습 없이 각 나라의 도덕적 가치관에 맞게 LLM 출력을 조정하는 추론 시점 기법 DISCA 제안
Using Claude Code: The unreasonable effectiveness of HTML