BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts
TL;DR Highlight
Discovered 9 patterns that make LLMs generate tokens explosively using plain text prompts alone, and showed that a simple one-line defense can cut output by more than half
Who Should Read
Backend/MLOps developers running LLM APIs in services who worry about operational costs and latency. Especially teams exposing LLMs in public chatbots or multi-tenant environments.
Core Mechanics
- Plain natural-language prompts without jailbreaks can push output length to the 5,000 token limit across all 9 models tested including GPT-5, Claude-Sonnet, and Gemini-2.5-Flash
- Two most powerful patterns: 'explicit length forcing' (e.g., 'make 1,200 quizzes') and 'tokenizer stress' (emoji/unicode combinations) achieve up to 69% CSR@5k (5,000+ token rate)
- Even when the model issues a refusal response, output often doesn't get shorter — a self-contradictory pattern where it says 'I can't do that, but here's a brief summary...' then generates thousands of tokens
- Adding a single line 'Please provide a concise, precise response without unnecessary elaboration.' at the start of the system prompt reduces average tokens by 92% on Gemini-2.5-Flash (647→51) and 88% on Qwen-3-8B (1,301→152)
- Gemma-2-9B-It is relatively robust with high refusal rates against overflow strategies, while GPT-5 tends to follow long outputs without refusal — a difference in alignment design philosophy
- When the same prompt is run 4 times, GPT-5, Claude-Sonnet, and Gemma series produce nearly consistent output lengths, but LLaMA-3.x and Gemini-2.5-Flash show significant run-to-run length instability
Evidence
- Average token reduction with conciseness reminder: GPT-5 1,933→1,365 (30%), Claude-Sonnet 310→112 (64%), Gemini-2.5-Flash 647→51 (92%), Qwen-3-8B 1,301→152 (88%), Gemma-3-4B-It 950→70 (93%)
- Explicit forced length CSR@5k: GPT-5 63%, Claude-Sonnet 69%, LLaMA-3.2-3B 39.8% — benign baseline is near 0%
- Tokenizer stress CSR@3k: GPT-5 75.5%, Qwen-3-8B 51.5%, Gemini-2.5-Flash 38.8%
- Cross-model correlation within LLaMA family 69-71% (high), GPT-5 and Claude-Sonnet correlate 51-54% with other families — overflow patterns transfer along model lineages
How to Apply
- When connecting LLMs to public APIs or chatbots, add 'Please provide a concise, precise response without unnecessary elaboration.' as the first line of the system prompt to reduce verbosity by 50-90% on most models with no additional configuration
- Add a validation layer on user prompts that filters keywords like '1,000 item list,' 'copy entire text,' 'infinite repeat,' or enforce max_tokens through a gateway to reduce overflow DoS risk
- When choosing models for cost/latency stability, prioritize models like Gemma-2-9B-It with high refusal rates and low out-of-budget ratios; for models like GPT-5 that tend to follow long outputs, always set max_tokens limits
Code Example
# Example of adding a conciseness reminder to the system prompt (OpenAI SDK)
import openai
client = openai.OpenAI()
CONCISENESS_REMINDER = "Reminder: Please provide a concise, precise response without unnecessary elaboration."
def chat_with_overflow_defense(user_message: str, max_tokens: int = 1000) -> str:
response = client.chat.completions.create(
model="gpt-4o",
max_tokens=max_tokens,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"{user_message}\n\n{CONCISENESS_REMINDER}"}
]
)
return response.choices[0].message.content
# Example overflow-inducing prompts (for detection/filtering)
OVERFLOW_PATTERNS = [
r"\b(\d{3,})[\s]*(unique|different|distinct)", # '1,200 unique items'
r"\b(all|every|each)\b.*\b(integer|permutation|combination)", # implicit enumeration
r"\bwithout (stopping|end|limit)", # infinite generation
r"\bfull text\b|\bverbatim\b|\btranscribe\b", # quote attack
]
import re
def has_overflow_risk(prompt: str) -> bool:
return any(re.search(p, prompt, re.IGNORECASE) for p in OVERFLOW_PATTERNS)Terminology
Related Resources
Original Abstract (Expand)
We investigate a failure mode of large language models (LLMs) in which plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and can lead to elevated serving cost, latency, and cross-user performance degradation, particularly when scaled across many requests. Beyond usability, the stakes are economic and environmental: unnecessary tokens increase per-request cost and energy consumption, compounding into substantial operational spend and carbon footprint at scale. Moreover, Overflow represents a practical vector for compute amplification and service degradation in shared environments. We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. Using a standardized protocol with a fixed budget of 5000 new tokens, we evaluate nine open- and closed-source models and observe pronounced rightward shifts and heavy tails in length distributions. Cap-saturation rates (CSR@1k/3k/5k) and empirical cumulative distribution functions (ECDFs) quantify tail risk; within-prompt variance and cross-model correlations show that Overflow is broadly reproducible yet heterogeneous across families and attack vectors. A lightweight mitigation-a fixed conciseness reminder-attenuates right tails and lowers CSR for all strategies across the majority of models. Our findings position length control as a measurable reliability, cost, and sustainability concern rather than a stylistic quirk. By enabling standardized comparison of length-control robustness across models, BenchOverflow provides a practical basis for selecting deployments that minimize resource waste and operating expense, and for evaluating defenses that curb compute amplification without eroding task performance.