BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts

Jan 13, 2026•Erin Feiglin, Nir Hutnik, Raz Lapid•View PDF

TL;DR Highlight

Discovered 9 patterns that make LLMs generate tokens explosively using plain text prompts alone, and showed that a simple one-line defense can cut output by more than half

Who Should Read

Backend/MLOps developers running LLM APIs in services who worry about operational costs and latency. Especially teams exposing LLMs in public chatbots or multi-tenant environments.

Core Mechanics

Plain natural-language prompts without jailbreaks can push output length to the 5,000 token limit across all 9 models tested including GPT-5, Claude-Sonnet, and Gemini-2.5-Flash
Two most powerful patterns: 'explicit length forcing' (e.g., 'make 1,200 quizzes') and 'tokenizer stress' (emoji/unicode combinations) achieve up to 69% CSR@5k (5,000+ token rate)
Even when the model issues a refusal response, output often doesn't get shorter — a self-contradictory pattern where it says 'I can't do that, but here's a brief summary...' then generates thousands of tokens
Adding a single line 'Please provide a concise, precise response without unnecessary elaboration.' at the start of the system prompt reduces average tokens by 92% on Gemini-2.5-Flash (647→51) and 88% on Qwen-3-8B (1,301→152)
Gemma-2-9B-It is relatively robust with high refusal rates against overflow strategies, while GPT-5 tends to follow long outputs without refusal — a difference in alignment design philosophy
When the same prompt is run 4 times, GPT-5, Claude-Sonnet, and Gemma series produce nearly consistent output lengths, but LLaMA-3.x and Gemini-2.5-Flash show significant run-to-run length instability

Evidence

Average token reduction with conciseness reminder: GPT-5 1,933→1,365 (30%), Claude-Sonnet 310→112 (64%), Gemini-2.5-Flash 647→51 (92%), Qwen-3-8B 1,301→152 (88%), Gemma-3-4B-It 950→70 (93%)
Explicit forced length CSR@5k: GPT-5 63%, Claude-Sonnet 69%, LLaMA-3.2-3B 39.8% — benign baseline is near 0%
Tokenizer stress CSR@3k: GPT-5 75.5%, Qwen-3-8B 51.5%, Gemini-2.5-Flash 38.8%
Cross-model correlation within LLaMA family 69-71% (high), GPT-5 and Claude-Sonnet correlate 51-54% with other families — overflow patterns transfer along model lineages

How to Apply

When connecting LLMs to public APIs or chatbots, add 'Please provide a concise, precise response without unnecessary elaboration.' as the first line of the system prompt to reduce verbosity by 50-90% on most models with no additional configuration
Add a validation layer on user prompts that filters keywords like '1,000 item list,' 'copy entire text,' 'infinite repeat,' or enforce max_tokens through a gateway to reduce overflow DoS risk
When choosing models for cost/latency stability, prioritize models like Gemma-2-9B-It with high refusal rates and low out-of-budget ratios; for models like GPT-5 that tend to follow long outputs, always set max_tokens limits

Code Example

snippet

# Example of adding a conciseness reminder to the system prompt (OpenAI SDK)
import openai

client = openai.OpenAI()

CONCISENESS_REMINDER = "Reminder: Please provide a concise, precise response without unnecessary elaboration."

def chat_with_overflow_defense(user_message: str, max_tokens: int = 1000) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        max_tokens=max_tokens,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"{user_message}\n\n{CONCISENESS_REMINDER}"}
        ]
    )
    return response.choices[0].message.content

# Example overflow-inducing prompts (for detection/filtering)
OVERFLOW_PATTERNS = [
    r"\b(\d{3,})[\s]*(unique|different|distinct)",  # '1,200 unique items'
    r"\b(all|every|each)\b.*\b(integer|permutation|combination)",  # implicit enumeration
    r"\bwithout (stopping|end|limit)",  # infinite generation
    r"\bfull text\b|\bverbatim\b|\btranscribe\b",  # quote attack
]

import re

def has_overflow_risk(prompt: str) -> bool:
    return any(re.search(p, prompt, re.IGNORECASE) for p in OVERFLOW_PATTERNS)

Terminology

OverflowA phenomenon where LLMs produce unlimited tokens from normal requests like 'make 1,000 items' without any hacking. Like turning on a faucet with no way to stop it.

CSR (Cap-Saturation Rate)The percentage of generations that exceed the set token cap (e.g., 1k, 3k, 5k). Higher means the model frequently hits the token ceiling.

ECDF (Empirical Cumulative Distribution Function)A graph stacking output lengths from shortest to longest. The more the curve skews right, the more long outputs there are.

Denial of WalletAn attack where the attacker sends excessive requests to drain the victim's cloud credits. In LLM context, it means causing explosive token generation to exhaust API costs.

RLHF (Reinforcement Learning from Human Feedback)A method where humans rate model outputs as 'this answer is better' and the model is further trained on that data. Guides the model toward more helpful and safe responses.

Tokenizer stressA technique using emojis and unicode combination characters that look like few characters but produce many tokens, inducing the model to generate long outputs.

LLM-as-a-judgeUsing another LLM as a judge to evaluate other LLMs' outputs. Instead of humans scoring each response, a model like GPT judges 'is this a refusal or not.'

Related Resources

Original Abstract (Expand)

We investigate a failure mode of large language models (LLMs) in which plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and can lead to elevated serving cost, latency, and cross-user performance degradation, particularly when scaled across many requests. Beyond usability, the stakes are economic and environmental: unnecessary tokens increase per-request cost and energy consumption, compounding into substantial operational spend and carbon footprint at scale. Moreover, Overflow represents a practical vector for compute amplification and service degradation in shared environments. We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. Using a standardized protocol with a fixed budget of 5000 new tokens, we evaluate nine open- and closed-source models and observe pronounced rightward shifts and heavy tails in length distributions. Cap-saturation rates (CSR@1k/3k/5k) and empirical cumulative distribution functions (ECDFs) quantify tail risk; within-prompt variance and cross-model correlations show that Overflow is broadly reproducible yet heterogeneous across families and attack vectors. A lightweight mitigation-a fixed conciseness reminder-attenuates right tails and lowers CSR for all strategies across the majority of models. Our findings position length control as a measurable reliability, cost, and sustainability concern rather than a stylistic quirk. By enabling standardized comparison of length-control robustness across models, BenchOverflow provides a practical basis for selecting deployments that minimize resource waste and operating expense, and for evaluating defenses that curb compute amplification without eroding task performance.