Dynamic Context Evolution for Scalable Synthetic Data Generation
TL;DR Highlight
A framework that completely eliminates duplication and repetition in large-scale synthetic data generation with LLMs using three mechanisms (VTS + Semantic Memory + Adaptive Prompt).
Who Should Read
ML engineers or data pipeline developers who want to generate large amounts of synthetic data for training with LLM APIs. Especially those experiencing duplicate data issues when generating repeatedly in hundreds to thousands of batches.
Core Mechanics
- Independently repeating prompts for batches causes LLMs to repeatedly generate the same concepts, a phenomenon called 'cross-batch mode collapse' — in the education domain, 34% of the last 50 batches duplicated the first 50 batches.
- DCE combines three mechanisms: ① VTS (Verbalized Tail Sampling, the model itself discards ideas it deems 'obvious') ② Semantic Memory (stores embeddings in ChromaDB to reject similar ideas) ③ Adaptive Prompt Evolution (reconstructs the prompt based on memory state for each batch).
- VTS filters out 'obvious but semantically novel' ideas, and dedup filters out 'semantically duplicated' ideas — they must be used together as they operate on disjoint sets (96.9% non-overlap).
- Adaptive Prompt cycles through four strategies in a round-robin fashion: Gap targeting (focus on less generated categories), Assumption inversion (reverse assumptions of recent ideas), Cross-industry stimulus (introduce perspectives from other industries), Constraint variation (impose extreme constraints).
- Based on GPT-5-mini, Claude Haiku 4.5 is much more repetitive (dedup rejection rate of 30.1% vs 5.7%) — DCE reduces Claude's rejection rate from 30.1% to 11.0%, a 19%p reduction. The same pipeline can be used without model replacement.
- The cost is approximately $0.50 for 1,000 candidates, and it does not require fine-tuning or custom architectures, operating only with standard API calls.
Evidence
- "Collapse rate: DCE 0.0 ± 0.0% vs naive 5.6 ± 2.0% (3 seeds, packaging domain). In the education domain, DCE eliminates 34% naive collapse to 0%. \nHDBSCAN cluster count: DCE consistently shows 17~18 clusters per seed vs naive with high variance (2~17 clusters) (based on seeds 42/123/456: DCE: 18/18/17, naive: 2/14/17). \nDownstream classifier (DeBERTa-base) F1: DCE 30.5% vs naive 15.2% in the packaging domain (approximately 2x). In the education domain, with δ=0.90 relaxation, achieves 44.9% F1 compared to naive. \nVTS analysis: 96.9% (826 out of 852) of the ideas rejected by VTS are semantically novel based on dedup criteria — VTS removes only the 'obvious' without destroying diversity."
How to Apply
- "If generating training data with LLM APIs in 100+ batches: Set up ChromaDB as a backend and inject the 10 most recent ideas + dense clusters + unexplored categories into the prompt before each batch generation. Adding a cosine similarity threshold δ=0.85 dedup filter will immediately reduce collapse to 0%. \nIf domain-specific δ tuning is needed: Relax δ to 0.90 for domains with high natural duplication rates like education/Q&A to secure training set size, and maintain the default δ=0.85 for diverse domains like packaging/creation. If F1 performance is low, try relaxing δ first. \nIf using a model like GPT-5-mini that doesn't provide temperature/top-p parameters in the API: Concept-level approaches like DCE are the only option as token-level diversity control is impossible. Prompt VTS to encourage the model to generate only 'ideas with P < 0.10'."
Code Example
Terminology
Related Resources
Original Abstract (Expand)
Large language models produce repetitive output when prompted independently across many batches, a phenomenon we term cross-batch mode collapse: the progressive loss of output diversity when a language model is prompted repeatedly without access to its prior generations. Practitioners have long mitigated this with ad hoc deduplication and seed rotation, but no principled framework exists. We introduce Dynamic Context Evolution (DCE), comprising three mechanisms: (1) verbalized tail sampling (the model labels each idea with a guess about how obvious it is, and obvious ideas are discarded), which filters high-probability candidates via model self-assessment; (2) semantic memory, which maintains a persistent embedding index to reject near-duplicates across batches; and (3) adaptive prompt evolution, which reconstructs the generation prompt each batch using memory state and rotating diversity strategies. In experiments across three domains (sustainable packaging concepts, educational exam questions, and creative writing prompts) and two model families (gpt-5-mini and claude-haiku-4-5), a component ablation across 2-3 random seeds per method shows that DCE achieves 0.0 +/- 0.0% collapse versus 5.6 +/- 2.0% for naive prompting, while producing 17-18 HDBSCAN clusters per seed versus naive's volatile 2-17, indicating reliably richer conceptual structure. These results are validated with an independent embedding model (all-MiniLM-L6-v2) and hold across sensitivity sweeps of the VTS threshold tau and dedup threshold delta. Deduplication and prompt evolution are individually insufficient but jointly effective, at approximately $0.50 per 1,000 candidates using only standard API calls, with no fine-tuning or custom architectures required.