Dynamic Context Evolution for Scalable Synthetic Data Generation

Apr 8, 2026•Ryan Lingo, Rajeev Chhajer•View PDF

TL;DR Highlight

A framework that completely eliminates duplication and repetition in large-scale synthetic data generation with LLMs using three mechanisms (VTS + Semantic Memory + Adaptive Prompt).

Who Should Read

ML engineers or data pipeline developers who want to generate large amounts of synthetic data for training with LLM APIs. Especially those experiencing duplicate data issues when generating repeatedly in hundreds to thousands of batches.

Core Mechanics

Independently repeating prompts for batches causes LLMs to repeatedly generate the same concepts, a phenomenon called 'cross-batch mode collapse' — in the education domain, 34% of the last 50 batches duplicated the first 50 batches.
DCE combines three mechanisms: ① VTS (Verbalized Tail Sampling, the model itself discards ideas it deems 'obvious') ② Semantic Memory (stores embeddings in ChromaDB to reject similar ideas) ③ Adaptive Prompt Evolution (reconstructs the prompt based on memory state for each batch).
VTS filters out 'obvious but semantically novel' ideas, and dedup filters out 'semantically duplicated' ideas — they must be used together as they operate on disjoint sets (96.9% non-overlap).
Adaptive Prompt cycles through four strategies in a round-robin fashion: Gap targeting (focus on less generated categories), Assumption inversion (reverse assumptions of recent ideas), Cross-industry stimulus (introduce perspectives from other industries), Constraint variation (impose extreme constraints).
Based on GPT-5-mini, Claude Haiku 4.5 is much more repetitive (dedup rejection rate of 30.1% vs 5.7%) — DCE reduces Claude's rejection rate from 30.1% to 11.0%, a 19%p reduction. The same pipeline can be used without model replacement.
The cost is approximately $0.50 for 1,000 candidates, and it does not require fine-tuning or custom architectures, operating only with standard API calls.

Evidence

"Collapse rate: DCE 0.0 ± 0.0% vs naive 5.6 ± 2.0% (3 seeds, packaging domain). In the education domain, DCE eliminates 34% naive collapse to 0%. \nHDBSCAN cluster count: DCE consistently shows 17~18 clusters per seed vs naive with high variance (2~17 clusters) (based on seeds 42/123/456: DCE: 18/18/17, naive: 2/14/17). \nDownstream classifier (DeBERTa-base) F1: DCE 30.5% vs naive 15.2% in the packaging domain (approximately 2x). In the education domain, with δ=0.90 relaxation, achieves 44.9% F1 compared to naive. \nVTS analysis: 96.9% (826 out of 852) of the ideas rejected by VTS are semantically novel based on dedup criteria — VTS removes only the 'obvious' without destroying diversity."

How to Apply

"If generating training data with LLM APIs in 100+ batches: Set up ChromaDB as a backend and inject the 10 most recent ideas + dense clusters + unexplored categories into the prompt before each batch generation. Adding a cosine similarity threshold δ=0.85 dedup filter will immediately reduce collapse to 0%. \nIf domain-specific δ tuning is needed: Relax δ to 0.90 for domains with high natural duplication rates like education/Q&A to secure training set size, and maintain the default δ=0.85 for diverse domains like packaging/creation. If F1 performance is low, try relaxing δ first. \nIf using a model like GPT-5-mini that doesn't provide temperature/top-p parameters in the API: Concept-level approaches like DCE are the only option as token-level diversity control is impossible. Prompt VTS to encourage the model to generate only 'ideas with P < 0.10'."

Code Example

snippet

Terminology

cross-batch mode collapseA phenomenon where LLMs repeatedly generate the same content when called multiple times independently. It's like being asked to generate 200 test questions but ending up slightly modifying the earlier questions later on.

VTSVerbalized Tail Sampling. A filter that makes the model score 'how obvious' an idea is and discards the obvious ones (P >= 0.10). A way to make the model censor itself.

Semantic MemoryA mechanism that converts generated ideas into vectors (arrays of numbers), stores them in a database, and rejects new ideas that are too similar to existing ones. It can identify ideas that are the same in meaning even if the wording is different, like 'smart bottle' and 'intelligent hydration container'.

HDBSCANAn algorithm that automatically clusters data points based on density. Here, it's used to measure how many conceptual groups the generated ideas fall into. A higher number of clusters indicates greater diversity of ideas.

EDVEffective Diversity Volume. A diversity score created by multiplying 'how surprising' each idea is (depth) by 'how different it is from existing ideas' (breadth). It requires both conditions to be met to achieve a high score.

ChromaDBAn open-source vector database. It stores text as numerical vectors and quickly finds similar vectors. Used as the semantic memory backend in DCE.

UMAPA technique for compressing high-dimensional vector data into 2D for visualization. Used to visually compare the distribution of early/late batches by representing 1536-dimensional idea embeddings as 2D points.

DeBERTaA BERT-based text classification model developed by Microsoft. Used in the paper to downstream validate the classification performance when training with synthetic data generated by DCE.

Related Resources

Original Abstract (Expand)

Large language models produce repetitive output when prompted independently across many batches, a phenomenon we term cross-batch mode collapse: the progressive loss of output diversity when a language model is prompted repeatedly without access to its prior generations. Practitioners have long mitigated this with ad hoc deduplication and seed rotation, but no principled framework exists. We introduce Dynamic Context Evolution (DCE), comprising three mechanisms: (1) verbalized tail sampling (the model labels each idea with a guess about how obvious it is, and obvious ideas are discarded), which filters high-probability candidates via model self-assessment; (2) semantic memory, which maintains a persistent embedding index to reject near-duplicates across batches; and (3) adaptive prompt evolution, which reconstructs the generation prompt each batch using memory state and rotating diversity strategies. In experiments across three domains (sustainable packaging concepts, educational exam questions, and creative writing prompts) and two model families (gpt-5-mini and claude-haiku-4-5), a component ablation across 2-3 random seeds per method shows that DCE achieves 0.0 +/- 0.0% collapse versus 5.6 +/- 2.0% for naive prompting, while producing 17-18 HDBSCAN clusters per seed versus naive's volatile 2-17, indicating reliably richer conceptual structure. These results are validated with an independent embedding model (all-MiniLM-L6-v2) and hold across sensitivity sweeps of the VTS threshold tau and dedup threshold delta. Deduplication and prompt evolution are individually insufficient but jointly effective, at approximately $0.50 per 1,000 candidates using only standard API calls, with no fine-tuning or custom architectures required.