Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model
TL;DR Highlight
LLM context compression quality is determined by data distribution, not model architecture — and the decoder's training data dominates over the encoder's.
Who Should Read
ML engineers integrating or customizing context compression techniques (LLMLingua, ICAE, etc.) in RAG pipelines or long-context workloads. Developers designing encoder-decoder compression systems or using domain-specific LLMs as compression modules.
Core Mechanics
- Higher input text entropy (information complexity) leads to worse compression quality — math/code-heavy text is much harder to compress than general web text
- When encoder and decoder are pretrained on different data distributions, compression quality drops sharply — Qwen2.5-Base + Qwen2.5-Coder combo drops BLEU from 80.02 to 64.61
- Compression quality is dominated by the decoder's training distribution, not the encoder's — matching the decoder distribution is far more effective than swapping the encoder
- Scaling up model size cannot fully overcome distribution mismatch — a matched 500M+500M pair performs as well or better than a mismatched 500M+1B pair
- Extreme distribution mismatch (decoder trained on random text) produces completely meaningless noise as output
- For compute efficiency, a small encoder + large decoder is the optimal split — enlarging the encoder adds latency without meaningful quality gains
Evidence
- Matched encoder-decoder distribution: F1 96.96%, BLEU 80.02% vs. mismatched (Base+Coder): F1 92.15%, BLEU 64.61% — a 19% BLEU drop
- As decoder distribution mismatch worsens (D1→D6), F1 monotonically drops 81.57%→69.44%, BLEU plummets 42.12%→27.02%
- Scaling decoder from 500M→1B under mismatch improves F1 from 75.86%→82.87%, but still only matches the 500M+500M matched pair (81.57%)
- Encoder compression latency: 200M→1B grows 1.4ms→7.4ms (5x), decoder latency grows only 4.7ms→5.5ms (17%) — encoder scaling costs far more in latency
How to Apply
- When attaching an external compression module (encoder) to RAG, fine-tune on data that matches the domain of the answer-generating LLM (decoder) — that's more impactful than tuning the encoder
- For domain-specific compression (code, math docs, etc.), pick a decoder pretrained on that domain data instead of swapping the encoder — decoder choice takes priority
- When designing a compression system, allocate budget toward a smaller encoder and larger decoder — enlarging the encoder just adds latency with minimal quality gain
Code Example
# Compression module selection checklist (pseudo-code)
# Core principle: prioritize matching the decoder distribution
# 1. Identify the domain to process
domain = 'code' # or 'math', 'general', 'legal'
# 2. Check the decoder (generative LLM) training domain
decoder_model = 'Qwen2.5-Coder-7B' # code-specialized
# 3. Select encoder (compressor): must match decoder domain
# Bad: encoder=general_model, decoder=code_model → BLEU drops sharply
# Good: encoder=code_model, decoder=code_model → optimal
# Acceptable: encoder=general, decoder=general → maintains consistency
# 4. Fine-tuning data should also be selected based on decoder domain
finetune_data = load_domain_data(decoder_model.training_domain)
# 5. Size allocation: encoder < decoder recommended
encoder_size = '500M' # keep small
decoder_size = '7B' # keep large
# Scaling encoder to 1B only increases latency with minimal performance gainTerminology
Related Resources
Original Abstract (Expand)
The deployment of Large Language Models (LLMs) in long-context scenarios is hindered by computational inefficiency and significant information redundancy. Although recent advancements have widely adopted context compression to address these challenges, existing research only focus on model-side improvements, the impact of the data distribution itself on context compression remains largely unexplored. To bridge this gap, we are the first to adopt a data-centric perspective to systematically investigate how data distribution impacts compression quality, including two dimensions: input data and intrinsic data (i.e., the model's internal pretrained knowledge). We evaluate the semantic integrity of compressed representations using an autoencoder-based framework to systematically investigate it. Our experimental results reveal that: (1) encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under a frozen-decoder setting; and (2) the gap between intrinsic data of the encoder and decoder significantly diminishes compression gains, which is hard to mitigate. Based on these findings, we further present practical guidelines to optimize compression gains.