Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model

Feb 2, 2026•Kangtao Lv, Jiwei Tang, Langming Liu +7•View PDF

TL;DR Highlight

LLM context compression quality is determined by data distribution, not model architecture — and the decoder's training data dominates over the encoder's.

Who Should Read

ML engineers integrating or customizing context compression techniques (LLMLingua, ICAE, etc.) in RAG pipelines or long-context workloads. Developers designing encoder-decoder compression systems or using domain-specific LLMs as compression modules.

Core Mechanics

Higher input text entropy (information complexity) leads to worse compression quality — math/code-heavy text is much harder to compress than general web text
When encoder and decoder are pretrained on different data distributions, compression quality drops sharply — Qwen2.5-Base + Qwen2.5-Coder combo drops BLEU from 80.02 to 64.61
Compression quality is dominated by the decoder's training distribution, not the encoder's — matching the decoder distribution is far more effective than swapping the encoder
Scaling up model size cannot fully overcome distribution mismatch — a matched 500M+500M pair performs as well or better than a mismatched 500M+1B pair
Extreme distribution mismatch (decoder trained on random text) produces completely meaningless noise as output
For compute efficiency, a small encoder + large decoder is the optimal split — enlarging the encoder adds latency without meaningful quality gains

Evidence

Matched encoder-decoder distribution: F1 96.96%, BLEU 80.02% vs. mismatched (Base+Coder): F1 92.15%, BLEU 64.61% — a 19% BLEU drop
As decoder distribution mismatch worsens (D1→D6), F1 monotonically drops 81.57%→69.44%, BLEU plummets 42.12%→27.02%
Scaling decoder from 500M→1B under mismatch improves F1 from 75.86%→82.87%, but still only matches the 500M+500M matched pair (81.57%)
Encoder compression latency: 200M→1B grows 1.4ms→7.4ms (5x), decoder latency grows only 4.7ms→5.5ms (17%) — encoder scaling costs far more in latency

How to Apply

When attaching an external compression module (encoder) to RAG, fine-tune on data that matches the domain of the answer-generating LLM (decoder) — that's more impactful than tuning the encoder
For domain-specific compression (code, math docs, etc.), pick a decoder pretrained on that domain data instead of swapping the encoder — decoder choice takes priority
When designing a compression system, allocate budget toward a smaller encoder and larger decoder — enlarging the encoder just adds latency with minimal quality gain

Code Example

snippet

# Compression module selection checklist (pseudo-code)
# Core principle: prioritize matching the decoder distribution

# 1. Identify the domain to process
domain = 'code'  # or 'math', 'general', 'legal'

# 2. Check the decoder (generative LLM) training domain
decoder_model = 'Qwen2.5-Coder-7B'  # code-specialized

# 3. Select encoder (compressor): must match decoder domain
# Bad: encoder=general_model, decoder=code_model  → BLEU drops sharply
# Good: encoder=code_model, decoder=code_model    → optimal
# Acceptable: encoder=general, decoder=general    → maintains consistency

# 4. Fine-tuning data should also be selected based on decoder domain
finetune_data = load_domain_data(decoder_model.training_domain)

# 5. Size allocation: encoder < decoder recommended
encoder_size = '500M'   # keep small
decoder_size = '7B'     # keep large
# Scaling encoder to 1B only increases latency with minimal performance gain

Terminology

context compressionTechnique to shorten long text fed into an LLM. The goal is to reduce token count to lower processing cost and latency.

entropyA measure of information complexity/uncertainty. Higher entropy means harder to predict and more information-dense. Math formulas and code have higher entropy than plain prose.

autoencoderA model architecture that compresses input then reconstructs it. Reconstruction quality is used to measure compression fidelity.

intrinsic dataThe distribution of knowledge a model absorbed during pretraining — what the model knows by default.

Related Resources

Original Abstract (Expand)

The deployment of Large Language Models (LLMs) in long-context scenarios is hindered by computational inefficiency and significant information redundancy. Although recent advancements have widely adopted context compression to address these challenges, existing research only focus on model-side improvements, the impact of the data distribution itself on context compression remains largely unexplored. To bridge this gap, we are the first to adopt a data-centric perspective to systematically investigate how data distribution impacts compression quality, including two dimensions: input data and intrinsic data (i.e., the model's internal pretrained knowledge). We evaluate the semantic integrity of compressed representations using an autoencoder-based framework to systematically investigate it. Our experimental results reveal that: (1) encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under a frozen-decoder setting; and (2) the gap between intrinsic data of the encoder and decoder significantly diminishes compression gains, which is hard to mitigate. Based on these findings, we further present practical guidelines to optimize compression gains.