Info-Gain Sampler

01

Abstract

Masked Diffusion Models (MDMs) enable flexible decoding orders, yet existing samplers remain largely greedy, selecting locally certain tokens without accounting for their downstream effects. We show that this myopia can increase cumulative uncertainty and lead to suboptimal generation.

To address this, we propose the Info-Gain Sampler, a training-free decoding method that uses the bidirectional structure of MDMs to balance immediate uncertainty with the information gained over remaining masked positions. Across reasoning, coding, creative writing, and image generation tasks, Info-Gain Sampler consistently outperforms existing MDM samplers.

02

Key Findings

Five headline numbers from the paper.

Reasoning · Avg. Pass@1

+11.6 pp

Largest gain on Semi-AR MDMs

On SDAR-8B-Chat (K=2), Info-Gain pushes average accuracy from 45.1% (LookUM) to 55.8%, an absolute +10.7 pp over the strongest concurrent lookahead baseline.

Cumulative Entropy ↓

78.4 → 48.6

38% lower trajectory uncertainty

On reasoning tasks, cumulative entropy H̃ drops from 78.4 to 48.6 — only ~50% of the best greedy baseline. Lower H̃ correlates with higher accuracy (Pearson r = −0.70).

Creative Writing · Win Rate

62.8%

Average win-rate vs. baselines

Across temperatures and configs, Info-Gain wins 62.8% of head-to-heads on AlpacaEval, peaking at 80.3% against Entropy at high stochasticity (τ = 1.5).

Text-to-Image · GenEval

58.2

+1.9 pp on GenEval, FID 43.3 → 38.1

On MMaDa with τ=0.4, Info-Gain raises GenEval avg. to 58.2 and improves ImageNet-512 FID by 5.2 points and IS by 9.7 points.

Practical Overhead

+24% time

Training-free, near-parallel

Acceleration via threshold γ = 0.8 keeps generation time within +24% and GPU memory within +20% of greedy samplers — no extra training, no KV-cache surgery.

Why it works

“Bidirectional attention lets MDMs look ahead. Greedy samplers throw that gift away. Info-Gain spends a tiny budget asking: which token, once revealed, makes the rest of the sequence easier?”

03

Method

Score each candidate by immediate cost plus expected future information gain.

Info-Gain Sampler workflow diagram — At each decoding step, candidate actions are scored by **immediate certainty** plus the **information gain** they induce on the remaining masked positions. Acceleration via early-stopping keeps the per-step overhead low.

The myopia of greedy MDM samplers

Confidence/Entropy/Margin samplers commit to the locally easiest token at every step. They never ask: “if I commit here, do the other masks become easier or harder?” In MDMs, where attention is bidirectional, that question is cheap to answer.

The Info-Gain objective

For each candidate decoding action a, score it by the reduction in expected uncertainty over the remaining masked tokens after committing a. Combine this future term with the usual immediate certainty term and pick the highest-scoring action.

Why it stays fast

Sample N = 8 candidates per step (not exhaustive).
Score them in a single forward pass, batched.
Skip lookahead when local certainty > γ = 0.8.

04

Results

Best in bold. All numbers from the paper.

Semi-AR MDM Reasoning (Pass@1, %)

SDAR-8B-Chat with block size 16, τ_token=0.7. H̃ = cumulative entropy (lower is better).

K	Sampler	GSM8K	MATH500	HumanEval	MBPP	Avg. ↑	H̃ ↓
2	Entropy	42.2	24.4	26.2	20.6	28.4	238.6
	Confidence	47.2	36.6	24.4	20.2	32.1	204.1
	Margin	45.2	22.4	19.5	19.8	26.7	230.9
	KLASS	50.4	32.3	30.7	26.6	35.0	210.3
	LookUM	75.3	44.9	28.2	31.8	45.1	103.2
	Info-Gain	82.7	54.6	46.3	39.4	55.8	74.1
1	Entropy	68.8	44.6	37.8	49.0	34.9	120.4
	Confidence	67.9	51.4	42.1	46.2	51.9	117.4
	Margin	65.3	40.2	32.3	43.2	45.3	138.2
	KLASS	69.9	42.3	45.7	46.6	51.1	105.3
	LookUM	80.3	60.0	38.2	39.8	54.6	53.7
	Info-Gain	87.9	61.8	62.2	53.0	66.2	41.0

Text-to-Image · GenEval (MMaDa)

τ_token=0.4, 50-step cosine scheduler.

Method	Single ↑	Two ↑	Count ↑	Color ↑	Pos ↑	Attr ↑	Avg ↑
Uniform	94.1	66.7	38.4	78.2	19.0	28.8	54.2
Entropy	94.3	67.3	46.0	79.9	17.8	26.8	55.3
Confidence	93.8	69.7	46.3	81.9	16.0	27.0	56.0
Margin	94.0	68.7	47.3	80.1	19.0	29.0	56.3
Info-Gain	97.5	68.7	47.5	79.8	25.0	32.0	58.2

Creative Writing · Win-Rate (%)

Length-controlled win-rate against three baselines on AlpacaEval (SDAR-8B-Chat).

τ	K	vs Confidence	vs Entropy	vs Margin
0.5	1	65.8	59.1	63.6
0.5	2	68.9	70.4	64.7
1.0	1	57.7	60.1	55.2
1.0	2	61.1	65.7	57.5
1.5	1	53.0	60.3	54.6
1.5	2	70.1	80.3	66.8

05

Analysis

Why cumulative entropy is the right signal — and how Info-Gain reshapes the trajectory.

Cumulative entropy trajectories during decoding — **Cumulative entropy trajectories.** Info-Gain stabilizes uncertainty far earlier than the greedy Entropy baseline, producing globally cleaner decoding paths.

Scatter plot of accuracy versus cumulative entropy — **Accuracy vs. cumulative entropy.** Across configs, lower H̃ predicts higher accuracy (Pearson r = −0.70). Info-Gain points cluster in the bottom-right corner — low H̃, high accuracy.

ImageNet-512 qualitative comparison — **Qualitative ImageNet-512 samples.** Same prompt, same model (MMaDa), only the sampler differs. Info-Gain produces sharper, more globally coherent compositions.

Temperature sensitivity of cumulative entropy — **Temperature sensitivity.** Greedy baselines blow up under temperature scaling. Info-Gain keeps trajectory uncertainty stable across position and token temperatures.

06

Cite

@inproceedings{yang2026improving,
  title     = {Improving Sampling for Masked Diffusion Models via Information Gain},
  author    = {Yang, Kaisen and Teoh, Jayden and Yang, Kaicheng
               and Zhang, Yitong and Lamb, Alex},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  url       = {https://arxiv.org/abs/2602.18176}
}