🎉 News: Accepted to ICML 2026!
ICML 2026 · Masked Diffusion · Training-free Sampler

Improving Sampling for
Masked Diffusion Models
via Information Gain

Kaisen Yang1, Jayden Teoh2, Kaicheng Yang3, Yitong Zhang4, Alex Lamb1

1Tsinghua University  ·  2Singapore Management University  ·  3Shanghai Jiao Tong University  ·  4Beihang University

TL;DR Existing MDM samplers are myopic — they pick the locally most certain token without thinking ahead. Info-Gain Sampler additionally rewards decisions that reduce uncertainty over the rest of the sequence, and consistently beats greedy baselines across reasoning, code, image and creative writing.
01

Abstract

Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models, but require careful planning to achieve high-quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty. Through failure-case analysis, we identify a fundamental limitation: such samplers neglect the downstream impact of current decoding choices and fail to minimize cumulative uncertainty.

We propose the Info-Gain Sampler, a principled, training-free decoding framework that balances immediate uncertainty with information gain over future masked tokens. Across reasoning, coding, creative writing, and image generation, Info-Gain Sampler consistently outperforms prior greedy baselines — including a concurrent lookahead method — while keeping practical overhead minimal.

02

Key Findings

Five headline numbers from the paper.

Reasoning · Avg. Pass@1
+11.6 pp

Largest gain on Semi-AR MDMs

On SDAR-8B-Chat (K=2), Info-Gain pushes average accuracy from 45.1% (LookUM) to 55.8%, an absolute +10.7 pp over the strongest concurrent lookahead baseline.

Cumulative Entropy ↓
78.4 → 48.6

38% lower trajectory uncertainty

On reasoning tasks, cumulative entropy drops from 78.4 to 48.6 — only ~50% of the best greedy baseline. Lower H̃ correlates with higher accuracy (Pearson r = −0.70).

Creative Writing · Win Rate
63.1%

Average win-rate vs. baselines

Across temperatures and configs, Info-Gain wins 63.1% of head-to-heads on AlpacaEval, peaking at 80.3% against Entropy at high stochasticity (τ = 1.5).

Text-to-Image · GenEval
58.2

+1.9 pp on GenEval, FID 43.3 → 38.1

On MMaDa with τ=0.4, Info-Gain raises GenEval avg. to 58.2 and improves ImageNet-512 FID by 5.2 points and IS by 9.7 points.

Practical Overhead
+24% time

Training-free, near-parallel

Acceleration via threshold γ = 0.8 keeps generation time within +24% and GPU memory within +20% of greedy samplers — no extra training, no KV-cache surgery.

Why it works
“Bidirectional attention lets MDMs look ahead. Greedy samplers throw that gift away. Info-Gain spends a tiny budget asking: which token, once revealed, makes the rest of the sequence easier?
03

Method

Score each candidate by immediate cost plus expected future information gain.

Info-Gain Sampler workflow diagram
At each decoding step, candidate actions are scored by immediate certainty plus the information gain they induce on the remaining masked positions. Acceleration via early-stopping keeps the per-step overhead low.

The myopia of greedy MDM samplers

Confidence/Entropy/Margin samplers commit to the locally easiest token at every step. They never ask: “if I commit here, do the other masks become easier or harder?” In MDMs, where attention is bidirectional, that question is cheap to answer.

The Info-Gain objective

For each candidate decoding action a, score it by the reduction in expected uncertainty over the remaining masked tokens after committing a. Combine this future term with the usual immediate certainty term and pick the highest-scoring action.

Why it stays fast

  • Sample N = 8 candidates per step (not exhaustive).
  • Score them in a single forward pass, batched.
  • Skip lookahead when local certainty > γ = 0.8.
04

Results

Best in bold. All numbers from the paper.

Semi-AR MDM Reasoning (Pass@1, %)

SDAR-8B-Chat with block size 16, τtoken=0.7. = cumulative entropy (lower is better).

KSampler GSM8KMATH500HumanEvalMBPP Avg. ↑H̃ ↓
2Entropy42.224.426.220.628.4238.6
Confidence47.236.624.420.232.1204.1
Margin45.222.419.519.826.7230.9
KLASS50.432.330.726.635.0210.3
LookUM75.344.928.231.845.1103.2
Info-Gain82.754.646.339.455.874.1
1Entropy68.844.637.849.034.9120.4
Confidence67.951.442.146.251.9117.4
Margin65.340.232.343.245.3138.2
KLASS69.942.345.746.651.1105.3
LookUM80.360.038.239.854.653.7
Info-Gain87.961.862.253.066.241.0

Text-to-Image · GenEval (MMaDa)

τtoken=0.4, 50-step cosine scheduler.

MethodSingle ↑Two ↑Count ↑Color ↑Pos ↑Attr ↑Avg ↑
Uniform94.166.738.478.219.028.854.2
Entropy94.367.346.079.917.826.855.3
Confidence93.869.746.381.916.027.056.0
Margin94.068.747.380.119.029.056.3
Info-Gain97.568.747.579.825.032.058.2

Creative Writing · Win-Rate (%)

Length-controlled win-rate against three baselines on AlpacaEval (SDAR-8B-Chat).

τKvs Confidencevs Entropyvs Margin
0.5165.859.163.6
268.970.464.7
1.0157.760.155.2
261.165.757.5
1.5153.060.354.6
270.180.366.8
05

Analysis

Why cumulative entropy is the right signal — and how Info-Gain reshapes the trajectory.

Cumulative entropy trajectories during decoding
Cumulative entropy trajectories. Info-Gain stabilizes uncertainty far earlier than the greedy Entropy baseline, producing globally cleaner decoding paths.
Scatter plot of accuracy versus cumulative entropy
Accuracy vs. cumulative entropy. Across configs, lower H̃ predicts higher accuracy (Pearson r = −0.70). Info-Gain points cluster in the bottom-right corner — low H̃, high accuracy.
ImageNet-512 qualitative comparison
Qualitative ImageNet-512 samples. Same prompt, same model (MMaDa), only the sampler differs. Info-Gain produces sharper, more globally coherent compositions.
Temperature sensitivity of cumulative entropy
Temperature sensitivity. Greedy baselines blow up under temperature scaling. Info-Gain keeps trajectory uncertainty stable across position and token temperatures.
06

Cite

@article{yang2026improving,
  title   = {Improving Sampling for Masked Diffusion Models via Information Gain},
  author  = {Yang, Kaisen and Teoh, Jayden and Yang, Kaicheng
             and Zhang, Yitong and Lamb, Alex},
  journal = {arXiv preprint arXiv:2602.18176},
  year    = {2026}
}