ICML 2026 accepted paper · arXiv updated May 22, 2026
ICML 2026 · Masked Diffusion Models · Training-free Decoding

Improving Sampling for
Masked Diffusion Models
via Information Gain

Kaisen Yang1, Jayden Teoh2, Kaicheng Yang3, Yitong Zhang4, Alex Lamb1

1Tsinghua University  ·  2Singapore Management University  ·  3Shanghai Jiao Tong University  ·  4Beihang University

TL;DR Existing MDM samplers are myopic: they select locally certain tokens without accounting for downstream effects. Info-Gain Sampler scores decoding actions by both immediate uncertainty and information gained over remaining masked positions, improving reasoning accuracy by 2.9-11.6 pp and achieving a 62.8% average creative-writing win rate.
Training-freeNo model updates
Bidirectional lookaheadOne batched pass
Broad evaluationReasoning, code, text, image
01

Abstract

Masked Diffusion Models (MDMs) enable flexible decoding orders, yet existing samplers remain largely greedy, selecting locally certain tokens without accounting for their downstream effects. We show that this myopia can increase cumulative uncertainty and lead to suboptimal generation.

To address this, we propose the Info-Gain Sampler, a training-free decoding method that uses the bidirectional structure of MDMs to balance immediate uncertainty with the information gained over remaining masked positions. Across reasoning, coding, creative writing, and image generation tasks, Info-Gain Sampler consistently outperforms existing MDM samplers.

02

Key Findings

Five headline numbers from the paper.

Reasoning · Avg. Pass@1
+11.6 pp

Largest gain on Semi-AR MDMs

On SDAR-8B-Chat (K=2), Info-Gain pushes average accuracy from 45.1% (LookUM) to 55.8%, an absolute +10.7 pp over the strongest concurrent lookahead baseline.

Cumulative Entropy ↓
78.4 → 48.6

38% lower trajectory uncertainty

On reasoning tasks, cumulative entropy drops from 78.4 to 48.6 — only ~50% of the best greedy baseline. Lower H̃ correlates with higher accuracy (Pearson r = −0.70).

Creative Writing · Win Rate
62.8%

Average win-rate vs. baselines

Across temperatures and configs, Info-Gain wins 62.8% of head-to-heads on AlpacaEval, peaking at 80.3% against Entropy at high stochasticity (τ = 1.5).

Text-to-Image · GenEval
58.2

+1.9 pp on GenEval, FID 43.3 → 38.1

On MMaDa with τ=0.4, Info-Gain raises GenEval avg. to 58.2 and improves ImageNet-512 FID by 5.2 points and IS by 9.7 points.

Practical Overhead
+24% time

Training-free, near-parallel

Acceleration via threshold γ = 0.8 keeps generation time within +24% and GPU memory within +20% of greedy samplers — no extra training, no KV-cache surgery.

Why it works
“Bidirectional attention lets MDMs look ahead. Greedy samplers throw that gift away. Info-Gain spends a tiny budget asking: which token, once revealed, makes the rest of the sequence easier?
03

Method

Score each candidate by immediate cost plus expected future information gain.

Info-Gain Sampler workflow diagram
At each decoding step, candidate actions are scored by immediate certainty plus the information gain they induce on the remaining masked positions. Acceleration via early-stopping keeps the per-step overhead low.

The myopia of greedy MDM samplers

Confidence/Entropy/Margin samplers commit to the locally easiest token at every step. They never ask: “if I commit here, do the other masks become easier or harder?” In MDMs, where attention is bidirectional, that question is cheap to answer.

The Info-Gain objective

For each candidate decoding action a, score it by the reduction in expected uncertainty over the remaining masked tokens after committing a. Combine this future term with the usual immediate certainty term and pick the highest-scoring action.

Why it stays fast

  • Sample N = 8 candidates per step (not exhaustive).
  • Score them in a single forward pass, batched.
  • Skip lookahead when local certainty > γ = 0.8.
04

Results

Best in bold. All numbers from the paper.

Semi-AR MDM Reasoning (Pass@1, %)

SDAR-8B-Chat with block size 16, τtoken=0.7. = cumulative entropy (lower is better).

KSampler GSM8KMATH500HumanEvalMBPP Avg. ↑H̃ ↓
2Entropy42.224.426.220.628.4238.6
Confidence47.236.624.420.232.1204.1
Margin45.222.419.519.826.7230.9
KLASS50.432.330.726.635.0210.3
LookUM75.344.928.231.845.1103.2
Info-Gain82.754.646.339.455.874.1
1Entropy68.844.637.849.034.9120.4
Confidence67.951.442.146.251.9117.4
Margin65.340.232.343.245.3138.2
KLASS69.942.345.746.651.1105.3
LookUM80.360.038.239.854.653.7
Info-Gain87.961.862.253.066.241.0

Text-to-Image · GenEval (MMaDa)

τtoken=0.4, 50-step cosine scheduler.

MethodSingle ↑Two ↑Count ↑Color ↑Pos ↑Attr ↑Avg ↑
Uniform94.166.738.478.219.028.854.2
Entropy94.367.346.079.917.826.855.3
Confidence93.869.746.381.916.027.056.0
Margin94.068.747.380.119.029.056.3
Info-Gain97.568.747.579.825.032.058.2

Creative Writing · Win-Rate (%)

Length-controlled win-rate against three baselines on AlpacaEval (SDAR-8B-Chat).

τKvs Confidencevs Entropyvs Margin
0.5165.859.163.6
268.970.464.7
1.0157.760.155.2
261.165.757.5
1.5153.060.354.6
270.180.366.8
05

Analysis

Why cumulative entropy is the right signal — and how Info-Gain reshapes the trajectory.

Cumulative entropy trajectories during decoding
Cumulative entropy trajectories. Info-Gain stabilizes uncertainty far earlier than the greedy Entropy baseline, producing globally cleaner decoding paths.
Scatter plot of accuracy versus cumulative entropy
Accuracy vs. cumulative entropy. Across configs, lower H̃ predicts higher accuracy (Pearson r = −0.70). Info-Gain points cluster in the bottom-right corner — low H̃, high accuracy.
ImageNet-512 qualitative comparison
Qualitative ImageNet-512 samples. Same prompt, same model (MMaDa), only the sampler differs. Info-Gain produces sharper, more globally coherent compositions.
Temperature sensitivity of cumulative entropy
Temperature sensitivity. Greedy baselines blow up under temperature scaling. Info-Gain keeps trajectory uncertainty stable across position and token temperatures.
06

Cite

@inproceedings{yang2026improving,
  title     = {Improving Sampling for Masked Diffusion Models via Information Gain},
  author    = {Yang, Kaisen and Teoh, Jayden and Yang, Kaicheng
               and Zhang, Yitong and Lamb, Alex},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  url       = {https://arxiv.org/abs/2602.18176}
}