PixelDiT
Pixel Diffusion Transformers
for Image Generation

1NVIDIA 2University of Rochester
†Project Lead and Main Advising

Say Goodbye to VAEs

Direct Pixel Space Optimization

Latent Diffusion Models (LDMs) like Stable Diffusion rely on a Variational Autoencoder (VAE) to compress images into latents. This process is lossy.

  • × Lossy Reconstruction: VAEs blur high-frequency details (text, texture).
  • × Artifacts: Compression artifacts can confuse the generation process.
  • × Misalignment: Two-stage training leads to objective mismatch.

Pixel Models change the game:

  • End-to-End: Trained and sampled directly on pixels.
  • High-Fidelity Editing: Preserves details during editing.
  • Simplicity: Single-stage training pipeline.

Method: Dual-Level Architecture

We introduce a Dual-Level DiT Architecture to make pixel-space diffusion efficient.

PixelDiT dual-level architecture diagram showing patch and pixel pathways
PixelDiT uses a Patch-Level Pathway for global semantics and a Pixel-Level Pathway for texture refinement, connected via Pixel-wise AdaLN.
System and user prompts feeding semantic and pixel tokens through AdaLN-Zero into the joint DiT block.
Text prompts feed semantic tokens. The patch-level MM-DiT handles global semantics while the pixel pathway performs dense refinement.

Our architecture employs Pixel Token Compaction to reduce the computational cost of attention over dense pixels, and Pixel-wise AdaLN to condition per-pixel updates on the global semantic context.

Content Consistency in Image Editing

PixelDiT edits stay faithful!

FlowEdit exposes how VAE reconstructions from FLUX can warp fine text when tracing the full flow path (nmin=0). The comparison shows PixelDiT keeping the brick-wall lettering intact because it denoises directly in pixel space—no lossy VAE, no baked-in artifacts.

We simply plug the pretrained PixelDiT flow into FlowEdit and obtain clean local edits without re-rendering the entire scene.

FlowEdit comparison: real image, FLUX, and PixelDiT outputs illustrating preserved wall text.
FlowEdit (nmin=0). PixelDiT keeps typography legible while FLUX smear characters after VAE reconstruction.

Prompt edit: “A bicycle parked on the sidewalk…” → “A motorcycle parked on the sidewalk…”

Performance

State-of-the-art on ImageNet 256×256

Method Space Params GFLOPs gFID (↓) IS (↑) Recall (↑)
DiT-XL Latent 675M 238 2.27 278.2 0.57
SiT-XL Latent 675M 238 2.06 270.3 0.59
REPA (SiT-XL) Latent 675M 238 1.42 305.7 0.65
ADM-U Pixel 554M 2240 4.59 186.7 0.52
PixelFlow-XL Pixel 677M 5818 1.98 282.1 0.60
PixNerd-XL Pixel 700M 268 1.93 298.0 0.60
JiT-G Pixel 2B 766 1.82 292.6 0.62
PixelDiT-XL (Ours) Pixel 797M 311 1.61 292.7 0.64

Comparison of class-conditioned generation on ImageNet 256×256. PixelDiT outperforms prior pixel-space models and closes the gap with latent models.

State-of-the-art on ImageNet 512×512

Method Space Params gFID (↓) sFID (↓) IS (↑) Recall (↑)
DiT-XL Latent 675M 3.04 5.02 240.8 0.54
SiT-XL Latent 675M 2.62 4.18 252.2 0.57
REPA (SiT-XL) Latent 675M 2.08 4.19 274.6 0.58
ADM Pixel 554M 3.85 5.86 221.7 0.53
PixNerd-XL Pixel 700M 2.84 5.95 245.6 0.59
EPG Pixel 583M 2.35 295.4 0.57
JiT-H Pixel 956M 1.94 309.1
PixelDiT-XL (Ours) Pixel 797M 1.80 5.53 279.4 0.66

Comparison of class-conditioned generation on ImageNet 512×512.

PixelDiT Podcast Deep Dive

A long-form conversation covering the motivation behind PixelDiT, architectural choices, and how pixel-space diffusion stacks up against latent pipelines.

Provided by AI Papers Slop (YouTube)

Podcast format explainer hosted on YouTube.

BibTeX

@article{yu2025pixeldit,
  title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
  author={Yu, Yongsheng and Xiong, Wei and Nie, Weili and Sheng, Yichen and Liu, Shiqiu and Luo, Jiebo},
  journal={arXiv preprint arXiv:2511.20645},
  year={2025}
}