PixelDiT: Pixel Diffusion Transformers

Say Goodbye to VAEs

Direct Pixel Space Optimization

Latent Diffusion Models (LDMs) like Stable Diffusion rely on a Variational Autoencoder (VAE) to compress images into latents. This process is lossy.

× Lossy Reconstruction: VAEs blur high-frequency details (text, texture).
× Artifacts: Compression artifacts can confuse the generation process.
× Misalignment: Two-stage training leads to objective mismatch.

Pixel Models change the game:

✓ End-to-End: Trained and sampled directly on pixels.
✓ High-Fidelity Editing: Preserves details during editing.
✓ Simplicity: Single-stage training pipeline.

Method: Dual-Level Architecture

We introduce a Dual-Level DiT Architecture to make pixel-space diffusion efficient.

PixelDiT dual-level architecture diagram showing patch and pixel pathways — PixelDiT uses a **Patch-Level Pathway** for global semantics and a **Pixel-Level Pathway** for texture refinement, connected via Pixel-wise AdaLN.

System and user prompts feeding semantic and pixel tokens through AdaLN-Zero into the joint DiT block. — Text prompts feed semantic tokens. The patch-level MM-DiT handles global semantics while the pixel pathway performs dense refinement.

Our architecture employs Pixel Token Compaction to reduce the computational cost of attention over dense pixels, and Pixel-wise AdaLN to condition per-pixel updates on the global semantic context.

Content Consistency in Image Editing

PixelDiT edits stay faithful!

FlowEdit^† exposes how VAE reconstructions from FLUX can warp fine text when tracing the full flow path (nmin=0). The comparison shows PixelDiT keeping the brick-wall lettering intact because it denoises directly in pixel space—no lossy VAE, no baked-in artifacts.

We simply plug the pretrained PixelDiT flow into FlowEdit and obtain clean local edits without re-rendering the entire scene.

FlowEdit comparison: real image, FLUX, and PixelDiT outputs illustrating preserved wall text. — FlowEdit (nmin=0). PixelDiT keeps typography legible while FLUX smear characters after VAE reconstruction.

State-of-the-art on ImageNet 256×256

Method	Space	Params	GFLOPs	gFID (↓)	IS (↑)	Recall (↑)
DiT-XL	Latent	675M	238	2.27	278.2	0.57
SiT-XL	Latent	675M	238	2.06	270.3	0.59
REPA (SiT-XL)	Latent	675M	238	1.42	305.7	0.65
ADM-U	Pixel	554M	2240	4.59	186.7	0.52
PixelFlow-XL	Pixel	677M	5818	1.98	282.1	0.60
PixNerd-XL	Pixel	700M	268	1.93	298.0	0.60
JiT-G	Pixel	2B	766	1.82	292.6	0.62
PixelDiT-XL (Ours)	Pixel	797M	311	1.61	292.7	0.64

Comparison of class-conditioned generation on ImageNet 256×256. PixelDiT outperforms prior pixel-space models and closes the gap with latent models.

State-of-the-art on ImageNet 512×512

Method	Space	Params	gFID (↓)	sFID (↓)	IS (↑)	Recall (↑)
DiT-XL	Latent	675M	3.04	5.02	240.8	0.54
SiT-XL	Latent	675M	2.62	4.18	252.2	0.57
REPA (SiT-XL)	Latent	675M	2.08	4.19	274.6	0.58
ADM	Pixel	554M	3.85	5.86	221.7	0.53
PixNerd-XL	Pixel	700M	2.84	5.95	245.6	0.59
EPG	Pixel	583M	2.35	—	295.4	0.57
JiT-H	Pixel	956M	1.94	—	309.1	—
PixelDiT-XL (Ours)	Pixel	797M	1.80	5.53	279.4	0.66

Comparison of class-conditioned generation on ImageNet 512×512.

Method	Space	GenEval (↑)	DPG (↑)
PixArt-α	Latent	0.48	71.6
PixArt-Σ	Latent	0.52	79.5
PixelFlow	Pixel	0.60	77.9
PixNerd	Pixel	0.73	80.9
PixelDiT-T2I	Pixel	0.78	83.7

Method	Space	GenEval (↑)	DPG (↑)
PixArt-Σ	Latent	0.54	80.5
SDXL	Latent	0.55	74.7
DALLE 3	Latent	0.67	83.5
FLUX-dev	Latent	0.67	84.0
PixelDiT-T2I	Pixel	0.74	83.5

Gallery

Curated glimpses of PixelDiT across panoramas, portraits, still lifes, fashion stories and ImageNet samples. Tap any frame to reveal the original prompt and view the full-resolution render.

Cinematic Panoramas

Storm-forged samurai titan — 1344×752 px

Portraits

Weathered mariner portrait — 752×1344 px

Documentary Pets & Street Moments

Abandoned dog on Bangkok street — 1024×1024 px

White cat studio portrait — 1024×1024 px

Concept Art & Macro Objects

Crochet alpine landscape diorama — 1024×1024 px

Knitted mountain valley diorama — 896×1344 px

ImageNet 512 x 512 Highlights

BibTeX

@article{yu2025pixeldit,
  title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
  author={Yu, Yongsheng and Xiong, Wei and Nie, Weili and Sheng, Yichen and Liu, Shiqiu and Luo, Jiebo},
  journal={arXiv preprint arXiv:2511.20645},
  year={2025}
}

PixelDiT
Pixel Diffusion Transformers
for Image Generation

Say Goodbye to VAEs

Method: Dual-Level Architecture

PixelDiT edits stay faithful!

Performance

State-of-the-art on ImageNet 256×256

State-of-the-art on ImageNet 512×512

Competitive Text-to-Image Generation

512×512 Resolution

1024×1024 Resolution

PixelDiT Podcast Deep Dive

Gallery

BibTeX

PixelDiT Pixel Diffusion Transformers for Image Generation

Say Goodbye to VAEs

Method: Dual-Level Architecture

PixelDiT edits stay faithful!

Performance

State-of-the-art on ImageNet 256×256

State-of-the-art on ImageNet 512×512

Competitive Text-to-Image Generation

512×512 Resolution

1024×1024 Resolution

PixelDiT Podcast Deep Dive

Gallery

BibTeX

PixelDiT
Pixel Diffusion Transformers
for Image Generation