AI, But Simple
Posts
Diffusion Transformers (DiTs), Simply Explained

Diffusion Transformers (DiTs), Simply Explained

AI, But Simple Issue #44

Edwin Dong
March 31, 2025

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.

Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

Diffusion Transformers (DiTs), Simply Explained

AI, But Simple Issue #44

The diffusion transformer (DiT), proposed by William Peebles at UC Berkeley and Saining Xie at New York University, is a type of generative AI model that blends two powerful machine learning techniques: diffusion models and transformers.

DiT Architecture

The original paper, “Scalable Diffusion Models with Transformers”, was released in 2022. Feel free to check it out on arXiv!

What is diffusion?

Diffusion refers to the phenomenon of a “scattering” of particles. An example of this is sunshine that passes through clouds, creating rays that scatter and spread out in different directions. Diffusion is occurring from the random motion of the light particles.

This is similar to what happens in diffusion models (Ho et al., 2020) used for image generation. Random noise is added to the image, progressively displaying scattered pixels of colors that deviate from the original image.

Specifically, diffusion models add Gaussian noise to training data successively (forward diffusion process), then learn to reconstruct the data by reversing this noising process (reverse diffusion process).

After training, these models can generate high-quality images starting from random noise—iteratively removing it to create full images of just about anything it was trained on.

What are transformers?

A transformer model (Vaswani, et al., 2017) is a type of neural network architecture that has become the foundation for many modern natural language processing (NLP) tasks, such as chatbots, translation, and text generation. It is the core architecture behind GPT-3.

The main innovation behind the transformer is the self-attention mechanism, which was originally used for machine translation in RNNs. This mechanism is extremely useful, allowing the model to learn global relationships in the input data.

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign In.Not now