Introduction: The New Frontier of Generation
Diffusion Models have surpassed GANs as the state of the art in image generation, powering systems like DALL-E, Stable Diffusion, and Midjourney. The underlying idea is surprisingly simple: a forward process gradually adds Gaussian noise to an image until it is completely destroyed, then a neural network learns the reverse process, removing noise step by step to reconstruct the original image (or generate a new one).
Unlike GANs, diffusion models offer stable training, greater diversity in results, and a rigorous probabilistic framework. The trade-off is a slower generation process (hundreds of denoising steps), mitigated by techniques like DDIM and latent diffusion.
What You Will Learn
- The forward process: how noise progressively destroys an image
- The reverse process: how a neural network learns to remove noise
- DDPM: Denoising Diffusion Probabilistic Models
- DDIM: faster generation with fewer steps
- Text conditioning: generating images from text descriptions with CLIP
- Stable Diffusion: diffusion in latent space
- Practical implementation with Hugging Face Diffusers
The Forward Process: Adding Noise
The forward process is a Markov chain that progressively adds Gaussian noise to the image over T steps. At each step t, a small amount of noise is added according to a predefined schedule (linear or cosine). After sufficient steps (typically T=1000), the original image is completely destroyed and becomes pure Gaussian noise.
A fundamental property is that we can jump directly to any step t without computing all intermediate steps, thanks to the closed-form formula:
import torch
import torch.nn as nn
import numpy as np
class DiffusionSchedule:
"""Schedule for the diffusion process"""
def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
self.num_timesteps = num_timesteps
# Linear beta schedule
self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
self.alphas = 1.0 - self.betas
# Cumulative alpha: product of all alphas up to t
self.alpha_cumprod = torch.cumprod(self.alphas, dim=0)
self.sqrt_alpha_cumprod = torch.sqrt(self.alpha_cumprod)
self.sqrt_one_minus_alpha_cumprod = torch.sqrt(1.0 - self.alpha_cumprod)
def add_noise(self, x_0, t, noise=None):
"""Forward process: q(x_t | x_0) - adds noise to x_0"""
if noise is None:
noise = torch.randn_like(x_0)
sqrt_alpha = self.sqrt_alpha_cumprod[t].view(-1, 1, 1, 1)
sqrt_one_minus = self.sqrt_one_minus_alpha_cumprod[t].view(-1, 1, 1, 1)
# x_t = sqrt(alpha_cumprod_t) * x_0 + sqrt(1 - alpha_cumprod_t) * noise
return sqrt_alpha * x_0 + sqrt_one_minus * noise
# Demo: progressive noise
schedule = DiffusionSchedule()
image = torch.randn(1, 3, 64, 64) # Original image
for t in [0, 250, 500, 750, 999]:
t_tensor = torch.tensor([t])
noisy = schedule.add_noise(image, t_tensor)
print(f"Step {t}: noise level = {schedule.sqrt_one_minus_alpha_cumprod[t]:.4f}")
The Reverse Process: Removing Noise
The reverse process is where the magic happens. A neural network (typically a U-Net) learns to predict the noise added at each step. Starting from pure Gaussian noise, the model gradually removes noise over T steps, generating a coherent image.
The training objective is simple: minimize the difference between the actual noise added and the noise predicted by the network. This MSE loss on noise has proven extremely effective.
class SimpleUNet(nn.Module):
"""Simplified U-Net for noise prediction"""
def __init__(self, channels=3, time_emb_dim=256):
super().__init__()
# Time embedding: transforms timestep into vector
self.time_mlp = nn.Sequential(
nn.Linear(1, time_emb_dim),
nn.SiLU(),
nn.Linear(time_emb_dim, time_emb_dim)
)
# Encoder
self.enc1 = self._block(channels, 64)
self.enc2 = self._block(64, 128)
self.enc3 = self._block(128, 256)
# Bottleneck
self.bottleneck = self._block(256, 512)
# Decoder with skip connections
self.dec3 = self._block(512 + 256, 256)
self.dec2 = self._block(256 + 128, 128)
self.dec1 = self._block(128 + 64, 64)
self.final = nn.Conv2d(64, channels, 1)
self.pool = nn.MaxPool2d(2)
self.up = nn.Upsample(scale_factor=2, mode='bilinear')
def _block(self, in_ch, out_ch):
return nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1),
nn.GroupNorm(8, out_ch),
nn.SiLU(),
nn.Conv2d(out_ch, out_ch, 3, padding=1),
nn.GroupNorm(8, out_ch),
nn.SiLU()
)
def forward(self, x, t):
# t_emb = self.time_mlp(t.float().unsqueeze(-1))
e1 = self.enc1(x)
e2 = self.enc2(self.pool(e1))
e3 = self.enc3(self.pool(e2))
b = self.bottleneck(self.pool(e3))
d3 = self.dec3(torch.cat([self.up(b), e3], dim=1))
d2 = self.dec2(torch.cat([self.up(d3), e2], dim=1))
d1 = self.dec1(torch.cat([self.up(d2), e1], dim=1))
return self.final(d1)
# Training: predict the noise
model = SimpleUNet()
schedule = DiffusionSchedule()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
def training_step(x_0):
t = torch.randint(0, 1000, (x_0.size(0),))
noise = torch.randn_like(x_0)
x_t = schedule.add_noise(x_0, t, noise)
noise_pred = model(x_t, t)
loss = nn.functional.mse_loss(noise_pred, noise)
return loss
DDPM and DDIM: Speed vs Quality
DDPM
DDPM (Denoising Diffusion Probabilistic Models) uses all T steps in the sampling process, ensuring high quality but requiring hundreds or thousands of network evaluations.
DDIM
DDIM (Denoising Diffusion Implicit Models) accelerates sampling by using a non-Markovian process that allows skipping steps. With only 50-100 steps it achieves quality comparable to DDPM with 1000 steps, reducing generation time by 10-20 times.
Stable Diffusion: Diffusion in Latent Space
Stable Diffusion applies the diffusion process not in pixel space but in a compressed latent space (typically 64x64 instead of 512x512). An autoencoder (VAE) compresses the image into latent space, diffusion operates in this reduced space, and the decoder reconstructs the final image. This reduces computational requirements by approximately 50 times, making generation possible on consumer GPUs.
Text-to-Image: Conditioning with CLIP
Text-to-image generation uses CLIP (Contrastive Language-Image Pre-training) to encode the text prompt into an embedding that guides the denoising process. Classifier-free guidance balances prompt adherence and diversity: higher values produce images more faithful to the text but less varied.
from diffusers import StableDiffusionPipeline
import torch
# Load Stable Diffusion
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
# Generate an image from text
prompt = "A serene Japanese garden with cherry blossoms, digital art"
image = pipe(
prompt,
num_inference_steps=50,
guidance_scale=7.5 # Classifier-free guidance
).images[0]
image.save("japanese_garden.png")
print(f"Generated image: {image.size}")
ControlNet and the Future
ControlNet adds precise spatial control to generation: hand-drawn sketches, depth maps, human poses, and Canny edges can guide generation, maintaining the desired composition while the model adds details and style.
The field of generative models evolves rapidly: consistency models promise single-step generation, video diffusion generates coherent videos, and multimodal models combine text, images, and audio in a unified framework.
Next Steps in the Series
- In the next article we will explore Reinforcement Learning
- We will see how agents learn from rewards: Q-Learning, DQN, and PPO
- We will implement an agent that learns to play with OpenAI Gymnasium







