Introduction: Creating Images from Nothing
Image generation is one of the most spectacular results of generative AI. Describing an image in natural language and seeing it materialize in seconds was science fiction just a few years ago. Today, tools like Stable Diffusion, DALL-E, and Midjourney make it accessible to anyone.
But how do these models actually work? This article disassembles the architecture of Diffusion Models, explains the differences between the main tools, and provides practical prompt engineering techniques for professional-quality images.
What You'll Learn in This Article
- How Diffusion Models work: from noise to image
- Stable Diffusion architecture: Text Encoder, UNet, and VAE
- Practical comparison: Stable Diffusion vs DALL-E vs Midjourney
- Text-to-image, image-to-image, and inpainting
- Image-specific prompt engineering
- Copyright and ethical considerations
How Diffusion Models Work
Diffusion Models are based on a counter-intuitive principle: destroying an image by progressively adding noise, and then training a neural network to reverse the process. If the model learns to remove noise effectively, it can start from pure noise and generate a completely new image.
Forward Process: Adding Noise
In the forward process, you start from a real image and add Gaussian noise in T progressive steps. After enough steps, the original image is completely unrecognizable: only random noise remains. This process is deterministic and requires no training.
Reverse Process: Removing Noise
The reverse process is where the magic happens. A neural network (typically a UNet) is trained to predict the noise to remove at each step. Starting from pure noise, the model iteratively removes noise for T steps, generating a coherent image.
Stable Diffusion Architecture
Stable Diffusion operates in latent space, not pixel space. A VAE Encoder compresses the image (512x512 pixels) into a compact latent representation (64x64). The UNet works on this reduced space, making the process much more efficient. A Text Encoder (CLIP) converts the text prompt into embeddings that guide generation. Finally, the VAE Decoder converts the latent representation back to a full-resolution image.
Stable Diffusion: Open Source and Self-Hosted
Stable Diffusion is the reference model for open source image generation. Published by Stability AI in 2022, it's freely downloadable and runnable on consumer hardware. With a GPU with 8+ GB VRAM, you can generate high-quality images locally.
# Image generation with Stable Diffusion and diffusers
from diffusers import StableDiffusionPipeline
import torch
# Load the model (first run downloads ~5GB)
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16, # Half precision to save VRAM
variant="fp16",
use_safetensors=True
)
pipe = pipe.to("cuda")
# Generate an image
prompt = "A serene Japanese garden at sunset, koi pond with lily pads, cherry blossom trees, soft golden light, photorealistic, 8k"
negative_prompt = "blurry, low quality, distorted, ugly, deformed"
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=30, # More steps = better quality (20-50)
guidance_scale=7.5, # How closely to follow prompt (5-15)
width=1024,
height=1024
).images[0]
image.save("japanese_garden.png")
print("Image generated!")
DALL-E: OpenAI's API Solution
DALL-E 3 by OpenAI offers excellent quality through a simple API. It requires no local GPU or complex setup, but has a per-image cost.
# Image generation with DALL-E 3 via API
from openai import OpenAI
client = OpenAI()
response = client.images.generate(
model="dall-e-3",
prompt="A futuristic cityscape at night with neon lights reflecting on wet streets, cyberpunk style, dramatic lighting",
size="1024x1024", # 1024x1024, 1024x1792, 1792x1024
quality="hd", # "standard" or "hd"
n=1 # Number of images (DALL-E 3: only 1)
)
image_url = response.data[0].url
revised_prompt = response.data[0].revised_prompt
print(f"Image URL: {image_url}")
print(f"Revised prompt by DALL-E: {revised_prompt}")
# DALL-E 3 pricing:
# Standard 1024x1024: $0.040/image
# HD 1024x1024: $0.080/image
# HD 1024x1792: $0.120/image
Comparison: Stable Diffusion vs DALL-E vs Midjourney
Image Generation Tools Comparison
| Feature | Stable Diffusion | DALL-E 3 | Midjourney |
|---|---|---|---|
| Type | Open source, self-hosted | Proprietary API | SaaS (Discord/Web) |
| Cost | Free (your hardware) | $0.04-0.12/image | $10-60/month |
| Quality | High (depends on model) | Very high | Excellent (artistic style) |
| Customization | Total (LoRA, ControlNet) | Limited | Medium (style parameters) |
| Data privacy | Total (local) | Data sent to OpenAI | Data sent to Midjourney |
| Speed | Depends on GPU | ~15-30 seconds | ~30-60 seconds |
| API integration | Yes (local or cloud) | Yes (REST API) | No (Discord/Web only) |
Generation Modes
Diffusion models support several generation modes, each with specific applications.
Text-to-Image
The base mode: from a text prompt, generates a completely new image. It's the most widely used mode and the one that captured the public's imagination.
Image-to-Image
Starts from an existing image and transforms it following a prompt. Useful for stylistic variations, quality improvement, or transforming sketches into complete images.
# Image-to-image: transform an existing image
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16
).to("cuda")
# Load starting image
init_image = Image.open("sketch.png").resize((1024, 1024))
# Transform: from sketch to realistic image
result = pipe(
prompt="Detailed architectural rendering of a modern house, photorealistic, professional photography",
image=init_image,
strength=0.75, # How much to modify the original (0-1)
guidance_scale=7.5,
num_inference_steps=30
).images[0]
result.save("house_rendering.png")
Inpainting
Inpainting allows modifying specific regions of an image while keeping the rest intact. Useful for removing objects, adding elements, or correcting defects.
Prompt Engineering for Images
Prompt engineering for images follows different rules than text prompting. The typical structure includes subject, style, quality, and technical parameters.
Structure of an Effective Image Prompt
- Subject: what you want in the image ("a red fox sitting on a rock")
- Environment: where it is ("in a misty forest at dawn")
- Style: which artistic style ("oil painting style", "photorealistic", "anime")
- Lighting: type of light ("soft golden hour light", "dramatic side lighting")
- Quality: technical keywords ("8k resolution", "highly detailed", "sharp focus")
- Composition: how it's framed ("close-up portrait", "wide landscape shot")
# Image prompt builder
def build_image_prompt(
subject: str,
environment: str = "",
style: str = "photorealistic",
lighting: str = "natural light",
quality: str = "8k, highly detailed, sharp focus",
extra: str = ""
) -> dict:
"""Build a structured prompt for image generation."""
parts = [subject]
if environment:
parts.append(environment)
parts.extend([style, lighting, quality])
if extra:
parts.append(extra)
prompt = ", ".join(parts)
# Standard negative prompt for high quality
negative = "blurry, low quality, distorted, deformed, ugly, watermark, text, signature, cropped, worst quality, low resolution"
return {"prompt": prompt, "negative_prompt": negative}
# Examples of structured prompts
prompts = [
build_image_prompt(
subject="A majestic snow leopard",
environment="on a Himalayan mountain peak",
style="National Geographic photography",
lighting="dramatic sunset backlighting"
),
build_image_prompt(
subject="A cozy Italian cafe interior",
environment="narrow cobblestone street visible through window",
style="warm watercolor illustration",
lighting="soft warm afternoon light"
)
]
for p in prompts:
print(f"Prompt: {p['prompt']}\n")
Customization: LoRA and ControlNet
Stable Diffusion offers two powerful mechanisms for customizing generation:
- LoRA for images: trains an adapter on a few images of a specific style or subject (e.g., your face, a brand style). Result: the model generates images in that specific style
- ControlNet: provides an additional control signal to the model: a human pose (skeleton), image edges (canny edge), a depth map, or a segmentation. The model generates the image respecting that structural signal
Copyright and Ethics in Image Generation
AI image generation raises significant legal and ethical questions that cannot be ignored.
Copyright Issues
- Training data: models are trained on billions of images collected from the web, many copyright-protected
- Artistic style: generating images "in the style of [living artist]" is ethically questionable
- Output ownership: who owns the rights to an AI-generated image? Legislation is still evolving
Ethical Best Practices
- Don't use AI to imitate living artists' styles without permission
- Disclose when an image is AI-generated
- Don't generate content that could be used for disinformation
- Respect the usage guidelines of models and providers
State of AI Copyright Legislation (2025)
The legal situation is rapidly evolving. The European Union with the AI Act requires transparency about training data. In the US, several lawsuits are ongoing between artists and AI companies. The trend is toward stricter regulation, with mandatory disclosure and potentially compensation for training data.
Conclusions
AI image generation has opened unprecedented creative possibilities, but requires technical understanding to achieve professional results. Stable Diffusion offers flexibility and total control for those willing to invest in customization. DALL-E 3 is ideal for those seeking immediate quality via API. Midjourney excels in artistic style.
Copyright and ethical questions are real and evolving. As professionals, it's our responsibility to use these tools ethically and respect artists' rights.
In the next article, we'll return to code with Generative AI for Software Development: GitHub Copilot, Claude Code, Cursor, and best practices for using AI assistants in daily programming.







