Diffusion Models (Stable Diffusion, DALL·E Basics)

🌀 Diffusion Models — Stable Diffusion और DALL·E Basics (Hindi Mega Guide)

Diffusion models ने जनरेटिव एआई को नई ऊँचाई पर पहुँचा दिया है। ये models random noise से high-fidelity images synthesize कर सकते हैं और text prompts के अनुसार जटिल scenes बनाते हैं। इस ब्लॉग में हम diffusion की intuition, mathematics, architectures, training objective, guidance techniques, Latent Diffusion, और end-to-end pipelines को विस्तार से समझेंगे — साथ ही Stable Diffusion एवं DALL·E जैसे systems के design patterns पर भी चर्चा करेंगे।

1) Big Picture: Diffusion का मूल विचार

Diffusion model दो चरणों पर आधारित होता है:

Forward Process: एक clean image में क्रमशः कई steps पर Gaussian noise जोड़ा जाता है जब तक कि वह pure noise में बदल न जाए।
Reverse Process: एक neural network हर step पर noise हटाना सीखता है, ताकि अंत में meaningful image reconstruct हो सके।

Training में हम मॉडल को यह सिखाते हैं कि किसी भी noised image पर जो noise जोड़ा गया है, उसे कैसे अनुमानित किया जाए। Inference में यही मॉडल step-by-step noise को हटाकर नई image जनरेट करता है।

2) गणितीय रूपरेखा: Forward और Reverse Diffusion

मान लीजिए हमारे पास वास्तविक data distribution से एक image x₀ है। Forward process में एक pre-defined variance schedule β_t के साथ क्रमशः noise add किया जाता है और हमें x_t मिलते हैं:

q(x_t | x_{t-1}) = N(x_t; √(1 - β_t) x_{t-1}, β_t I)

Closed-form से हम x_t को सीधे x₀ से भी sample कर सकते हैं:

x_t = √(ᾱ_t) x_0 + √(1 - ᾱ_t) ε,   जहाँ α_t = 1 - β_t और ᾱ_t = ∏_{s=1}^t α_s

Reverse process में एक neural network ε_θ(x_t, t, c) noise ε का अनुमान करता है। Training objective सरल MSE बन जाता है:

L_simple = E_{t, x_0, ε} [ || ε - ε_θ(x_t, t, c) ||^2 ]

यहाँ c optional conditioning है, जैसे text embedding। यही simple objective diffusion models को train करना संभव और stable बनाता है।

3) आर्किटेक्चर: U-Net, Time Embeddings और Cross-Attention

U-Net Backbone: Encoder–decoder with skip connections. Encoder progressively downsample करता है; decoder upsample करता है। Skip connections fine details retain करने में मदद करती हैं।
Time Embeddings: Timestep t को sinusoidal या learned embeddings में map किया जाता है, ताकि model को पता रहे कि अभी कितना noise है।
Cross-Attention (Text-to-Image): Text encoder से आए embeddings को image features के साथ attend कराते हैं, जिससे prompt के अनुसार coherent images बनें।

Modern diffusion models architectural tricks जैसे group normalization, residual blocks, self-attention at certain resolutions, और efficient memory layouts का उपयोग करते हैं।

4) Noise Schedules और Samplers

β schedule linear, cosine, या learnable हो सकता है। Inference के लिए samplers जैसे DDPM, DDIM, PLMS, DPM++ आदि प्रयोग होते हैं। Fast sampling के लिए कम steps लेकिन उच्च quality के लिए अधिक steps; सही balance application पर निर्भर है।

5) Classifier-Free Guidance

Text conditioning में एक popular technique है classifier-free guidance। Training के दौरान कुछ probability पर conditioning drop कर देते हैं ताकि model unconditional और conditional दोनों सीखे। Inference में guidance scale w से conditional और unconditional predictions interpolate करते हैं:

ε_guided = ε_uncond + w (ε_cond - ε_uncond)

w बढ़ाने से prompt adherence बढ़ती है पर artifacts का जोखिम भी बढ़ता है। Practical range अक्सर 3 से 12 के बीच होती है।

6) Latent Diffusion: Stable Diffusion का मूल

High-resolution images पर direct pixel-space diffusion computationally भारी है। Latent Diffusion Models पहले image को एक pretrained autoencoder या VAE के encoder से compact latent space z में map करते हैं। Diffusion z पर चलता है और अंत में decoder z से image reconstruct करता है। लाभ:

Computational cost में भारी कमी
Memory efficiency और तेज sampling
उच्च गुणवत्ता बनाए रखते हुए scalable training

7) Text Encoders: CLIP और Tokenization

Text-to-image diffusion में prompt को एक text encoder प्रोसेस करता है। Stable Diffusion variants अक्सर CLIP या T5 जैसे encoders का उपयोग करते हैं। Tokenization subword स्तर पर होती है; prompt engineering से semantics और style control में बदलाव आता है।

8) DALL·E Basics

DALL·E परिवार मूलतः text-to-image generation के लिए जाना जाता है। शुरुआती versions में discrete autoencoder या VQ-VAE आधारित tokenized image representations के साथ autoregressive decoders का संयोजन देखा गया, जबकि नयी प्रणालियाँ diffusion paradigms अपनाती हैं। Core ideas:

Text और image को joint representation space में align करना
Powerful decoder जो prompt के अनुरूप high-quality scenes generate करे
Safety filters और policy guardrails

9) Training Objective और Practical Losses

Simple loss MSE काफी प्रभावी है। कुछ कार्यों में perceptual losses, VAE reconstruction losses, या prior preservation losses का संयोजन उपयोगी होता है। Mixed-precision training, gradient checkpointing, और distributed data parallelism large-scale training में महत्वपूर्ण हैं।

10) Sampling Pipeline: Pseudocode

# Pseudocode: Latent diffusion sampling with guidance
# Inputs: text prompt, steps T, guidance scale w
# 1) Encode text to conditioning c using text encoder
# 2) Initialize latent z_T ~ N(0, I)
# 3) For t = T ... 1:
#       ε_uncond = model(z_t, t, c_uncond)
#       ε_cond   = model(z_t, t, c)
#       ε        = ε_uncond + w * (ε_cond - ε_uncond)
#       z_{t-1}  = denoise_step(z_t, ε, t)   # sampler update (DDIM/DPM++)
# 4) Decode z_0 via VAE decoder to image

11) Prompt Engineering

Positive prompt: desired content और style स्पष्ट लिखें — medium, camera lens, lighting, composition, artist references।
Negative prompt: अवांछित artifacts जैसे blur, low-res, extra fingers, distortions।
Guidance scale और step count task के अनुसार tune करें।
Seed control से reproducibility सुनिश्चित करें।

12) ControlNet, LoRA और Adapters

ControlNet auxiliary नेटवर्क से conditioning signals जैसे edges, poses, depth maps जोड़ता है। LoRA lightweight adapters हैं जो बड़े मॉडेल पर कम parameters के साथ fine-tuning संभव बनाते हैं। ये techniques domain adaptation और style transfer के लिए लोकप्रिय हैं।

13) Evaluation: FID, KID, CLIP-Score

FID: Inception-feature space में distance; lower बेहतर।
KID: Kernel Inception Distance; unbiased और reliable छोटे samples पर।
CLIP-Score: Text-image alignment के लिए; higher बेहतर।

14) Ethics, Safety और Responsible Use

Diffusion models की शक्ति के साथ जोखिम भी आते हैं: deepfakes, misinformation, और bias। Responsible deployment के लिए:

Content filters और policy guardrails
Watermarking व provenance signals
Dataset transparency और consent-aware curation
Bias audits और red-teaming

15) Practical Workflows: End-to-End

Dataset तैयारी: high-quality captions, deduplication, NSFW filtering, language normalization।
Text encoder चयन: CLIP या T5 family; tokenizer hygiene।
Autoencoder/VAE pretraining या adoption; reconstruction fidelity का आकलन।
UNet training: noise schedule, augmentation, mixed precision।
Validation: prompt suites, alignment metrics, human eval loops।
Optimization: LoRA adapters, distillation, fast samplers।
Deployment: ONNX/TensorRT, batching, memory pinning, cache warmup।
Monitoring: abuse detection, drift, latency SLOs, cost tracking।

16) Code Skeletons (Illustrative)

Training time noise addition:

# Pseudocode: forward noising
def add_noise(x0, t, alphas_bar):
    eps = random_normal_like(x0)
    return sqrt(alphas_bar[t]) * x0 + sqrt(1 - alphas_bar[t]) * eps, eps

Loss computation:

# ε_theta predicts added noise
pred_eps = unet(x_t, t_embed, cond_embed)      # forward pass
loss = mse(pred_eps, eps)                       # simple objective

Sampler step (DDIM-style sketch):

# z_{t-1} update sketch (details omitted)
x0_est = (x_t - sqrt(1 - alphas_bar[t]) * pred_eps) / sqrt(alphas_bar[t])
x_{t-1} = sqrt(alphas_bar[t-1]) * x0_est + sqrt(1 - alphas_bar[t-1]) * pred_eps

17) Stable Diffusion: Pieces That Matter

Autoencoder: Images को latent z में compress करके perceptual quality बनाए रखना।
UNet with Cross-Attention: Text embeddings पर conditioned denoising।
Tokenization और CLIP: Prompt semantics capture करना।
Samplers: DDIM, DPM++ आदि fast yet high-quality sampling।
Extensions: Img2Img, Inpainting, ControlNet, LoRA fine-tunes।

18) DALL·E: System View

DALL·E systems में text और image representations का alignment central है। Training बड़े captioned datasets पर होती है; safety layers और usage policies production में अनिवार्य रहती हैं। Modern variants diffusion decoders अपनाती हैं जिससे quality और controllability बेहतर होती है।

19) Common Failure Modes और Remedies

Over-saturation या washed colors: Guidance scale, sampler choice और step count tune करें; negative prompts refine करें।
Anatomical errors: Pose conditioning via ControlNet; high-res fix और two-stage upscaling।
Text rendering कमजोर: Explicit text control models, higher resolution, specialized fine-tunes।
Prompt drift: Seed control, stricter negative prompts, lower temperature-like effects via sampler params।

20) Scaling Laws, Cost और Throughput

Quality dataset size, model capacity और compute के साथ improve करती है। Practical deployments में batching, KV caches for text encoders, half-precision decoding, और lazy loading throughput सुधारते हैं। Distilled samplers और latent upscalers latency घटाते हैं।

21) Mini Project: Prompted Portrait Generator

Text encoder और pretrained latent diffusion मॉडल लोड करें।
Prompt templates बनाएँ: style, lighting, lens, composition।
Guidance scale sweep और samplers benchmark करें।
Negative prompt library बनाएँ।
Batch generation, best-of-n selection और simple aesthetic scorer।
Watermarking व content policy filters जोड़ें।
Export pipeline: web-friendly formats और thumbnails।

22) निष्कर्ष

Diffusion models today's generative AI का backbone बन चुके हैं। Forward–reverse stochastic processes, UNet-आधारित denoisers, और clever conditioning तंत्रों के साथ यह models extraordinary fidelity deliver करते हैं। Stable Diffusion की latent रणनीति ने democratization संभव किया, जबकि DALL·E ने text-image alignment और safety के robust patterns दिखाए।

आगे बढ़ने के लिए आप ControlNet, LoRA, और domain-specific fine-tunes के साथ hands-on projects करें; साथ ही responsible AI practices, safety, और evaluation को production cycles में केंद्रीय स्थान दें।