Variational Autoencoders (VAE)

🧠 Variational Autoencoders (VAE) — Complete Guide in Hindi

Variational Autoencoder या VAE एक powerful generative model है जो complex data distributions को सीखता है और उनसे नए, यथार्थसंगत samples generate करता है. VAE probabilistic inference और deep learning का सुंदर मेल है. यह blog step-by-step आपको VAE की दुनिया में ले जाएगा: सबसे पहले intuition, फिर गणितीय नींव, उसके बाद architecture, training, evaluation, और आखिर में advanced variants तथा production considerations.

🎯 Autoencoder से Variational Autoencoder तक

Autoencoder एक neural network है जो input को lower-dimensional code में compress करता है (encoder) और फिर उसे reconstruct करता है (decoder). Classical autoencoders powerful होते हैं, लेकिन उनके latent space की probabilistic semantics स्पष्ट नहीं रहती. VAE इस कमी को दूर करता है: यह latent space पर एक distribution सीखता है ताकि sample generation principled तरीके से हो सके.

VAE का मूल विचार

हर input x के लिए encoder एक latent distribution q(z|x) predict करता है, आमतौर पर Gaussian जिसके parameters mean और log-variance होते हैं.
Decoder p(x|z) सीखता है ताकि z से reconstruct x बनाया जा सके.
Training objective reconstruction quality और latent distribution regularization के बीच trade-off सेट करता है.

📐 Probabilistic Foundations

Generative modeling में हम data distribution p(x) को model करना चाहते हैं. Latent-variable model मानते हैं जिसमें hidden variable z के साथ p(x) = ∫ p(x|z) p(z) dz. Direct maximization कठिन है, इसलिए variational inference का सहारा लेते हैं जहां q(z|x) एक tractable approximate posterior है.

Evidence Lower Bound (ELBO)

Log p(x) का एक lower bound निकाला जाता है जिसे ELBO कहते हैं:

ELBO(x) = E_{z ~ q(z|x)} [ log p(x|z) ] - KL( q(z|x) || p(z) )

पहला term reconstruction likelihood है; दूसरा term q(z|x) को prior p(z) के करीब रखने की penalty है. Training में हम ELBO को maximize या negative ELBO को minimize करते हैं.

Reparameterization Trick

q(z|x) से stochastic sampling gradient flow तोड़ देता है. इसे ठीक करने के लिए reparameterization trick इस्तेमाल होती है: z = μ(x) + σ(x) ⊙ ε, जहाँ ε standard normal से आता है. इस तरह randomness network के बाहर चली जाती है और μ, σ पर gradients सुरक्षित रहते हैं.

🏗️ VAE Architecture Design

Encoder: Convolutional या fully connected layers जो input x से latent parameters μ(x), logσ²(x) निकालते हैं.
Sampling layer: reparameterization से z बनाया जाता है.
Decoder: latent z से x का reconstruction p(x|z) produce करता है. Images के लिए Bernoulli या continuous के लिए Gaussian likelihood लोकप्रिय है.
Prior: सामान्यतः standard normal p(z) = N(0, I). Advanced VAEs में learnable priors भी उपयोग होते हैं.

Loss Function (Practical)

Loss = ReconstructionLoss + Beta * KL(q(z|x) || p(z))

Reconstruction loss data type पर निर्भर: binary images पर binary cross-entropy, real images पर MSE या log-likelihood. KL term latent space को नियमित करता है. Hyperparameter Beta से bottleneck strength नियंत्रित होती है.

🧪 Minimal TensorFlow/Keras VAE (MNIST)

import tensorflow as tf
from tensorflow.keras import layers, Model

latent_dim = 2

# Encoder
inputs = tf.keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(32, 3, strides=2, padding="same", activation="relu")(inputs)
x = layers.Conv2D(64, 3, strides=2, padding="same", activation="relu")(x)
x = layers.Flatten()(x)
x = layers.Dense(128, activation="relu")(x)
z_mean = layers.Dense(latent_dim)(x)
z_log_var = layers.Dense(latent_dim)(x)

def sample_z(args):
    m, lv = args
    eps = tf.random.normal(shape=tf.shape(m))
    return m + tf.exp(0.5 * lv) * eps

z = layers.Lambda(sample_z)([z_mean, z_log_var])

# Decoder
decoder_inputs = tf.keras.Input(shape=(latent_dim,))
y = layers.Dense(7 * 7 * 64, activation="relu")(decoder_inputs)
y = layers.Reshape((7, 7, 64))(y)
y = layers.Conv2DTranspose(64, 3, strides=2, padding="same", activation="relu")(y)
y = layers.Conv2DTranspose(32, 3, strides=2, padding="same", activation="relu")(y)
outputs = layers.Conv2D(1, 3, padding="same", activation="sigmoid")(y)
decoder = Model(decoder_inputs, outputs)

recon = decoder(z)
vae = Model(inputs, recon)

# Loss
recon_loss = tf.reduce_mean(tf.keras.losses.binary_crossentropy(inputs, recon)) * 28 * 28
kl_loss = -0.5 * tf.reduce_mean(1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
vae.add_loss(recon_loss + kl_loss)
vae.compile(optimizer="adam")

ऊपर दिया गया कोड MNIST जैसे binary images के लिए उपयुक्त है. Training के बाद decoder को अकेले उपयोग करके latent z से नए samples generate किए जा सकते हैं.

🧪 Minimal PyTorch VAE (Skeleton)

import torch
import torch.nn as nn
import torch.nn.functional as F

class Encoder(nn.Module):
    def __init__(self, latent_dim=2):
        super().__init__()
        self.fc1 = nn.Linear(784, 400)
        self.mu = nn.Linear(400, latent_dim)
        self.lvar = nn.Linear(400, latent_dim)
    def forward(self, x):
        x = x.view(-1, 784)
        h = F.relu(self.fc1(x))
        return self.mu(h), self.lvar(h)

class Decoder(nn.Module):
    def __init__(self, latent_dim=2):
        super().__init__()
        self.fc1 = nn.Linear(latent_dim, 400)
        self.fc2 = nn.Linear(400, 784)
    def forward(self, z):
        h = F.relu(self.fc1(z))
        return torch.sigmoid(self.fc2(h))

def reparam(mu, lvar):
    std = torch.exp(0.5 * lvar)
    eps = torch.randn_like(std)
    return mu + eps * std

def vae_loss(x, recon, mu, lvar):
    bce = F.binary_cross_entropy(recon, x.view(-1, 784), reduction="sum")
    kld = -0.5 * torch.sum(1 + lvar - mu.pow(2) - lvar.exp())
    return (bce + kld) / x.size(0)

🔬 Intuition: Latent Space का अर्थ

VAE का latent space smooth और meaningful होता है. Points के बीच linear interpolation से semantically smooth transitions दिखते हैं. उदाहरण के लिए handwritten digits में 3 से 8 तक morphing साफ दिखती है. यह property downstream tasks के लिए मददगार है, जैसे retrieval या controllable generation.

🧭 Hyperparameters & Training Tips

Latent dim: बहुत छोटा dim underfit करेगा, बहुत बड़ा dim posterior collapse बढ़ा सकता है. MNIST के लिए 2 से 16 पर्याप्त; complex images में 64 से 256 प्रचलित.
Beta (in Beta-VAE): KL weight बढ़ाने से disentanglement बेहतर हो सकता है पर reconstruction degrade हो सकती है. Beta को dataset के हिसाब से tune करें.
KL Annealing: शुरुआत में KL weight कम रखें और training के साथ धीरे-धीरे बढ़ाएं ताकि decoder पहले reconstruct करना सीख ले.
Likelihood choice: Binary data पर BCE; continuous पर Gaussian (MSE) या discretized logistic mixture बेहतर.
Optimizer: Adam एक अच्छा default है; learning rate 1e-3 से शुरू करें.

⚠️ Common Pitfalls

Posterior collapse: Decoder बहुत powerful हो तो q(z|x) prior के बराबर हो सकता है, z में information नहीं रहती. Remedies: KL annealing, word dropout (text), weaker decoder, free bits.
Blurry reconstructions: Pixel-wise Gaussian या BCE से blur आ सकता है. Remedies: perceptual loss, autoregressive decoders, hierarchical VAEs.
Mode coverage vs sharpness: VAE mode coverage अच्छा रखता है पर sharpness कम दिख सकती है. Hybrid models एक मार्ग हैं.

🧩 Advanced Variants

Beta-VAE: KL term को Beta से scale करके disentanglement बढ़ाने की कोशिश.
Vector-Quantized VAE (VQ-VAE): Continuous z के बजाय codebook indices सीखता है; audio और image synthesis में प्रभावी.
Conditional VAE (CVAE): x के साथ label y देकर q(z|x,y) और p(x|z,y) सीखता है; class-conditioned samples generate होते हैं.
Hierarchical VAE: बहु-स्तरीय latent variables से high-level semantics capture.
Flow-VAE: posterior को normalizing flows से enrich करना.

🧰 Evaluation Metrics

Reconstruction error: validation BCE या MSE.
Negative log-likelihood estimate: importance-weighted bounds से approximate.
FID/KID (images): generated set की quality और realism.
Latent traversals: interpretability और disentanglement का दृश्य परीक्षण.

🧪 Mini Project: Fashion-MNIST पर CVAE

Dataset लोड करें और images normalize करें.
Encoder में x और label y concatenate करके μ, logσ² निकालें.
Decoder में z और y देकर reconstruction करें.
Loss में reconstruction + KL, साथ ही label-conditioning सुनिश्चित करें.
Training के बाद हर class के लिए grid samples बनाएं.

🏭 Production Considerations

Export: Trained decoder को lightweight सर्विंग के लिए अलग से save करें.
Throughput: Sampling parallel करें; hardware acceleration का लाभ लें.
Monitoring: Output drift, diversity और failure modes पर checks लगाएं.
Safety: Synthetic data generation में privacy और misuse जोखिम पर policy guardrails रखें.

🔄 VAE बनाम GAN बनाम Diffusion

Criteria	VAE	GAN	Diffusion
Training Stability	अच्छी, ELBO आधारित	कभी अस्थिर, adversarial	स्थिर, पर धीमा sampling
Sharpness	कभी blurry	बहुत sharp	बहुत उच्च गुणवत्ता
Latent Semantics	अर्थपूर्ण, disentanglement संभव	latent optional	latent नहीं, पर guidance संभव

🧭 Checklist: VAE Implement करते समय

Input scaling सही है या नहीं
Likelihood match कर रहा है या नहीं (binary vs continuous)
KL weight schedule और Beta tuning
Latent traversals से sanity-check
Decoder capacity balanced है या नहीं

📦 Code Snippet: Sampling After Training (Keras)

# मान लें decoder सीखा हुआ है और latent_dim ज्ञात है
import numpy as np
z = np.random.normal(0.0, 1.0, size=(16, latent_dim))
samples = decoder.predict(z)
# samples को 0..1 images के रूप में visualize करें

📚 Further Reading & Extensions

Importance-Weighted Autoencoders
Discrete Latents with Gumbel-Softmax
Diffusion decoders with VAE encoders (hybrid)
Semisupervised learning with VAEs

✅ निष्कर्ष

VAE एक सुंदर framework है जो generative modeling और representation learning दोनों के लिए उपयोगी है. इसकी probabilistic नींव इसे principled बनाती है, और reparameterization trick deep networks के साथ इसे trainable बनाती है. सही architecture, loss चयन और training नीति के साथ VAE robust और उपयोगी models देता है.

अगले चरण में आप Beta-VAE या VQ-VAE के साथ प्रयोग कर सकते हैं, या CVAE से नियंत्रित generation बना सकते हैं. यदि आपका लक्ष्य ultra-sharp images है, तो VAE को perceptual losses या hybrid adversarial सेटअप के साथ जोड़कर देखें.