📸 Convolutional Neural Networks (CNN) — Complete Guide
Convolutional Neural Networks (CNNs) कंप्यूटर विज़न का backbone हैं। CNNs images और spatial data के patterns पकड़ने के लिए specially designed layers का उपयोग करते हैं — convolutional filters, pooling, और hierarchical feature extraction। इस ब्लॉग में हम step-by-step CNN के सिद्धांत, architecture, implementation और best practices सीखेंगे।
🔍 1 — Intuition: क्यों CNN?
Traditional fully-connected networks हर pixel को एक independent feature की तरह treat करते हैं — जिससे parameters बहुत बढ़ जाते हैं और spatial context खो जाता है। CNNs उस spatial structure का लाभ उठाते हैं: local connectivity और weight sharing के कारण filters image के local patterns (edges, textures) सीखते हैं और higher layers में complex structures बनते हैं।
🧮 2 — Convolution Operation (Mathematical Intuition)
Convolution एक sliding-window operation है: एक small kernel (filter) image पर slide होता है और dot-product से एक feature map generate होता है। अगर input image I और kernel K हो तो 2D discrete convolution:
(S * K)[i,j] = Σ_m Σ_n S[i+m, j+n] * K[m,n]
Practically deep learning frameworks cross-correlation implement करते हैं (kernel flip नहीं करते), पर concept वही रहता है — local patterns के लिए weight sharing।
🔧 3 — Kernels / Filters
- Kernel size (3x3, 5x5) — छोटे kernels (3x3) ज़्यादा popular क्योंकि stacking से receptive field बढ़ता है।
- Number of kernels = output channels (feature maps)।
- Weight sharing → same filter across spatial positions → translation invariance।
↔️ 4 — Stride और Padding
Stride से पता चलता है kernel कितने steps में slide करेगा (stride=1 सबसे common)। Padding से output size control होता है — 'same' padding input size preserve करता है; 'valid' padding no border padding होता है।
📉 5 — Pooling (Max / Average)
Pooling layers spatial resolution कम करते हैं और translation-invariance बढ़ाते हैं। Max-pooling सबसे common है — local maxima preserve करता है। Modern architectures कभी-कभी pooling के बजाय strided convolutions prefer करते हैं।
⚡ 6 — Activation & Batch Normalization
हर convolution के बाद non-linearity (ReLU/LeakyReLU) लगती है। Batch Normalization training stabilize करता है, learning rate बढ़ाने में मदद करता है और regularization भी देता है।
🏛️ 7 — Famous CNN Architectures (brief)
- LeNet-5 (1998): शुरुआती architecture — convolution, pooling, FC layers — digit recognition के लिए।
- AlexNet (2012): deep CNN जिसने ImageNet पर breakthrough किया — ReLU, dropout, data augmentation।
- VGG (2014): 3x3 conv stacked और deeper networks (VGG16/19) — simple और effective पर heavy।
- ResNet (2015): residual connections introduce करके very deep networks (50/101/152 layers) feasible बनाए।
- Inception / MobileNet / EfficientNet: computational trade-offs और mobile/efficiency focus वाले models।
🔭 8 — Receptive Field
Receptive field किसी neuron का input image पर कितना बड़ा region देखता है। Layer stacking और larger kernels receptive field बढ़ाते हैं। ResNet जैसी architectures skip connections से deep features बिना degradation capture करती हैं।
📐 9 — Common Loss Functions
- Classification: Cross-Entropy Loss (categorical / binary)
- Localization / Detection: Smooth L1, IoU based losses
- Segmentation: Dice Loss, IoU Loss, Binary Cross-Entropy per pixel
🛡️ 10 — Regularization Techniques
- Data Augmentation (rotation, flip, color jitter)
- Dropout (FC layers / spatial dropout)
- Weight Decay (L2 regularization)
- Early Stopping, Mixup, Cutout, CutMix
🔁 11 — Transfer Learning
Pretrained CNNs (on ImageNet) को feature extractor के रूप में use करना common practice है। आम तरीका: backbone को freeze करें और top layers को task-specific fine-tune करें। यह limited-data scenarios में ज़बरदस्त फायदा देता है।
🔎 12 — Visualization Techniques
- Filter visualization (what filters learn)
- Activation maps / feature maps
- Grad-CAM / Guided Backpropagation for class-specific localization
- t-SNE / UMAP on deep features for cluster visualization
🏗️ 13 — Practical CNN Pattern (Example)
Typical image classification pipeline:
- Data loading + balanced splits (train/val/test)
- Preprocessing (resize, normalize, augment)
- Backbone (Conv blocks + BatchNorm + ReLU + Pooling)
- Global Average Pooling
- FC head + dropout + softmax
- Optimizer (Adam/SGD with momentum), LR scheduler, early stopping
💻 14 — Python Example: Keras (TensorFlow)
from tensorflow.keras import layers, models, optimizers def build_simple_cnn(input_shape=(224,224,3), num_classes=10): model = models.Sequential() model.add(layers.Conv2D(32, (3,3), padding='same', activation='relu', input_shape=input_shape)) model.add(layers.BatchNormalization()) model.add(layers.Conv2D(32, (3,3), padding='same', activation='relu')) model.add(layers.MaxPooling2D((2,2))) model.add(layers.Dropout(0.25)) model.add(layers.Conv2D(64, (3,3), padding='same', activation='relu')) model.add(layers.BatchNormalization()) model.add(layers.Conv2D(64, (3,3), padding='same', activation='relu')) model.add(layers.MaxPooling2D((2,2))) model.add(layers.Dropout(0.25)) model.add(layers.GlobalAveragePooling2D()) model.add(layers.Dense(128, activation='relu')) model.add(layers.Dropout(0.5)) model.add(layers.Dense(num_classes, activation='softmax')) model.compile(optimizer=optimizers.Adam(learning_rate=1e-3), loss='categorical_crossentropy', metrics=['accuracy']) return model
🐍 15 — Python Example: PyTorch
import torch import torch.nn as nn import torch.nn.functional as F class SimpleCNN(nn.Module): def __init__(self, num_classes=10): super(SimpleCNN, self).__init__() self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1) self.bn1 = nn.BatchNorm2d(32) self.conv2 = nn.Conv2d(32, 32, kernel_size=3, padding=1) self.pool = nn.MaxPool2d(2,2) self.conv3 = nn.Conv2d(32, 64, kernel_size=3, padding=1) self.bn3 = nn.BatchNorm2d(64) self.conv4 = nn.Conv2d(64, 64, kernel_size=3, padding=1) self.gap = nn.AdaptiveAvgPool2d(1) self.fc1 = nn.Linear(64, 128) self.fc2 = nn.Linear(128, num_classes) def forward(self, x): x = F.relu(self.bn1(self.conv1(x))) x = F.relu(self.conv2(x)) x = self.pool(x) x = F.dropout(x, 0.25) x = F.relu(self.bn3(self.conv3(x))) x = F.relu(self.conv4(x)) x = self.pool(x) x = self.gap(x) x = x.view(x.size(0), -1) x = F.relu(self.fc1(x)) x = F.dropout(x, 0.5) x = self.fc2(x) return x
📈 16 — Training Tips & Tricks
- Normalize images with dataset mean/std (e.g., ImageNet statistics) for pretrained backbones.
- Use learning rate scheduling (ReduceLROnPlateau, CosineAnnealing).
- Prefer SGD with momentum for large-scale datasets; Adam is great for faster convergence on smaller datasets.
- Use mixed-precision (FP16) to accelerate training on modern GPUs.
- Monitor validation metrics and use early stopping to prevent overfitting.
📊 17 — Evaluation Metrics for CV
- Classification: accuracy, top-k accuracy, confusion matrix
- Detection: mAP (mean Average Precision) @ IoU thresholds
- Segmentation: IoU (Jaccard), Dice coefficient, pixel accuracy
🚀 18 — Deployment & Production
- Model quantization (INT8) for latency-sensitive inference on edge devices
- TensorRT / ONNX / TorchScript for optimized serving
- Batching, caching, and asynchronous pipelines for throughput
- Monitoring: latency, accuracy drift, input distribution drift
⚠️ 19 — Common Pitfalls
- Overfitting small dataset without augmentation
- Data leakage between train/val/test splits (shuffle with care for temporal data)
- Wrong normalization statistics for pretrained models
- Ignoring class imbalance — use weighted loss or balanced sampling
🧪 20 — Mini Case Studies
A) CIFAR-10 classification
Use small CNN / ResNet baseline → augment with random crop + flip → tune LR scheduler. Typical accuracy for small ResNet: 85%+ with proper training.
B) Medical image segmentation
U-Net backbone + heavy augmentation + class-balanced Dice loss → improves lesion segmentation performance substantially.
🏁 Conclusion
CNNs structured spatial learning के लिए designed powerful models हैं। सही preprocessing, architecture choice, regularization, और training recipes के साथ आप state-of-the-art performance achieve कर सकते हैं। ऊपर दिए गए code snippets और best practices से आप एक मजबूत foundation बना सकते हैं—अगला कदम है Transfer Learning और efficient backbones जैसे MobileNet/EfficientNet पर काम करना।