Model Optimization (Hyperparameter Tuning, GridSearchCV)

इस मेगा ब्लॉग में हम Model Optimization के हर पहलू को विस्तार से सीखेंगे — Hyperparameter vs Parameter, GridSearchCV, RandomizedSearchCV, Bayesian Optimization (Optuna/Hyperopt), practical tuning strategies, cross-validation integration, pipelines पर

⚙️ Model Optimization: Hyperparameter Tuning का अल्टीमेट गाइड

Machine Learning में अच्छा model architecture और clean data महत्वपूर्ण हैं — पर hyperparameter tuning अक्सर final performance में भरी बढ़त देता है। यह ब्लॉग beginner से लेकर advanced practitioner तक के लिए step-by-step playbook है: सिद्धांत, algorithms, practical heuristics और production tips — सब हिंदी में, code templates सहित।

🎯 Learning outcomes

  • Hyperparameter vs Model Parameter का स्पष्ट अंतर
  • GridSearchCV और RandomizedSearchCV के काम करने का तरीका और trade-offs
  • Bayesian optimization का मूल विचार और practical libraries (Optuna, Hyperopt)
  • Early stopping, warm-start, incremental training strategies
  • पाइपलाइन-सुरक्षित tuning और leakage से बचने के तरीके
  • Compute-efficient tuning: successive halving, Hyperband, pruning
  • Production-ready considerations: reproducibility, logging, model registry

🔎 Part A — Core Concepts: Parameters vs Hyperparameters

Parameters वे गुण हैं जिन्हें model training data से सीखता है (जैसे linear regression के coefficients)। Hyperparameters वे सेटिंग्स हैं जो training से पहले निर्धारित किये जाते हैं — जैसे learning rate, number of trees, regularization strength। Hyperparameters model की learning dynamics और capacity को नियंत्रित करते हैं; इन्हें सही करना अक्सर model के generalization पर बड़ा प्रभाव डालता है।

❗ Part B — क्यों Hyperparameter Tuning ज़रूरी है?

  • Default parameters हर dataset के लिए optimal नहीं होते।
  • Small improvements in metric (AUC, RMSE, F1) बिजनेस में बड़े ROI दे सकते हैं।
  • कई बार model का tuning bias-variance trade-off ठीक कर देता है — overfitting/underfitting दोनों नियंत्रित होते हैं।

🔍 Part C — Search Strategies (Grid, Random, Bayesian)

सबसे साधारण approach exhaustive grid search है — पर यह expensive होता है। Alternatives provide better compute vs performance trade-offs।

1) GridSearchCV (Exhaustive)

GridSearchCV सभी parameter combinations evaluate करता है (cross-validation के साथ)। यह reliable पर computationally expensive है — छोटे parameter spaces के लिए best।

2) RandomizedSearchCV

RandomizedSearchCV predefined distributions से random samples लेता है; same compute में अधिक diverse combinations explore कर सकता है। बड़े search spaces में अक्सर Grid से बेहतर।

3) Bayesian Optimization (Optuna/Hyperopt/Scikit-Optimize)

Bayesian methods probabilistic surrogate model (e.g., Gaussian Process, Tree Parzen Estimator) बनाते हैं और expected improvement maximize करके informed search करते हैं। Compute-efficient और often state-of-the-art for black-box tuning.

4) Successive Halving / Hyperband / ASHA

Resource-aware methods: शॉर्ट trials run करके poorly performing configurations early बंद कर देते हैं, promising ones को और resources देते हैं। Large-scale hyperparameter tuning के लिए बहुत useful।

🧩 Part D — Cross-Validation और Pipelines के साथ Tuning

Hyperparameter tuning को leakage-free रखने के लिए pipeline के भीतर selection और preprocessing करना ज़रूरी है। Scikit-learn का Pipeline और GridSearchCV/RandomizedSearchCV का integration सुनिश्चित करता है कि scaling/imputation केवल train fold पर fit हो।

  • Pipeline + GridSearchCV ⇒ safe preprocessing within CV.
  • GroupKFold/TimeSeriesSplit चुने यदि data grouped/time-dependent है।
  • Scoring metric को problem specific रखें (PR-AUC for imbalanced, RMSE for regression)।

🧰 Part E — Model-wise Hyperparameter Focus

निम्न सारांश रोज़मर्रा के models के लिए सबसे प्रभावी hyperparameters बताता है:

Model Key Hyperparameters Tuning Tips
Random Forest n_estimators, max_depth, max_features, min_samples_leaf Start with n_estimators=200–500, tune max_features (sqrt/log2), regularize depth
XGBoost / LightGBM learning_rate, n_estimators, max_depth, subsample, colsample_bytree, reg_lambda, min_child_weight Use early_stopping, small learning_rate (0.01–0.1) + larger n_estimators, tune colsample/subsample
Logistic Regression C (inverse reg), penalty (l1/l2/elasticnet), solver Standardize features; L1 for sparsity
Neural Networks learning_rate, batch_size, epochs, optimizer, dropout, architecture Use learning rate schedules, early stopping; tune with ASHA/Hyperband

🔧 Part F — GridSearchCV: Deep-dive & Best Practices

GridSearchCV API powerful है—पर सही तरीके से उपयोग करना ज़रूरी है। नीचे कुछ best practices दिए गए हैं:

  • Pipeline में preprocessing रखें (scaler, imputer, encoder) ताकि CV leakage न हो।
  • Use StratifiedKFold for classification to maintain class balance in folds.
  • Avoid exhaustive grids on many hyperparams; prefer 1–2 params per stage.
  • Prefer RandomizedSearch for large spaces; then grid-search around the best region.
  • Use n_jobs=-1 carefully (memory contention); use resource-aware scheduling on clusters.

Code: GridSearchCV Template (scikit-learn)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(random_state=42, n_jobs=-1))
])

param_grid = {
    "clf__n_estimators": [200, 400],
    "clf__max_depth": [None, 8, 12],
    "clf__max_features": ["sqrt", 0.3],
    "clf__min_samples_leaf": [1, 3, 5]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
gs = GridSearchCV(pipe, param_grid, scoring="roc_auc", cv=cv, n_jobs=-1, verbose=2)
gs.fit(X_train, y_train)
print(gs.best_params_, gs.best_score_)
    

🎲 Part G — RandomizedSearchCV & Parameter Distributions

RandomizedSearch allows sampling from distributions (uniform, log-uniform) — यह continuous ranges और log-scales के लिए especially powerful है।

from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    "clf__n_estimators": randint(100, 1000),
    "clf__max_depth": randint(3, 20),
    "clf__max_features": uniform(0.1, 0.9)
}

rs = RandomizedSearchCV(pipe, param_distributions=param_dist, n_iter=50, cv=cv, scoring="roc_auc", n_jobs=-1, random_state=42)
rs.fit(X_train, y_train)
print(rs.best_params_, rs.best_score_)
    

🧠 Part H — Bayesian Optimization (Optuna) Practical

Optuna और Hyperopt automatic hyperparameter search libraries हैं जो intelligent sampling और pruning provide करती हैं। Optuna का study/pruner API बहुत flexible है।

import optuna
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

def objective(trial):
    n_estimators = trial.suggest_int("n_estimators", 100, 1000)
    max_depth = trial.suggest_int("max_depth", 3, 20)
    max_features = trial.suggest_float("max_features", 0.1, 1.0)
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, max_features=max_features, n_jobs=-1, random_state=42)
    score = cross_val_score(clf, X_train, y_train, cv=StratifiedKFold(5, shuffle=True, random_state=42), scoring="roc_auc").mean()
    return score

study = optuna.create_study(direction="maximize", pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=100, n_jobs=1)
print("Best:", study.best_params, study.best_value)
    

⏱️ Part I — Early Stopping, Warm-start & Incremental Training

Early stopping validation पर improvement न होने पर training रोकता है — boosting and neural nets में performance और compute बचत दोनों होते हैं. Warm-start allows reuse of previous model state (e.g., incrementally add trees) — useful for iterative tuning.

💡 Part J — Compute-efficient Strategies

  • RandomizedSearch → narrowed grid → localized GridSearch
  • Successive Halving / Hyperband for large-scale parallel tuning
  • Use small subsamples for initial screening, then full data for final tuning
  • Parallelize across trials on cluster (Ray Tune, Dask, Optuna with RDB storage)

📏 Part K — Metric Choice & Multi-objective Tuning

Metric selection should reflect business goals: AUC vs PR-AUC vs recall@k. For multi-objective (accuracy + latency), use Pareto optimization or weighted objectives and present trade-offs.

📚 Part L — Logging, Reproducibility & MLOps

  • Log all trials (params, seed, CV folds, scores) — MLflow/Weights & Biases/Optuna storage help search traceability.
  • Fix random_state across numpy, sklearn, framework to ensure reproducibility.
  • Register best model in model registry with metadata (feature list, preprocessing, hyperparams, evaluation metrics).

🗺️ Part M — Practical Playbook

  1. Baseline model with default params → record CV score + runtime.
  2. Quick RandomizedSearch on large ranges (n_iter ~ 50–100) to find promising regions.
  3. Run Optuna (20–100 trials) with pruning for tighter search.
  4. Narrow grid around best region → run GridSearchCV for fine tuning (small grid).
  5. Final model validation on holdout/test set + calibration if needed.
  6. Serialize model + store artifacts + document training data snapshot and code.

💻 Part N — Advanced Python Templates

Optuna with Pipeline & early stopping (XGBoost)

import optuna
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 1500),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "learning_rate": trial.suggest_loguniform("learning_rate", 0.005, 0.3),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.4, 1.0),
        "reg_lambda": trial.suggest_loguniform("reg_lambda", 1e-3, 10),
    }
    pipe = Pipeline([("scale", StandardScaler()), ("xgb", XGBClassifier(**params, use_label_encoder=False, eval_metric="logloss", random_state=42))])
    score = cross_val_score(pipe, X, y, cv=StratifiedKFold(5, shuffle=True, random_state=42), scoring="roc_auc", n_jobs=1).mean()
    return score

study = optuna.create_study(direction="maximize", pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=100)
print(study.best_params)
    

⚠️ Part O — Common Pitfalls & How to Avoid

  • Data leakage through preprocessing on full data — always within CV pipeline.
  • Tuning on test set — use proper holdout; test set only for final evaluation.
  • Overfitting the validation metric — use nested CV for unbiased estimate when needed.
  • Ignoring compute cost — track runtime and memory; include them in decision.

📝 Part P — Hands-on Assignments

  1. Baseline vs Tuned: किसी dataset पर baseline RandomForest (default) बनाइए; फिर RandomizedSearchCV (50 iter) और GridSearch पर best params compare कीजिए—CV AUC और runtime रिपोर्ट कीजिए।
  2. Optuna Challenge: Optuna use करके XGBoost के लिए 100 trials चलाइए; early stopping और pruning enable करें; best trial की parameters और validation curves analyze करें।
  3. Resource-aware tuning: Successive Halving (HalvingGridSearchCV) implement करके compute savings report कीजिए।
  4. Nested CV: small dataset पर nested CV implement कर के unbiased generalization estimate निकालिए।

🧾 Part Q — Quick Reference / Cheat-sheet

  • Quick search → RandomizedSearch; refine → GridSearch.
  • Large compute budget → Bayesian (Optuna) + early stopping + pruning.
  • Very large dataset → subsample for search, then full-data retrain.
  • Time-series → TimeSeriesSplit + no shuffling.
  • Imbalanced → optimize PR-AUC/recall and use stratified folds.

🏁 निष्कर्ष

Model Optimization सिर्फ parameter tweaking नहीं—यह disciplined experimental design है: safe pipelines, correct CV, compute-aware search और reproducible logging। GridSearchCV reliable है पर expensive; RandomizedSearch और Bayesian methods compute-efficient हैं; ASHA/Hyperband बड़े searches को तेज बनाते हैं। Production में बेहतर practice यह है कि आप trials का log रखें, best model को registry में रख कर monitoring और periodic re-tuning pipeline बनाएं। Practice के साथ यह art और science दोनों बन जाता है—आरंभ करने के लिए ऊपर दिए गए playbook और assignments follow कीजिए।