Instruction Tuning & RLHF — एक व्यावहारिक और व्यापक मार्गदर्शिका

यह गाइड Instruction Tuning और RLHF (Reinforcement Learning from Human Feedback) के सिद्धांतों और व्यावहारिक implementation को step-by-step कवर करता है। हम बात करेंगे कि किस तरह high-quality instruction datasets बनाये जाते हैं, human preference डेटा कैसे इकट्ठा करें, reward model कैसे train करें, PPO जैसा RL लूप कैसे चलाएं, और इन सबका production-ready deployment व monitoring में क्या अर्थ होता है। लेख हिन्दी और English mix में है ताकि तकनीकी शब्दों का संदर्भ स्पष्ट रहे और learners को दोनों भाषियों का संतुलन मिले।

1. परिचय: Instruction Tuning और RLHF क्यों?

Pre-trained LLMs में general language understanding और generation की क्षमता पहले से मौजूद होती है, पर अक्सर ये models user instructions को reliably और safely follow नहीं करते।

Instruction tuning एक supervised step है जो model को instruction-response pairs से train कराती है ताकि वह user prompts की structure और intent को बेहतर तरह समझे। RLHF उस पर आगे जाकर human value alignment का काम करता है — जहाँ humans model के outputs को rank/label करते हैं और हम उन preferences से एक reward model बनाकर policy को optimise करते हैं ताकि outputs more helpful, truthful और harmless हों।

2. High-level pipeline overview

एक typical Instruction Tuning + RLHF pipeline के मुख्य घटक:

Instruction dataset: curated instruction-response pairs (supervised data)।
Supervised fine-tune (SFT): मॉडल पर instruction data से supervised training।
Preference data collection: model outputs की जोड़ीया/ट्रिपलेट्स और humans द्वारा ranking/choice labels।
Reward model (RM): preference labels से reward predictor train करना।
RL optimization: PPO जैसी algorithm से policy/model को reward model का maximize करने के लिए fine-tune करना।
Evaluation & safety checks: automated + human evaluation, adversarial tests, and deployment gating।
Monitoring & continuous loop: field feedback, drift detection, and retraining cycles।

3. Instruction Tuning — डेटा और प्रारूप (Data & Format)

Instruction dataset की गुणवत्ता model के व्यवहार को बहुत प्रभावित करती है। कुछ बेसिक गाइडलाइन्स:

Clear templates: Use consistent templates like Instruction: ... Input: ... Response: ... — इससे tokenizer और model दोनों के लिये structure साफ़ रहती है।
Diversity: Different phrasing, languages, and difficulty levels रखें।
Refusal examples: Add examples where model should politely refuse or ask for clarification.
Negative/edge cases: Intentionally include malicious prompts, ambiguous questions, and privacy-sensitive prompts with labelled safe responses.
Human-reviewed quality: Synthetic data की जगह human-reviewed examples पर प्राथमिकता दें; अगर synthetic use कर रहें तो human-in-the-loop verification करें।

4. Supervised Fine-Tuning (SFT) — practical recipe

SFT का उद्देश्य base model को instruction-following behaviour सिखाना है। common steps:

Tokenize inputs & responses using model tokenizer; ensure special tokens for instruction boundaries if needed.
Use causal LM or seq2seq fine-tuning depending on model architecture.
Train with teacher forcing and monitor generation quality on dev set.
Use validation metrics such as perplexity plus task-specific automated metrics and small human eval samples.
Save checkpoints and note which examples caused regressions for later analysis.

5. Preference Data Collection — design & tooling

Preference data वह core asset है जिससे reward model बनता है। कुछ best practices:

Pairwise or ranking format: Show annotators two or more candidate responses and ask which is better, और क्यों — reason text optional पर उपयोगी होता है।
Annotation guidelines: स्पष्ट criteria दें — helpfulness, correctness, relevance, safety, conciseness। Calibration rounds और agreement checks ज़रूरी हैं।
Annotator training: Provide examples and edge-cases; periodic calibration to prevent drift।
Sampling strategy: Sample prompts from production logs, synthetic hard cases, and random prompts to get a balanced dataset.
Quality control: Insert gold prompts, test annotators, and compute inter-annotator agreement (Cohen's Kappa, Krippendorff's alpha)।

6. Reward Model (RM) — training & architecture

Reward model का काम preference labels को predict करना है। कुछ महत्वपूर्ण बिंदु:

Architecture: Small-to-mid size transformer-based classifier/regressor that takes prompt+response and outputs scalar score.
Loss: Use pairwise loss like Bradley-Terry or cross-entropy over softmaxed scores for pairwise comparisons. Example: if response A preferred over B, optimize P(A>B) = softmax(scoreA,scoreB).
Regularization: Prevent overfitting to annotator idiosyncrasies — use dropout, weight decay, and early stopping.
Calibration: Calibrate scores across different prompt distributions and annotator populations if needed.
Evaluation: Hold-out preference test set; measure accuracy of predicted pairwise rankings and correlation with human ratings.

7. RL Optimization (PPO) — conceptual overview

PPO और अन्य policy-gradient algorithms का उद्देश्य policy (LLM) को ऐसे outputs generate करने पर reward maximize करना है जो reward model के अनुसार बेहतर हों। high-level steps:

Initialize policy: Start with SFT checkpoint as policy model to provide reasonable starting behavior.
Rollouts: For a batch of prompts, generate K candidate responses (or sample multiple token-level rollouts).
Reward computation: Score each generated response using the reward model.
PPO update: Compute policy gradients with clipped objective, include KL-penalty to control deviation from reference policy (SFT checkpoint).
Repeat: Iterate with fresh rollouts, monitor reward, and run safety/regression checks frequently.

8. PPO practical considerations & hyperparameters

PPO ट्रेनिंग sensitive होती है; practical tips:

KL-constraint: Use a KL penalty or trust-region to prevent catastrophic policy drift. Typical approach: add KL regularizer against SFT policy with coefficient beta and tune it.
Reward scaling: Normalize rewards across batch to stabilize updates.
Batch sizes and epochs: Moderate batch sizes and multiple epochs per batch can be used, but monitor overfitting to reward model quirks.
Advantage estimation: Use generalized advantage estimation (GAE) for variance reduction.
Safety guardrails: Clip gradients, use reward model ensembles, and include rejection templates so policy learns to refuse unsafe requests.

9. Avoiding reward hacking & reward model weaknesses

Reward models can be gamed — model learns shortcuts that inflate reward but are not genuinely better. Mitigations:

Adversarial evaluation: Search for prompts where high-reward outputs are actually low-quality; add to preference labeling.
Ensemble RMs: Use multiple reward models trained on diverse annotator pools to reduce single-model bias.
Regular human audits: Periodically sample high-reward outputs and have humans rate them for real-world quality.
Conservative reward shaping: Penalize overly long or repetitive outputs; include explicit penalties for hallucinations if detectable.

10. Safety, refusal behavior, and policy constraints

Alignment का मकसद केवल helpfulness नहीं, बल्कि safe and aligned behavior भी है। Implement these practices:

Refusal dataset: Provide many examples where the correct action is to refuse or ask for clarification.
Safety filters: Pre- and post-generation filters for toxic content, PII leakage, और illicit instructions.
Human-in-the-loop for edge cases: Route high-risk queries to human reviewers and use those interactions to improve models.
Explainability: Log reasons for refusals and provide users with clear messaging when the model refuses.

11. Evaluation strategy — automated + human

Evaluation must be multi-dimensional:

Automated tests: task-specific metrics, safety classifier pass rates, perplexity, and regression suites.
Human evaluations: pairwise preference tests, Likert-scale ratings for helpfulness/truthfulness, and scenario-based checks.
Longitudinal monitoring: check for performance drift on production prompts over time.
Coverage tests: ensure model performs across languages, dialects, and demographic groups representative of users.

12. Tools & infra — recommended stack

Annotation & labeling: Prolific/Scale/Label-studio/custom UI with gold checks and annotator dashboards.
Experiment tracking: Weights & Biases or MLflow for SFT, RM, and RL experiments.
Model training infra: Accelerate / Deepspeed / TorchElastic for distributed training; use bitsandbytes for k-bit memory reductions.
RL libs: Implement PPO with libraries or custom code; use stable-baselines style patterns adapted for autoregressive LMs.
Serving: Triton/TorchServe/custom FastAPI with rate limiting and canary rollout support.
Monitoring: Prometheus + Grafana for latency/throughput, plus user feedback dashboards for quality metrics.

13. Practical code snippets — simplified pseudo-examples

नीचे pseudo-code steps हैं — production में adaptation और safety checks आवश्यक हैं:

# 1) Supervised fine-tune (SFT) pseudo
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

tokenizer = AutoTokenizer.from_pretrained("base-model")
model = AutoModelForCausalLM.from_pretrained("base-model")

# prepare instruction dataset (input -> output)
# training args
# trainer = Trainer(model, args, train_dataset, eval_dataset)
# trainer.train()

# 2) Reward model training (pairwise)
# reward_model inputs: prompt + response => scalar score
# optimize pairwise loss for labeled preferences

# 3) PPO loop (conceptual)
# for batch in dataloader:
#   gen_responses = policy.generate(prompts, do_sample=True)
#   rewards = reward_model.score(prompts, gen_responses)
#   advantages = compute_advantages(rewards, baseline)
#   ppo_update(policy, gen_responses, advantages, kl_coeff)

14. Data governance, privacy & compliance

Preference और instruction datasets में अक्सर user data शामिल हो सकती है। आवश्यक कदम:

Collect consent और ऑडिट लॉग रखें कि किस annotator/यूज़र से data आया।
PII removal और redaction pipelines लागू करें।
Access control, encryption at rest and transit, और retention policies सेट करें।
License compliance for base models और datasets की जाँच करें।

15. Cost & resource planning

RLHF लागत में compute और human labeling दोनों शामिल हैं। कुछ planning tips:

Start with modest RM and small RL pilots before scaling to large models.
Use active learning to prioritize labeling for examples the RM/policy disagrees on.
Estimate human-labeling cost per comparison and budget for calibration rounds.
Use PEFT and k-bit tricks to reduce GPU footprint of policy updates where possible.

16. Regression & rollout strategy

Productionization के लिए conservative rollout ज़रूरी है:

Canary deployments: small percent यूज़र्स पर मॉडल दिखाएँ और real-world metrics देखें।
Automated regression tests: core capabilities पर model से baseline performance ensure करें।
Rollback plan: quick rollback के लिए delta-weight strategy और versioned deployments रखें।

17. Common pitfalls & troubleshooting

Over-optimization on RM: If policy improves RM score but human ratings fall, collect more human labels and refine RM.
Poor annotator quality: leads to noisy RM — employ qualification tests and gold checks.
Catastrophic forgetting: Mix SFT examples or rehearsal buffer to maintain base capabilities.
Mode collapse / repetitive outputs: add diversity penalties or explicit diversity objectives.

18. Example project plan (8-12 weeks)

Week 1-2: Define instruction taxonomy, collect SFT dataset, build labeling UI.
Week 3-4: SFT experiments and validation; deploy SFT checkpoint for internal testing.
Week 5-6: Collect preference labels, train reward model, run audits.
Week 7-8: Small-scale PPO runs, safety evaluation, and human-in-the-loop testing.
Week 9-10: Canary rollout, monitor, collect production preferences and iterate.
Week 11-12: Scale labeling & RL if metrics justify, and set up continuous retraining pipelines.

19. Assignments & hands-on exercises

Prepare an instruction dataset of 1k high-quality pairs for a domain and run SFT; evaluate vs zero-shot.
Collect 500 pairwise preference labels on SFT outputs, train a simple reward model, and report RM accuracy on held-out pairs.
Run a small PPO loop on a distilled policy for a narrow domain and measure human-rated helpfulness before and after RLHF.
Design and run adversarial tests to find reward-hacking examples; propose fixes based on findings.

20. Conclusion — practical takeaways

Instruction Tuning और RLHF मिलकर LLMs को real-world user-aligned behaviour तक ले जाने का सबसे प्रभावी तरीका हैं। पर success के लिए high-quality human data, robust reward models, careful RL training with KL/penalty constraints, और thorough human evaluation आवश्यक हैं। Start small, use conservative rollout strategies, and invest in monitoring और governance ताकि alignment sustainable और auditable रहे।

--- अंत ---

Instruction Tuning & RLHF (Reinforcement Learning from Human Feedback)