Large Language Models (LLMs) Basics in Hindi — Transformers, Tokenization, Inference, RAG

🧠 Large Language Models (LLMs) Basics — Complete Hindi Guide

Large Language Models वह वर्ग हैं जो प्राकृतिक भाषा को समझने और उत्पन्न करने की क्षमता रखते हैं। ये मॉडल web scale corpora पर pre-train किए जाते हैं और फिर विविध उपयोगों के लिए fine-tune किए जाते हैं। इस गाइड में हम अवधारणा से लेकर practically उपयोग तक 0 से 1 की यात्रा करेंगे ताकि आप LLM आधारित text systems confidently बना सकें।

1) LLM क्या है और क्यों जरूरी है?

LLM एक probabilistic sequence model है जो दिए गए context के आधार पर अगला token अनुमानित करता है।
Pre-training के दौरान massive unlabeled text पर next-token prediction सीखी जाती है।
यह general world knowledge, syntax, semantics और reasoning patterns को implicit रूप से सीख लेता है।
Downstream tasks: प्रश्नोत्तरी, summarization, translation, code generation, information extraction, creative writing, agents, और बहुत कुछ।

2) Tokenization की बुनियाद

LLMs raw characters पर directly नहीं चलते, वे tokens पर operate करते हैं। Modern tokenizers उप-शब्द आधारित होते हैं जैसे BPE या WordPiece। इनका लक्ष्य vocabulary size को manageable रखते हुए out-of-vocabulary शब्दों को subword units में तोड़ना है।

# Hugging Face style pseudo example (quotes double रखे गए हैं)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
print(tok.tokenize("unbelievable"))   # उदाहरण: ["un", "bel", "ievable"] जैसा subword split

Hindi, English और bilingual datasets में sentencepiece आधारित tokenization लाभदायक रहती है।
Tokenization गुणवत्ता throughput और latency दोनों को प्रभावित करती है क्योंकि inference प्रति token चलता है।

3) Transformer Architecture: Attention का जादू

Transformer encoder-decoder ने RNNs और LSTMs को मुख्यधारा से बाहर कर दिया। LLMs प्रायः decoder-only Transformers होते हैं जो autoregressive तरीके से text generate करते हैं।

Self-Attention: हर token context के अन्य tokens पर attend करके contextual representation सीखता है।
Positional Encoding: क्रम की जानकारी देने के लिए sinusoidal या learned positions उपयोग होते हैं।
Residual Connections, LayerNorm, MLP: deep training को stable रखने के लिए ये building blocks अनिवार्य हैं।
KV Cache: inference में पहले के key/value को cache करने से लंबी sequences पर latency काफी घटती है।

4) Pre-training Objectives

Decoder-only LLMs का लक्ष्य next token prediction है। Encoder-decoder मॉडल (जैसे T5 परिवार) अक्सर masked span corruption सीखते हैं। Pre-training में curriculum, deduplication, data quality और domain diversity critical होते हैं। High-quality data बेहतर factuality, कम toxicity और robust generalization देता है।

5) Fine-tuning के तरीके

Full fine-tuning: पूरे मॉडल को task-specific data पर train करना। भारी compute की मांग।
Parameter-efficient fine-tuning (PEFT): जैसे LoRA, LoRA-plus, Prefix/Prompt Tuning, adapters — कम GPU मेमोरी और तेज adaptation।
Instruction Tuning: curated instruction-response pairs पर मॉडल को helpful behavior सिखाना।
Domain Adaptation: enterprise या legal/medical domain corpora पर PEFT बहुत उपयोगी है।

# LoRA fine-tuning का high-level sketch (सिर्फ demonstration)
# 1) Base causal LM लोड करें
# 2) LoRA adapters attach करें
# 3) Instruction dataset पर train करें
# 4) केवल adapters save करें, base weights shared रहते हैं

6) RLHF: Human Feedback से Alignment

केवल supervised instruction tuning पर्याप्त नहीं है। RLHF pipeline सामान्यतः तीन चरणों में चलती है: (1) SFT यानी supervised fine-tuning, (2) preference मॉडल या reward मॉडल का प्रशिक्षण, (3) policy optimization जहाँ मॉडल को वांछित outputs के लिए optimize किया जाता है। इसका लक्ष्य helpful, honest और harmless behavior की दिशा में मॉडल को align करना है।

7) Inference: Decoding Strategies

Greedy decoding: हर स्टेप पर सबसे probable token — अक्सर repetitive।
Beam search: कई candidate hypotheses — factual tasks में उपयोगी, पर diversity कम।
Sampling: Temperature, top-k, top-p (nucleus) sampling से creativity और diversity नियंत्रित होती है।
Repetition penalty और no-repeat n-gram: loops और verbatim repetition रोकने के लिए।
KV Cache: लंबे prompts और streaming output के लिए throughput critical।

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

name = "gpt2"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(name)
inp = tok("हिंदी में LLMs का महत्व समझाइए:", return_tensors="pt")
out = model.generate(**inp, max_length=160, temperature=0.8, top_p=0.9, do_sample=True)
print(tok.decode(out[0], skip_special_tokens=True))

8) Evaluation: क्या मॉडल सच में अच्छा है?

LLM evaluation बहुआयामी है: intrinsic metrics (perplexity), task benchmarks (QA, reasoning), factuality checks, toxicity/bias audit, और human evaluation। Enterprise settings में red-teaming और safety evaluations आवश्यक हैं।

Automatic metrics को हमेशा human judgments से cross-verify करें।
Domain-specific eval suites डिजाइन करें जो आपके real users से मेल खाते हों।
Long-context tasks में retrieval-augmented probes उपयोग करें।

9) Hallucinations और Mitigation

Hallucination वह स्थिति है जब मॉडल confident अंदाज में गलत तथ्य प्रस्तुत करता है। इसे घटाने के तरीके:

RAG (Retrieval-Augmented Generation): external knowledge base से grounded context जोड़ना।
Tool use: calculators, code interpreters, search से सत्यापन।
Constrained decoding: structured outputs के लिए schemas, regex, या function-calling।
Prompt discipline: clear instructions, citations, step-by-step checking।

10) RAG: Practical System Design

Documents ingest करें: PDFs, webpages, databases।
Chunking: overlapping windows से retrieval quality सुधारें।
Embeddings: sentence-level vectorization; multilingual settings में उपयुक्त models चुनें।
Vector store: approximate nearest neighbor indexing (जैसे HNSW)।
Query pipeline: retriever → ranker → context composer।
Generator: LLM with system prompt और citation formatting।
Evaluation: groundedness, faithfulness, latency, cost।

11) Cost, Latency और Scaling

Context length जितना बड़ा, उतनी अधिक compute लागत; context pruning और summaries उपयोग करें।
Batching, KV cache reuse, speculative decoding और distillation latency घटाते हैं।
PEFT से छोटे adapters के साथ many-task personalization संभव है।
Monitoring: token throughput, error rates, safety triggers, cache hit-rate।

12) Safety, Policy और Governance

Responsible AI frameworks में safety classifiers, content filtering, privacy protection, sensitive data redaction, और audit trails शामिल करें। मॉडल तथा prompts को version-control करें और approvals के साथ change management अपनाएँ।

13) Practical Prompt Engineering

System, developer और user prompts की स्पष्ट भूमिका रखें।
Few-shot examples task structure स्पष्ट करते हैं।
Output format: JSON schema या markdown tables तय करें।
Guardrails: refusal cases, safety notices, और fallback instructions।

# Few-shot prompt sketch
instruction = (
  "आप एक सहायक शिक्षक हैं। नीचे दिए उदाहरणों की शैली में उत्तर दें।"
  "

उदाहरण 1:
प्र: Tokenization क्या है?
उ: यह टेक्स्ट को टुकड़ों में तोड़ने की प्रक्रिया है..."
  "

प्रश्न:
LLM में attention क्यों जरूरी है?
उत्तर:"
)

14) Open Source vs Proprietary LLMs

Open-source models पारदर्शिता और customization देते हैं; proprietary models state-of-the-art प्रदर्शन और managed tooling। Decision factors: data privacy, cost, compliance, latency, language coverage, और integration ecosystem।

15) Multilingual और Code-LLMs

Multilingual LLMs script diversity और mixed-language inputs संभालते हैं; tokenization और pretraining corpus critical।
Code-LLMs में syntax-aware pretraining और format-preserving decoding आवश्यक है।
Evaluation में unit tests, static analysis और execution-based scoring उपयोग करें।

16) Hands-on: Minimal Text Generation API

# केवल डबल quotes उपयोग; fastapi स्केच
from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()
gen = pipeline("text-generation", model="gpt2")

@app.get("/generate")
def generate(prompt: str, max_new_tokens: int = 120):
    out = gen(prompt, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.8, top_p=0.9)
    return {"text": out[0]["generated_text"]}

Production में authentication, rate-limits, logging, prompt templates, caching और safety filters जोड़ें।

17) LLM-Ops: Observability और Lifecycle

Prompt/version registry और evaluation pipelines को CI/CD में integrate करें।
Offline replay datasets से regressions पकड़ें।
Feedback loops: thumbs up/down, flagged outputs, और human-in-the-loop review।
Shadow deployments और canary releases risk कम करते हैं।

18) Typical Failure Modes

Prompt leakage: internal instructions का अनजाने में खुलासा — mitigations: strict parsing और separation।
Prompt injection: user input में hostile instructions — mitigations: content validation, allow-lists, tool-use mediation।
Context overflow: लंबे prompts से truncation — mitigations: summarization और retrieval windows।
Repetition और mode collapse — mitigations: decoding constraints और penalties।

19) Study Plan: 30 Days to LLM Pro

Days 1–5: Tokenization, attention, transformer blocks।
Days 6–10: Pre-training objectives, data quality, evaluation basics।
Days 11–15: HF ecosystem, inference APIs, decoding strategies।
Days 16–20: PEFT/LoRA, instruction datasets, safety।
Days 21–25: RAG, vector databases, retrieval pipelines।
Days 26–30: Full app: chat, function calling, monitoring, cost control।

20) Mini Project: Knowledge-Grounded QnA Bot

Company docs ingest करके embeddings बनाइए।
Retriever बनाइए और top-k passages लें।
System prompt में style guidelines तय करें।
LLM से grounded उत्तर generate कराइए और citations जोड़ें।
Evaluate groundedness और latency; caching एवं rate-limits लगाएँ।

21) Glossary

Attention: context-aware weighting mechanism।

KV Cache: past keys/values store करके तेज inference।

LoRA: low-rank adaptation for parameter-efficient tuning।

RAG: retrieval-augmented generation grounded answers के लिए।

RLHF: human feedback से alignment।

Top-p: nucleus sampling for diversity।

22) निष्कर्ष

LLMs आधुनिक जनरेटिव एआई की रीढ़ हैं। सही data, मजबूत evaluation, सुरक्षित deployment और thoughtful prompt engineering के साथ आप उच्च-गुणवत्ता वाले text systems बना सकते हैं। अगला कदम है hands-on projects: instruction tuning, RAG आधारित चैटबॉट, और domain-specific assistants।