Word Embeddings in NLP Hindi (Word2Vec, GloVe, FastText)

🧠 Word Embeddings in NLP

जब हम NLP में raw text data को machine learning models में input के रूप में use करना चाहते हैं, तो हमें text को numerical representation (vector form) में बदलना पड़ता है। Simple methods जैसे Bag of Words (BoW) या TF-IDF सिर्फ frequency या importance को capture करते हैं, लेकिन वे semantic meaning (अर्थ) को नहीं समझ पाते। यहीं पर Word Embeddings काम आते हैं।

Word embeddings continuous vector representations हैं, जो words के बीच semantic similarity और context को preserve करते हैं। उदाहरण के लिए, 'king' और 'queen' embeddings में करीब होंगे क्योंकि उनका अर्थ related है। Word2Vec, GloVe और FastText तीन सबसे लोकप्रिय techniques हैं।

1️⃣ Word2Vec

Word2Vec (by Google, 2013) एक neural network आधारित मॉडल है। यह words को high-dimensional vector space में map करता है। इसमें दो मुख्य architectures हैं:

CBOW (Continuous Bag of Words): आसपास के context words से target word predict करता है।
Skip-gram: एक target word से आसपास के context words predict करता है।

      Example: "The cat sits on the mat"
      Skip-gram task: Given "cat", predict ["The", "sits", "on", "the"]

🔹 Word2Vec के फायदे

Semantic relationships को capture करता है।
Efficient training with large data.
Analogies possible: "king - man + woman = queen"

2️⃣ GloVe (Global Vectors for Word Representation)

GloVe (by Stanford) एक matrix factorization based approach है। यह co-occurrence matrix बनाता है और statistical information का उपयोग करके word vectors generate करता है।

      Example:
      अगर "ice" और "steam" दोनों "solid" और "gas" से अलग-अलग frequency में आते हैं,
      तो GloVe इन relationships को numerical space में capture कर लेगा।

Global statistical information use करता है।
Training fast और memory efficient है।
Pre-trained embeddings available हैं (Wikipedia, Common Crawl data पर trained)।

3️⃣ FastText

FastText (by Facebook AI Research) Word2Vec का extension है। यह words को character n-grams में तोड़ता है। इसका सबसे बड़ा फायदा यह है कि यह OOV (Out-of-Vocabulary) words को भी represent कर सकता है।

      Example:
      Word: "playing"
      Subwords: "play", "lay", "ing"
      Representation = sum of subword vectors

Rare और misspelled words को handle करता है।
Languages with rich morphology के लिए best है।
OOV problem solve करता है।

📊 Comparison of Word2Vec, GloVe and FastText

Method	Approach	Strength	Weakness
Word2Vec	Neural Network (CBOW, Skip-gram)	Captures context well	Does not use global co-occurrence
GloVe	Matrix Factorization	Uses global statistics	Fixed embeddings, less dynamic
FastText	Subword Information	Handles OOV words	Training heavier

💡 Practical Use Cases of Word Embeddings

Sentiment Analysis
Machine Translation
Text Classification
Question Answering Systems
Chatbots
Semantic Search

⚡ Example with Gensim (Python)

      from gensim.models import Word2Vec
      sentences = [["i", "love", "nlp"], ["nlp", "is", "fun"]]
      model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1)
      print(model.wv["nlp"])

Word embeddings NLP की दुनिया में revolution लेकर आए हैं। अब simple Bag of Words से आगे बढ़कर हम ऐसे vectors generate कर सकते हैं जो वास्तव में words के अर्थ और context को समझते हैं। अगर आप NLP में projects कर रहे हैं, तो Word2Vec, GloVe और FastText को अच्छे से समझना ज़रूरी है।

Word Embeddings (Word2Vec, GloVe, FastText)