Sentiment Analysis Project Part 1: Data and Preprocessing (Hindi)

📘 Sentiment Analysis Project — Part 1: Data and Preprocessing

This part covers the foundation of any successful sentiment analysis project: data collection, exploratory data analysis, and a robust, production-aware preprocessing pipeline. A well designed preprocessing stage reduces noise, keeps important signals such as negation, emojis and punctuation, and produces stable input for classical and deep learning models.

1. Project Scope and Problem Definition

Decide target formulation early: binary classification (positive versus negative), ternary classification (positive, neutral, negative), or fine-grained regression on star ratings. Business use cases include product review analytics, social media monitoring, support ticket triage, and market research. Define success metrics: accuracy, macro F1, precision for positive class, or business KPIs such as escalation rate reduction.

2. Data Sources and Collection

Common datasets and data sources:

IMDB Movie Reviews for binary sentiment experiments
Twitter corpora such as Sentiment140 and SemEval for social media signals
Amazon and Yelp review datasets for product and business sentiment
Company internal logs, support tickets, and feedback forms for domain specific models

Data collection tips: collect both text and metadata (user id, timestamp, product id). For social media, handle rate limits and terms of service. Store raw data immutable to allow reprocessing when pipeline changes.

3. Exploratory Data Analysis (EDA)

Before building models, understand class balance, text length distribution, frequent tokens, and noise patterns. EDA helps decide preprocessing choices such as keeping or removing punctuation and handling emojis.

3.1 Quick Python EDA snippet

# pandas EDA
import pandas as pd

df = pd.read_csv("sentiment_dataset.csv")   # columns: text, label
print(df.label.value_counts())
df["text_len"] = df.text.str.split().apply(len)
print(df["text_len"].describe())
# Frequent tokens (simple)
from collections import Counter
tokens = Counter()
for t in df.text.fillna("").str.lower().str.split():
    tokens.update(t)
print(tokens.most_common(30))

Key EDA checks:

Class balance and whether sampling is needed
Short text fraction (e.g., tweets under 10 tokens)
Presence of URLs, emojis, non-latin scripts and markup
Frequent punctuation usage and emotive tokens such as exclamation marks

4. Preprocessing Principles

Preprocessing should be reproducible, deterministic and versioned. Save tokenizer versions, stoplists and any normalization maps. The goal is to remove noise while preserving signals that matter for sentiment such as negation, intensifiers, emojis and punctuation.

4.1 Design choices matrix

Lowercasing: helpful for many tasks; preserve case for NER or case-sensitive signals
Remove or map URLs: replace with <URL> token to preserve the signal of a link
Emojis: map to textual tokens or sentiment lexicon (e.g., ":smile:", ":cry:")
Negation: preserve tokens like "not" and handle scopes carefully
Tokenization strategy: classical ML: word-level tokens or character n-grams; transformers: use model tokenizer

5. A Robust Preprocessing Pipeline

Below is a practical production-aware preprocessing pipeline. Implement as functions and persist artifacts.

5.1 Pipeline steps

Unicode normalization
Lowercasing (optional)
URL, email and mention detection and replacement
Contraction expansion map
Emoji and emoticon mapping
Tokenization appropriate to model family
Stopword handling with domain aware exceptions
Lemmatization or stemming if using classical pipelines
Prune rare tokens and apply min_df / max_df in vectorizers

5.2 End-to-end preprocessing example (Python)

import re
import unicodedata
import emoji
import spacy
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
CONTRACTION_MAP = {
    "dont": "do not",
    "cant": "cannot",
    "i'm": "i am",
    "it's": "it is"
}
EMOJI_MAP = {
    "😊": ":smile:",
    "😢": ":cry:"
}

def unicode_normalize(text):
    return unicodedata.normalize("NFKC", text)

def replace_urls_and_emails(text):
    text = re.sub(r"http\S+|www\S+", "  ", text)
    text = re.sub(r"\S+@\S+", "  ", text)
    return text

def expand_contractions(text):
    for k, v in CONTRACTION_MAP.items():
        text = re.sub(r"\b" + re.escape(k) + r"\b", v, text, flags=re.IGNORECASE)
    return text

def map_emojis(text):
    return emoji.replace_emoji(text, replace=lambda ch: EMOJI_MAP.get(ch, "  "))

def clean_text(text):
    text = unicode_normalize(text)
    text = replace_urls_and_emails(text)
    text = map_emojis(text)
    text = expand_contractions(text)
    # remove non-alphanumeric except basic punctuation for sentiment
    text = re.sub(r"[^a-zA-Z0-9!?.,;:\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def tokenize_and_lemmatize(text):
    doc = nlp(text)
    tokens = [token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop]
    return tokens

# usage
sample = "I do not like the new update! It is awful :("
clean = clean_text(sample)
tokens = tokenize_and_lemmatize(clean)
print(clean)
print(tokens)

Notes on the example:

Use spaCy lemmatizer for better morphological handling
Keep a small set of punctuation tokens when they carry sentiment like exclamation and question marks
For transformer models, skip manual lemmatization and use the model tokenizer instead

6. Handling Negation and Intensifiers

Negation flips sentiment; naive removal of stopwords can remove signals. Two strategies:

Negation tagging: append negation marker to tokens until next punctuation, e.g., "not good" -> ["not_good"]
Feature engineering: count negation words and intensifiers separately as numeric features

6.1 Simple negation tagger example

NEG_WORDS = set(["not", "never", "no"])
def negation_tag(tokens):
    out = []
    neg = False
    for t in tokens:
        if t in NEG_WORDS:
            neg = True
            out.append(t)
            continue
        if neg:
            out.append("neg_" + t)
            # stop negation on punctuation in production
        else:
            out.append(t)
    return out

print(negation_tag(["i", "do", "not", "like", "it"]))
# -> ['i', 'do', 'not', 'neg_like', 'neg_it']

7. Feature Extraction for Classical Models

For fast baselines use count vectors, TF-IDF, character n-grams and lexicon based features. Combine text vector with engineered numeric features such as punctuation counts, emoji counts and length.

7.1 TF-IDF and n-gram example

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.9, min_df=5, max_features=30000)
X = vectorizer.fit_transform(df["clean_text"])

Additional engineered features:

Counts: exclamation, question marks
Uppercase ratio (for emphasis)
Emoji positive/negative counts
Sentiment lexicon scores (VADER, SentiWordNet)

8. Word Embeddings for Deep Models

For deep learning models use pre-trained embeddings such as GloVe, Word2Vec or fastText or contextual embeddings from BERT family. Pre-trained vectors speed up convergence and often improve performance.

# load GloVe into embedding matrix (illustrative)
import numpy as np
emb_index = {}
with open("glove.6B.100d.txt", encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vec = np.asarray(values[1:], dtype="float32")
        emb_index[word] = vec
# build embedding matrix for tokenizer word_index

Use subword-aware embeddings like fastText for morphologically rich languages and robust OOV handling. For transformer based models use tokenizer and embedding layers from Hugging Face models.

9. Vocabulary Pruning and Storage

Limit vocabulary by min_df and max_df, or keep top-K frequent tokens. Persist vectorizer, tokenizer, and embedding mappings using joblib or pickle and version them.

10. Testing and Validation of Preprocessing

Create a small unit test suite for preprocessing functions. Include edge cases: emojis, long strings, URLs, empty text, non-English characters. Ensure transformation produces reproducible tokens and consistent shapes.

Wrap up of Part 1

In this first part we covered data collection, EDA and a robust, production-aware preprocessing pipeline including Unicode normalization, URL handling, contraction expansion, emoji mapping, tokenization, negation handling and feature extraction. These steps are critical for reliable downstream model performance. Part 2 will focus on model choices, training procedures, hyperparameter tuning and evaluation strategies.

Sentiment Analysis Project — Part 1: Data & Preprocessing