📘 Sentiment Analysis Project — Part 1: Data and Preprocessing
This part covers the foundation of any successful sentiment analysis project: data collection, exploratory data analysis, and a robust, production-aware preprocessing pipeline. A well designed preprocessing stage reduces noise, keeps important signals such as negation, emojis and punctuation, and produces stable input for classical and deep learning models.
1. Project Scope and Problem Definition
Decide target formulation early: binary classification (positive versus negative), ternary classification (positive, neutral, negative), or fine-grained regression on star ratings. Business use cases include product review analytics, social media monitoring, support ticket triage, and market research. Define success metrics: accuracy, macro F1, precision for positive class, or business KPIs such as escalation rate reduction.
2. Data Sources and Collection
Common datasets and data sources:
- IMDB Movie Reviews for binary sentiment experiments
- Twitter corpora such as Sentiment140 and SemEval for social media signals
- Amazon and Yelp review datasets for product and business sentiment
- Company internal logs, support tickets, and feedback forms for domain specific models
Data collection tips: collect both text and metadata (user id, timestamp, product id). For social media, handle rate limits and terms of service. Store raw data immutable to allow reprocessing when pipeline changes.
3. Exploratory Data Analysis (EDA)
Before building models, understand class balance, text length distribution, frequent tokens, and noise patterns. EDA helps decide preprocessing choices such as keeping or removing punctuation and handling emojis.
3.1 Quick Python EDA snippet
# pandas EDA import pandas as pd df = pd.read_csv("sentiment_dataset.csv") # columns: text, label print(df.label.value_counts()) df["text_len"] = df.text.str.split().apply(len) print(df["text_len"].describe()) # Frequent tokens (simple) from collections import Counter tokens = Counter() for t in df.text.fillna("").str.lower().str.split(): tokens.update(t) print(tokens.most_common(30))
Key EDA checks:
- Class balance and whether sampling is needed
- Short text fraction (e.g., tweets under 10 tokens)
- Presence of URLs, emojis, non-latin scripts and markup
- Frequent punctuation usage and emotive tokens such as exclamation marks
4. Preprocessing Principles
Preprocessing should be reproducible, deterministic and versioned. Save tokenizer versions, stoplists and any normalization maps. The goal is to remove noise while preserving signals that matter for sentiment such as negation, intensifiers, emojis and punctuation.
4.1 Design choices matrix
- Lowercasing: helpful for many tasks; preserve case for NER or case-sensitive signals
- Remove or map URLs: replace with <URL> token to preserve the signal of a link
- Emojis: map to textual tokens or sentiment lexicon (e.g., ":smile:", ":cry:")
- Negation: preserve tokens like "not" and handle scopes carefully
- Tokenization strategy: classical ML: word-level tokens or character n-grams; transformers: use model tokenizer
5. A Robust Preprocessing Pipeline
Below is a practical production-aware preprocessing pipeline. Implement as functions and persist artifacts.
5.1 Pipeline steps
- Unicode normalization
- Lowercasing (optional)
- URL, email and mention detection and replacement
- Contraction expansion map
- Emoji and emoticon mapping
- Tokenization appropriate to model family
- Stopword handling with domain aware exceptions
- Lemmatization or stemming if using classical pipelines
- Prune rare tokens and apply min_df / max_df in vectorizers
5.2 End-to-end preprocessing example (Python)
import re import unicodedata import emoji import spacy nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]) CONTRACTION_MAP = { "dont": "do not", "cant": "cannot", "i'm": "i am", "it's": "it is" } EMOJI_MAP = { "😊": ":smile:", "😢": ":cry:" } def unicode_normalize(text): return unicodedata.normalize("NFKC", text) def replace_urls_and_emails(text): text = re.sub(r"http\S+|www\S+", "", text) text = re.sub(r"\S+@\S+", " ", text) return text def expand_contractions(text): for k, v in CONTRACTION_MAP.items(): text = re.sub(r"\b" + re.escape(k) + r"\b", v, text, flags=re.IGNORECASE) return text def map_emojis(text): return emoji.replace_emoji(text, replace=lambda ch: EMOJI_MAP.get(ch, " ")) def clean_text(text): text = unicode_normalize(text) text = replace_urls_and_emails(text) text = map_emojis(text) text = expand_contractions(text) # remove non-alphanumeric except basic punctuation for sentiment text = re.sub(r"[^a-zA-Z0-9!?.,;:\s]", " ", text) text = re.sub(r"\s+", " ", text).strip() return text def tokenize_and_lemmatize(text): doc = nlp(text) tokens = [token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop] return tokens # usage sample = "I do not like the new update! It is awful :(" clean = clean_text(sample) tokens = tokenize_and_lemmatize(clean) print(clean) print(tokens)
Notes on the example:
- Use spaCy lemmatizer for better morphological handling
- Keep a small set of punctuation tokens when they carry sentiment like exclamation and question marks
- For transformer models, skip manual lemmatization and use the model tokenizer instead
6. Handling Negation and Intensifiers
Negation flips sentiment; naive removal of stopwords can remove signals. Two strategies:
- Negation tagging: append negation marker to tokens until next punctuation, e.g., "not good" -> ["not_good"]
- Feature engineering: count negation words and intensifiers separately as numeric features
6.1 Simple negation tagger example
NEG_WORDS = set(["not", "never", "no"]) def negation_tag(tokens): out = [] neg = False for t in tokens: if t in NEG_WORDS: neg = True out.append(t) continue if neg: out.append("neg_" + t) # stop negation on punctuation in production else: out.append(t) return out print(negation_tag(["i", "do", "not", "like", "it"])) # -> ['i', 'do', 'not', 'neg_like', 'neg_it']
7. Feature Extraction for Classical Models
For fast baselines use count vectors, TF-IDF, character n-grams and lexicon based features. Combine text vector with engineered numeric features such as punctuation counts, emoji counts and length.
7.1 TF-IDF and n-gram example
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.9, min_df=5, max_features=30000) X = vectorizer.fit_transform(df["clean_text"])
Additional engineered features:
- Counts: exclamation, question marks
- Uppercase ratio (for emphasis)
- Emoji positive/negative counts
- Sentiment lexicon scores (VADER, SentiWordNet)
8. Word Embeddings for Deep Models
For deep learning models use pre-trained embeddings such as GloVe, Word2Vec or fastText or contextual embeddings from BERT family. Pre-trained vectors speed up convergence and often improve performance.
# load GloVe into embedding matrix (illustrative) import numpy as np emb_index = {} with open("glove.6B.100d.txt", encoding="utf8") as f: for line in f: values = line.split() word = values[0] vec = np.asarray(values[1:], dtype="float32") emb_index[word] = vec # build embedding matrix for tokenizer word_index
Use subword-aware embeddings like fastText for morphologically rich languages and robust OOV handling. For transformer based models use tokenizer and embedding layers from Hugging Face models.
9. Vocabulary Pruning and Storage
Limit vocabulary by min_df and max_df, or keep top-K frequent tokens. Persist vectorizer, tokenizer, and embedding mappings using joblib or pickle and version them.
10. Testing and Validation of Preprocessing
Create a small unit test suite for preprocessing functions. Include edge cases: emojis, long strings, URLs, empty text, non-English characters. Ensure transformation produces reproducible tokens and consistent shapes.
Wrap up of Part 1
In this first part we covered data collection, EDA and a robust, production-aware preprocessing pipeline including Unicode normalization, URL handling, contraction expansion, emoji mapping, tokenization, negation handling and feature extraction. These steps are critical for reliable downstream model performance. Part 2 will focus on model choices, training procedures, hyperparameter tuning and evaluation strategies.
© Course content — persist preprocessing artifacts and document versions for production.