Big Data & NLP · Data Scientist

Twitter Sentiment Analysis at Scale

PySparkNLPDatabricksBig DataML

A big data machine learning project that classifies tweets as positive or negative by processing 1.6 million labeled tweets from the Sentiment140 dataset. Built on Apache Spark and Databricks, leveraging distributed computing for scalable NLP processing.

The Problem

Understanding public sentiment at scale requires processing millions of text entries efficiently. Traditional single-machine approaches don't scale. The challenge was building a distributed ML pipeline that handles massive text data while maintaining classification accuracy.

The Approach

Developed a PySpark pipeline: text preprocessing (lowercasing, URL removal, tokenization, stopword filtering), TF-IDF vectorization, and model comparison between Logistic Regression and Naive Bayes. Deployed on Databricks for scalable cloud-based processing.

Technical Details

Apache Spark & PySpark for distributed computing
TF-IDF vectorization for text feature extraction
Logistic Regression vs Naive Bayes comparison
Databricks cloud platform for scalable execution
1.6M tweet dataset (balanced: 800K positive, 800K negative)

Outcomes

Logistic Regression achieved 77.96% accuracy (outperforming Naive Bayes at 76.37%)
Built production-ready pipeline on scalable cloud infrastructure
Demonstrated distributed ML capabilities for real-world NLP tasks

Interested in working together on something similar?

Let's Talk