Big Data & NLP · Data Scientist
View CodeTwitter Sentiment Analysis at Scale
A big data machine learning project that classifies tweets as positive or negative by processing 1.6 million labeled tweets from the Sentiment140 dataset. Built on Apache Spark and Databricks, leveraging distributed computing for scalable NLP processing.
The Problem
Understanding public sentiment at scale requires processing millions of text entries efficiently. Traditional single-machine approaches don't scale. The challenge was building a distributed ML pipeline that handles massive text data while maintaining classification accuracy.
The Approach
Developed a PySpark pipeline: text preprocessing (lowercasing, URL removal, tokenization, stopword filtering), TF-IDF vectorization, and model comparison between Logistic Regression and Naive Bayes. Deployed on Databricks for scalable cloud-based processing.
Technical Details
- Apache Spark & PySpark for distributed computing
- TF-IDF vectorization for text feature extraction
- Logistic Regression vs Naive Bayes comparison
- Databricks cloud platform for scalable execution
- 1.6M tweet dataset (balanced: 800K positive, 800K negative)
Outcomes
- Logistic Regression achieved 77.96% accuracy (outperforming Naive Bayes at 76.37%)
- Built production-ready pipeline on scalable cloud infrastructure
- Demonstrated distributed ML capabilities for real-world NLP tasks
Interested in working together on something similar?
Let's Talk