Skip to content

A complete end-to-end NLP project that classifie This project combines Natural Language Processing, Machine Learning, and MLOps — from data cleaning to model deployment.

Notifications You must be signed in to change notification settings

SamIeer/SentimentAnalysis

Repository files navigation

💬 Twitter Sentiment Analysis App

A complete end-to-end NLP project that classifies tweets as Positive or Negative using the Sentiment140 Dataset.
This project combines Natural Language Processing, Machine Learning, and MLOps — from data cleaning to model deployment.


🚀 Features

✅ Text Preprocessing (cleaning, tokenization, lemmatization)
✅ TF-IDF Vectorization for feature extraction
✅ Multiple ML models with cross-validation & hyperparameter tuning
✅ Streamlit web app for real-time sentiment prediction
✅ Model saving & reusability with joblib
✅ Fully Dockerized for consistent deployment
✅ GitHub Actions CI Workflow for automated testing & build
✅ Kubernetes/Manifest ready for cloud deployment (optional)


🧩 Project Structure

├── .github
    └── workflows
    │   └── sentimentlsis.yml
├── Docker-compose.yml
├── Dockerfile
├── dashboard.py
├── manifest.yml
├── requirements.txt
└── src
    ├── preprep.ipynb
    ├── sentiment_model.pkl
    └── tfidf_vectorizer.pkl

🧠 Tech Stack

Category Tools / Libraries
Language Python
Data Handling Pandas, NumPy
NLP NLTK, Regex, Emoji
Feature Extraction TF-IDF (sklearn)
Modeling Logistic Regression, SVM, Random Forest
App Framework Streamlit
Model Persistence Joblib
Containerization Docker
Automation GitHub Actions
Deployment Streamlit Cloud / Render / Kubernetes

🧹 Data Preprocessing

  • Lowercasing text
  • Removing URLs, mentions, hashtags, and punctuation
  • Tokenization using nltk
  • Stopword removal
  • Lemmatization (WordNetLemmatizer)
  • Emoji handling (emoji.demojize)

This ensures the model sees only meaningful words.


🧮 Feature Engineering — TF-IDF

Why TF-IDF?
It represents each tweet as a numerical vector based on word importance.

[ TFIDF(w) = TF(w) \times \log\left(\frac{N}{df(w)}\right) ]

Used TfidfVectorizer(max_features=5000, ngram_range=(1,2)) for best balance between accuracy and speed.


🤖 Model Training

Model Description Accuracy (CV)
Logistic Regression Simple & effective for text data ✅ Best
SVM Handles high-dimensional data Good
Random Forest Captures non-linear patterns Moderate

Performed:

  • 5-Fold Cross-Validation
  • GridSearchCV for hyperparameter tuning
  • Evaluation Metrics: Accuracy, Precision, Recall, F1-score

💾 Model Saving

Used joblib to persist model and TF-IDF vectorizer:

joblib.dump(model, 'sentiment_model.pkl')
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')

Streamlit Web App

Simple, interactive web app for real-time predictions.

Run locally:

streamlit run app.py

    App Flow:

  1. Input tweet text 📝
  2. Clean & preprocess
  3. Convert text → TF-IDF vector
  4. Predict sentiment using model
  5. Display result (😊 Positive / 😠 Negative)

## 🐳 Docker Integration
docker build -t sentiment-app .
docker run -p 8501:8501 sentiment-app

    📊 Results

  • Logistic Regression achieved ~85% accuracy on validation data
  • Clean UI for sentiment prediction
  • Fully automated CI/CD pipeline with Docker integration

    Key Takeaways

  • Built a complete ML workflow: from preprocessing → training → deployment
  • Learned to ensure preprocessing consistency between training & inference
  • Containerized the app for reproducibility
  • Automated CI/CD with GitHub Actions
  • Gained experience with MLOps fundamentals

Setup Instructions

## Clone repo
git clone https://github.com//sentiment-analysis.git
cd sentiment-analysis
# Install dependencies
pip install -r requirements.txt
# Run Streamlit app
streamlit run app.py

or run in Docker:

docker-compose up --build

Author

Sameer Chauhan

MLOps & Machine Learning Engineer
💼 Passionate about bridging ML with real-world deployment through Docker, CI/CD, and automation.

About

A complete end-to-end NLP project that classifie This project combines Natural Language Processing, Machine Learning, and MLOps — from data cleaning to model deployment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published