A complete end-to-end NLP project that classifies tweets as Positive or Negative using the Sentiment140 Dataset.
This project combines Natural Language Processing, Machine Learning, and MLOps — from data cleaning to model deployment.
✅ Text Preprocessing (cleaning, tokenization, lemmatization)
✅ TF-IDF Vectorization for feature extraction
✅ Multiple ML models with cross-validation & hyperparameter tuning
✅ Streamlit web app for real-time sentiment prediction
✅ Model saving & reusability with joblib
✅ Fully Dockerized for consistent deployment
✅ GitHub Actions CI Workflow for automated testing & build
✅ Kubernetes/Manifest ready for cloud deployment (optional)
├── .github
└── workflows
│ └── sentimentlsis.yml
├── Docker-compose.yml
├── Dockerfile
├── dashboard.py
├── manifest.yml
├── requirements.txt
└── src
├── preprep.ipynb
├── sentiment_model.pkl
└── tfidf_vectorizer.pkl
| Category | Tools / Libraries |
|---|---|
| Language | Python |
| Data Handling | Pandas, NumPy |
| NLP | NLTK, Regex, Emoji |
| Feature Extraction | TF-IDF (sklearn) |
| Modeling | Logistic Regression, SVM, Random Forest |
| App Framework | Streamlit |
| Model Persistence | Joblib |
| Containerization | Docker |
| Automation | GitHub Actions |
| Deployment | Streamlit Cloud / Render / Kubernetes |
- Lowercasing text
- Removing URLs, mentions, hashtags, and punctuation
- Tokenization using nltk
- Stopword removal
- Lemmatization (
WordNetLemmatizer) - Emoji handling (
emoji.demojize)
This ensures the model sees only meaningful words.
Why TF-IDF?
It represents each tweet as a numerical vector based on word importance.
[ TFIDF(w) = TF(w) \times \log\left(\frac{N}{df(w)}\right) ]
Used TfidfVectorizer(max_features=5000, ngram_range=(1,2)) for best balance between accuracy and speed.
| Model | Description | Accuracy (CV) |
|---|---|---|
| Logistic Regression | Simple & effective for text data | ✅ Best |
| SVM | Handles high-dimensional data | Good |
| Random Forest | Captures non-linear patterns | Moderate |
Performed:
- 5-Fold Cross-Validation
- GridSearchCV for hyperparameter tuning
- Evaluation Metrics: Accuracy, Precision, Recall, F1-score
Used joblib to persist model and TF-IDF vectorizer:
joblib.dump(model, 'sentiment_model.pkl')
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')- Input tweet text 📝
- Clean & preprocess
- Convert text → TF-IDF vector
- Predict sentiment using model
- Display result (😊 Positive / 😠 Negative)
## 🐳 Docker Integration
docker build -t sentiment-app . docker run -p 8501:8501 sentiment-app
- Logistic Regression achieved ~85% accuracy on validation data
- Clean UI for sentiment prediction
- Fully automated CI/CD pipeline with Docker integration
- Built a complete ML workflow: from preprocessing → training → deployment
- Learned to ensure preprocessing consistency between training & inference
- Containerized the app for reproducibility
- Automated CI/CD with GitHub Actions
- Gained experience with MLOps fundamentals
## Clone repo git clone https://github.com//sentiment-analysis.git cd sentiment-analysis # Install dependencies pip install -r requirements.txt # Run Streamlit app streamlit run app.py
docker-compose up --build