This project explores the application of Machine Learning (ML), Deep Learning (DL), and Transfer Learning (TL) techniques to classify structural protein sequences. By treating protein sequences like natural language, we leverage NLP-inspired approaches to identify and classify structural types based on sequence patterns.
Proteins are composed of sequences of amino acids, and their structure plays a critical role in their function. This project applies traditional ML algorithms, sequence-based neural networks, and state-of-the-art pretrained models for classifying protein sequences into structural categories.
- Naive Bayes
- XGBoost
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Bidirectional LSTM (BiLSTM)
- Convolutional Neural Network (CNN)
- Recurrent Neural Network (RNN)
- Gated Recurrent Unit (GRU)
- ProtBERT
- ESM2
| Model | Accuracy |
|---|---|
| Naive Bayes | 86.53% |
| XGBoost | 81.26% |
| Logistic Regression | 90.45% |
| KNN | 75.54% |
| Model | Accuracy |
|---|---|
| BiLSTM | 84.19% |
| CNN | 82.43% |
| GRU | 84.71% |
| RNN | 50.67% |
| Model | Accuracy |
|---|---|
| ProtBERT | 82.79% |
| ESM2 | 96.43% |
- Source: Kaggle - Protein Dataset
- Classes: 8
- Size: 67728 sequences
- Python
- NumPy
- Pandas
- Scikit-learn
- XGBoost
- PyTorch
- Transformers (Hugging Face)
- Author - Naman Arora
- Twitter - @naman_22a
Protein Sequences Classification is licensed under GPL V3