🧬 Protein Sequences Classification

This project explores the application of Machine Learning (ML), Deep Learning (DL), and Transfer Learning (TL) techniques to classify structural protein sequences. By treating protein sequences like natural language, we leverage NLP-inspired approaches to identify and classify structural types based on sequence patterns.

📌 Project Overview

Proteins are composed of sequences of amino acids, and their structure plays a critical role in their function. This project applies traditional ML algorithms, sequence-based neural networks, and state-of-the-art pretrained models for classifying protein sequences into structural categories.

🧠 Model Architectures

🔹 Machine Learning Models

Naive Bayes
XGBoost
Logistic Regression
K-Nearest Neighbors (KNN)

🔹 Deep Learning Models

Bidirectional LSTM (BiLSTM)
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
Gated Recurrent Unit (GRU)

🔹 Transfer Learning Models

ProtBERT
ESM2

📊 Evaluation Metrics

🔹 Machine Learning Results

Model	Accuracy
Naive Bayes	86.53%
XGBoost	81.26%
Logistic Regression	90.45%
KNN	75.54%

🔹 Deep Learning Results

Model	Accuracy
BiLSTM	84.19%
CNN	82.43%
GRU	84.71%
RNN	50.67%

🔹 Transfer Learning Results

Model	Accuracy
ProtBERT	82.79%
ESM2	96.43%

🧪 Dataset

Source: Kaggle - Protein Dataset
Classes: 8
Size: 67728 sequences

🛠️ Tech Stack

Python
NumPy
Pandas
Scikit-learn
XGBoost
PyTorch
Transformers (Hugging Face)

📫 Stay in touch

Author - Naman Arora
Twitter - @naman_22a

🗒️ License

Protein Sequences Classification is licensed under GPL V3

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
BiLSTM.ipynb		BiLSTM.ipynb
CNN.ipynb		CNN.ipynb
ESM2.ipynb		ESM2.ipynb
GRU.ipynb		GRU.ipynb
LICENSE		LICENSE
ML.ipynb		ML.ipynb
README.md		README.md
RNN.ipynb		RNN.ipynb
data_preprocessing.ipynb		data_preprocessing.ipynb
environment.yml		environment.yml
protBERT.ipynb		protBERT.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬 Protein Sequences Classification

📌 Project Overview

🧠 Model Architectures

🔹 Machine Learning Models

🔹 Deep Learning Models

🔹 Transfer Learning Models

📊 Evaluation Metrics

🔹 Machine Learning Results

🔹 Deep Learning Results

🔹 Transfer Learning Results

🧪 Dataset

🛠️ Tech Stack

📫 Stay in touch

🗒️ License

About

Uh oh!

Releases

Packages

Languages

License

naman22a/8-protein-sequence-classification

Folders and files

Latest commit

History

Repository files navigation

🧬 Protein Sequences Classification

📌 Project Overview

🧠 Model Architectures

🔹 Machine Learning Models

🔹 Deep Learning Models

🔹 Transfer Learning Models

📊 Evaluation Metrics

🔹 Machine Learning Results

🔹 Deep Learning Results

🔹 Transfer Learning Results

🧪 Dataset

🛠️ Tech Stack

📫 Stay in touch

🗒️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages