Skip to content

naman22a/8-protein-sequence-classification

Repository files navigation

🧬 Protein Sequences Classification

This project explores the application of Machine Learning (ML), Deep Learning (DL), and Transfer Learning (TL) techniques to classify structural protein sequences. By treating protein sequences like natural language, we leverage NLP-inspired approaches to identify and classify structural types based on sequence patterns.


📌 Project Overview

Proteins are composed of sequences of amino acids, and their structure plays a critical role in their function. This project applies traditional ML algorithms, sequence-based neural networks, and state-of-the-art pretrained models for classifying protein sequences into structural categories.


🧠 Model Architectures

🔹 Machine Learning Models

  • Naive Bayes
  • XGBoost
  • Logistic Regression
  • K-Nearest Neighbors (KNN)

🔹 Deep Learning Models

  • Bidirectional LSTM (BiLSTM)
  • Convolutional Neural Network (CNN)
  • Recurrent Neural Network (RNN)
  • Gated Recurrent Unit (GRU)

🔹 Transfer Learning Models

  • ProtBERT
  • ESM2

📊 Evaluation Metrics

🔹 Machine Learning Results

Model Accuracy
Naive Bayes 86.53%
XGBoost 81.26%
Logistic Regression 90.45%
KNN 75.54%

🔹 Deep Learning Results

Model Accuracy
BiLSTM 84.19%
CNN 82.43%
GRU 84.71%
RNN 50.67%

🔹 Transfer Learning Results

Model Accuracy
ProtBERT 82.79%
ESM2 96.43%

🧪 Dataset


🛠️ Tech Stack

  • Python
  • NumPy
  • Pandas
  • Scikit-learn
  • XGBoost
  • PyTorch
  • Transformers (Hugging Face)

📫 Stay in touch

🗒️ License

Protein Sequences Classification is licensed under GPL V3

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published