This project implements action recognition on videos using the Vision Transformer (ViT) model. It includes a Streamlit-based web application for uploading videos, predicting actions, and visualizing Grad-CAM heatmaps.
- Train and evaluate a Vision Transformer (ViT) model for video classification.
- Web interface for uploading videos and predicting actions.
- Grad-CAM visualizations for model interpretability.
action-recognition-vit
├── src
│ ├── models
│ │ └── vit.py # Implementation of the Vision Transformer model
│ ├── training
│ │ ├── train.py # Training script for the ViT model
│ │ └── dataset.py # Dataset class for loading and preprocessing video data
│ ├── evaluation
│ │ └── evaluate.py # Evaluation script for assessing model performance
│ ├── web
│ │ ├── app.py # Web application for user interaction
│ └── utils
│ └── helpers.py # Utility functions for data processing and visualization
├── requirements.txt # List of project dependencies
├── README.md # Project documentation
└── .gitignore # Files and directories to ignore in Git
Follow these steps to set up the project on your local machine:
Clone the repository to your local machine:
git clone https://github.com/your-repo/action-recognition-vit.git
cd action-recognition-vitCreate and activate a virtual environment to manage dependencies:
# Create a virtual environment
python -m venv .venv
# Activate the virtual environment
# On Windows:
.\.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activateInstall the required Python packages:
pip install --upgrade pip
pip install -r requirements.txtTo train the Vision Transformer model on your dataset, run the following command:
python src/training/train.pyAfter training, evaluate the model's performance using:
python src/evaluation/evaluate.pyThe project includes a Streamlit-based web application for uploading videos and predicting actions.
-
Ensure the virtual environment is activated:
.\.venv\Scripts\activate
-
Start the Streamlit app:
streamlit run src/web/app.py
-
Open the URL : https://action-recognition-using-vit.streamlit.app/
- Upload Videos: Upload a video file in
.mp4,.avi, or.movformat. - Action Prediction: The app predicts the action in the video using the Vision Transformer model.
- Grad-CAM Visualizations: Visualize Grad-CAM heatmaps to understand which parts of the video influenced the model's predictions.
-
Upload a Video:
- Use the sidebar to upload a video file.
- Supported formats:
.mp4,.avi,.mov.
-
View Uploaded Video:
- The uploaded video is displayed in the main interface.
-
Prediction and Visualization:
- The app extracts frames from the video and processes them through the ViT model.
- The predicted action is displayed, and Grad-CAM heatmaps are generated for interpretability.
-
Interact with Results:
- View Grad-CAM heatmaps for each frame to understand the model's focus areas.
- Upload another video to repeat the process.
-
Dependencies Not Installed: Ensure all dependencies are installed:
pip install -r requirements.txt
-
Streamlit App Not Starting: Ensure the virtual environment is activated and all dependencies are installed.
-
CUDA Issues: If using a GPU, ensure PyTorch is installed with CUDA support:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Contributions are welcome! Please feel free to submit a pull request or open an issue for any suggestions or improvements.
This project is licensed under the MIT License. See the LICENSE file for details.

