This project automatically detects and masks sensitive information (SSNs, Names, DOB, etc.) from scanned ID documents.
- OCR: Tesseract OCR (option: Google Vision API)
- NER: spaCy, HuggingFace Transformers
- Visual Layout: YOLOv5 (future enhancement)
- Image Processing: OpenCV, Pillow
- Backend (future): Flask
- Detect text regions (YOLO planned, Tesseract fallback used now)
- Extract text with OCR
- Identify sensitive info using Regex + NER
- Mask regions (black, blur, pixelation)
- Save outputs:
- Masked image (
/outputs) - JSON report (
/reports)
- Masked image (
personal-info-masking/
├── data/ # input images
├── outputs/ # masked results
├── reports/ # json logs
├── src/ # source code
├── requirements.txt
└── README.md
pip install -r requirements.txt
python -m spacy download en_core_web_sm
cd src
python main.py| Input |
|---|
![]() |
| Masked Output |
![]() |
- Add phone number + address masking
- Support cloud deployment
- Interactive web interface

