Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
8e6fc60
feat(script): Integrated script to preprocess fetched copyright conte…
smilingprogrammer Jul 9, 2025
cff1df9
feat(script): add script to fetch copyright from server
smilingprogrammer Jul 9, 2025
3431499
feat(script): Integrated script to declutter fetched copyright conten…
smilingprogrammer Jul 9, 2025
ba63bf5
feat(script): Resolve corrections in script for fetching of content f…
smilingprogrammer Jul 12, 2025
910e8bb
feat(script): added different files into one single script with their…
smilingprogrammer Jul 13, 2025
a09195d
feat(script): Resolve corrections in pipeline.yml scrit
smilingprogrammer Jul 14, 2025
197e694
feat(script): changed pipeline scripts location, and renamed folder t…
smilingprogrammer Jul 21, 2025
b1bbfdc
feat(script): resolve corrections in pipeline.yml scrit
smilingprogrammer Jul 21, 2025
3ed9383
feat(script): made huge update to the scripts regarding path and pipe…
smilingprogrammer Jul 22, 2025
40f45e6
feat(script): made huge update to the scripts regarding path and pipe…
smilingprogrammer Jul 22, 2025
22660f7
feat(script): declared global paths, implements argument, and direct …
smilingprogrammer Jul 28, 2025
d1e63cd
feat(script): implemented training script in pipeline
smilingprogrammer Jul 29, 2025
bff64a5
feat(script): fixed path conflict
smilingprogrammer Jul 29, 2025
c6de8df
feat(script): updated sql script to leave out ignored contents when f…
smilingprogrammer Aug 4, 2025
5245a38
feat(script): integrated both the testing phase in the .yml file and …
smilingprogrammer Aug 6, 2025
e12372b
feat(script): added declutter model & entity recognizer folders to ut…
smilingprogrammer Aug 6, 2025
98b03a6
feat(script): removed all requirements.txt relations
smilingprogrammer Aug 6, 2025
46e4c8d
feat(script): fixed preprocessing and decluttering to maintain datafr…
smilingprogrammer Aug 8, 2025
852b764
feat(script): added data for testing pipeline
smilingprogrammer Aug 10, 2025
f596515
feat(script): updated pipeline scripts to automate creation of PR
smilingprogrammer Aug 16, 2025
0322dcc
feat(docs): included a README.md for the folder
smilingprogrammer Aug 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions .github/workflows/pipeline.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
name: Safaa Model retraining

on:
# push:
# branches: [testing]
# pull_request:
# branches: [testing]
workflow_dispatch:

jobs:
safaa-model-retraining:
runs-on: ubuntu-latest
env:
BASE_PATH: utility/retraining
PYTHONPATH: ${{ github.workspace }}/Safaa/src
# SAFAA_SECRET: ${{ secrets.MY_SECRET }}

steps:
- name: Checkout repo
uses: actions/checkout@v4
with:
persist-credentials: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this for checking out the repo? This might be required for the PR creation step.


- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'

- name: Install dependencies
run: |
pip install "spacy~=3.8.7" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look into actions/cache to cache pip dependencies which will increase the the speed of execution and reduce bandwidth too,

"pandas~=2.2.3" \
"psycopg2~=2.9.10" \
"python-dotenv~=1.1.0" \
"scikit-learn~=1.7.0"
Comment on lines +31 to +35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these dependencies in the project toml file? We can install them with required flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we cant, Then we should add all the required dependencies in the pipeline while setting up python.

With the perspective users using the scripts from source and locally training --> We need to handle the dependencies anyways? maybe add a local requirements.txt which satisfies both utility and script_for_copyright


# - name: Install dependencies
# run: |
# pip install -r $BASE_PATH/requirements.txt
# - name: Create .env file
# run: |
# echo "DB_NAME=${{ secrets.DB_NAME }}" >> .env
# echo "DB_USER=${{ secrets.DB_USER }}" >> .env
# echo "DB_PASSWORD=${{ secrets.DB_PASSWORD }}" >> .env
# echo "DB_HOST=${{ secrets.DB_HOST }}" >> .env
# echo "DB_PORT=${{ secrets.DB_PORT }}" >> .env

# - name: Fetch Copyright content from server
# run: |
Comment on lines +37 to +49
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we dont need these, Comments can be removed.

# python $BASE_PATH/script_for_copyrights.py

- name: Run utility steps
run: |
python $BASE_PATH/utility_scripts.py --preprocess
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preprocessed_data = self.preprocess_data(data)

function: train_false_positive_detector_model() already preprocess data, Do we need to add the pre-process step anhow?

python $BASE_PATH/utility_scripts.py --declutter
python $BASE_PATH/utility_scripts.py --split
python $BASE_PATH/utility_scripts.py --train
python $BASE_PATH/utility_scripts.py --test | tee test_metrics.txt
Comment on lines +54 to +58
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm, Should everything run under one stage?? preprocess comes under Data Preprocessing step, test and train comes under Model Training and Testing Step (split can also be part of it as well)

Now --train flag is training the false positive detector. NER also has a train and test steps (maybe we can include that as well if that makes sense?)


- name: Extract metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a sub step to Model Training & Testing stage. We cannot have 10 different steps. We need to classify them under basic MLOps Stages:
ML-Ops Principles

id: metrics
run: |
echo "accuracy=$(grep "Accuracy" test_metrics.txt | awk '{print $3}')" >> $GITHUB_OUTPUT
echo "precision=$(grep "Precision" test_metrics.txt | awk '{print $3}')" >> $GITHUB_OUTPUT
echo "recall=$(grep "Recall" test_metrics.txt | awk '{print $3}')" >> $GITHUB_OUTPUT
echo "f1=$(grep "F1 Score" test_metrics.txt | awk '{print $4}')" >> $GITHUB_OUTPUT

- name: Upload trained model artifact
uses: actions/upload-artifact@v4
with:
name: safaa-trained-model
path: ${{ env.BASE_PATH }}/model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ummmmm, Are we also doing model versioning somewhere? If we can keep on overwriting rolling back would be an issue?
Also save the metrics as a model_{version}_metrics.txt it will be handy to rollback to correct version.


- name: Move retrained model to original path
run: |
ORIGINAL_PATH="Safaa/src/safaa/models"
mkdir -p "$ORIGINAL_PATH"
cp $BASE_PATH/model/false_positive_detection_vectorizer.pkl Safaa/src/safaa/models/
cp $BASE_PATH/model/false_positive_detection_model_sgd.pkl Safaa/src/safaa/models/
Comment on lines +78 to +79
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically the agent.save() function saves the model at right path. Why do we need to cp everything altogether?


- name: Set branch name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the ML-Ops principles and see where we can add this as a sub step?
Major focus should be here --> Classifying the correct step into correct stage. :)

id: vars
run: echo "branch_name=$(date +'%Y%m%d-%H%M%S')" >> $GITHUB_OUTPUT

- name: Create Pull Request
uses: peter-evans/create-pull-request@v6
with:
token: ${{ secrets.GITHUB_TOKEN }}
branch: retrained-model-${{ steps.vars.outputs.branch_name }}
commit-message: "Update retrained Safaa model"
add-paths: |
Safaa/src/safaa/models/false_positive_detection_vectorizer.pkl
Safaa/src/safaa/models/false_positive_detection_model_sgd.pkl
Comment on lines +91 to +93
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use, relative paths not absolute.

title: "Retrained model - Accuracy ${{ steps.metrics.outputs.accuracy }}, F1 ${{ steps.metrics.outputs.f1 }}"
body: |
This PR contains the newly retrained Safaa model.

**Test results:**
- Accuracy: ${{ steps.metrics.outputs.accuracy }}
- Precision: ${{ steps.metrics.outputs.precision }}
- Recall: ${{ steps.metrics.outputs.recall }}
- F1 Score: ${{ steps.metrics.outputs.f1 }}

The trained model is also available as a downloadable artifact from the workflow run.
base: main
76 changes: 76 additions & 0 deletions utility/retraining/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
## Safaa Model Retraining Pipeline

This folder provides an automated pipeline for retraining the Safaa false positive detection model. It includes data fetching from the fossology server instance (currently on localhost), preprocessing, model training, evaluation, and automated pull request creation with the updated model.

### Overview

The Safaa Model Retraining Pipeline is designed to:
- Fetch copyright data from a fossology server instance.
- Preprocess and clean copyright data.
- Train a false positive detection model using Safaa.
- Evaluate model performance on test data.
- Automatically create pull requests with retrained models and performance metrics.

### Features

- **Fossology Server Integration**: Fetch copyright data directly from fossology localhost instance.
- **Automated Workflow**: GitHub Actions workflow for model retraining when triggered.
- **Data Pipeline**: Complete data preprocessing including decluttering and splitting
- **Model Training**: Train false positive detection models using the SafaaAgent training script
- **Performance Metrics**: Automatic calculation of accuracy, precision, recall, and F1 score
- **Version Control**: Automatic PR creation with model performance in the title

### Project Structure

```
├── .github/
│ └── workflows/
│ └── pipeline.yml # GitHub Actions workflow
├── utility/
│ └── retraining/
│ ├── script_for_copyrights.py # Fossology server copyright fetch script
│ ├── utility_scripts.py # Main retraining scripts
│ ├── data/ # Data directory
│ │ └── copyrights_*.csv # Fetched copyright data (uses copyrights_timestamp format)
│ └── model/ # Trained model output
```

### How to fetch copyrights from fossology local instance

- From fetching copyrights from fossology local instance. Create a `.env` file in your project root with the following database credentials:

```env
DB_NAME=your_database_name
DB_USER=your_database_user
DB_PASSWORD=your_database_password
DB_HOST=your_database_host
DB_PORT=your_database_port
```
- Start the fossology instance locally.
- Run the `script_for_copyrights.py` script.

### Pipeline Usage (GitHub Actions)

The workflow can be manually triggered from the GitHub Actions tab in your repository. It will:
1. Perform Pre-processing till training model steps.
2. Perform evaluation.
3. Create a PR with the updated model.

### Model Output

The trained model produces two files:
- `false_positive_detection_vectorizer.pkl`: Text vectorizer for feature extraction
- `false_positive_detection_model_sgd.pkl`: Trained SGD classifier

### Performance Metrics

The pipeline automatically calculates and reports:
- Accuracy
- Precision
- Recall
- F1 Score

## Contact Information

- **Name**: Abdulsobur Oyewale
- **Email**: [[email protected]](mailto:[email protected])
Loading