fossology · smilingprogrammer · Jul 9, 2025 · Jul 9, 2025 · Jul 9, 2025 · Jul 12, 2025
diff --git a/.github/workflows/pipeline.yml b/.github/workflows/pipeline.yml
@@ -0,0 +1,105 @@
+name: Safaa Model retraining
+
+on:
+#  push:
+#    branches: [testing]
+#  pull_request:
+#    branches: [testing]
+  workflow_dispatch:
+
+jobs:
+  safaa-model-retraining:
+    runs-on: ubuntu-latest
+    env:
+      BASE_PATH: utility/retraining
+      PYTHONPATH: ${{ github.workspace }}/Safaa/src
+#      SAFAA_SECRET: ${{ secrets.MY_SECRET }}
+
+    steps:
+      - name: Checkout repo
+        uses: actions/checkout@v4
+        with:
+          persist-credentials: false
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.10'
+
+      - name: Install dependencies
+        run: |
+          pip install "spacy~=3.8.7" \
+                      "pandas~=2.2.3" \
+                      "psycopg2~=2.9.10" \
+                      "python-dotenv~=1.1.0" \
+                      "scikit-learn~=1.7.0"
+
+#      - name: Install dependencies
+#        run: |
+#          pip install -r $BASE_PATH/requirements.txt
+#      - name: Create .env file
+#        run: |
+#          echo "DB_NAME=${{ secrets.DB_NAME }}" >> .env
+#          echo "DB_USER=${{ secrets.DB_USER }}" >> .env
+#          echo "DB_PASSWORD=${{ secrets.DB_PASSWORD }}" >> .env
+#          echo "DB_HOST=${{ secrets.DB_HOST }}" >> .env
+#          echo "DB_PORT=${{ secrets.DB_PORT }}" >> .env
+
+#      - name: Fetch Copyright content from server
+#        run: |
+#          python $BASE_PATH/script_for_copyrights.py
+
+      - name: Run utility steps
+        run: |
+          python $BASE_PATH/utility_scripts.py --preprocess
 preprocessed_data = self.preprocess_data(data) 
 preprocessed_data = self.preprocess_data(data) 
+          python $BASE_PATH/utility_scripts.py --declutter
+          python $BASE_PATH/utility_scripts.py --split
+          python $BASE_PATH/utility_scripts.py --train
+          python $BASE_PATH/utility_scripts.py --test | tee test_metrics.txt
+
+      - name: Extract metrics
+        id: metrics
+        run: |
+          echo "accuracy=$(grep "Accuracy" test_metrics.txt | awk '{print $3}')" >> $GITHUB_OUTPUT
+          echo "precision=$(grep "Precision" test_metrics.txt | awk '{print $3}')" >> $GITHUB_OUTPUT
+          echo "recall=$(grep "Recall" test_metrics.txt | awk '{print $3}')" >> $GITHUB_OUTPUT
+          echo "f1=$(grep "F1 Score" test_metrics.txt | awk '{print $4}')" >> $GITHUB_OUTPUT
+
+      - name: Upload trained model artifact
+        uses: actions/upload-artifact@v4
+        with:
+          name: safaa-trained-model
+          path: ${{ env.BASE_PATH }}/model
+
+      - name: Move retrained model to original path
+        run: |
+          ORIGINAL_PATH="Safaa/src/safaa/models"
+          mkdir -p "$ORIGINAL_PATH"
+          cp $BASE_PATH/model/false_positive_detection_vectorizer.pkl Safaa/src/safaa/models/
+          cp $BASE_PATH/model/false_positive_detection_model_sgd.pkl Safaa/src/safaa/models/
+
+      - name: Set branch name
+        id: vars
+        run: echo "branch_name=$(date +'%Y%m%d-%H%M%S')" >> $GITHUB_OUTPUT
+
+      - name: Create Pull Request
+        uses: peter-evans/create-pull-request@v6
+        with:
+          token: ${{ secrets.GITHUB_TOKEN }}
+          branch: retrained-model-${{ steps.vars.outputs.branch_name }}
+          commit-message: "Update retrained Safaa model"
+          add-paths: |
+            Safaa/src/safaa/models/false_positive_detection_vectorizer.pkl
+            Safaa/src/safaa/models/false_positive_detection_model_sgd.pkl
+          title: "Retrained model - Accuracy ${{ steps.metrics.outputs.accuracy }}, F1 ${{ steps.metrics.outputs.f1 }}"
+          body: |
+            This PR contains the newly retrained Safaa model.
+
+            **Test results:**
+            - Accuracy: ${{ steps.metrics.outputs.accuracy }}
+            - Precision: ${{ steps.metrics.outputs.precision }}
+            - Recall: ${{ steps.metrics.outputs.recall }}
+            - F1 Score: ${{ steps.metrics.outputs.f1 }}
+
+            The trained model is also available as a downloadable artifact from the workflow run.
+          base: main
diff --git a/utility/retraining/README.md b/utility/retraining/README.md
@@ -0,0 +1,76 @@
+## Safaa Model Retraining Pipeline
+
+This folder provides an automated pipeline for retraining the Safaa false positive detection model. It includes data fetching from the fossology server instance (currently on localhost), preprocessing, model training, evaluation, and automated pull request creation with the updated model.
+
+### Overview
+
+The Safaa Model Retraining Pipeline is designed to:
+- Fetch copyright data from a fossology server instance.
+- Preprocess and clean copyright data.
+- Train a false positive detection model using Safaa.
+- Evaluate model performance on test data.
+- Automatically create pull requests with retrained models and performance metrics.
+
+### Features
+
+- **Fossology Server Integration**: Fetch copyright data directly from fossology localhost instance.
+- **Automated Workflow**: GitHub Actions workflow for model retraining when triggered.
+- **Data Pipeline**: Complete data preprocessing including decluttering and splitting
+- **Model Training**: Train false positive detection models using the SafaaAgent training script
+- **Performance Metrics**: Automatic calculation of accuracy, precision, recall, and F1 score
+- **Version Control**: Automatic PR creation with model performance in the title
+
+### Project Structure
+
+```
+├── .github/
+│   └── workflows/
+│       └── pipeline.yml              # GitHub Actions workflow
+├── utility/
+│   └── retraining/
+│       ├── script_for_copyrights.py  # Fossology server copyright fetch script
+│       ├── utility_scripts.py        # Main retraining scripts
+│       ├── data/                     # Data directory 
+│       │   └── copyrights_*.csv      # Fetched copyright data (uses copyrights_timestamp format)
+│       └── model/                    # Trained model output
+```
+
+### How to fetch copyrights from fossology local instance
+
+- From fetching copyrights from fossology local instance. Create a `.env` file in your project root with the following database credentials:
+
+    ```env
+    DB_NAME=your_database_name
+    DB_USER=your_database_user
+    DB_PASSWORD=your_database_password
+    DB_HOST=your_database_host
+    DB_PORT=your_database_port
+    ```
+- Start the fossology instance locally.
+- Run the `script_for_copyrights.py` script.
+
+### Pipeline Usage (GitHub Actions) 
+
+The workflow can be manually triggered from the GitHub Actions tab in your repository. It will:
+1. Perform Pre-processing till training model steps.
+2. Perform evaluation.
+3. Create a PR with the updated model.
+
+### Model Output
+
+The trained model produces two files:
+- `false_positive_detection_vectorizer.pkl`: Text vectorizer for feature extraction
+- `false_positive_detection_model_sgd.pkl`: Trained SGD classifier
+
+### Performance Metrics
+
+The pipeline automatically calculates and reports:
+- Accuracy
+- Precision
+- Recall
+- F1 Score
+
+## Contact Information
+
+- **Name**: Abdulsobur Oyewale
+- **Email**: [[email protected]](mailto:[email protected])