- 
                Notifications
    
You must be signed in to change notification settings  - Fork 6
 
feat(pipeline): Data Pipeline for Safaa main PR #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
8e6fc60
              cff1df9
              3431499
              ba63bf5
              910e8bb
              a09195d
              197e694
              b1bbfdc
              3ed9383
              40f45e6
              22660f7
              d1e63cd
              bff64a5
              c6de8df
              5245a38
              e12372b
              98b03a6
              46e4c8d
              852b764
              f596515
              0322dcc
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -0,0 +1,105 @@ | ||||
| name: Safaa Model retraining | ||||
| 
     | 
||||
| on: | ||||
| # push: | ||||
| # branches: [testing] | ||||
| # pull_request: | ||||
| # branches: [testing] | ||||
| workflow_dispatch: | ||||
| 
     | 
||||
| jobs: | ||||
| safaa-model-retraining: | ||||
| runs-on: ubuntu-latest | ||||
| env: | ||||
| BASE_PATH: utility/retraining | ||||
| PYTHONPATH: ${{ github.workspace }}/Safaa/src | ||||
| # SAFAA_SECRET: ${{ secrets.MY_SECRET }} | ||||
| 
     | 
||||
| steps: | ||||
| - name: Checkout repo | ||||
| uses: actions/checkout@v4 | ||||
| with: | ||||
| persist-credentials: false | ||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need this for checking out the repo? This might be required for the PR creation step.  | 
||||
| 
     | 
||||
| - name: Set up Python | ||||
| uses: actions/setup-python@v5 | ||||
| with: | ||||
| python-version: '3.10' | ||||
| 
     | 
||||
| - name: Install dependencies | ||||
| run: | | ||||
| pip install "spacy~=3.8.7" \ | ||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Look into   | 
||||
| "pandas~=2.2.3" \ | ||||
| "psycopg2~=2.9.10" \ | ||||
| "python-dotenv~=1.1.0" \ | ||||
| "scikit-learn~=1.7.0" | ||||
| 
         
      Comment on lines
    
      +31
     to 
      +35
    
   
  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add these dependencies in the project  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even if we cant, Then we should add all the required dependencies in the pipeline while setting up python. With the perspective users using the scripts from source and locally training --> We need to handle the dependencies anyways? maybe add a local   | 
||||
| 
     | 
||||
| # - name: Install dependencies | ||||
| # run: | | ||||
| # pip install -r $BASE_PATH/requirements.txt | ||||
| # - name: Create .env file | ||||
| # run: | | ||||
| # echo "DB_NAME=${{ secrets.DB_NAME }}" >> .env | ||||
| # echo "DB_USER=${{ secrets.DB_USER }}" >> .env | ||||
| # echo "DB_PASSWORD=${{ secrets.DB_PASSWORD }}" >> .env | ||||
| # echo "DB_HOST=${{ secrets.DB_HOST }}" >> .env | ||||
| # echo "DB_PORT=${{ secrets.DB_PORT }}" >> .env | ||||
| 
     | 
||||
| # - name: Fetch Copyright content from server | ||||
| # run: | | ||||
| 
         
      Comment on lines
    
      +37
     to 
      +49
    
   
  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we dont need these, Comments can be removed.  | 
||||
| # python $BASE_PATH/script_for_copyrights.py | ||||
| 
     | 
||||
| - name: Run utility steps | ||||
| run: | | ||||
| python $BASE_PATH/utility_scripts.py --preprocess | ||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. safaa/Safaa/src/safaa/Safaa.py Line 272 in 0c93457 
 function:   | 
||||
| python $BASE_PATH/utility_scripts.py --declutter | ||||
| python $BASE_PATH/utility_scripts.py --split | ||||
| python $BASE_PATH/utility_scripts.py --train | ||||
| python $BASE_PATH/utility_scripts.py --test | tee test_metrics.txt | ||||
| 
         
      Comment on lines
    
      +54
     to 
      +58
    
   
  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. umm, Should everything run under one stage??  Now   | 
||||
| 
     | 
||||
| - name: Extract metrics | ||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be a sub step to   | 
||||
| id: metrics | ||||
| run: | | ||||
| echo "accuracy=$(grep "Accuracy" test_metrics.txt | awk '{print $3}')" >> $GITHUB_OUTPUT | ||||
| echo "precision=$(grep "Precision" test_metrics.txt | awk '{print $3}')" >> $GITHUB_OUTPUT | ||||
| echo "recall=$(grep "Recall" test_metrics.txt | awk '{print $3}')" >> $GITHUB_OUTPUT | ||||
| echo "f1=$(grep "F1 Score" test_metrics.txt | awk '{print $4}')" >> $GITHUB_OUTPUT | ||||
| 
     | 
||||
| - name: Upload trained model artifact | ||||
| uses: actions/upload-artifact@v4 | ||||
| with: | ||||
| name: safaa-trained-model | ||||
| path: ${{ env.BASE_PATH }}/model | ||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ummmmm, Are we also doing model versioning somewhere? If we can keep on overwriting rolling back would be an issue?  | 
||||
| 
     | 
||||
| - name: Move retrained model to original path | ||||
| run: | | ||||
| ORIGINAL_PATH="Safaa/src/safaa/models" | ||||
| mkdir -p "$ORIGINAL_PATH" | ||||
| cp $BASE_PATH/model/false_positive_detection_vectorizer.pkl Safaa/src/safaa/models/ | ||||
| cp $BASE_PATH/model/false_positive_detection_model_sgd.pkl Safaa/src/safaa/models/ | ||||
| 
         
      Comment on lines
    
      +78
     to 
      +79
    
   
  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Logically the   | 
||||
| 
     | 
||||
| - name: Set branch name | ||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Follow the ML-Ops principles and see where we can add this as a sub step?  | 
||||
| id: vars | ||||
| run: echo "branch_name=$(date +'%Y%m%d-%H%M%S')" >> $GITHUB_OUTPUT | ||||
| 
     | 
||||
| - name: Create Pull Request | ||||
| uses: peter-evans/create-pull-request@v6 | ||||
| with: | ||||
| token: ${{ secrets.GITHUB_TOKEN }} | ||||
| branch: retrained-model-${{ steps.vars.outputs.branch_name }} | ||||
| commit-message: "Update retrained Safaa model" | ||||
| add-paths: | | ||||
| Safaa/src/safaa/models/false_positive_detection_vectorizer.pkl | ||||
| Safaa/src/safaa/models/false_positive_detection_model_sgd.pkl | ||||
| 
         
      Comment on lines
    
      +91
     to 
      +93
    
   
  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use, relative paths not absolute.  | 
||||
| title: "Retrained model - Accuracy ${{ steps.metrics.outputs.accuracy }}, F1 ${{ steps.metrics.outputs.f1 }}" | ||||
| body: | | ||||
| This PR contains the newly retrained Safaa model. | ||||
| 
     | 
||||
| **Test results:** | ||||
| - Accuracy: ${{ steps.metrics.outputs.accuracy }} | ||||
| - Precision: ${{ steps.metrics.outputs.precision }} | ||||
| - Recall: ${{ steps.metrics.outputs.recall }} | ||||
| - F1 Score: ${{ steps.metrics.outputs.f1 }} | ||||
| 
     | 
||||
| The trained model is also available as a downloadable artifact from the workflow run. | ||||
| base: main | ||||
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,76 @@ | ||
| ## Safaa Model Retraining Pipeline | ||
| 
     | 
||
| This folder provides an automated pipeline for retraining the Safaa false positive detection model. It includes data fetching from the fossology server instance (currently on localhost), preprocessing, model training, evaluation, and automated pull request creation with the updated model. | ||
| 
     | 
||
| ### Overview | ||
| 
     | 
||
| The Safaa Model Retraining Pipeline is designed to: | ||
| - Fetch copyright data from a fossology server instance. | ||
| - Preprocess and clean copyright data. | ||
| - Train a false positive detection model using Safaa. | ||
| - Evaluate model performance on test data. | ||
| - Automatically create pull requests with retrained models and performance metrics. | ||
| 
     | 
||
| ### Features | ||
| 
     | 
||
| - **Fossology Server Integration**: Fetch copyright data directly from fossology localhost instance. | ||
| - **Automated Workflow**: GitHub Actions workflow for model retraining when triggered. | ||
| - **Data Pipeline**: Complete data preprocessing including decluttering and splitting | ||
| - **Model Training**: Train false positive detection models using the SafaaAgent training script | ||
| - **Performance Metrics**: Automatic calculation of accuracy, precision, recall, and F1 score | ||
| - **Version Control**: Automatic PR creation with model performance in the title | ||
| 
     | 
||
| ### Project Structure | ||
| 
     | 
||
| ``` | ||
| ├── .github/ | ||
| │ └── workflows/ | ||
| │ └── pipeline.yml # GitHub Actions workflow | ||
| ├── utility/ | ||
| │ └── retraining/ | ||
| │ ├── script_for_copyrights.py # Fossology server copyright fetch script | ||
| │ ├── utility_scripts.py # Main retraining scripts | ||
| │ ├── data/ # Data directory | ||
| │ │ └── copyrights_*.csv # Fetched copyright data (uses copyrights_timestamp format) | ||
| │ └── model/ # Trained model output | ||
| ``` | ||
| 
     | 
||
| ### How to fetch copyrights from fossology local instance | ||
| 
     | 
||
| - From fetching copyrights from fossology local instance. Create a `.env` file in your project root with the following database credentials: | ||
| 
     | 
||
| ```env | ||
| DB_NAME=your_database_name | ||
| DB_USER=your_database_user | ||
| DB_PASSWORD=your_database_password | ||
| DB_HOST=your_database_host | ||
| DB_PORT=your_database_port | ||
| ``` | ||
| - Start the fossology instance locally. | ||
| - Run the `script_for_copyrights.py` script. | ||
| 
     | 
||
| ### Pipeline Usage (GitHub Actions) | ||
| 
     | 
||
| The workflow can be manually triggered from the GitHub Actions tab in your repository. It will: | ||
| 1. Perform Pre-processing till training model steps. | ||
| 2. Perform evaluation. | ||
| 3. Create a PR with the updated model. | ||
| 
     | 
||
| ### Model Output | ||
| 
     | 
||
| The trained model produces two files: | ||
| - `false_positive_detection_vectorizer.pkl`: Text vectorizer for feature extraction | ||
| - `false_positive_detection_model_sgd.pkl`: Trained SGD classifier | ||
| 
     | 
||
| ### Performance Metrics | ||
| 
     | 
||
| The pipeline automatically calculates and reports: | ||
| - Accuracy | ||
| - Precision | ||
| - Recall | ||
| - F1 Score | ||
| 
     | 
||
| ## Contact Information | ||
| 
     | 
||
| - **Name**: Abdulsobur Oyewale | ||
| - **Email**: [[email protected]](mailto:[email protected]) | 
Uh oh!
There was an error while loading. Please reload this page.