Skip to content

Conversation

@smilingprogrammer
Copy link

Description

This PR is an ongoing project on the Data Pipeline for Safaa that allows us to automate most manual Safaa tasks

Changes

  1. Created a new directory with this script to include the ongoing work on creating a pipeline to automate most manual Safaa tasks.
  2. I also included the script to preprocess our fetched copyright content from the fossology server, while also implementing it into the pipeline through the pipeline.yml
  3. I implemented the available decluttering script into the pipeline. I also included another seperate script (extra_decluter.py) for improving our decluttering using regex (Experimental purpose).

How to test for script_for_copyright.py

  1. Start the fossology server instance in localhost
  2. Upload a project zip file to scan its copyright
  3. create a .env to store your server variables, which are [DB_NAME, DB_USER, DB_PASSWORD, DB_HOST, DB_PORT]
  4. Run this .py script directly in the folder

How to test for preprocessing_script.py

  1. Have copyrights.csv obtained from the fossology server in the data directory (An example dataset is already available)
  2. Run the preprocesing script
  3. Trigger it on GitHub Actions under the Pipeline Script

@smilingprogrammer smilingprogrammer changed the title Copyright pipeline feat(pipeline): Data Pipeline for Safaa main PR Jul 9, 2025
Copy link
Member

@Kaushl2208 Kaushl2208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @smilingprogrammer ,

There is a lot of gap in what we were planning to do and this version. I have left comments where I could understand things could be done differently and in a better way maybe.

Also: Safaa/src/safaa/pipeline_dir/data/copyrights.csv why are we commiting this CSV file?

Take a closer look to all the requested changes and try to align with the priorities.

@smilingprogrammer
Copy link
Author

Hey @smilingprogrammer ,

There is a lot of gap in what we were planning to do and this version. I have left comments where I could understand things could be done differently and in a better way maybe.

Also: Safaa/src/safaa/pipeline_dir/data/copyrights.csv why are we commiting this CSV file?

Take a closer look to all the requested changes and try to align with the priorities.

Okay, will make adjustment to the comments. As for the file, as indicated in the PR details, it's just a dummy example to try the functionalities out.

@smilingprogrammer
Copy link
Author

Thank you very much for the feedbacks, it was really insightful. I have made corrections to the places you noted. For the 2 others i haven't marked resolved, i have just some little questions about it before pushing what i have for it in my local environment, and will be asking in the next meeting.

Thank you once again, very much appreciated.

Copy link
Member

@Kaushl2208 Kaushl2208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @smilingprogrammer ,

I have requested some changes, Please take a look into them and let me know if you need to understand anything better.

PS:

  1. We dont need: utility/retraining/model/* directories at all. So please remove everything from there and whatever dependency that the pipeline holds.
  2. I can see 21 commits in the PR, Please squash the commits. Follow: Contributing Guidelines
  3. Dont forget to include the pulling copyrights step in the pipeline. Else that will be left as a manual step. All the requred inputs can be provided as env variables.
  4. Make the PR cleaner, There are a lot of unwanted changes in there.

Comment on lines +37 to +49
# - name: Install dependencies
# run: |
# pip install -r $BASE_PATH/requirements.txt
# - name: Create .env file
# run: |
# echo "DB_NAME=${{ secrets.DB_NAME }}" >> .env
# echo "DB_USER=${{ secrets.DB_USER }}" >> .env
# echo "DB_PASSWORD=${{ secrets.DB_PASSWORD }}" >> .env
# echo "DB_HOST=${{ secrets.DB_HOST }}" >> .env
# echo "DB_PORT=${{ secrets.DB_PORT }}" >> .env

# - name: Fetch Copyright content from server
# run: |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we dont need these, Comments can be removed.

Comment on lines +31 to +35
pip install "spacy~=3.8.7" \
"pandas~=2.2.3" \
"psycopg2~=2.9.10" \
"python-dotenv~=1.1.0" \
"scikit-learn~=1.7.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these dependencies in the project toml file? We can install them with required flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we cant, Then we should add all the required dependencies in the pipeline while setting up python.

With the perspective users using the scripts from source and locally training --> We need to handle the dependencies anyways? maybe add a local requirements.txt which satisfies both utility and script_for_copyright


- name: Install dependencies
run: |
pip install "spacy~=3.8.7" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look into actions/cache to cache pip dependencies which will increase the the speed of execution and reduce bandwidth too,

- name: Checkout repo
uses: actions/checkout@v4
with:
persist-credentials: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this for checking out the repo? This might be required for the PR creation step.

Comment on lines +54 to +58
python $BASE_PATH/utility_scripts.py --preprocess
python $BASE_PATH/utility_scripts.py --declutter
python $BASE_PATH/utility_scripts.py --split
python $BASE_PATH/utility_scripts.py --train
python $BASE_PATH/utility_scripts.py --test | tee test_metrics.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm, Should everything run under one stage?? preprocess comes under Data Preprocessing step, test and train comes under Model Training and Testing Step (split can also be part of it as well)

Now --train flag is training the false positive detector. NER also has a train and test steps (maybe we can include that as well if that makes sense?)

@@ -0,0 +1 @@
��moves�x{"0":{},"1":{"Copyright":36194},"2":{"Copyright":36194},"3":{"Copyright":36194},"4":{"Copyright":36194,"":1},"5":{"":1}}�cfg��neg_key� No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unwanted file. Please remove

Comment on lines +4 to +9
import os
from dotenv import load_dotenv
import psycopg2
import pandas as pd
from datetime import datetime
import argparse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure we have all the required dependencies in the env?

Comment on lines +107 to +108
if __name__ == "__main__":
fetch_copyright_data()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, Where are we using this file in our pipeline?
There should be a step in the pipeline itself to pull the new data as soon as the workflow is dispatched right?

Comment on lines +59 to +64
if args.preprocess:
latest_file = find_latest_copyright_file(data_dir)
raw_df = load_data(latest_file)
raw_df['copyright'] = preprocess_data(agent, raw_df['copyright'])
save_to_csv(raw_df, os.path.join(data_dir, "preprocessed_copyrights.csv"))
print("✅ Preprocessing completed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not need it at all. Train step preprocess data already

Comment on lines +99 to +103
print("✅ Evaluation on test set:")
print(f" Accuracy : {accuracy:.4f}")
print(f" Precision: {precision:.4f}")
print(f" Recall : {recall:.4f}")
print(f" F1 Score : {f1:.4f}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe write it in a json file? That we can use and store in our artifacts with the correct model version?

@smilingprogrammer
Copy link
Author

smilingprogrammer commented Sep 30, 2025

Thanks for the feedback @Kaushl2208 . Will make the necessary adjustments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants