feat(pipeline): Data Pipeline for Safaa main PR #22

smilingprogrammer · 2025-07-09T13:18:14Z

Description

This PR is an ongoing project on the Data Pipeline for Safaa that allows us to automate most manual Safaa tasks

Changes

Created a new directory with this script to include the ongoing work on creating a pipeline to automate most manual Safaa tasks.
I also included the script to preprocess our fetched copyright content from the fossology server, while also implementing it into the pipeline through the pipeline.yml
I implemented the available decluttering script into the pipeline. I also included another seperate script (extra_decluter.py) for improving our decluttering using regex (Experimental purpose).

How to test for `script_for_copyright.py`

Start the fossology server instance in localhost
Upload a project zip file to scan its copyright
create a .env to store your server variables, which are [DB_NAME, DB_USER, DB_PASSWORD, DB_HOST, DB_PORT]
Run this .py script directly in the folder

How to test for `preprocessing_script.py`

Have copyrights.csv obtained from the fossology server in the data directory (An example dataset is already available)
Run the preprocesing script
Trigger it on GitHub Actions under the Pipeline Script

…nt, and added it into the pipeline

…t, and added it into the pipeline, I also included an upgraded declutter script with regex

Kaushl2208

Hey @smilingprogrammer ,

There is a lot of gap in what we were planning to do and this version. I have left comments where I could understand things could be done differently and in a better way maybe.

Also: Safaa/src/safaa/pipeline_dir/data/copyrights.csv why are we commiting this CSV file?

Take a closer look to all the requested changes and try to align with the priorities.

Safaa/src/safaa/pipeline_dir/script_for_copyrights.py

.github/workflows/pipeline.yml

smilingprogrammer · 2025-07-10T20:16:27Z

Hey @smilingprogrammer ,

There is a lot of gap in what we were planning to do and this version. I have left comments where I could understand things could be done differently and in a better way maybe.

Also: Safaa/src/safaa/pipeline_dir/data/copyrights.csv why are we commiting this CSV file?

Take a closer look to all the requested changes and try to align with the priorities.

Okay, will make adjustment to the comments. As for the file, as indicated in the PR details, it's just a dummy example to try the functionalities out.

…rom the fossology serv

… various functionalities

smilingprogrammer · 2025-07-14T18:13:20Z

Thank you very much for the feedbacks, it was really insightful. I have made corrections to the places you noted. For the 2 others i haven't marked resolved, i have just some little questions about it before pushing what i have for it in my local environment, and will be asking in the next meeting.

Thank you once again, very much appreciated.

…o 'retraining'

…line conflict

…dependency installation

…etching

…retraining script

…ility path for easier pipeline metrics

…ame alignment

Kaushl2208

Hey @smilingprogrammer ,

I have requested some changes, Please take a look into them and let me know if you need to understand anything better.

PS:

We dont need: utility/retraining/model/* directories at all. So please remove everything from there and whatever dependency that the pipeline holds.
I can see 21 commits in the PR, Please squash the commits. Follow: Contributing Guidelines
Dont forget to include the pulling copyrights step in the pipeline. Else that will be left as a manual step. All the requred inputs can be provided as env variables.
Make the PR cleaner, There are a lot of unwanted changes in there.

Kaushl2208 · 2025-09-24T09:11:09Z

.github/workflows/pipeline.yml

+#      - name: Install dependencies
+#        run: |
+#          pip install -r $BASE_PATH/requirements.txt
+#      - name: Create .env file
+#        run: |
+#          echo "DB_NAME=${{ secrets.DB_NAME }}" >> .env
+#          echo "DB_USER=${{ secrets.DB_USER }}" >> .env
+#          echo "DB_PASSWORD=${{ secrets.DB_PASSWORD }}" >> .env
+#          echo "DB_HOST=${{ secrets.DB_HOST }}" >> .env
+#          echo "DB_PORT=${{ secrets.DB_PORT }}" >> .env
+
+#      - name: Fetch Copyright content from server
+#        run: |


If we dont need these, Comments can be removed.

Kaushl2208 · 2025-09-24T09:13:07Z

.github/workflows/pipeline.yml

+          pip install "spacy~=3.8.7" \
+                      "pandas~=2.2.3" \
+                      "psycopg2~=2.9.10" \
+                      "python-dotenv~=1.1.0" \
+                      "scikit-learn~=1.7.0"


Can we add these dependencies in the project toml file? We can install them with required flag?

Even if we cant, Then we should add all the required dependencies in the pipeline while setting up python.

With the perspective users using the scripts from source and locally training --> We need to handle the dependencies anyways? maybe add a local requirements.txt which satisfies both utility and script_for_copyright

Kaushl2208 · 2025-09-24T09:14:45Z

.github/workflows/pipeline.yml

+
+      - name: Install dependencies
+        run: |
+          pip install "spacy~=3.8.7" \


Look into actions/cache to cache pip dependencies which will increase the the speed of execution and reduce bandwidth too,

Kaushl2208 · 2025-09-24T09:15:31Z

.github/workflows/pipeline.yml

+      - name: Checkout repo
+        uses: actions/checkout@v4
+        with:
+          persist-credentials: false


Do we need this for checking out the repo? This might be required for the PR creation step.

Kaushl2208 · 2025-09-24T09:26:50Z

.github/workflows/pipeline.yml

+          python $BASE_PATH/utility_scripts.py --preprocess
+          python $BASE_PATH/utility_scripts.py --declutter
+          python $BASE_PATH/utility_scripts.py --split
+          python $BASE_PATH/utility_scripts.py --train
+          python $BASE_PATH/utility_scripts.py --test | tee test_metrics.txt


umm, Should everything run under one stage?? preprocess comes under Data Preprocessing step, test and train comes under Model Training and Testing Step (split can also be part of it as well)

Now --train flag is training the false positive detector. NER also has a train and test steps (maybe we can include that as well if that makes sense?)

Kaushl2208 · 2025-09-24T09:49:57Z

utility/retraining/model/declutter_model/ner/moves

@@ -0,0 +1 @@
+��moves�x{"0":{},"1":{"Copyright":36194},"2":{"Copyright":36194},"3":{"Copyright":36194},"4":{"Copyright":36194,"":1},"5":{"":1}}�cfg��neg_key�


Unwanted file. Please remove

Kaushl2208 · 2025-09-24T09:51:43Z

utility/retraining/script_for_copyrights.py

+import os
+from dotenv import load_dotenv
+import psycopg2
+import pandas as pd
+from datetime import datetime
+import argparse


Make sure we have all the required dependencies in the env?

Kaushl2208 · 2025-09-24T09:52:43Z

utility/retraining/script_for_copyrights.py

+if __name__ == "__main__":
+    fetch_copyright_data()


Wait, Where are we using this file in our pipeline?
There should be a step in the pipeline itself to pull the new data as soon as the workflow is dispatched right?

Kaushl2208 · 2025-09-24T09:53:15Z

utility/retraining/utility_scripts.py

+    if args.preprocess:
+        latest_file = find_latest_copyright_file(data_dir)
+        raw_df = load_data(latest_file)
+        raw_df['copyright'] = preprocess_data(agent, raw_df['copyright'])
+        save_to_csv(raw_df, os.path.join(data_dir, "preprocessed_copyrights.csv"))
+        print("✅ Preprocessing completed")


Might not need it at all. Train step preprocess data already

Kaushl2208 · 2025-09-24T09:53:52Z

utility/retraining/utility_scripts.py

+        print("✅ Evaluation on test set:")
+        print(f"  Accuracy : {accuracy:.4f}")
+        print(f"  Precision: {precision:.4f}")
+        print(f"  Recall   : {recall:.4f}")
+        print(f"  F1 Score : {f1:.4f}")


maybe write it in a json file? That we can use and store in our artifacts with the correct model version?

smilingprogrammer · 2025-09-30T15:04:22Z

Thanks for the feedback @Kaushl2208 . Will make the necessary adjustments

smilingprogrammer added 3 commits July 9, 2025 12:49

feat(script): Integrated script to preprocess fetched copyright conte…

8e6fc60

…nt, and added it into the pipeline

feat(script): add script to fetch copyright from server

cff1df9

feat(script): Integrated script to declutter fetched copyright conten…

3431499

…t, and added it into the pipeline, I also included an upgraded declutter script with regex

smilingprogrammer changed the title ~~Copyright pipeline~~ feat(pipeline): Data Pipeline for Safaa main PR Jul 9, 2025

Kaushl2208 requested changes Jul 10, 2025

View reviewed changes

smilingprogrammer added 3 commits July 12, 2025 16:16

feat(script): Resolve corrections in script for fetching of content f…

ba63bf5

…rom the fossology serv

feat(script): added different files into one single script with their…

910e8bb

… various functionalities

feat(script): Resolve corrections in pipeline.yml scrit

a09195d

smilingprogrammer added 3 commits July 21, 2025 14:32

feat(script): changed pipeline scripts location, and renamed folder t…

197e694

…o 'retraining'

feat(script): resolve corrections in pipeline.yml scrit

b1bbfdc

feat(script): made huge update to the scripts regarding path and pipe…

3ed9383

…line conflict

smilingprogrammer force-pushed the copyright-pipeline branch from 27dab47 to 3ed9383 Compare July 22, 2025 15:44

smilingprogrammer added 12 commits July 22, 2025 16:49

feat(script): made huge update to the scripts regarding path and pipe…

40f45e6

…line conflict

feat(script): declared global paths, implements argument, and direct …

22660f7

…dependency installation

feat(script): implemented training script in pipeline

d1e63cd

feat(script): fixed path conflict

bff64a5

feat(script): updated sql script to leave out ignored contents when f…

c6de8df

…etching

feat(script): integrated both the testing phase in the .yml file and …

5245a38

…retraining script

feat(script): added declutter model & entity recognizer folders to ut…

e12372b

…ility path for easier pipeline metrics

feat(script): removed all requirements.txt relations

98b03a6

feat(script): fixed preprocessing and decluttering to maintain datafr…

46e4c8d

…ame alignment

feat(script): added data for testing pipeline

852b764

feat(script): updated pipeline scripts to automate creation of PR

f596515

feat(docs): included a README.md for the folder

0322dcc

Kaushl2208 requested changes Sep 24, 2025

View reviewed changes

		@@ -0,0 +1 @@
		��moves�x{"0":{},"1":{"Copyright":36194},"2":{"Copyright":36194},"3":{"Copyright":36194},"4":{"Copyright":36194,"":1},"5":{"":1}}�cfg��neg_key� No newline at end of file

feat(pipeline): Data Pipeline for Safaa main PR #22

Are you sure you want to change the base?

feat(pipeline): Data Pipeline for Safaa main PR #22

Uh oh!

Conversation

smilingprogrammer commented Jul 9, 2025

Description

Changes

How to test for script_for_copyright.py

How to test for preprocessing_script.py

Uh oh!

Kaushl2208 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smilingprogrammer commented Jul 10, 2025

Uh oh!

smilingprogrammer commented Jul 14, 2025

Uh oh!

Kaushl2208 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smilingprogrammer commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

How to test for `script_for_copyright.py`

How to test for `preprocessing_script.py`

smilingprogrammer commented Sep 30, 2025 •

edited

Loading