-
Notifications
You must be signed in to change notification settings - Fork 6
feat(pipeline): Data Pipeline for Safaa main PR #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(pipeline): Data Pipeline for Safaa main PR #22
Conversation
…nt, and added it into the pipeline
…t, and added it into the pipeline, I also included an upgraded declutter script with regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @smilingprogrammer ,
There is a lot of gap in what we were planning to do and this version. I have left comments where I could understand things could be done differently and in a better way maybe.
Also: Safaa/src/safaa/pipeline_dir/data/copyrights.csv why are we commiting this CSV file?
Take a closer look to all the requested changes and try to align with the priorities.
Okay, will make adjustment to the comments. As for the file, as indicated in the PR details, it's just a dummy example to try the functionalities out. |
…rom the fossology serv
… various functionalities
|
Thank you very much for the feedbacks, it was really insightful. I have made corrections to the places you noted. For the 2 others i haven't marked resolved, i have just some little questions about it before pushing what i have for it in my local environment, and will be asking in the next meeting. Thank you once again, very much appreciated. |
27dab47 to
3ed9383
Compare
…dependency installation
…retraining script
…ility path for easier pipeline metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @smilingprogrammer ,
I have requested some changes, Please take a look into them and let me know if you need to understand anything better.
PS:
- We dont need:
utility/retraining/model/*directories at all. So please remove everything from there and whatever dependency that the pipeline holds. - I can see 21 commits in the PR, Please squash the commits. Follow: Contributing Guidelines
- Dont forget to include the pulling copyrights step in the pipeline. Else that will be left as a manual step. All the requred inputs can be provided as env variables.
- Make the PR cleaner, There are a lot of unwanted changes in there.
| # - name: Install dependencies | ||
| # run: | | ||
| # pip install -r $BASE_PATH/requirements.txt | ||
| # - name: Create .env file | ||
| # run: | | ||
| # echo "DB_NAME=${{ secrets.DB_NAME }}" >> .env | ||
| # echo "DB_USER=${{ secrets.DB_USER }}" >> .env | ||
| # echo "DB_PASSWORD=${{ secrets.DB_PASSWORD }}" >> .env | ||
| # echo "DB_HOST=${{ secrets.DB_HOST }}" >> .env | ||
| # echo "DB_PORT=${{ secrets.DB_PORT }}" >> .env | ||
|
|
||
| # - name: Fetch Copyright content from server | ||
| # run: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we dont need these, Comments can be removed.
| pip install "spacy~=3.8.7" \ | ||
| "pandas~=2.2.3" \ | ||
| "psycopg2~=2.9.10" \ | ||
| "python-dotenv~=1.1.0" \ | ||
| "scikit-learn~=1.7.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add these dependencies in the project toml file? We can install them with required flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we cant, Then we should add all the required dependencies in the pipeline while setting up python.
With the perspective users using the scripts from source and locally training --> We need to handle the dependencies anyways? maybe add a local requirements.txt which satisfies both utility and script_for_copyright
|
|
||
| - name: Install dependencies | ||
| run: | | ||
| pip install "spacy~=3.8.7" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look into actions/cache to cache pip dependencies which will increase the the speed of execution and reduce bandwidth too,
| - name: Checkout repo | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| persist-credentials: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this for checking out the repo? This might be required for the PR creation step.
| python $BASE_PATH/utility_scripts.py --preprocess | ||
| python $BASE_PATH/utility_scripts.py --declutter | ||
| python $BASE_PATH/utility_scripts.py --split | ||
| python $BASE_PATH/utility_scripts.py --train | ||
| python $BASE_PATH/utility_scripts.py --test | tee test_metrics.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm, Should everything run under one stage?? preprocess comes under Data Preprocessing step, test and train comes under Model Training and Testing Step (split can also be part of it as well)
Now --train flag is training the false positive detector. NER also has a train and test steps (maybe we can include that as well if that makes sense?)
| @@ -0,0 +1 @@ | |||
| ��moves�x{"0":{},"1":{"Copyright":36194},"2":{"Copyright":36194},"3":{"Copyright":36194},"4":{"Copyright":36194,"":1},"5":{"":1}}�cfg��neg_key� No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unwanted file. Please remove
| import os | ||
| from dotenv import load_dotenv | ||
| import psycopg2 | ||
| import pandas as pd | ||
| from datetime import datetime | ||
| import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure we have all the required dependencies in the env?
| if __name__ == "__main__": | ||
| fetch_copyright_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, Where are we using this file in our pipeline?
There should be a step in the pipeline itself to pull the new data as soon as the workflow is dispatched right?
| if args.preprocess: | ||
| latest_file = find_latest_copyright_file(data_dir) | ||
| raw_df = load_data(latest_file) | ||
| raw_df['copyright'] = preprocess_data(agent, raw_df['copyright']) | ||
| save_to_csv(raw_df, os.path.join(data_dir, "preprocessed_copyrights.csv")) | ||
| print("✅ Preprocessing completed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might not need it at all. Train step preprocess data already
| print("✅ Evaluation on test set:") | ||
| print(f" Accuracy : {accuracy:.4f}") | ||
| print(f" Precision: {precision:.4f}") | ||
| print(f" Recall : {recall:.4f}") | ||
| print(f" F1 Score : {f1:.4f}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe write it in a json file? That we can use and store in our artifacts with the correct model version?
|
Thanks for the feedback @Kaushl2208 . Will make the necessary adjustments |
Description
This PR is an ongoing project on the Data Pipeline for Safaa that allows us to automate most manual Safaa tasks
Changes
extra_decluter.py) for improving our decluttering using regex (Experimental purpose).How to test for
script_for_copyright.py.envto store your server variables, which are [DB_NAME, DB_USER, DB_PASSWORD, DB_HOST, DB_PORT]How to test for
preprocessing_script.pycopyrights.csvobtained from the fossology server in the data directory (An example dataset is already available)Pipeline Script