Skip to content

This is my private repository, inspired by a career-related challenge. It is still in Development stage. in this project you can see an ETL Pipeline, it extracts data from different public API - NHTSA , this one having various sources of an API or a ZiP File

Notifications You must be signed in to change notification settings

fexi12/DataPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

ETL Pipeline 📎

This is my private repository, inspired by a career-related challenge. It is still in Development stage. in this project you can see an ETL Pipeline, it extracts data from different public API - NHTSA , this one having various sources of an API or a ZiP File.

The language used was python and its different modules were:

  • Python 3.9
  • pyspark
  • pandas
  • configparser
  • json
  • sqlalchemy
  • concurrent.futures

Project Tree

📦NHTSA
 ┣ 📂Complaints
 ┃ ┣ 📂config
 ┃ ┃ ┗ 📜config.ini
 ┃ ┣ 📂in
 ┃ ┣ 📂out
 ┃ ┣ 📜extract_complaints.py
 ┃ ┣ 📜load_complaints.py
 ┃ ┣ 📜main_complaints.py
 ┃ ┗ 📜transform_complaints.py
 ┣ 📂Investigations
 ┃ ┣ 📂config
 ┃ ┃ ┗ 📜.env
 ┃ ┣ 📂in
 ┃ ┣ 📂out
 ┃ ┣ 📜extract_investigations.py
 ┃ ┣ 📜load_investigations.py
 ┃ ┣ 📜main_investigations.py
 ┃ ┗ 📜transform_investigations.py
 ┣ 📂ManufacturerCommunications
 ┃ ┣ 📂config
 ┃ ┃ ┗ 📜.env
 ┃ ┣ 📂in
 ┃ ┣ 📂out
 ┣ 📂Ratings
 ┃ ┣ 📂config
 ┃ ┃ ┗ 📜.env
 ┃ ┣ 📂in
 ┃ ┣ 📂out
 ┃ ┣ 📜extract_ratings.py
 ┃ ┣ 📜load_ratings.py
 ┃ ┣ 📜main_ratings.py
 ┃ ┗ 📜transform_ratings.py
 ┗ 📂Recalls
 ┃ ┣ 📂config
 ┃ ┃ ┗ 📜.env
 ┃ ┣ 📂in
 ┃ ┣ 📂out
 ┃ ┣ 📜extract_recalls.py
 ┃ ┣ 📜load_recalls.py
 ┃ ┣ 📜main_recalls.py
 ┃ ┗ 📜transform_recalls.py

Each sub-folders is referring to NHTSA and it's Dataset Source.

To Start a Pipeline you need to execute the command -> ´´python .\main_*.py ´´
Main*.py initialize the pipeline giving it's proper order :
 - extract.*py -> transform.*.py -> load.*py

Extract.*py

- Retrieve the JSON from the API or the ZIP from the URL
- Upload the sources to /in

Transform.*.py

- Get the file from /in
- Transform to DataFrame via pyspark
- Do transformation to the data
- Transform to Pandas DataFrame
- Upload the df to /out

Load.*.py

- Get the DataFrame from Transform.*py
- Upload to a Database 

.env

- It holds the variables

Roadmap

  • Change the request to multiThread request ( to improve (reduce) the time of extract )
  • Create a .env file for the folders
  • Transform data to the correct format
  • Replace Null Values to empty
  • Upload the DataFrame Locally
  • Check the Data and it's transformation
  • Upload the DataFrame to MSQ Server
  • Create DataFrame with index (missing someones)
  • See if it's possible to create foreing keys in order to link the dataframes in the SQL
  • [ ]

About

This is my private repository, inspired by a career-related challenge. It is still in Development stage. in this project you can see an ETL Pipeline, it extracts data from different public API - NHTSA , this one having various sources of an API or a ZiP File

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages