ETL Pipeline 📎

This is my private repository, inspired by a career-related challenge. It is still in Development stage. in this project you can see an ETL Pipeline, it extracts data from different public API - NHTSA , this one having various sources of an API or a ZiP File.

The language used was python and its different modules were:

Python 3.9
pyspark
pandas
configparser
json
sqlalchemy
concurrent.futures

Project Tree

📦NHTSA
 ┣ 📂Complaints
 ┃ ┣ 📂config
 ┃ ┃ ┗ 📜config.ini
 ┃ ┣ 📂in
 ┃ ┣ 📂out
 ┃ ┣ 📜extract_complaints.py
 ┃ ┣ 📜load_complaints.py
 ┃ ┣ 📜main_complaints.py
 ┃ ┗ 📜transform_complaints.py
 ┣ 📂Investigations
 ┃ ┣ 📂config
 ┃ ┃ ┗ 📜.env
 ┃ ┣ 📂in
 ┃ ┣ 📂out
 ┃ ┣ 📜extract_investigations.py
 ┃ ┣ 📜load_investigations.py
 ┃ ┣ 📜main_investigations.py
 ┃ ┗ 📜transform_investigations.py
 ┣ 📂ManufacturerCommunications
 ┃ ┣ 📂config
 ┃ ┃ ┗ 📜.env
 ┃ ┣ 📂in
 ┃ ┣ 📂out
 ┣ 📂Ratings
 ┃ ┣ 📂config
 ┃ ┃ ┗ 📜.env
 ┃ ┣ 📂in
 ┃ ┣ 📂out
 ┃ ┣ 📜extract_ratings.py
 ┃ ┣ 📜load_ratings.py
 ┃ ┣ 📜main_ratings.py
 ┃ ┗ 📜transform_ratings.py
 ┗ 📂Recalls
 ┃ ┣ 📂config
 ┃ ┃ ┗ 📜.env
 ┃ ┣ 📂in
 ┃ ┣ 📂out
 ┃ ┣ 📜extract_recalls.py
 ┃ ┣ 📜load_recalls.py
 ┃ ┣ 📜main_recalls.py
 ┃ ┗ 📜transform_recalls.py

Each sub-folders is referring to NHTSA and it's Dataset Source.

To Start a Pipeline you need to execute the command -> ´´python .\main_*.py ´´
Main*.py initialize the pipeline giving it's proper order :
 - extract.*py -> transform.*.py -> load.*py

Extract.*py

- Retrieve the JSON from the API or the ZIP from the URL
- Upload the sources to /in

Transform.*.py

- Get the file from /in
- Transform to DataFrame via pyspark
- Do transformation to the data
- Transform to Pandas DataFrame
- Upload the df to /out

Load.*.py

- Get the DataFrame from Transform.*py
- Upload to a Database

.env

- It holds the variables

Roadmap

Change the request to multiThread request ( to improve (reduce) the time of extract )
Create a .env file for the folders
Transform data to the correct format
Replace Null Values to empty
Upload the DataFrame Locally
Check the Data and it's transformation
Upload the DataFrame to MSQ Server
Create DataFrame with index (missing someones)
See if it's possible to create foreing keys in order to link the dataframes in the SQL
[ ]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
NHTSA		NHTSA
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETL Pipeline 📎

Project Tree

Extract.*py

Transform.*.py

Load.*.py

.env

Roadmap

About

Uh oh!

Releases

Packages

Languages

fexi12/DataPipeline

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline 📎

Project Tree

Extract.*py

Transform.*.py

Load.*.py

.env

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages