Skip to content

seniut/pySpark_Combine_Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Integration Project

main.py is the file with solution

Overview

This project involves integrating datasets from three distinct sources: Facebook, Google, and a website.

sources folder in project.

The integration is performed using a PySpark script that executes data cleaning, column renaming, data joining, and data coalescence.

Data Joining Strategy

1. Join Column: hash_key

The column used for joining the datasets is hash_key. This key is generated using a hash function that concatenates and hashes cleaned values of company_name, country_name, and city from each dataset. Each column has been cleaned before hashing (def key_cleaner(column) function)

Reason for Choice:

  • Uniqueness and Consistency: Hashing these fields together provides a unique identifier that is likely to be consistent across different datasets, assuming these fields are reliable and standardized.
  • Reduction of False Matches: Using multiple fields reduces the chances of false matches compared to using a single field.
  • Partitioning by hash_key: Repartition DF by hash_key for future join

2. Resolving Data Conflicts

In the event of data conflicts after joining:

  • Preferential Source: Establish a hierarchy of trustworthiness among the sources. For example, prioritize Facebook data over Google or website data if it's considered more reliable.
  • Data Freshness: Use timestamps or other indicators to choose the most recent data.
  • Completeness: Prefer the source which provides the most complete information for the field in question.

In the current situation, preference was given to Google data. Because data looks more consistent. However, the solution includes the ability to coalesce data from other sources if data from Google is NULL.

Reason for Approach:

  • Source Reliability: Different sources may have varying levels of accuracy and timeliness. And If one is empty (or NULL) we can switch to another.
  • Recency of Information: More recent data is often more accurate. In our situation, we don't have any timestamps, so this is why we pay attention to consistency and completeness of data.
  • Completeness: More complete information is generally more useful for analysis and decision-making. This is our situation

3. Handling Similar Data

When encountering similar data across sources:

  • Concatenation: For categorical fields like category, I concatenate the values from different sources, separated by a delimiter (e.g., " | "). But for future analysis may be better to store each column separately. I make a concatenation just for visibility.
  • Coalescing Values: For fields like address, country_name, city, etc., use coalescing to keep the first non-null value in the order of source preference.

Reason for Approach:

  • Richness of Data: Concatenating categorical fields provides a comprehensive view.
  • Priority to Completeness and Reliability: Coalescing ensures that the most complete and reliable information is retained, based on the established source hierarchy.

The result of joining data will be stored in destination folder. Each partition into a separate file


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages