Titanic---Machine-Learning-from-Disaster

A data-driven analysis and machine learning model to predict passenger survival in the Titanic disaster using Python, Pandas, and Scikit-learn.

Titanic Survival Prediction

This project is a classic machine learning challenge from a Kaggle competition. The goal is to build a model that predicts whether a passenger on the RMS Titanic survived the infamous 1912 disaster, based on a given set of passenger data.

Competition Link: Titanic - Machine Learning from Disaster

Project Overview

The sinking of the Titanic is one of the most notorious shipwrecks in history. While some luck was involved in surviving, it seems some groups of people were more likely to survive than others. This project uses passenger data (e.g., name, age, gender, socio-economic class, etc.) to build a machine learning model capable of predicting survival outcomes.

This end-to-end project demonstrates a complete data science pipeline, including:

Data Cleaning and Preprocessing
Exploratory Data Analysis (EDA)
Feature Engineering
Model Training and Evaluation
Submission Generation

Tech Stack: Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn

Project Workflow

The project followed a structured approach to ensure a robust and well-documented solution.

Data Loading & Initial Inspection: Loaded train.csv and test.csv datasets and examined their structure, datatypes, and initial statistics.
Data Cleaning: Handled missing values in critical columns like Age, Embarked, and Cabin.
Exploratory Data Analysis (EDA): Used visualizations to understand the relationships between different features and the survival outcome.
Feature Engineering: Created new, more informative features from existing ones to improve model performance (e.g., FamilySize, Title).
Model Building: Trained several classification algorithms on the prepared data.
Model Evaluation: Assessed model performance using cross-validation and accuracy metrics.
Submission: Used the best-performing model to make predictions on the test dataset and generated the submission file.

Data Dictionary

Variable	Definition	Key
`Survived`	Survival	0 = No, 1 = Yes
`Pclass`	Ticket class (Proxy for socio-economic status)	1 = 1st, 2 = 2nd, 3 = 3rd
`Sex`	Sex	`male`, `female`
`Age`	Age in years
`SibSp`	# of siblings / spouses aboard the Titanic
`Parch`	# of parents / children aboard the Titanic
`Ticket`	Ticket number
`Fare`	Passenger fare
`Cabin`	Cabin number
`Embarked`	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Exploratory Data Analysis (EDA) & Key Findings

Several key insights were drawn from the data that heavily influenced feature engineering and modeling:

Gender vs. Survival: Women had a significantly higher survival rate (~74%) compared to men (~19%), confirming the "women and children first" protocol. * Passenger Class vs. Survival: First-class passengers had the highest survival rate (~63%), followed by second (~47%) and third class (~24%). Wealth and status played a crucial role.
Age vs. Survival: Children (age < 16) had a higher survival rate than other age groups. The Age column had missing values that were imputed using the median age of passengers grouped by their Pclass and Sex.
Port of Embarkation: Passengers who embarked from Cherbourg (C) had a higher survival rate, likely because they were predominantly first-class passengers.

Feature Engineering

To enhance the model's predictive power, the following features were created:

FamilySize: Combined SibSp and Parch to get the total number of family members on board. $$FamilySize = SibSp + Parch + 1$$
IsAlone: A binary feature derived from FamilySize to indicate if a passenger was traveling alone.
Title: Extracted titles (e.g., "Mr", "Mrs", "Miss", "Master") from the Name column. This served as a strong proxy for age, gender, and social status.
AgeGroup: Binned the Age feature into categories (e.g., Child, Teen, Adult, Senior) to better capture non-linear relationships with survival.
FarePerPerson: Calculated by dividing the Fare by FamilySize.

Modeling & Evaluation

Several classification models were trained and evaluated using 5-fold cross-validation to ensure robustness and prevent overfitting.

The models considered were:

Logistic Regression: A good baseline model for binary classification.
Support Vector Machine (SVM): Effective in high-dimensional spaces.
Random Forest Classifier: An ensemble model that is robust against overfitting and captures complex interactions.
Gradient Boosting Classifier: A powerful ensemble method that builds trees sequentially.

The primary metric for evaluation was Accuracy, as required by the Kaggle competition.

Results & Conclusion

The models were compared based on their mean cross-validation accuracy.

Model	Mean Cross-Validation Accuracy	Kaggle Score
Logistic Regression	0.795	[Your Score]
Support Vector Machine	0.812	[Your Score]
Random Forest Classifier	0.825	[Your Best Score]
Gradient Boosting	0.821	[Your Score]

The Random Forest Classifier provided the best and most stable performance, achieving a final Kaggle submission score of [Your Best Score].

The analysis confirms that Title, Sex, Pclass, and FamilySize were the most influential features in predicting survival. This project successfully demonstrates a complete machine learning pipeline, from raw data to a predictive model.

Future Improvements

Hyperparameter Tuning: Use GridSearchCV or RandomizedSearchCV to find the optimal parameters for the best-performing models.
Advanced Ensembling: Combine the predictions of multiple models (stacking/blending) to potentially improve the final score.
More Feature Engineering: Explore the Ticket and Cabin features more deeply to extract potentially useful information.

How to Run this Project

To replicate this analysis, follow these steps:

Clone the repository:

git clone https://github.com/nayanj2221/Titanic---Machine-Learning-from-Disaster.git
cd Titanic---Machine-Learning-from-Disaster

Create a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required libraries:
```
pip install -r requirements.txt
```

Run the Jupyter Notebook:

jupyter notebook Titanic Survival Prediction.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Titanic Survival Prediction		Titanic Survival Prediction
.gitignore		.gitignore
Badge.png		Badge.png
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Titanic---Machine-Learning-from-Disaster

Titanic Survival Prediction

Table of Contents

Project Overview

Project Workflow

Data Dictionary

Exploratory Data Analysis (EDA) & Key Findings

Feature Engineering

Modeling & Evaluation

Results & Conclusion

Future Improvements

How to Run this Project

About

Uh oh!

Releases

Packages

Languages

License

nayanj2221/Titanic---Machine-Learning-from-Disaster

Folders and files

Latest commit

History

Repository files navigation

Titanic---Machine-Learning-from-Disaster

Titanic Survival Prediction

Table of Contents

Project Overview

Project Workflow

Data Dictionary

Exploratory Data Analysis (EDA) & Key Findings

Feature Engineering

Modeling & Evaluation

Results & Conclusion

Future Improvements

How to Run this Project

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages