Introduction
The Titanic dataset represents one of the most famous datasets in data science education and competition. This tragic maritime disaster from 1912 provides a rich source of data for understanding machine learning fundamentals while tackling a real-world classification problem: predicting passenger survival based on various demographic and social factors.
In this comprehensive tutorial, we'll walk through a complete data science pipeline from initial data understanding to model deployment. This project demonstrates best practices in data science methodology, following a systematic 5-phase approach that can be applied to any machine learning project.
Project Repository
The complete data science pipeline with all phases is available on GitHub:
View GitHub RepositoryProject Overview
Our Titanic survival analysis follows a structured 5-phase methodology that ensures comprehensive coverage of the data science lifecycle:
Project Phases:
- Phase 1: Data Understanding - Initial exploration and data profiling
- Phase 2: Data Cleaning - Handling missing values, outliers, and duplicates
- Phase 3: Feature Engineering - Normalization, encoding, and feature selection
- Phase 4: Data Splitting - Train-test splits with stratification
- Phase 5: Modeling & Evaluation - Model comparison, tuning, and analysis
Dataset Context:
The RMS Titanic sank on April 15, 1912, during its maiden voyage. Our analysis aims to understand which factors contributed most to passenger survival, using features such as:
- Demographics: Age, gender, family relationships
- Socioeconomic: Passenger class, fare paid
- Logistics: Cabin location, embarkation port
- Family: Number of siblings/spouses and parents/children aboard
Phase 1: Data Understanding
Data Loading & Initial Preview
We begin our analysis by loading the Titanic dataset and conducting initial exploration:
import seaborn as sns
import pandas as pd
# Load Titanic dataset from seaborn
df = sns.load_dataset("titanic")
# First glimpse of the data
print("First 5 rows:")
print(df.head())
# Dataset dimensions
print(f"\nDataset shape: {df.shape}")
print(f"Total samples: {df.shape[0]}")
print(f"Total features: {df.shape[1]}")
# Save initial dataset
df.to_csv("data/phase_1_titanic_dataset.csv", index=False)
Initial Data Exploration
Understanding the structure and characteristics of our dataset:
# Comprehensive dataset information
print("Dataset Info:")
print(df.info())
# Statistical summary for numerical columns
print("\nNumerical Statistics:")
print(df.describe())
# Data types examination
print("\nData Types:")
print(df.dtypes)
# Check for missing values
print("\nMissing Values Count:")
print(df.isnull().sum())
# Unique values in categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
print(f"\n{col}: {df[col].unique()}")
Initial Data Insights:
- Dataset Size: 891 passengers with 15 features
- Target Variable: 'survived' (0 = No, 1 = Yes)
- Missing Data: Age (~20%), Deck (~77%), Embark_town (~0.2%)
- Data Types: Mix of numerical and categorical features
- Class Distribution: More non-survivors than survivors
Phase 2: Data Cleaning
Handling Missing Values
We implement different strategies for different types of missing data:
import pandas as pd
# Load data from previous phase
df = pd.read_csv("data/phase_1_titanic_dataset.csv")
# Check current missing values
print("Missing values before cleaning:")
print(df.isnull().sum())
# š¹ Numerical Columns - Use median for age
df['age'] = df['age'].fillna(df['age'].median())
# š¹ Categorical Columns - Use mode for most common value
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])
# š¹ Deck column - Use 'Unknown' for missing values
df['deck'] = df['deck'].fillna('Unknown')
print("\nMissing values after cleaning:")
print(df.isnull().sum())
Outlier Detection and Removal
Using the Interquartile Range (IQR) method to handle fare outliers:
# Outlier detection using IQR method for fare
Q1 = df['fare'].quantile(0.25)
Q3 = df['fare'].quantile(0.75)
IQR = Q3 - Q1
# Calculate outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Fare outlier bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
# Filter out outliers
original_size = len(df)
df = df[(df['fare'] >= lower_bound) & (df['fare'] <= upper_bound)]
new_size = len(df)
print(f"Removed {original_size - new_size} outliers ({((original_size - new_size)/original_size)*100:.1f}%)")
Duplicate Removal
Ensuring data quality by removing duplicate records:
# Check for and remove duplicates
duplicates_before = df.duplicated().sum()
df = df.drop_duplicates()
duplicates_after = df.duplicated().sum()
print(f"Duplicates removed: {duplicates_before}")
print(f"Final dataset size: {len(df)} samples")
# Convert target to categorical
df['survived'] = df['survived'].astype('category')
# Save cleaned dataset
df.to_csv("data/phase_2_titanic_dataset.csv", index=False)
Data Cleaning Results:
- Missing Values: All handled with appropriate strategies
- Age: Filled with median (28.0 years)
- Embarked: Filled with mode ('S' - Southampton)
- Deck: Missing values labeled as 'Unknown'
- Outliers: Extreme fare values removed using IQR method
- Duplicates: Any duplicate rows eliminated
Phase 3: Feature Engineering
Data Normalization
Standardizing numerical features to improve model performance:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Load cleaned data
df = pd.read_csv("data/phase_2_titanic_dataset.csv")
# Min-Max Normalization (0-1 scaling)
# Formula: x' = (x - min(x)) / (max(x) - min(x))
scaler = MinMaxScaler()
# Apply normalization to numerical features
df['age_norm'] = scaler.fit_transform(df[['age']])
df['fare_norm'] = scaler.fit_transform(df[['fare']])
print("Normalization completed:")
print(f"Age range: [{df['age_norm'].min():.3f}, {df['age_norm'].max():.3f}]")
print(f"Fare range: [{df['fare_norm'].min():.3f}, {df['fare_norm'].max():.3f}]")
Categorical Encoding
Converting categorical variables to numerical format for machine learning:
from sklearn.preprocessing import LabelEncoder
# š¹ 1. Label Encoding for binary features
le = LabelEncoder()
df['sex_encoded'] = le.fit_transform(df['sex']) # male=1, female=0
print(f"Sex encoding: {dict(zip(le.classes_, le.transform(le.classes_)))}")
# š¹ 2. One-Hot Encoding for multi-category features
# This creates binary columns for each category
df = pd.get_dummies(df, columns=['embarked', 'class'], drop_first=True)
print("\nNew columns after one-hot encoding:")
print([col for col in df.columns if 'embarked_' in col or 'class_' in col])
Feature Selection
Removing redundant and non-predictive features to simplify the model:
# Remove original columns that have been encoded or are not useful for prediction
columns_to_drop = [
'sex', # Replaced by sex_encoded
'age', # Replaced by age_norm
'fare', # Replaced by fare_norm
'deck', # Too many missing values
'embark_town', # Redundant with embarked
'who', # Redundant with sex and age
'alive', # Same as survived
'adult_male' # Derivable from sex and age
]
df = df.drop(columns=columns_to_drop)
print("Final feature set:")
print(list(df.columns))
# Save feature-engineered dataset
df.to_csv("data/phase_3_titanic_dataset.csv", index=False)
Feature Engineering Summary:
- Normalization: Age and fare scaled to [0,1] range
- Label Encoding: Binary sex variable converted to 0/1
- One-Hot Encoding: Embarked and class expanded to binary features
- Feature Reduction: Removed 8 redundant/non-predictive columns
- Final Features: 11 engineered features for modeling
Phase 4: Data Splitting
Train-Test Split with Stratification
Properly splitting the data while maintaining class balance:
import pandas as pd
from sklearn.model_selection import train_test_split
# Load feature-engineered data
df = pd.read_csv("data/phase_3_titanic_dataset.csv")
# Separate features (X) and target (y)
X = df.drop(columns=['survived']) # All features except target
y = df['survived'] # Target variable
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Target distribution:\n{y.value_counts(normalize=True)}")
# Stratified train-test split to maintain class balance
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42, # For reproducibility
stratify=y # Maintain class distribution in both sets
)
print(f"\nTraining set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"Training target distribution:\n{y_train.value_counts(normalize=True)}")
print(f"Test target distribution:\n{y_test.value_counts(normalize=True)}")
Data Splitting Results:
- Training Set: 80% of data for model training
- Test Set: 20% of data for final evaluation
- Stratification: Maintains survival rate balance in both sets
- Feature Count: 11 engineered features
- Random State: Set for reproducible results
Phase 5: Modeling & Evaluation
Baseline Model Development
Starting with a simple Logistic Regression as our baseline:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Create and train baseline model
baseline_model = LogisticRegression(max_iter=1000)
baseline_model.fit(X_train, y_train)
# Make predictions
y_pred_baseline = baseline_model.predict(X_test)
# Evaluate baseline performance
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)
print(f"Baseline Accuracy: {baseline_accuracy:.4f}")
print("\nBaseline Classification Report:")
print(classification_report(y_test, y_pred_baseline))
Comprehensive Model Comparison
Evaluating multiple algorithms to find the best performer:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
# Define models for comparison
models = {
"Logistic Regression": LogisticRegression(max_iter=1000),
"Decision Tree": DecisionTreeClassifier(random_state=42),
"Random Forest": RandomForestClassifier(random_state=42),
"K-Nearest Neighbors": KNeighborsClassifier(),
"Support Vector Machine": SVC(random_state=42)
}
# Store results for comparison
results = []
# Train and evaluate each model
for name, model in models.items():
# Train model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, output_dict=True)
# Store results
results.append({
'Model': name,
'Accuracy': round(accuracy, 4),
'Precision_0': round(report['0']['precision'], 4),
'Recall_0': round(report['0']['recall'], 4),
'Precision_1': round(report['1']['precision'], 4),
'Recall_1': round(report['1']['recall'], 4),
'F1_1': round(report['1']['f1-score'], 4)
})
print(f"\nš {name}")
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Create comparison DataFrame
import pandas as pd
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('F1_1', ascending=False)
print("\nš Model Comparison Summary:")
print(results_df)
Model Performance Summary:
Model | Accuracy | Precision (Survivors) | Recall (Survivors) | F1-Score |
---|---|---|---|---|
K-Nearest Neighbors | 0.7941 | 0.75 | 0.60 | 0.68 |
Logistic Regression | 0.7721 | 0.72 | 0.56 | 0.64 |
Random Forest | 0.7426 | 0.69 | 0.64 | 0.65 |
Decision Tree | 0.7426 | 0.68 | 0.62 | 0.64 |
Feature Importance Analysis
Understanding which features contribute most to survival prediction:
import matplotlib.pyplot as plt
import numpy as np
# Use Random Forest for feature importance analysis
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
# Get feature importances
feature_importance = rf_model.feature_importances_
feature_names = X_train.columns
# Create feature importance DataFrame
importance_df = pd.DataFrame({
'Feature': feature_names,
'Importance': feature_importance
}).sort_values('Importance', ascending=False)
print("Feature Importance Ranking:")
print(importance_df)
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(range(len(importance_df)), importance_df['Importance'])
plt.yticks(range(len(importance_df)), importance_df['Feature'])
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance for Titanic Survival Prediction')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.savefig("results/feature_importance.png", dpi=300, bbox_inches='tight')
plt.show()
šÆ Key Feature Insights:
- Sex (Gender): Most important predictor - "women and children first" policy
- Fare: Higher fares indicated better accommodations and survival chances
- Age: Younger passengers had higher survival rates
- Passenger Class: First-class passengers had better survival odds
- Family Size: Being alone vs. traveling with family affected survival
Model Deployment
Creating a user-friendly application for real-world use of our trained model:
import tkinter as tk
from tkinter import messagebox
import joblib
import numpy as np
# Save the best model (Random Forest)
import joblib
joblib.dump(rf_model, "deployment/random_forest_model.pkl")
class TitanicApp:
def __init__(self, root):
self.root = root
self.root.title("Titanic Survival Predictor")
self.root.geometry("400x600")
# Load the trained model
self.model = joblib.load("deployment/random_forest_model.pkl")
# Create input fields for all features
self.entries = {}
fields = [
'pclass', 'sibsp', 'parch', 'alone', 'age_norm', 'fare_norm',
'sex_encoded', 'embarked_Q', 'embarked_S', 'class_Second', 'class_Third'
]
for field in fields:
label = tk.Label(root, text=field.replace('_', ' ').title())
label.pack(pady=2)
entry = tk.Entry(root)
entry.pack(pady=2)
self.entries[field] = entry
# Prediction button
predict_btn = tk.Button(root, text="Predict Survival",
command=self.predict, bg='#2563eb', fg='white')
predict_btn.pack(pady=10)
# Result display
self.result_label = tk.Label(root, text="", font=("Arial", 14))
self.result_label.pack(pady=10)
def predict(self):
try:
# Collect input values
input_values = []
for field in self.entries:
value = float(self.entries[field].get())
input_values.append(value)
# Make prediction
X_input = np.array([input_values])
prediction = self.model.predict(X_input)[0]
probability = self.model.predict_proba(X_input)[0]
# Display result
if prediction == 1:
result = f"SURVIVED\nConfidence: {probability[1]:.2%}"
self.result_label.config(text=result, fg='green')
else:
result = f"DID NOT SURVIVE\nConfidence: {probability[0]:.2%}"
self.result_label.config(text=result, fg='red')
except ValueError:
messagebox.showerror("Error", "Please enter valid numbers for all fields")
# Run the application
if __name__ == "__main__":
root = tk.Tk()
app = TitanicApp(root)
root.mainloop()
Desktop Application Features
The Tkinter application provides an intuitive interface for survival prediction:
- User-Friendly Interface: Simple form-based input for all features
- Real-Time Prediction: Instant results with confidence scores
- Model Integration: Uses the best-performing Random Forest model
- Error Handling: Validates input and provides helpful error messages
- Visual Feedback: Color-coded results (green=survived, red=not survived)
Conclusion
This comprehensive Titanic dataset analysis demonstrates a complete data science workflow that can be applied to any classification problem. Through our systematic 5-phase approach, we've created a robust predictive model that achieves approximately 79% accuracy in predicting passenger survival.
šÆ Key Project Outcomes:
- Methodology: Structured 5-phase approach ensures comprehensive analysis
- Data Quality: Systematic cleaning and feature engineering improved model performance
- Model Selection: K-Nearest Neighbors emerged as the best performer
- Feature Insights: Gender, fare, and age were the most predictive factors
- Deployment: Created a user-friendly application for practical use
- Reproducibility: All phases documented with clear, executable code
š Historical Insights:
Our analysis reveals important social and logistical factors that influenced survival on the Titanic:
- Social Hierarchy: Passenger class significantly affected survival chances
- Demographics: Women and children had priority in lifeboats
- Economic Factors: Higher fare correlated with better survival odds
- Family Dynamics: Family size impacted individual survival strategies
- Emergency Procedures: Embarkation port suggested different response protocols
š Advanced Extensions:
This foundational project can be extended in numerous ways:
- Advanced Models: Neural networks, ensemble methods, or boosting algorithms
- Feature Engineering: Creating interaction terms or polynomial features
- Cross-Validation: More robust model evaluation with k-fold CV
- Hyperparameter Optimization: Grid search or Bayesian optimization
- Web Deployment: Flask/Django web application or cloud deployment
- Interpretability: LIME or SHAP for model explanation