Titanic Dataset Analysis: A Complete Data Science Pipeline

Introduction

The Titanic dataset represents one of the most famous datasets in data science education and competition. This tragic maritime disaster from 1912 provides a rich source of data for understanding machine learning fundamentals while tackling a real-world classification problem: predicting passenger survival based on various demographic and social factors.

In this comprehensive tutorial, we'll walk through a complete data science pipeline from initial data understanding to model deployment. This project demonstrates best practices in data science methodology, following a systematic 5-phase approach that can be applied to any machine learning project.

Project Repository

The complete data science pipeline with all phases is available on GitHub:

View GitHub Repository

Project Overview

Our Titanic survival analysis follows a structured 5-phase methodology that ensures comprehensive coverage of the data science lifecycle:

Project Phases:

Phase 1: Data Understanding - Initial exploration and data profiling
Phase 2: Data Cleaning - Handling missing values, outliers, and duplicates
Phase 3: Feature Engineering - Normalization, encoding, and feature selection
Phase 4: Data Splitting - Train-test splits with stratification
Phase 5: Modeling & Evaluation - Model comparison, tuning, and analysis

Dataset Context:

The RMS Titanic sank on April 15, 1912, during its maiden voyage. Our analysis aims to understand which factors contributed most to passenger survival, using features such as:

Demographics: Age, gender, family relationships
Socioeconomic: Passenger class, fare paid
Logistics: Cabin location, embarkation port
Family: Number of siblings/spouses and parents/children aboard

Phase 1: Data Understanding

Data Loading & Initial Preview

We begin our analysis by loading the Titanic dataset and conducting initial exploration:

import seaborn as sns
import pandas as pd

# Load Titanic dataset from seaborn
df = sns.load_dataset("titanic")

# First glimpse of the data
print("First 5 rows:")
print(df.head())

# Dataset dimensions
print(f"\nDataset shape: {df.shape}")
print(f"Total samples: {df.shape[0]}")
print(f"Total features: {df.shape[1]}")

# Save initial dataset
df.to_csv("data/phase_1_titanic_dataset.csv", index=False)

Initial Data Exploration

Understanding the structure and characteristics of our dataset:

# Comprehensive dataset information
print("Dataset Info:")
print(df.info())

# Statistical summary for numerical columns
print("\nNumerical Statistics:")
print(df.describe())

# Data types examination
print("\nData Types:")
print(df.dtypes)

# Check for missing values
print("\nMissing Values Count:")
print(df.isnull().sum())

# Unique values in categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
    print(f"\n{col}: {df[col].unique()}")

Initial Data Insights:

Dataset Size: 891 passengers with 15 features
Target Variable: 'survived' (0 = No, 1 = Yes)
Missing Data: Age (~20%), Deck (~77%), Embark_town (~0.2%)
Data Types: Mix of numerical and categorical features
Class Distribution: More non-survivors than survivors

Phase 2: Data Cleaning

Handling Missing Values

We implement different strategies for different types of missing data:

import pandas as pd

# Load data from previous phase
df = pd.read_csv("data/phase_1_titanic_dataset.csv")

# Check current missing values
print("Missing values before cleaning:")
print(df.isnull().sum())

# 🔹 Numerical Columns - Use median for age
df['age'] = df['age'].fillna(df['age'].median())

# 🔹 Categorical Columns - Use mode for most common value
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])

# 🔹 Deck column - Use 'Unknown' for missing values
df['deck'] = df['deck'].fillna('Unknown')

print("\nMissing values after cleaning:")
print(df.isnull().sum())

Outlier Detection and Removal

Using the Interquartile Range (IQR) method to handle fare outliers:

# Outlier detection using IQR method for fare
Q1 = df['fare'].quantile(0.25)
Q3 = df['fare'].quantile(0.75)
IQR = Q3 - Q1

# Calculate outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Fare outlier bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")

# Filter out outliers
original_size = len(df)
df = df[(df['fare'] >= lower_bound) & (df['fare'] <= upper_bound)]
new_size = len(df)

print(f"Removed {original_size - new_size} outliers ({((original_size - new_size)/original_size)*100:.1f}%)")

Duplicate Removal

Ensuring data quality by removing duplicate records:

# Check for and remove duplicates
duplicates_before = df.duplicated().sum()
df = df.drop_duplicates()
duplicates_after = df.duplicated().sum()

print(f"Duplicates removed: {duplicates_before}")
print(f"Final dataset size: {len(df)} samples")

# Convert target to categorical
df['survived'] = df['survived'].astype('category')

# Save cleaned dataset
df.to_csv("data/phase_2_titanic_dataset.csv", index=False)

Data Cleaning Results:

Missing Values: All handled with appropriate strategies
Age: Filled with median (28.0 years)
Embarked: Filled with mode ('S' - Southampton)
Deck: Missing values labeled as 'Unknown'
Outliers: Extreme fare values removed using IQR method
Duplicates: Any duplicate rows eliminated

Phase 3: Feature Engineering

Data Normalization

Standardizing numerical features to improve model performance:

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load cleaned data
df = pd.read_csv("data/phase_2_titanic_dataset.csv")

# Min-Max Normalization (0-1 scaling)
# Formula: x' = (x - min(x)) / (max(x) - min(x))
scaler = MinMaxScaler()

# Apply normalization to numerical features
df['age_norm'] = scaler.fit_transform(df[['age']])
df['fare_norm'] = scaler.fit_transform(df[['fare']])

print("Normalization completed:")
print(f"Age range: [{df['age_norm'].min():.3f}, {df['age_norm'].max():.3f}]")
print(f"Fare range: [{df['fare_norm'].min():.3f}, {df['fare_norm'].max():.3f}]")

Categorical Encoding

Converting categorical variables to numerical format for machine learning:

from sklearn.preprocessing import LabelEncoder

# 🔹 1. Label Encoding for binary features
le = LabelEncoder()
df['sex_encoded'] = le.fit_transform(df['sex'])  # male=1, female=0
print(f"Sex encoding: {dict(zip(le.classes_, le.transform(le.classes_)))}")

# 🔹 2. One-Hot Encoding for multi-category features
# This creates binary columns for each category
df = pd.get_dummies(df, columns=['embarked', 'class'], drop_first=True)

print("\nNew columns after one-hot encoding:")
print([col for col in df.columns if 'embarked_' in col or 'class_' in col])

Feature Selection

Removing redundant and non-predictive features to simplify the model:

# Remove original columns that have been encoded or are not useful for prediction
columns_to_drop = [
    'sex',          # Replaced by sex_encoded
    'age',          # Replaced by age_norm  
    'fare',         # Replaced by fare_norm
    'deck',         # Too many missing values
    'embark_town',  # Redundant with embarked
    'who',          # Redundant with sex and age
    'alive',        # Same as survived
    'adult_male'    # Derivable from sex and age
]

df = df.drop(columns=columns_to_drop)

print("Final feature set:")
print(list(df.columns))

# Save feature-engineered dataset
df.to_csv("data/phase_3_titanic_dataset.csv", index=False)

Feature Engineering Summary:

Normalization: Age and fare scaled to [0,1] range
Label Encoding: Binary sex variable converted to 0/1
One-Hot Encoding: Embarked and class expanded to binary features
Feature Reduction: Removed 8 redundant/non-predictive columns
Final Features: 11 engineered features for modeling

Phase 4: Data Splitting

Train-Test Split with Stratification

Properly splitting the data while maintaining class balance:

import pandas as pd
from sklearn.model_selection import train_test_split

# Load feature-engineered data
df = pd.read_csv("data/phase_3_titanic_dataset.csv")

# Separate features (X) and target (y)
X = df.drop(columns=['survived'])  # All features except target
y = df['survived']                 # Target variable

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Target distribution:\n{y.value_counts(normalize=True)}")

# Stratified train-test split to maintain class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,        # 20% for testing
    random_state=42,      # For reproducibility
    stratify=y            # Maintain class distribution in both sets
)

print(f"\nTraining set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"Training target distribution:\n{y_train.value_counts(normalize=True)}")
print(f"Test target distribution:\n{y_test.value_counts(normalize=True)}")

Data Splitting Results:

Training Set: 80% of data for model training
Test Set: 20% of data for final evaluation
Stratification: Maintains survival rate balance in both sets
Feature Count: 11 engineered features
Random State: Set for reproducible results

Phase 5: Modeling & Evaluation

Baseline Model Development

Starting with a simple Logistic Regression as our baseline:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Create and train baseline model
baseline_model = LogisticRegression(max_iter=1000)
baseline_model.fit(X_train, y_train)

# Make predictions
y_pred_baseline = baseline_model.predict(X_test)

# Evaluate baseline performance
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)
print(f"Baseline Accuracy: {baseline_accuracy:.4f}")
print("\nBaseline Classification Report:")
print(classification_report(y_test, y_pred_baseline))

Comprehensive Model Comparison

Evaluating multiple algorithms to find the best performer:

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Define models for comparison
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Support Vector Machine": SVC(random_state=42)
}

# Store results for comparison
results = []

# Train and evaluate each model
for name, model in models.items():
    # Train model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    
    # Store results
    results.append({
        'Model': name,
        'Accuracy': round(accuracy, 4),
        'Precision_0': round(report['0']['precision'], 4),
        'Recall_0': round(report['0']['recall'], 4),
        'Precision_1': round(report['1']['precision'], 4),
        'Recall_1': round(report['1']['recall'], 4),
        'F1_1': round(report['1']['f1-score'], 4)
    })
    
    print(f"\n📌 {name}")
    print(f"Accuracy: {accuracy:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))

# Create comparison DataFrame
import pandas as pd
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('F1_1', ascending=False)
print("\n📊 Model Comparison Summary:")
print(results_df)

Model Performance Summary:

Model	Accuracy	Precision (Survivors)	Recall (Survivors)	F1-Score
K-Nearest Neighbors	0.7941	0.75	0.60	0.68
Logistic Regression	0.7721	0.72	0.56	0.64
Random Forest	0.7426	0.69	0.64	0.65
Decision Tree	0.7426	0.68	0.62	0.64

Feature Importance Analysis

Understanding which features contribute most to survival prediction:

import matplotlib.pyplot as plt
import numpy as np

# Use Random Forest for feature importance analysis
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Get feature importances
feature_importance = rf_model.feature_importances_
feature_names = X_train.columns

# Create feature importance DataFrame
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print("Feature Importance Ranking:")
print(importance_df)

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(range(len(importance_df)), importance_df['Importance'])
plt.yticks(range(len(importance_df)), importance_df['Feature'])
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance for Titanic Survival Prediction')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.savefig("results/feature_importance.png", dpi=300, bbox_inches='tight')
plt.show()

🎯 Key Feature Insights:

Sex (Gender): Most important predictor - "women and children first" policy
Fare: Higher fares indicated better accommodations and survival chances
Age: Younger passengers had higher survival rates
Passenger Class: First-class passengers had better survival odds
Family Size: Being alone vs. traveling with family affected survival

Model Deployment

Creating a user-friendly application for real-world use of our trained model:

import tkinter as tk
from tkinter import messagebox
import joblib
import numpy as np

# Save the best model (Random Forest)
import joblib
joblib.dump(rf_model, "deployment/random_forest_model.pkl")

class TitanicApp:
    def __init__(self, root):
        self.root = root
        self.root.title("Titanic Survival Predictor")
        self.root.geometry("400x600")
        
        # Load the trained model
        self.model = joblib.load("deployment/random_forest_model.pkl")
        
        # Create input fields for all features
        self.entries = {}
        fields = [
            'pclass', 'sibsp', 'parch', 'alone', 'age_norm', 'fare_norm',
            'sex_encoded', 'embarked_Q', 'embarked_S', 'class_Second', 'class_Third'
        ]
        
        for field in fields:
            label = tk.Label(root, text=field.replace('_', ' ').title())
            label.pack(pady=2)
            entry = tk.Entry(root)
            entry.pack(pady=2)
            self.entries[field] = entry
        
        # Prediction button
        predict_btn = tk.Button(root, text="Predict Survival", 
                               command=self.predict, bg='#2563eb', fg='white')
        predict_btn.pack(pady=10)
        
        # Result display
        self.result_label = tk.Label(root, text="", font=("Arial", 14))
        self.result_label.pack(pady=10)
    
    def predict(self):
        try:
            # Collect input values
            input_values = []
            for field in self.entries:
                value = float(self.entries[field].get())
                input_values.append(value)
            
            # Make prediction
            X_input = np.array([input_values])
            prediction = self.model.predict(X_input)[0]
            probability = self.model.predict_proba(X_input)[0]
            
            # Display result
            if prediction == 1:
                result = f"SURVIVED\nConfidence: {probability[1]:.2%}"
                self.result_label.config(text=result, fg='green')
            else:
                result = f"DID NOT SURVIVE\nConfidence: {probability[0]:.2%}"
                self.result_label.config(text=result, fg='red')
                
        except ValueError:
            messagebox.showerror("Error", "Please enter valid numbers for all fields")

# Run the application
if __name__ == "__main__":
    root = tk.Tk()
    app = TitanicApp(root)
    root.mainloop()

Desktop Application Features

The Tkinter application provides an intuitive interface for survival prediction:

User-Friendly Interface: Simple form-based input for all features
Real-Time Prediction: Instant results with confidence scores
Model Integration: Uses the best-performing Random Forest model
Error Handling: Validates input and provides helpful error messages
Visual Feedback: Color-coded results (green=survived, red=not survived)

Conclusion

This comprehensive Titanic dataset analysis demonstrates a complete data science workflow that can be applied to any classification problem. Through our systematic 5-phase approach, we've created a robust predictive model that achieves approximately 79% accuracy in predicting passenger survival.

🎯 Key Project Outcomes:

Methodology: Structured 5-phase approach ensures comprehensive analysis
Data Quality: Systematic cleaning and feature engineering improved model performance
Model Selection: K-Nearest Neighbors emerged as the best performer
Feature Insights: Gender, fare, and age were the most predictive factors
Deployment: Created a user-friendly application for practical use
Reproducibility: All phases documented with clear, executable code

📊 Historical Insights:

Our analysis reveals important social and logistical factors that influenced survival on the Titanic:

Social Hierarchy: Passenger class significantly affected survival chances
Demographics: Women and children had priority in lifeboats
Economic Factors: Higher fare correlated with better survival odds
Family Dynamics: Family size impacted individual survival strategies
Emergency Procedures: Embarkation port suggested different response protocols

🚀 Advanced Extensions:

This foundational project can be extended in numerous ways:

Advanced Models: Neural networks, ensemble methods, or boosting algorithms
Feature Engineering: Creating interaction terms or polynomial features
Cross-Validation: More robust model evaluation with k-fold CV
Hyperparameter Optimization: Grid search or Bayesian optimization
Web Deployment: Flask/Django web application or cloud deployment
Interpretability: LIME or SHAP for model explanation

Resources & Further Reading

Complete Project

All 5 phases with Jupyter notebooks and deployment code

View on GitHub

Titanic Dataset

Original dataset information and historical context

Kaggle Competition

Scikit-learn

Machine learning library documentation and tutorials

Read Docs

Data Science Process

CRISP-DM methodology for data science projects

Learn More

Tags:

Data Science Machine Learning Classification Python Pandas Scikit-learn Feature Engineering Tkinter

About Dr. Sinan

Dr. Sinan is a Research Scientist and Machine Learning Engineer with extensive experience in data science methodology and predictive modeling. He specializes in creating end-to-end machine learning solutions and educational content that bridges theory and practice.

Contact LinkedIn GitHub