Building a Machine Learning Pipeline for Breast Cancer Detection

Introduction

Breast cancer is one of the most common cancers affecting women worldwide, with early detection being crucial for successful treatment outcomes. In this comprehensive tutorial, we'll walk through the complete process of building a machine learning pipeline for breast cancer detection using the Wisconsin Breast Cancer dataset.

This project demonstrates real-world application of machine learning in healthcare, covering everything from data preprocessing to model deployment. We'll explore multiple algorithms, perform hyperparameter tuning, and analyze feature importance to understand what makes our model effective.

Project Repository

The complete Jupyter notebook for this project is available on GitHub:

View Jupyter Notebook

Project Overview

Our machine learning pipeline includes the following key components:

Data Source: Wisconsin Breast Cancer Diagnostic Dataset
Problem Type: Binary Classification (Malignant vs Benign)
Models Tested: 6 different algorithms including Random Forest, SVM, and Logistic Regression
Best Model: Tuned Random Forest with 97.37% accuracy
Deployment: Tkinter-based desktop application

Data Exploration

Data Loading

We start by loading the Wisconsin Breast Cancer dataset directly from the UCI repository:

import warnings
warnings.filterwarnings("ignore")
import pandas as pd

# Load the UCI Breast Cancer Wisconsin dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
column_names = ['ID', 'Diagnosis'] + [
    # Mean values
    'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
    'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave_points_mean',
    'symmetry_mean', 'fractal_dimension_mean',
    # Standard error values
    'radius_se', 'texture_se', 'perimeter_se', 'area_se',
    'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se',
    'symmetry_se', 'fractal_dimension_se',
    # Worst values
    'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst',
    'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst',
    'symmetry_worst', 'fractal_dimension_worst'
]
cancer_data = pd.read_csv(url, names=column_names, na_values='?')

# Save the raw data
cancer_data.to_csv('results/data/raw_cancer_data.csv', index=False)

Data Inspection

Next, we examine the structure and characteristics of our dataset:

# Basic dataset information
print(cancer_data.head(5))
print(f"Shape: {cancer_data.shape}")

# Check target variable categories
print(cancer_data.Diagnosis.unique())  # Output: ['M' 'B']

# Dataset statistics
print(f"Dataset info:\n{cancer_data.info()}")
print(f"Description:\n{cancer_data.describe()}")

# Check for missing values
print(f"Missing values:\n{cancer_data.isnull().sum()}")

# Check for duplicates
print(f"Number of duplicated rows: {cancer_data.duplicated().sum()}")

Key Dataset Characteristics:

Shape: 569 samples × 32 features
Target Classes: M (Malignant), B (Benign)
Missing Values: 0 (Clean dataset)
Duplicates: 0 rows
Data Types: All numerical features except target

Data Cleaning

We convert the categorical diagnosis labels to numerical format for machine learning:

# Convert diagnosis to binary format
cancer_data['Diagnosis'] = cancer_data['Diagnosis'].map({'M': 1, 'B': 0})  # Malignant=1, Benign=0

# Export cleaned data
cancer_data.to_csv("results/data/clean_breast_cancer_wisconsin_mapped.csv", index=False)
df = pd.read_csv("results/data/clean_breast_cancer_wisconsin_mapped.csv")
print(df.head(5))

Model Development

Our model development process follows a systematic approach: baseline model → multiple model comparison → hyperparameter tuning.

Baseline Model

We start with a simple Logistic Regression model as our baseline:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Split the dataset
X = df.drop(['Diagnosis', 'ID'], axis=1)
y = df['Diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Create and train baseline model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate baseline performance
y_pred = model.predict(X_test)
print("Baseline Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Multiple Models Comparison

We compare six different machine learning algorithms to identify the best performer:

import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Define models to compare
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Support Vector Machine": SVC()
}

# Store metrics for comparison
metrics_list = []

# Train and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    
    # Extract metrics for malignant class (class '1')
    precision = report['1']['precision']
    recall = report['1']['recall']
    f1 = report['1']['f1-score']
    
    metrics_list.append({
        'Model': name,
        'Accuracy': round(acc, 4),
        'Precision (Malignant)': round(precision, 4),
        'Recall (Malignant)': round(recall, 4),
        'F1-Score (Malignant)': round(f1, 4)
    })

# Create comparison DataFrame
df_metrics = pd.DataFrame(metrics_list)
df_metrics = df_metrics.sort_values(by='F1-Score (Malignant)', ascending=False).reset_index(drop=True)
print(df_metrics)

# Save metrics
df_metrics.to_csv("results/metrices/breast_cancer_model_metrics.csv", index=False)

Model Comparison Results:

Model	Accuracy	Precision	Recall	F1-Score
Random Forest	0.9649	1.0000	0.9048	0.9500
Logistic Regression	0.9386	0.9730	0.8571	0.9114
Decision Tree	0.9211	0.9231	0.8571	0.8889
K-Nearest Neighbors	0.9123	0.9706	0.7857	0.8684
Support Vector Machine	0.9035	1.0000	0.7381	0.8493

Hyperparameter Tuning

Since Random Forest performed best, we optimize its hyperparameters using GridSearchCV:

from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['auto', 'sqrt'],
    'bootstrap': [True, False]
}

# Set up GridSearchCV
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                          cv=5, n_jobs=-1, scoring='f1', verbose=2)

# Fit grid search
grid_search.fit(X_train, y_train)

# Best parameters and performance
print("Best parameters:", grid_search.best_params_)
print("Best F1 Score:", grid_search.best_score_)

# Evaluate tuned model
best_rf = grid_search.best_estimator_
y_pred_tuned = best_rf.predict(X_test)
print(classification_report(y_test, y_pred_tuned, target_names=['Benign', 'Malignant']))

Optimal Hyperparameters:

bootstrap: False
max_depth: None
max_features: sqrt
min_samples_leaf: 1
min_samples_split: 5
n_estimators: 100

Cross-validation F1 Score: 0.956

Results Analysis

Performance Metrics

Our tuned Random Forest achieved outstanding performance:

# Add tuned model to comparison
report = classification_report(y_test, y_pred_tuned, output_dict=True)
acc = best_rf.score(X_test, y_test)

df_metrics.loc[len(df_metrics.index)] = {
    'Model': 'Tuned Random Forest',
    'Accuracy': round(acc, 4),
    'Precision (Malignant)': round(report['1']['precision'], 4),
    'Recall (Malignant)': round(report['1']['recall'], 4),
    'F1-Score (Malignant)': round(report['1']['f1-score'], 4),
}

# Final comparison
df_metrics_final = df_metrics.sort_values(by='F1-Score (Malignant)', ascending=False)
print(df_metrics_final)

Final Model Performance:

Accuracy: 97.37% (Best)
Precision: 100% (No false positives)
Recall: 90.48% (Catches most malignant cases)
F1-Score: 95.00% (Excellent balance)

Feature Importance Analysis

Understanding which features drive our model's predictions:

import seaborn as sns

# Get feature importances
importances = best_rf.feature_importances_
features = X.columns
feature_df = pd.DataFrame({'Feature': features, 'Importance': importances})
feature_df.sort_values(by='Importance', ascending=False, inplace=True)

# Create visualization
plt.figure(figsize=(8, 5))
sns.barplot(x='Importance', y='Feature', data=feature_df.head(10), palette='viridis')
plt.title('Top 10 Feature Importances - Tuned Random Forest')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.savefig("results/figures/feature_importances_breast_cancer_tuned_rf.png", dpi=600)
plt.show()

print("Top 10 Most Important Features:")
print(feature_df.head(10))

Top Contributing Features:

concave_points_worst: Most discriminative feature
perimeter_worst: Tumor boundary characteristics
concave_points_mean: Average concave point measurements
radius_worst: Maximum radius measurements
area_worst: Largest area measurements

Notice that "worst" features (maximum values) are particularly important for distinguishing malignant tumors.

Model Comparison Visualization

Below is the performance comparison between our baseline and best model:

# Create comparison plot
baseline = df_metrics[df_metrics['Model'] == 'Logistic Regression'].iloc[0]
best = df_metrics_final.iloc[0]

compare_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score'],
    'Baseline (LogReg)': [
        baseline['Accuracy'], baseline['Precision (Malignant)'],
        baseline['Recall (Malignant)'], baseline['F1-Score (Malignant)']
    ],
    'Best (Tuned RF)': [
        best['Accuracy'], best['Precision (Malignant)'],
        best['Recall (Malignant)'], best['F1-Score (Malignant)']
    ]
})

# Plot comparison
compare_df.set_index('Metric').plot(kind='bar', figsize=(8, 5), colormap='Set2')
plt.title('Baseline vs Best Model Performance')
plt.ylabel('Score')
plt.xticks(rotation=0)
plt.ylim(0.7, 1.05)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.savefig("results/figures/bar_graph_breast_cancer_model_comparison.png", dpi=600)
plt.show()

Model Deployment

The final step involves saving our trained model and creating a deployment-ready application:

import pickle

# Save the optimized model
with open("results/models/breast_cancer_tuned_model.pkl", "wb") as f:
    pickle.dump(best_rf, f)

# Load model for inference (example)
with open("results/models/breast_cancer_tuned_model.pkl", "rb") as f:
    loaded_model = pickle.load(f)

Desktop Application

The project includes a Tkinter-based desktop application that provides a user-friendly interface for medical professionals to input patient data and receive instant predictions.

Key Features:

Input form for all 30 diagnostic features
Real-time prediction with confidence scores
Easy-to-interpret results display
Professional medical interface design

Conclusion

This project demonstrates a complete machine learning workflow for medical diagnosis, achieving exceptional performance with 97.37% accuracy. Key takeaways include:

🎯 Key Insights:

Model Selection: Random Forest outperformed other algorithms due to its ability to handle feature interactions
Feature Engineering: "Worst" measurements proved most discriminative for malignancy detection
Hyperparameter Tuning: GridSearchCV improved performance from 96.49% to 97.37% accuracy
Medical Relevance: High precision (100%) is crucial to avoid false positive diagnoses
Deployment Ready: The model is packaged for real-world medical applications

🏥 Clinical Implications:

While this model shows excellent performance, it's important to note that machine learning should augment, not replace, clinical expertise. The model serves as a valuable decision support tool that can:

Assist radiologists in prioritizing cases for review
Provide second opinions for borderline cases
Support screening programs in resource-limited settings
Contribute to standardized diagnostic criteria

Resources & Further Reading

Code Repository

Complete Jupyter notebook with all code and data

View on GitHub

Dataset

Wisconsin Breast Cancer Diagnostic Dataset

UCI Repository

Documentation

Scikit-learn Random Forest Guide

Tags:

Machine Learning Healthcare Random Forest Python Scikit-learn Classification Medical AI

About Dr. Sinan

Dr. Sinan is a Research Scientist and Machine Learning Engineer specializing in AI applications in healthcare. With extensive experience in developing ML solutions for medical diagnosis, he focuses on creating interpretable and clinically-relevant models.

Contact LinkedIn GitHub