Event Details

Breast Cancer Prediction Using Machine Learning Algorithms

Presenter: Zeeshan Ali Shahzad
Supervisor:

Date: Thu, February 15, 2024
Time: 11:00:00 - 00:00:00
Place: Zoom - Please see below

ABSTRACT

Zoom Details:

Join Zoom Meeting:

https://uvic.zoom.us/j/89577894429?pwd=MHBrclBTUmwwMU1uVGpMNWxMWEUvUT09

 

Meeting ID: 895 7789 4429

Password: 259673

One tap mobile

+17789072071,,89577894429# Canada

+16475580588,,89577894429# Canada

 

Dial by your location

        +1 778 907 2071 Canada

        +1 647 558 0588 Canada

Meeting ID: 895 7789 4429

Find your local number: https://uvic.zoom.us/u/kiW9uNm3I

 

Note: Please log in to Zoom via SSO and your UVic Netlink ID

 

Abstract: 

Breast cancer has become a pressing global health issue with its prevalence steadily increasing worldwide. The rise in breast cancer cases is a cause for concern as it not only affects the physical and emotional well-being of individuals but also places a significant burden on the healthcare system. Early detection and timely intervention are critical factors in effectively combatting this disease. The ability to predict and diagnose breast cancer at its earliest stages can make a profound difference in patient outcomes, potentially saving countless lives. In recent years, the importance of Machine Learning (ML) in the field of healthcare has become paramount. This study considers the utility of supervised ML models to address the challenges posed by breast cancer using the publicly available Breast Cancer Wisconsin (Diagnostic) dataset from the University of California Irvine (UCI) ML repository. The Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), Naive Bayes and K-Nearest Neighbors (KNN) classifiers are implemented using Jupyter Notebook with Python programming.

 

The proposed methodology is comprised of several steps aimed at achieving accurate breast cancer prediction. First, data preprocessing is employed to clean the dataset by removing null values and duplicates, and handling missing data. In order to balance the target labels of the dataset, Synthetic Minority Oversampling Technique (SMOTE) is employed. Then, Principal Component Analysis (PCA) is used to reduce the dimensions of the dataset. The number of components is varied (n=2, 5, 10, 15). For training and testing the ML models, five distinct data splits, namely 80/20, 70/30, 50/50, 30/70, and 20/80 are employed to assess the impact on model performance.

 

The performance of the models is evaluated using the metrics accuracy, precision, recall, F1-score, and execution time. The results obtained show that SVM and Logistic Regression outperform the other models with SVM having an accuracy of 98.2% and an execution time of 9.99 ms with an 80/20 split using 10 features and Logistic Regression having an accuracy of 97.9% and an execution time of 8.42 ms with a 50/50 split using 15 features.