Identifying Optimal Algorithms for Breast Cancer Prediction in Ethiopia

doi:10.21203/rs.3.rs-4958400/v1

Download PDF

Research Article

Identifying Optimal Algorithms for Breast Cancer Prediction in Ethiopia

https://doi.org/10.21203/rs.3.rs-4958400/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

The most prevalent and lethal kind of cancer in Ethiopia is breast cancer. The number of deaths from breast cancer is rising dramatically every year. It is the most common kind of cancer overall and the leading cause of death for women in Ethiopia. Considering this, this study aims to identify optimal machine learning algorithms that can predict the stages of breast cancer. Unlike traditional methods, machine learning approaches have proven to be powerful methods in early detection and prediction of breast cancer. In this study, we have used the breast cancer dataset that was collected from Hiwot Fana Specialized University Hospital and Tikur Anbesa Specialized Hospital from September 2019 to April 2024. We have applied machine learning algorithms on the preprocessed breast cancer dataset; which are random forest, logistic regression, decision tree, and hybrid machine learning algorithms (RF, DT, GBC and SVM). Based on the results obtained from each algorithm, we compared and evaluated the performance of each classifier using evaluation metrics like precision, recall, F1 score, and accuracy to identify the optimal machine-learning algorithm. In order to find the optimal algorithms and improve the accuracy of the model, 13 features were selected as inputs. The model performance evaluation was done using the train split test and the 10-fold cross-validation. The experimental results were based on dataset division (80:20) to predict stages of breast cancer. Python programing language and required libraries were used to analyze dataset. According to the comparative analysis obtained from the dataset, the random forest model performed well in both trains split test and the 10-fold cross validation performance evaluation and surpassed other experimented algorithms. It has better effects, and its recall, precision, accuracy, and F1-scores are equal, which is 99% using train split test and 97% using 10-fold cross validation. Thus, random forest is the optimal machine-learning algorithm that used to determine stages of breast cancer patients in Ethiopia.

Breast cancer prediction

Machine-learning algorithms

and Performance evaluators

Cancer is a class of disease that is characterized by uncontrolled cell growth (Alfonse et al., 2014). These days, it is a non-communicable global disease that affects people worldwide and causes morbidity, death, and economic loss (Awedew et al., 2022). Among various types of cancer, like lung cancer, liver cancer, cervical cancer, leukemia and others, breast cancer is the most prevalent malignancy in women worldwide and one of the deadliest after lung cancer (Alshareeda et al., 2023). According to the World Health Organization report in 2021, in Ethiopia, breast cancer, cervix uteri cancer, and leukemia are the three most lethal types of cancer, respectively. According to the previous studies and current situation of cancer, breast cancer is the most often identified cancer among women and major reason for increasing mortality rate among women (Srivenkatesh, 2020). Similarly, Awedew et al. (2022), identified the national burden and trend of cancer in Ethiopia from 2010–2019. Their analytical report shows that breast cancer has the highest proportion of other types of cancer. The major causes of death in breast cancer include misdiagnosis, long distance to referral facilities, high cost of diagnostic services, long waiting time for diagnostic tests, and lack of screening and diagnostic tests in local facilities, which were identified as health-system-related barriers for late diagnosis of breast cancer that lead to poor treatment outcomes (Solbana & Chaka, 2023). As worldwide, 685,000 people died from breast cancer, and 2.3 million women were diagnosed with the disease in 2020 (Hessari, 2023).The survival rate of breast cancer patients depends on the stages of breast cancer.

Breast cancer staging is a process of describing breast cancer, such as where it is located, if or where it has spread, and whether it is affecting its functions. It can be determined by tumor size, lumpy node, the metastasis, and specific bio-technically possible factors (Trayes & Cokenakes, 2021). These authors clearly stated that the common stages of breast cancer include Stage 0, Stage I, Stage II, Stage III, and Stage IV. All stages of breast cancer are non-metastatic, except stage IV. These stages can be advanced to Stage IA, IB, Stage IIA, Stage IIB, and Stage IIIA, Stage IIIB, and Stage IIIC. Healthcare professionals classify stages of breast cancer based on the values of the primary tumor (T), the absence or presence of regional lymph node involvement (N), and the absence or presence of distant metastases (M). Likewise, based on the values of these important attributes, which are TNM values and others, machine learning algorithms can also predict stages of breast cancer.

Machine learning techniques can make a large contribution to the process of prediction and early diagnosis of breast cancer that makes real time decision to save people's lives (El Massari et al., 2022). The most common machine learning technique goals are classification and prediction, which use several algorithms for the prediction of breast cancer. This study mainly gives a comparison between the performances of four ML algorithms: Random Forest, Logistic Regression, Decision tree and Hybrid machine learning (RF, DT, GBC, and SVM) to determine the most influential ML algorithms used to predict stages of breast cancer. Because the aim of our study is to identify the optimal ML algorithm that predicts breast cancer based on the performance of each model in terms of confusion matrix, accuracy, precision, and recall or sensitivity.

Related works

There has been a lot of research on breast cancer prediction and diagnosis by using machine-learning techniques. El Massari et al., 2022) developed an ontological model based on machine learning for predicting breast cancer. The breast cancer dataset used contains 683 instances and 9 attributes, two of which are benign and malignant. The experimental findings of these authors reveal that the Random Forest achieved an accuracy of 96.00% and both Logistic Regression and Support Vector Machine achieved an accuracy of 95.30%. In addition, Chaurasia et al., (2021) also worked on prediction of presence of breast cancer disease in the patient using machine learning algorithms and SFS. They used classification algorithms, such as random forest (RF), support vector classifier (SVC), k-nearest neighbors (KNN), linear discriminant analysis (LDA), gradient boosting classifier (GBC), and decision tree (DT), to the detection and prediction of breast malignant tumors was covered in-depth by these authors. Their experimental result show that RF_sfs, KNN_sfs, SVC_rbf and SVC_sfs have equal accuracy, which is 97.66% and selected as highest accuracy. Srivinkateh (2020) also worked on the prediction of breast cancer disease using machine learning algorithms. The author compares the accuracies of the Support Vector Machine, Random Forest, Naive Bayes classifier and logistic regression on the dataset taken in a region to present an accurate model of predicting breast cancer disease. These ML algorithms were applied to a breast cancer dataset that contains 569 instances and 32 attributes.

The ML algorithms used were able to predict breast cancer disease in patients with accuracy between 52.63% and 98.24%. The experimental result show that Random Forest has better Accuracy (98.24%) when compared to different Machine-learning Algorithms. Naji et al., (2021) also used machine learning algorithms for breast cancer prediction and diagnosis. They applied five machine learning algorithms: SVM, RF, LR, DT and KNN on the breast cancer Wisconsin Diagnostic dataset that has 569 instances and 11 attributes with two classes benign and malignant. A performance comparison and evaluation between these experimented classifiers is done once the results are obtained. It is noted that SVM achieved the greatest accuracy of 97.2%, outperforming all other classifiers.

The previous studies conducted on breast cancer prediction focused on the absence or presence of breast cancer. After the presence of breast cancer is determined, the stages of that cancer are not predicted. In addition, the dataset used is also obtained from the same source, which is the Wisconsin diagnostic breast cancer dataset, with limited parameters. In this study, a breast cancer dataset was collected from two hospitals, and stages of breast cancer were predicted using ML algorithms. The data size and accuracy of the model were also increased when compared with the previous work.

In this section, different methods, techniques, and approaches used to identify optimal algorithms for breast cancer prediction were discussed. The architecture, which shows the overall description of this study, was designed, and its components are sequentially described. The machine learning algorithms and data analysis tools used in this study were also presented.

Dataset Description

The breast cancer dataset that was used for this study was collected from Hiwot Fana Specialized University Hospital and Tikur Anbessa Specialized Hospital. The dataset used in this study was secondary data collected from patient cards. The collected data contains both nominal and numeric data types. The total dataset that we used in this study contains 1164 instances and 13 attributes with a target class, which is stage. These attributes used are tumour size, metastasis, node, age, number of children, marital status, habits, occupation, residence, previous surgery, ECOG, and breastfeeding status. Among these total instances, 406 were collected from Hiwot Fana Specialized University Hospital, and 758 instances of the dataset were collected from Tikur Anbessa Specialized Hospital. The dataset used in this study is the data that has been recorded from 2019 to 2024.

The Proposed Architecture

The main objective of this study is to identify the most effective and optimal algorithm for predicting breast cancer stages. Therefore, we applied various machine learning classifiers, such as random forests, logistic regression, decision trees, and hybrid machine learning approaches involving random forest classifiers, gradient boosting classifiers, decision trees, and support vector machines. These algorithms were applied to the breast cancer dataset, and the obtained results were also evaluated to determine the model that provided the highest accuracy. The proposed methods begin with data acquisition, followed by pre-processing, which contains tasks like data cleaning, handling missing values, selecting attributes, and feature selection. The prepared data is used to create a model that can predict the breast cancer stages. To evaluate the model's performance, we have provided the model with new data that has been labeled. This is done by splitting the cleaned and labeled data into two parts using the train_test_split method and 10-fold cross-validation. 80% of the data is used to build the machine learning model and is called the training data or training set. 20% of the data used to assess how well the model works is called test data or test set. After testing the models, the obtained results have been compared to select the algorithm that provides the highest accuracy and identify the most predictive algorithm for the prediction of breast cancer. These comparisons have been based on metrics like recall, precision, and F-measure. The proposed architecture is presented in the following Fig. 1:

Breast cancer dataset Preparation

Data preparation is the set of methods that initialize and prepare the data properly to serve as input for a certain algorithm (García et al., 2015). It can also be considered a mandatory step that converts scattered data into new data that fits the data for the pattern extraction process. Thus, it mainly deals with converting scattered data into appropriate formats that facilitate the creation of a model. To facilitate data analysis, the collected data was encoded, saved in CSV format, and structured in a way that was easy to understand. In this study, after the dataset is prepared, data pre-processing tasks like data cleaning have been carried out and organized in a form suitable for classification algorithms.

Data Preprocessing

Data preprocessing is the fundamental steps for any data analysis. It contains key activities like data cleaning, integration, transformation, and data reduction (Khan et al., 2019). Most of the time, the data obtained from different sources may contain missing values, noisy data, or conflicting information. Therefore, the data in the selected dataset should be pre-processed to be reliably classified by the classification model (Vujović, 2021). In this study, the study carried out the following data preprocessing activities:

Data Cleaning

The collected data from different sources is prone to error, incompleteness, and inconsistent for various reasons (Ziafat & Shakeri, 2014). Due to this, filling in missing values, reducing noise data in the dataset are required. For example, if age of the patient is not filled, we obtained it from the patient’s date of birth if the patient is currently alive. Noises or irrelevant data occurred during data entry or labeling also handled at this stage. For instance, the value of attribute metastasis is M0 and M1. During data entry some data entered as Mo instead of M0. So, we removed Mo and we replaced with M0. Additionally, a data-cleaning task is also necessary to remove duplicate data that exists in the datasets. To eliminate unnecessary and redundant data from the breast cancer dataset that was gathered, the study carried out a data cleaning procedure and removed duplicate instances from the dataset. The total gathered breast cancer dataset contains 13 attributes and targets or classes with 1164 instances.

Feature Selection

Finding the ideal subset is a key component of the data preprocessing activity. Feature selection can eliminate superfluous and redundant characteristics, increasing the model's accuracy(Chen et al., 2023). To find the optimal algorithm and improve the accuracy of the model, feature selection is very important. In this study, feature selection process was conducted on the dataset to improve accuracy and reduce the dimensionality of the dataset. Attribute selection focuses on selecting a subgroup of attributes from all attributes to minimize irrelevant attributes while improving performance. The breast cancer patient dataset contains many attributes that do not have any effects while creating a model. Likewise, it also contains very crucial attributes that play a significant role in determining the stages of breast cancer. Some attributes that don’t have a significant effect on determining stages of breast cancer include the patient's medical record number, name, phone number, region, birthplace, and house number. All the collected datasets contain women with breast cancer. Basically, 13 attributes and the target class are selected to create a classification model. All these attributes are needed as input values for the classification of breast cancer into stages. The importance of these features is haven’t equal significance to determine stages of breast cancer.

Table 1

Description of selected attributes according to their importance
S/N	Attributes	Data type	Possible Values	Description
1.	Tumor size	Nominal	T0, T1, T2, T3, T4	It refers to the size of the tumor
2.	Metastasis	Nominal	M0, M1	The presence or absence of metastasis
3.	Lymph Node	Nominal	N0, N1, N2, N3	Lymph Nodes values
4.	Age	Nominal	1, 2,3,4,5…	Age of the patient
5.	Children	Numerical	1, 2, 3…	Number of children the patient has
6.	Marital Status	Nominal	Single, Married, Divorced, Widowed	Patient marital status
7.	Habits	Nominal	Smoking, Alcohol, Khat, other	Habits the patients have
8.	Occupation	Nominal	House wife, Employee	Life style
9.	Residence	Nominal	Urban, rural	Place where patient lives
10.	Previous Surgery	Nominal	Yes, No	Past history of the patient
11.	ECOG	Nominal	I, II…V	It’s level of functioning in terms of daily activities and physical abilities during cancer clinical trials
12.	Breast Feeding	Nominal	Yes, No	Breast feeding status
13.	Comorbid illness	Nominal	Yes, No	Other disease that patients may have
14.	Stage	Nominal	Stage 0, I, II, III, IV	It is classified depend on the values of attributes Tumor size, Metastasis, lumpy node and other attributes

The above Table 1 shows the selected attributes that were used to determine the stages of breast cancer. The four machine learning algorithms, which include Random Forest, Decision Tree, Logistic Regression, and hybrid machine learning, are experimented with on the prepared cancer dataset and used to predict stages of breast cancer based on the values of these selected attributes.

Machine learning algorithms

This paper examines and evaluates the most prominent machine learning algorithms used in the prediction of breast cancer in terms of their stages based on the values of selected attributes. The detailed description of experimented machine learning algorithms is discussed as follows:

Random Forests

It is a versatile machine learning approach for various tasks, including repetitive execution, classification, regression, and prediction, by creating a large number of decision trees during the training period (Chaurasia et al., 2021). In this study, the Random Forest is used as an ensemble method for classification, regression, and prediction by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classification or mean prediction (regression) of the individual trees. Similarly, RF is also used to determine feature importance that predict stages of breast cancer. Here are the steps should be followed when working with the Random Forest algorithm (Srivenkatesh, 2020

Step 1: First, start with the choice of random samples from a given dataset.

Step 2: Next, the calculation will build a choice tree. At that point, it will get the forecast outcome from each choice tree.

Step 3: In this progression, casting a ballot will be performed for each anticipated outcome.

Step 4: At last, select the most casted a ballot forecast result as the final prediction result.

The above-listed steps were also followed in this study during experimentation with a random forest on a breast cancer dataset.

Logistic Regression

It is a very powerful modeling tool, is a generalization of linear regression. It is used to assess the likelihood of a disease or health condition as a function of a risk factor (and covariates). Both simple and multiple logistic regression, assess the association between independent variable(s) (Xi) sometimes called exposure or predictor variables and a dichotomous dependent variable (Y) sometimes called the outcome or response variable. It is used primarily for predicting binary or multiclass dependent variables. So, in this study, independent 13 attributes and dependent attribute or stages were predicted based on the values of independent variables.

Decision Tree

It is a classification algorithm that concludes the value of a dependent attribute (variable) given the values of the independent (input) attributes (variables) (Bhargava et al., 2013). These authors noted that the decision-tree classification algorithm is useful in several fields, including data and text mining, information extraction, machine learning, and pattern recognition. Among these, we used DT in machine learning to predict stages of breast cancer. We used DT machine learning due to its simplicity and compatibility with our dataset. Since our dataset type is a combination of nominal, numerical, and textual datasets, the DT algorithm is appropriate to analyze and be able to process missing values in the dataset. Its performance is also better, according to the previous studies. Thus, using the decision tree method for prediction is appropriate since our data set is correctly classified and has the fewest possible nodes.

Hybrid machine learning algorithms

This is a method of combining various machine learning algorithms or approaches to produce a more reliable and accurate model. In hybrid machine learning, one algorithm can handle the drawbacks of another algorithm, so it leverages the strengths of different methods to improve the overall performance. In this study, hybrid machine learning approaches involving random forest, decision tree, and logistic regression were used.

The appropriate procedures followed during the use of these ML algorithm are summarized in the proposed architecture, which includes breast cancer data collection from these two hospitals. Performing preprocessing tasks like cleaning and handling missing values. After the dataset is preprocessed, feature selection also takes place to select the top features that predict the stages of breast cancer. Following preprocessing tasks and feature selection, the quality data is split into a train-test split, with 80% for training and 20% for testing. The training dataset was used to build a model that initialized and trained the model. The next step followed here is to make predictions on the test set. Finally, performance evaluation was carried out using evaluation metrics like accuracy, precision, recall, and Fl-Score.

Performance Evaluation

We have used various methods for evaluating the developed model to predict the stages of breast cancer. When comparing various classification algorithms, the value of the metric analysis must be appropriately understood (Tharwat, 2018). Confusion Matrix is one of the methods used to evaluate the performance of the models. It is an evaluation metrics for the classification where the output can be two or more type of classes. A confusion matrix is a table with two dimensions. “Actual” and “Predicted” and furthermore, both the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, and “False Negatives (FN)”. Accuracy is most common performance metric for classification algorithms. It defined as the number of correct predictions made as a ratio of all predictions made.

Precision, used in document retrievals, can be defined as the number of correct documents returned by our ML model. Sensitivity also defined as the number of positives returned by ML model. F1 score gives us the harmonic mean of precision and Sensitivity. Therefore, evaluation metrics like accuracy, precision, recall and F1-score are carried out. The formulas used to evaluate the performance of these algorithms are discussed as follows (Tharwat, 2018):

$$\:\varvec{P}\varvec{r}\varvec{e}\varvec{c}\varvec{i}\varvec{s}\varvec{i}\varvec{o}\varvec{n}=\frac{\mathbf{T}\mathbf{P}}{\mathbf{F}\mathbf{P}+\mathbf{T}\mathbf{P}}$$

$$\:\varvec{R}\varvec{e}\varvec{c}\varvec{a}\varvec{l}\varvec{l}\left(\varvec{S}\varvec{e}\varvec{n}\varvec{s}\varvec{a}\varvec{t}\varvec{i}\varvec{v}\varvec{i}\varvec{t}\varvec{y}\right)=\frac{\mathbf{T}\mathbf{P}}{\mathbf{F}\mathbf{N}+\mathbf{T}\mathbf{P}}$$

.$\:\varvec{F}1\:\varvec{s}\varvec{c}\varvec{o}\varvec{r}\varvec{e}-\varvec{m}\varvec{e}\varvec{a}\varvec{s}\varvec{u}\varvec{r}\varvec{e}=\frac{2\mathbf{P}\mathbf{R}}{\mathbf{P}+\mathbf{R}}$ (3)

$$\:A\varvec{c}\varvec{c}\varvec{u}\varvec{r}\varvec{a}\varvec{c}\varvec{y}=\frac{\mathbf{T}\mathbf{P}+\mathbf{T}\mathbf{N}}{\mathbf{T}\mathbf{P}+\mathbf{T}\mathbf{N}+\mathbf{F}\mathbf{N}+\mathbf{F}\mathbf{P}}$$

We used two test modes (split-test and 10-fold cross-validation) using several metrics, including recall, F1 score, accuracy, and precision, to analyse our experimental results, which are presented in the above Figs. 2 and 3. We used split-test methods due to its simplicity during model performance evaluation. It works by dividing the dataset into two parts: a training set and a test set. It is easy to implement and understand. Similarly, it allows for a quick evaluation of the model’s performance on unseen data, providing an immediate sense of how well the model generalizes. 10-fold cross-validation is generally preferred as a mode of evaluation due to its robustness and reliability. This process provides a more comprehensive evaluation of the model’s performance. It is also used due to its ability to use data efficiently. It makes efficient use of the available data, as each data point is used for both training and testing. This model performance evaluation mode is appropriate and compatible for the size of our dataset. While both techniques aid in determining a model's performance on unseen data, 10-fold cross-validation is typically used due to its robustness and reliability. Furthermore, the same criteria are utilized to evaluate the validity of this comparison, including ML models.

All experiments on the machine learning algorithms described in this paper were conducted using Scikit-learn library and Python programming language. Scikit-learn also known as sklearn is a free and open-source machine learning library for the Python programming language. The experimented classification algorithms include random forest, logistic regression, decision tree and hybrid. These algorithms are designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. The experimentation takes place on train split test and 10-fold cross validation. After applying machine learning algorithms on breast cancer dataset, we used Confusion Matrix, Accuracy, and Precision, recall or Sensitivity, F1 Score, as performance metrics to evaluate and compare the models and identify the best algorithm for the breast cancer Prediction.

Feature Importance selection Result

Not all attributes of breast cancer patients have equal significance in predicting stages of breast cancer. The selected attributes based on their significance are described in the following Fig. 2.

As shown above on Fig. 2: Tumor size, metastasis, node, age, number of children, marital status, habits, occupation, residence, previous surgery, ECOG and breast feeding are the important features that determine stages of breast cancer respectively.

Performance of ML models on train split test data

The train-test split procedure is used to estimate the performance of machine learning algorithms when they make predictions on data not used for training the model. It’s a fast and easy way to compare different algorithms for predictive modeling. The following Fig. 2 shows the evaluation result obtained using different evaluation metrics.

The above Fig. 3 shows the result obtained from model evaluation on the train test dataset. Since the performance of random forest machine learning algorithms surpassed that of other experimented algorithms, the detailed prediction report accuracy of random forest in stages 0, 1, 2, 3, and 4 of breast cancer is also discussed. The precision, recall, and F1 score of stage 0 are equal, which is 100%. Likewise, the obtained results of these evaluations are also equal in stage 1 prediction, which is 100%. Similarly, the precision obtained in terms of stage 2 prediction is 98%, recall is 99%, and the F1 score is 98%. Furthermore, in stage 3 prediction, the precision is 99%, the recall is 98%, and the F1 score is 98%. Like stage 0 and stage 1, the value of evaluation metrics including recall, precision, and F1-score achieved an equal value of 100% in stage 4 prediction. The overall weighted average of breast cancer stage prediction in the random forest model is 99%, which indicates that this model is optimal and most popular among experimented algorithms.

Performance of ML models on 10-Fold cross validation

The 10-fold cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. In 10-fold cross-validation, the model is trained on nine folds and tested on the remaining fold. This process is repeated 10 times, with each fold serving as the test set once. The performance of ML algorithms on 10-fold cross validation is presented below in Fig. 4.

The above Fig. 4 shows that the experiment algorithms with their respective evaluation results in terms of accuracy, precision, recall, and F1 score. As the result obtained shows, Random Forest achieved highest accuracy, precision, recall, and F1 scores when compared to others. The obtained results indicate that all evaluation metrics used, which are accuracy, precision, recall, and F1-measure, achieved 97%, hybrid achieved 97%, DT achieved 94% and LR achieved 96%. Based on the obtained result, RF is selected as the optimal algorithm used to predict stages of breast cancer in Ethiopia.

Receiver Operating Characteristic (ROC) curve

The ROC curve is an important metric for the performance of classifiers (Naji et al., 2021). The Receiver Operating Characteristic (ROC) curve of each machine learning algorithm are presented in Fig. 5 and the comparison of each classifier were clearly stated. The area under the ROC curve (AUC) also computed to show more performed algorithm. In the area, which is bigger, the performance of the classifier is better. The Random Forest has the highest AUC score of 0.99, while the decision tree's AUC score of 0.96 is the lowest, as shown in the following Fig. 5.

The above Fig. 6 shows the Receiver Operating Characteristic (ROC) curve comparison for four different classification models: Random Forest, Hybrid, Logistic Regression, and Decision Tree. The x-axis represents the False Positive Rate (FPR), and the y-axis represents the True Positive Rate (TPR). Each curve represents a model’s performance. The closer the curve follows the left-hand border and then the top border of the ROC space, the better the model. The Area Under the Curve (AUC) quantifies the overall performance. Higher AUC values indicate better performance. As indicated in the AUC, the overall performance achieved by Random Forest is 0.99, Hybrid is 0.98, Logistic Regression is 0.97 and Decision Tree is 0.96. The diagonal dashed line represents a no-skill classifier where TPR equals FPR. Among these experimented models the Random Forest model performs the best, followed by the Hybrid model.

Random Forest training Progress

The graph that visualized in Fig. 6 shows Random Forest Training Progress and the relationship between the numbers of trees in a random forest algorithm. This graph helps in understanding the behavior of a random forest model during training and its ability to predict accurately on new data.

The above Fig. 6 shows the progress of random forest training and how the accuracy of a model changes with the number of estimators (trees) used during training. The horizontal axis (number of estimators) ranges from 0 to 100, representing the number of trees in the random forest. Likewise, the vertical axis (accuracy) ranges from 0.825 to 1.000, indicating the accuracy of the model. The blue line (training accuracy) starts near 1.000 with a small number of estimators. It drops sharply as the number of estimator’s increases to around 10. Gradually decreases and levels off as it approaches 100 estimators. Orange Line (testing accuracy also starts near 1.000 with a small number of estimators) drops more gradually compared to the training accuracy. It shows some fluctuations before leveling off towards the end. The initial high training accuracy with few estimators suggests overfitting, where the model performs well on training data but poorly on unseen data. As the number of estimator’s increases, both training and testing accuracies stabilize, indicating a balance between bias and variance. In general, the graph shows how the model’s performance evolves with the number of trees, which is crucial for tuning the Random Forest algorithm.

Random Forest Confusion matrix

The confusion matrix is a visual tool used to evaluate the performance of a classification model. It is crucial for understanding the model’s strengths and weaknesses in classifying different classes. It helps in identifying which classes are being confused with others, guiding further refinement of the model. In the following confusion matrix presented in Fig. 7, the diagonal values indicate high accuracy, especially for classes 2 and 3.

In this section, the experimental results of this study are discussed and compared with those of other related studies. The previous studies were restricted to limited parameters and small datasets. Chaurasia et al. (2021) worked on the prediction of breast cancer disease in the patient using machine learning algorithms and SFS. The algorithms experimented with on the breast cancer dataset include RF, KNN, SVM, GBC, LDA, and DT. Among these experimented algorithms, the KNN and RF achieved equal accuracy, which is 97.66%, and were selected as appropriate ML algorithms used to predict breast cancer. Naji et al. (2021) also worked on machine learning algorithms for breast cancer prediction and diagnosis. These authors use ML algorithms like SVM, RF, LR, DT, and KNN. Based on the experimental result, the overall accuracy of SVM was 97.2%, which surpassed other experimented algorithms. Similarly, Srivenkatesh (2020) investigated the prediction of breast cancer disease using machine learning algorithms like SVM, RF, NB, and LR. Among these experimented algorithms, RF performed well and achieved an overall accuracy of 98.24%. In line with this related literature, our experimental results also show that RF performed well and achieved an accuracy of 99%. The previous works were mainly concerned with determining the absence or presence of breast cancer. However, in this study, we not only determined the presence of breast cancer but also explored its stage. The data size and attributes used in this study differ from those used in previous studies. It is mainly concerned with applying machine learning algorithms to a breast cancer dataset and predicting stages of breast cancer. Finally, random forest is identified as an appropriate algorithm that is used to predict stages of breast cancer.

The experimental results of different authors, like Chaurasia et al. (2021), Srivenkatesh (2020), and El Massari et al. (2022), show that RF performed well and was selected as an appropriate machine learning algorithm to predict breast cancer. These authors mentioned the reason why RF outperforms other algorithms. The major reason why RF performed better than other algorithms is its ability to use an ensemble learning method that combines multiple decision trees. Each tree is trained on a different subset of the data, and the final prediction is made by averaging the predictions of all the trees (for regression) or by majority vote (for classification). This reduces the risk of overfitting and improves generalization. Random forests improve the drawbacks of decision trees by reducing overfitting by averaging multiple trees. Similarly, in this study, as shown in the experimental results, the performance of the RF algorithm surpassed that of Logistic Regression, Decision Trees, and Hybrid machine learning. We used Random Forest to select important features that were used to predict the stages of breast cancer. During feature selection, a random forest selected appropriate attributes based on their importance in breast cancer prediction. The selected attributes by RF forest in the feature engineering process are tumor size, metastasis, node, age, number of children, marital status, habits, occupation, residence, previous surgery, ECOG, and breastfeeding status. These attributes are the most influential in predicting the stages of breast cancer. Finally, it is possible to conclude that RF is the most popular algorithm used for the prediction of breast cancer due to its ability to be used for both classification and regression tasks, making it a versatile tool for a variety of predictive modeling problems.

Breast cancer is the most common kind of cancer overall and the leading cause of death for women in Ethiopia. This study identified optimal and effective machine learning algorithm. Data were collected from Hiwot Fana Specialized University Hospital and Tikur Anbesa Specialized Hospital. The Python programming language is used for data analysis by incorporating different libraries. We utilized machine-learning models like Random Forest, Logistic Regression, Decision Tree, and hybrid machine learning approaches involving random forest classifiers, gradient boosting classifiers, decision trees, and support vector machines to predict stages of breast cancer based on the breast cancer patient data. The obtained experimental results demonstrated that the Random Forest classifier came up with the best outcomes based on precision, recall, F1 score, and overall accuracy metrics. Model evaluation has been done using both the train split test and 10-fold cross-validation. Random forest performed better than others in both model evaluation mechanisms. Random Forest achieved an accuracy of 99% on the train split test and 97% in 10-fold cross-validation. Random forest classifier achieved high accuracy in both mode of performance evaluation. Therefore, random forest is considered as an optimal algorithm that is used to predict the stages of breast cancer in Ethiopia. The dataset that we used in this study is text-based; image-based analysis is not included in this work. It is recommended to use cancerous breast images for data analysis by using machine learning or deep learning algorithms to improve the accuracy of breast cancer prediction.

Acknowledgements

Not applicable.

Author contribution

WGK was involved in data collection, cleaning, data analysis, interpretation and manuscript preparation. K, M, M, and J were helped in data analysis and manuscript preparation. All read and approved the manuscript.

Funding

This study was funded by Haramaya University's Research Office as part of a grand challenge research project (Project code: HUGG_2022_03_01_48).

Availability of data and materials

Access to the datasets used for this study are available from the corresponding author on reasonable request.

Ethics approval and consent to participate

Informed consent was waived due to patients’ data were anonymized by College of Health and Medical Science, Haramaya University Hiwot Fana Comprehensive Specialized Hospital (Institutional Health Research Ethics Review Committee (IHRERC)) (Ref: D/R/G/P/05/37/2023).

Consent for publication

Not applicable

Competing interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Author details

¹ College of Computing and Informatics, Haramaya University, Dire Dawa, Ethiopia

² College of Health and Medical Sciences, Haramaya University, Dire Dawa, Ethiopia

Corresponding author: Kasahun Takele, [email protected] , College of Computing and Informatics, Haramaya University

Alfonse, M., M. Aref, M., et al., 2014. An Ontology-Based System for Cancer Diseases Knowledge Management Int. J. Inf. Eng. Electron. Bus. 6, 6, 55–63.
Alshareeda, A.T., Nur Khatijah, M.Z., et al., 2023. Nanotechnology: A revolutionary approach to prevent breast cancer recurrence Asian J. Surg. 46, 1, 13–17.
Awedew, A.F., Asefa, Z., et al., 2022. National Burden and Trend of Cancer in Ethiopia, 2010-2019: a systemic analysis for Global burden of disease study Sci. Rep. 12, 1, 12736.
Chaurasia, V., Pandey, M., et al., 2021. RETRACTED: Prediction of Presence of Breast Cancer Disease in the Patient using Machine Learning Algorithms and SFS IOP Conf. Ser. Mater. Sci. Eng. 1099, 1, 012003.
Chen, H., Wang, N., et al., 2023. Classification Prediction of Breast Cancer Based on Machine Learning Comput. Intell. Neurosci. 2023, 1–9.
Hessari, M.J., 2023. Nanotechnology for Breast Cancer Diagnosis and Therapy June.
Khan, Z.M.A., Saeidlou, S., et al., 2019. Ontology-based decision tree model for prediction in a manufacturing network Prod. Manuf. Res. 7, 1, 335–349.
El Massari, H., Gherabi, N., et al., 2022. An Ontological Model based on Machine Learning for Predicting Breast Cancer Int. J. Adv. Comput. Sci. Appl. 13, 7, 108–115.
Naji, M.A., Filali, S. El, et al., 2021. Machine Learning Algorithms for Breast Cancer Prediction and Diagnosis Procedia Comput. Sci. 191, 487–492.
Solbana, L.K. & Chaka, E.E., 2023. Determinants of breast cancer in Ethiopia: a systematic review and meta-analysis Ecancermedicalscience 17, 1–11.
Srivenkatesh*, D.M., 2020. Prediction of Breast Cancer Disease using Machine Learning Algorithms Int. J. Innov. Technol. Explor. Eng. 9, 4, 2868–2878.
Tharwat, A., 2018. Classification assessment methods Appl. Comput. Informatics 17, 1, 168–192.
Vujović, Ž., 2021. Classification Model Evaluation Metrics Int. J. Adv. Comput. Sci. Appl. 12, 6, 599–606.
Alfonse, M., M. Aref, M., et al., 2014. An Ontology-Based System for Cancer Diseases Knowledge Management Int. J. Inf. Eng. Electron. Bus. 6, 6, 55–63.
Alshareeda, A.T., Nur Khatijah, M.Z., et al., 2023. Nanotechnology: A revolutionary approach to prevent breast cancer recurrence Asian J. Surg. 46, 1, 13–17.
Awedew, A.F., Asefa, Z., et al., 2022. National Burden and Trend of Cancer in Ethiopia, 2010-2019: a systemic analysis for Global burden of disease study Sci. Rep. 12, 1, 12736.
Chaurasia, V., Pandey, M., et al., 2021. RETRACTED: Prediction of Presence of Breast Cancer Disease in the Patient using Machine Learning Algorithms and SFS IOP Conf. Ser. Mater. Sci. Eng. 1099, 1, 012003.
Chen, H., Wang, N., et al., 2023. Classification Prediction of Breast Cancer Based on Machine Learning Comput. Intell. Neurosci. 2023, 1–9.
Hessari, M.J., 2023. Nanotechnology for Breast Cancer Diagnosis and Therapy June.
Khan, Z.M.A., Saeidlou, S., et al., 2019. Ontology-based decision tree model for prediction in a manufacturing network Prod. Manuf. Res. 7, 1, 335–349.
El Massari, H., Gherabi, N., et al., 2022. An Ontological Model based on Machine Learning for Predicting Breast Cancer Int. J. Adv. Comput. Sci. Appl. 13, 7, 108–115.
Naji, M.A., Filali, S. El, et al., 2021. Machine Learning Algorithms for Breast Cancer Prediction and Diagnosis Procedia Comput. Sci. 191, 487–492.
Solbana, L.K. & Chaka, E.E., 2023. Determinants of breast cancer in Ethiopia: a systematic review and meta-analysis Ecancermedicalscience 17, 1–11.
Srivenkatesh*, D.M., 2020. Prediction of Breast Cancer Disease using Machine Learning Algorithms Int. J. Innov. Technol. Explor. Eng. 9, 4, 2868–2878.
Tharwat, A., 2018. Classification assessment methods Appl. Comput. Informatics 17, 1, 168–192.
Vujović, Ž., 2021. Classification Model Evaluation Metrics Int. J. Adv. Comput. Sci. Appl. 12, 6, 599–606.

No competing interests reported.

Download PDF

Editorial decision: Revision requested
29 Aug, 2024
Editor assigned by journal
28 Aug, 2024
Submission checks completed at journal
28 Aug, 2024
First submitted to journal
22 Aug, 2024

You are reading this latest preprint version

Identifying Optimal Algorithms for Breast Cancer Prediction in Ethiopia

Status:

Version 1

Abstract

Figures

Background of the study

Related works

Methodology

Dataset Description

The Proposed Architecture

Breast cancer dataset Preparation

Data Preprocessing

Machine learning algorithms

Performance Evaluation

Experimental Results

Feature Importance selection Result

Performance of ML models on train split test data

Performance of ML models on 10-Fold cross validation

Receiver Operating Characteristic (ROC) curve

Random Forest training Progress

Random Forest Confusion matrix

Discussion of the Results

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1