In this section, different methods, techniques, and approaches used to identify optimal algorithms for breast cancer prediction were discussed. The architecture, which shows the overall description of this study, was designed, and its components are sequentially described. The machine learning algorithms and data analysis tools used in this study were also presented.
Dataset Description
The breast cancer dataset that was used for this study was collected from Hiwot Fana Specialized University Hospital and Tikur Anbessa Specialized Hospital. The dataset used in this study was secondary data collected from patient cards. The collected data contains both nominal and numeric data types. The total dataset that we used in this study contains 1164 instances and 13 attributes with a target class, which is stage. These attributes used are tumour size, metastasis, node, age, number of children, marital status, habits, occupation, residence, previous surgery, ECOG, and breastfeeding status. Among these total instances, 406 were collected from Hiwot Fana Specialized University Hospital, and 758 instances of the dataset were collected from Tikur Anbessa Specialized Hospital. The dataset used in this study is the data that has been recorded from 2019 to 2024.
The Proposed Architecture
The main objective of this study is to identify the most effective and optimal algorithm for predicting breast cancer stages. Therefore, we applied various machine learning classifiers, such as random forests, logistic regression, decision trees, and hybrid machine learning approaches involving random forest classifiers, gradient boosting classifiers, decision trees, and support vector machines. These algorithms were applied to the breast cancer dataset, and the obtained results were also evaluated to determine the model that provided the highest accuracy. The proposed methods begin with data acquisition, followed by pre-processing, which contains tasks like data cleaning, handling missing values, selecting attributes, and feature selection. The prepared data is used to create a model that can predict the breast cancer stages. To evaluate the model's performance, we have provided the model with new data that has been labeled. This is done by splitting the cleaned and labeled data into two parts using the train_test_split method and 10-fold cross-validation. 80% of the data is used to build the machine learning model and is called the training data or training set. 20% of the data used to assess how well the model works is called test data or test set. After testing the models, the obtained results have been compared to select the algorithm that provides the highest accuracy and identify the most predictive algorithm for the prediction of breast cancer. These comparisons have been based on metrics like recall, precision, and F-measure. The proposed architecture is presented in the following Fig. 1:
Breast cancer dataset Preparation
Data preparation is the set of methods that initialize and prepare the data properly to serve as input for a certain algorithm (García et al., 2015). It can also be considered a mandatory step that converts scattered data into new data that fits the data for the pattern extraction process. Thus, it mainly deals with converting scattered data into appropriate formats that facilitate the creation of a model. To facilitate data analysis, the collected data was encoded, saved in CSV format, and structured in a way that was easy to understand. In this study, after the dataset is prepared, data pre-processing tasks like data cleaning have been carried out and organized in a form suitable for classification algorithms.
Data Preprocessing
Data preprocessing is the fundamental steps for any data analysis. It contains key activities like data cleaning, integration, transformation, and data reduction (Khan et al., 2019). Most of the time, the data obtained from different sources may contain missing values, noisy data, or conflicting information. Therefore, the data in the selected dataset should be pre-processed to be reliably classified by the classification model (Vujović, 2021). In this study, the study carried out the following data preprocessing activities:
Data Cleaning
The collected data from different sources is prone to error, incompleteness, and inconsistent for various reasons (Ziafat & Shakeri, 2014). Due to this, filling in missing values, reducing noise data in the dataset are required. For example, if age of the patient is not filled, we obtained it from the patient’s date of birth if the patient is currently alive. Noises or irrelevant data occurred during data entry or labeling also handled at this stage. For instance, the value of attribute metastasis is M0 and M1. During data entry some data entered as Mo instead of M0. So, we removed Mo and we replaced with M0. Additionally, a data-cleaning task is also necessary to remove duplicate data that exists in the datasets. To eliminate unnecessary and redundant data from the breast cancer dataset that was gathered, the study carried out a data cleaning procedure and removed duplicate instances from the dataset. The total gathered breast cancer dataset contains 13 attributes and targets or classes with 1164 instances.
Feature Selection
Finding the ideal subset is a key component of the data preprocessing activity. Feature selection can eliminate superfluous and redundant characteristics, increasing the model's accuracy(Chen et al., 2023). To find the optimal algorithm and improve the accuracy of the model, feature selection is very important. In this study, feature selection process was conducted on the dataset to improve accuracy and reduce the dimensionality of the dataset. Attribute selection focuses on selecting a subgroup of attributes from all attributes to minimize irrelevant attributes while improving performance. The breast cancer patient dataset contains many attributes that do not have any effects while creating a model. Likewise, it also contains very crucial attributes that play a significant role in determining the stages of breast cancer. Some attributes that don’t have a significant effect on determining stages of breast cancer include the patient's medical record number, name, phone number, region, birthplace, and house number. All the collected datasets contain women with breast cancer. Basically, 13 attributes and the target class are selected to create a classification model. All these attributes are needed as input values for the classification of breast cancer into stages. The importance of these features is haven’t equal significance to determine stages of breast cancer.
Table 1
Description of selected attributes according to their importance
S/N | Attributes | Data type | Possible Values | Description |
1. | Tumor size | Nominal | T0, T1, T2, T3, T4 | It refers to the size of the tumor |
2. | Metastasis | Nominal | M0, M1 | The presence or absence of metastasis |
3. | Lymph Node | Nominal | N0, N1, N2, N3 | Lymph Nodes values |
4. | Age | Nominal | 1, 2,3,4,5… | Age of the patient |
5. | Children | Numerical | 1, 2, 3… | Number of children the patient has |
6. | Marital Status | Nominal | Single, Married, Divorced, Widowed | Patient marital status |
7. | Habits | Nominal | Smoking, Alcohol, Khat, other | Habits the patients have |
8. | Occupation | Nominal | House wife, Employee | Life style |
9. | Residence | Nominal | Urban, rural | Place where patient lives |
10. | Previous Surgery | Nominal | Yes, No | Past history of the patient |
11. | ECOG | Nominal | I, II…V | It’s level of functioning in terms of daily activities and physical abilities during cancer clinical trials |
12. | Breast Feeding | Nominal | Yes, No | Breast feeding status |
13. | Comorbid illness | Nominal | Yes, No | Other disease that patients may have |
14. | Stage | Nominal | Stage 0, I, II, III, IV | It is classified depend on the values of attributes Tumor size, Metastasis, lumpy node and other attributes |
The above Table 1 shows the selected attributes that were used to determine the stages of breast cancer. The four machine learning algorithms, which include Random Forest, Decision Tree, Logistic Regression, and hybrid machine learning, are experimented with on the prepared cancer dataset and used to predict stages of breast cancer based on the values of these selected attributes.
Machine learning algorithms
This paper examines and evaluates the most prominent machine learning algorithms used in the prediction of breast cancer in terms of their stages based on the values of selected attributes. The detailed description of experimented machine learning algorithms is discussed as follows:
Random Forests
It is a versatile machine learning approach for various tasks, including repetitive execution, classification, regression, and prediction, by creating a large number of decision trees during the training period (Chaurasia et al., 2021). In this study, the Random Forest is used as an ensemble method for classification, regression, and prediction by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classification or mean prediction (regression) of the individual trees. Similarly, RF is also used to determine feature importance that predict stages of breast cancer. Here are the steps should be followed when working with the Random Forest algorithm (Srivenkatesh, 2020
Step 1: First, start with the choice of random samples from a given dataset.
Step 2: Next, the calculation will build a choice tree. At that point, it will get the forecast outcome from each choice tree.
Step 3: In this progression, casting a ballot will be performed for each anticipated outcome.
Step 4: At last, select the most casted a ballot forecast result as the final prediction result.
The above-listed steps were also followed in this study during experimentation with a random forest on a breast cancer dataset.
Logistic Regression
It is a very powerful modeling tool, is a generalization of linear regression. It is used to assess the likelihood of a disease or health condition as a function of a risk factor (and covariates). Both simple and multiple logistic regression, assess the association between independent variable(s) (Xi) sometimes called exposure or predictor variables and a dichotomous dependent variable (Y) sometimes called the outcome or response variable. It is used primarily for predicting binary or multiclass dependent variables. So, in this study, independent 13 attributes and dependent attribute or stages were predicted based on the values of independent variables.
Decision Tree
It is a classification algorithm that concludes the value of a dependent attribute (variable) given the values of the independent (input) attributes (variables) (Bhargava et al., 2013). These authors noted that the decision-tree classification algorithm is useful in several fields, including data and text mining, information extraction, machine learning, and pattern recognition. Among these, we used DT in machine learning to predict stages of breast cancer. We used DT machine learning due to its simplicity and compatibility with our dataset. Since our dataset type is a combination of nominal, numerical, and textual datasets, the DT algorithm is appropriate to analyze and be able to process missing values in the dataset. Its performance is also better, according to the previous studies. Thus, using the decision tree method for prediction is appropriate since our data set is correctly classified and has the fewest possible nodes.
Hybrid machine learning algorithms
This is a method of combining various machine learning algorithms or approaches to produce a more reliable and accurate model. In hybrid machine learning, one algorithm can handle the drawbacks of another algorithm, so it leverages the strengths of different methods to improve the overall performance. In this study, hybrid machine learning approaches involving random forest, decision tree, and logistic regression were used.
The appropriate procedures followed during the use of these ML algorithm are summarized in the proposed architecture, which includes breast cancer data collection from these two hospitals. Performing preprocessing tasks like cleaning and handling missing values. After the dataset is preprocessed, feature selection also takes place to select the top features that predict the stages of breast cancer. Following preprocessing tasks and feature selection, the quality data is split into a train-test split, with 80% for training and 20% for testing. The training dataset was used to build a model that initialized and trained the model. The next step followed here is to make predictions on the test set. Finally, performance evaluation was carried out using evaluation metrics like accuracy, precision, recall, and Fl-Score.
Performance Evaluation
We have used various methods for evaluating the developed model to predict the stages of breast cancer. When comparing various classification algorithms, the value of the metric analysis must be appropriately understood (Tharwat, 2018). Confusion Matrix is one of the methods used to evaluate the performance of the models. It is an evaluation metrics for the classification where the output can be two or more type of classes. A confusion matrix is a table with two dimensions. “Actual” and “Predicted” and furthermore, both the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, and “False Negatives (FN)”. Accuracy is most common performance metric for classification algorithms. It defined as the number of correct predictions made as a ratio of all predictions made.
Precision, used in document retrievals, can be defined as the number of correct documents returned by our ML model. Sensitivity also defined as the number of positives returned by ML model. F1 score gives us the harmonic mean of precision and Sensitivity. Therefore, evaluation metrics like accuracy, precision, recall and F1-score are carried out. The formulas used to evaluate the performance of these algorithms are discussed as follows (Tharwat, 2018):
$$\:\varvec{P}\varvec{r}\varvec{e}\varvec{c}\varvec{i}\varvec{s}\varvec{i}\varvec{o}\varvec{n}=\frac{\mathbf{T}\mathbf{P}}{\mathbf{F}\mathbf{P}+\mathbf{T}\mathbf{P}}$$
1
$$\:\varvec{R}\varvec{e}\varvec{c}\varvec{a}\varvec{l}\varvec{l}\left(\varvec{S}\varvec{e}\varvec{n}\varvec{s}\varvec{a}\varvec{t}\varvec{i}\varvec{v}\varvec{i}\varvec{t}\varvec{y}\right)=\frac{\mathbf{T}\mathbf{P}}{\mathbf{F}\mathbf{N}+\mathbf{T}\mathbf{P}}$$
2
.\(\:\varvec{F}1\:\varvec{s}\varvec{c}\varvec{o}\varvec{r}\varvec{e}-\varvec{m}\varvec{e}\varvec{a}\varvec{s}\varvec{u}\varvec{r}\varvec{e}=\frac{2\mathbf{P}\mathbf{R}}{\mathbf{P}+\mathbf{R}}\) (3)
$$\:A\varvec{c}\varvec{c}\varvec{u}\varvec{r}\varvec{a}\varvec{c}\varvec{y}=\frac{\mathbf{T}\mathbf{P}+\mathbf{T}\mathbf{N}}{\mathbf{T}\mathbf{P}+\mathbf{T}\mathbf{N}+\mathbf{F}\mathbf{N}+\mathbf{F}\mathbf{P}}$$
4
We used two test modes (split-test and 10-fold cross-validation) using several metrics, including recall, F1 score, accuracy, and precision, to analyse our experimental results, which are presented in the above Figs. 2 and 3. We used split-test methods due to its simplicity during model performance evaluation. It works by dividing the dataset into two parts: a training set and a test set. It is easy to implement and understand. Similarly, it allows for a quick evaluation of the model’s performance on unseen data, providing an immediate sense of how well the model generalizes. 10-fold cross-validation is generally preferred as a mode of evaluation due to its robustness and reliability. This process provides a more comprehensive evaluation of the model’s performance. It is also used due to its ability to use data efficiently. It makes efficient use of the available data, as each data point is used for both training and testing. This model performance evaluation mode is appropriate and compatible for the size of our dataset. While both techniques aid in determining a model's performance on unseen data, 10-fold cross-validation is typically used due to its robustness and reliability. Furthermore, the same criteria are utilized to evaluate the validity of this comparison, including ML models.