Study population
This study involved a total of 1518 subjects. Subjects were retrieved from a hospital-based breast cancer registry in University Malaya Medical Centre, Kuala Lumpur from January 1993 to June 2021. Ethics approval was obtained and approved by Medical Ethics Committee, University Malaya Medical Centre Kuala Lumpur (MREC ID NO: #733.22). This study was conducted on female patients who had BCS as their primary surgical treatment for breast cancer. Subjects were excluded if they had mastectomy as the primary surgical intervention for breast cancer.
Study variables
Patients’ demographics, clinicopathological characteristics, treatment methods and oncological outcomes are summarised in Table 1. Patient demographics and clinicopathological variables such as age, ethnicity, marital status, menopausal status, age at initial diagnosis, tumour histology, presence of lymphovascular invasion, Bloom and Richardson tumour grading, tumour size group, nodal status, hormonal status (oestrogen receptor, progesterone receptor and HER2 receptor), luminal type and staging (AJCC 7th Edition) were collected. Participant’s treatment methods include type of BCS, initial BCS margin status, presence of re-excision surgery, type of axillary surgery, chemotherapy, radiotherapy, intraoperative radiotherapy, hormonal and targeted therapy were documented. The study endpoints are locoregional recurrence, distant metastasis and survival status. LR was defined as tumour recurrence in ipsilateral breast or chest wall with similar tumour histology as primary tumour and/or tumour recurrence in ipsilateral axilla, infraclavicular, supraclavicular or internal mammary lymph nodes. [13] Metastasis was defined as tumour metastasis identified on any distant site. Patient survival status at the end of follow-up[14] and cause of death was obtained from death certificate and national registration department.
Table 1: Demographic, clinicopathological characteristics, treatments and oncological outcomes of BCS patients (n=1518)
Characteristics
|
Frequency (n, %)
|
Ethnicity
Malay
Chinese
Indian
Others
|
429 (28.3)
852 (56.1)
197 (13.0)
40 (2.6)
|
Age at initial diagnosis
30 and below
31-40
41-50
51-60
61-70
71 and above
|
57 (3.8)
234 (15.4)
548 (36.1)
408 (26.8)
206 (13.6)
65 (4.3)
|
Menopausal status
Premenopausal
Natural menopause
Surgical menopause
|
762 (55.9)
502 (36.9)
98 (7.2)
|
Marital status
Married
Not married
|
1,218 (85.9)
200 (14.1)
|
Staging
Stage 0
Stage 1
Stage 2
Stage 3
Stage 4
|
131 (9.0)
662 (45.3)
611 (41.8)
52 (3.6)
6 (0.4)
|
Tumour histology
Invasive ductal
Invasive lobular
Ductal carcinoma in situ
Others
Phyllodes/Sarcoma
|
1,222 (80.5)
32 (2.1)
123 (8.1)
137 (9.0)
4 (0.3)
|
Lymphovascular invasion
Yes
No
|
341 (29.0)
833 (71.0)
|
Tumour grading
Grade 1
Grade 2
Grade 3
|
216 (18.2)
592 (49.9)
379 (31.9)
|
Tumour size group
<2cm
2-5cm
>5cm
|
707 (49.1)
682 (47.4)
51 (3.5)
|
Nodal status
0
1-3
4-9
10 or above
|
1,085 (74.1)
345 (23.5)
24 (1.6)
11 (0.8)
|
ER status
Positive
Low Positive
Negative
|
999 (70.4)
4 (0.3)
416 (29.3)
|
PR status
Positive
Negative
|
806 (63.2)
470 (36.8)
|
HER2 status
Positive
Negative
Equivocal
|
331 (25.1)
777 (58.8)
213 (16.1)
|
Luminal type
Luminal A
Luminal B
HER2 enriched
Triple negative
|
590 (55.2)
204 (19.1)
104 (9.7)
171 (16.0)
|
Type of BCS
WLE
HWLB
Oncoplastic
|
821 (79.9)
194 (18.9)
12 (1.2)
|
Margin involvement
Clear
Involved
Close
|
751 (77.9)
90 (9.3)
123 (12.8)
|
Re-excision
Yes
No
|
59 (5.4)
1,029 (94.6)
|
Axillary Surgery
None
SLNB
Axillary Dissection
SLNB to Axillary dissection
|
186 (12.8)
168 (11.6)
1,089 (74.8)
11 (0.8)
|
Chemotherapy
Yes
No
|
797 (54.0)
680 (46.0)
|
Radiotherapy
Yes
No
|
1,274 (87.4)
184 (12.6)
|
Intraoperative Radiotherapy
Yes
No
|
56 (5.1)
1,032 (94.9)
|
Hormonal Therapy
Yes
No
|
992 (68.2)
463 (31.8)
|
Targeted Therapy
Yes
No
|
23 (2.1)
1,063 (97.9)
|
Locoregional recurrence
Yes
No
|
141 (13.0)
947 (87.0)
|
Distant metastasis
Yes
No
|
110 (10.1)
979 (89.9)
|
Survival status*
Alive
Dead
|
1,134 (78.7)
307 (21.3)
|
*Malaysian only
Unknown values for menopausal status (n=156), marital status (n=100), staging (n=56), lymphovascular invasion (n=344), tumour grading (n=331), tumour size (n=78), nodal status (n=53), ER receptor (n=99), PR receptor (n=242), HER2 receptor (n=197), luminal type (n=449), type of BCS (n=491), margin involvement (n=554), re-excision (n= 430), chemotherapy (n=41), radiotherapy (n=60), IORT (n= 430), hormonal therapy (n=63), targeted therapy (n=432), locoregional recurrence (n= 430) and distant metastasis (n =429) are not presented in table.
Figure 1 demonstrates the flowchart of number of participants included for each step of analysis in data collection. Baseline demographics, clinicopathological characteristics, treatment methods and oncological outcomes (local recurrence, distant metastasis and overall survival) . Data collection was performed using SPSS Statistics, Version 28.0, IBM.
Machine Learning
Various analytical techniques were combined to develop an accurate and effective machine learning model. We used the standard scaler technique for numerical data, which transformed the data to a distribution of zero mean and a standard deviation of one [15]. In addition, one-hot encoder for categorical data were applied to convert categorical variables into corresponding numerical form. The dataset was then partitioned into two subsets, namely, the training set and the testing set with a ratio of 7:3. The training set was used to train and fine-tune a machine learning model, while the testing set was used to evaluate the performance of the constructed model. This approach ensured that the model was not overfitted to the training set and could generalize well to new and unseen data [16].
In this study, XGBoost (Extreme Gradient Boosting) machine learning model was selected to construct a predictive model, and it is especially suitable to handle complex and non-linear relationships between the input and output variables. To fine-tune the constructed model built by the training dataset, the grid search method was utilized which systematically verified different combinations of hyperparameter to identify the optimal set that could maximize the model's performance regarding the training data. With 5-fold cross-validation, it could help to prevent overfitting and ensure that the constructed model would be generalized well to new data. After obtaining the set of optimal hyperparameters, including 550 estimators, a maximum tree depth of 9, and a gamma value of 0.5, from the grid search processes, a machine learning model was trained and evaluated its performance on the unseen testing dataset to ensure that it produced accurate predictions [17]. Lastly, Shapley value technique was used to calculate the impact of each feature on the model's output. This technique would further explain the model's decision-making processes and provides insights into the importance of each feature, allowing us to identify areas for improvement [18].