Study design
The present study was designed to evaluate the predictive value of a combination of parameters based on clinical data (e.g., age, family history of breast cancer, clinical T-stage, clinical N-stage, unifocal or multifocal tumor), histopathological findings (grade, Ki-67 IHC expression), molecular markers measured on pretreatment biopsy (e.g., GGIr, or isolated component of the GGIr: CDC2, CDC20, KPNA2, and MYBL2 gene expressions), and FDG-PET/CT imaging features (e.g., tumor SUVmax, lymph node SUVmax, metabolic tumor volume, radiomics features...). All features were collected in a large database to be processed by a machine learning algorithm.
Eligibility criteria were patients with stage II-III triple negative breast cancers (TNBC) scheduled for neoadjuvant chemotherapy, and with a baseline PET. Patients with distant metastases and patients with bilateral cancer were not included.
We first determined the ability of these parameters alone or in combination, to predict pCR (primary endpoint). Then, we tested if the predicted pCR was a surrogate marker of patient’s outcome (secondary endpoint).
The Institutional Review Board approved the study and stated that no informed consent was needed, considering the non-interventional design of this retrospective analysis (IRB # 00003835, French ethics committee Paris-Saint-Louis, # 2013-27NICB; NCT02600442).
Histo-pathological features and gene expression profiling
Breast cancers were diagnosed based on a ultrasound-guided core-needle biopsy. An experienced pathologist determined tumor type and histological grade was determined using the modified Scarff-Bloom-Richardson (SBR) grading for invasive carcinoma.
Tumors were defined as triple negative based on immuno-histochemical staining, using specific antibodies and an automated immunostainer (Ventana XT; Tucson, AZ, USA). Tumors were considered ER and PR negative if less than 10% of tumor cells expressed ER and PR. HER2 was over-expressed if HER2 immunostaining was uniform, with intense membrane staining of >30% of invasive tumor cells, following the recommendations of the period of analysis.
Ki67 score was analyzed by immunohistochemistry using MIB-1 antibody (Dako, Glostrup, Denmark) and quantified automatically by the image analysis software Hamamatsu NDP Analyze. The threshold for Ki67 positivity was 14% stained cells, whatever the staining intensity (18).
Total RNA extracted from frozen biopsy was used for molecular analysis. TP53 functional status was determined using a highly efficient yeast functional assay (FASAY) as previously published (19).
Gene expression analysis of Ki67, CDC2, CDC20, KPNA2 and MYBL2 were performed by RT-quantitative PCR. GGIr score assessment was obtained by combining the expression of the 4 genes (CDC2, CDC20, KPNA2 and MYBL2), covering all cell cycle phases as described (20). We analyzed the predictive value of Ki67 mRNA expression and of the reduced Genomic Grade Index (GGIr) as a continuous value for the association with pCR.
FDG-PET/CT Imaging acquisition
Patients fasted for 6 hours, and blood glucose level had to be less than 7 mmol/L. FDG (5 MBq/kg) was administered and imaging (from mid-thigh level to the base of the skull with the arms raised) started almost 60 minutes later. The Gemini XL PET/CT scanner (Philips Medical systems) was used. CT data was acquired first (120 kV; 100 mAs; no contrast-enhancement). PET emission data was acquired in a 3-dimensional mode, with 2 min. per bed position. The attenuation-corrected images were normalized for injected dose and body weight, and subsequently converted into Standardized Uptake Values (SUV), defined as: [tracer concentration (kBq/mL)] / [injected activity (kBq)/patient body weight (g)]. SUVmax was measured in the primary tumor and in the axillary lymph nodes if present.
Imaging processing and radiomics features
From the baseline PET/CT imaging data, the primary breast tumor of each patient was segmented in 3D through a semi-automatic segmentation method using 42% of SUVmax (Figure 1). The segmentation was performed by an experimented nuclear physician using the SOPHiA DDM for radiomics platform (Research Use Only; SOPHiA GENETICS SA; Switzerland). Radiomics features describing the tumor through its size, shape, voxel intensity distribution, and texture were then extracted following the IBSI standards (21). Metabolic tumor volume (MTV) was determined using the SOPHiA platform.
Neoadjuvant Chemotherapy Regimen
Some patients (the oldest treated) received EC-D (4 cycles of Epirubicin 75 mg/m² d1 plusCyclophosphamide 750 mg/m² d1 administered every 3 weeks, followed by 4 cycles ofDocetaxel 100 mg/m² d1 qw3). Patients from the more recent period received Epirubicin 75 mg/m² d1 plusCyclophosphamide 1200 mg/m² d1 every 2 weeks (SIM) for 6 cycles. After surgery, patients who received SIM chemotherapy received 3 cycles of Docetaxel (75mg/m2 d1 plus cyclophosphamide 750mg/m2 d1) every 3 weeks. The shift towards the use of dose dense, dose intense cyclophosphamide-anthracyclins (SIM) in the treatment of TNBC patients at Saint-Louis hospital, aimed at increasing pCR rates based on our previous data (22).
Pathology Assessment, follow-up and Event-free Survival
Pathologic complete response (pCR) was defined as no evidence of residual invasive cancer in breast tissues and lymph nodes (4). Absence of carcinoma in situ was not mandatory.
During neoadjuvant chemotherapy, patients underwent clinical examination every two cycles. After surgery, patients had follow-up visits every 4 months for two years, then twice yearly. Events included local, regional, or distant recurrences or death. Event-free survival (EFS) was defined as the period between the date of surgery and the date of the first event or the last follow-up.
Multimodal Data Aggregation
Clinical, histo-pathological, genomic, PET and radiomics features were aggregated in a large database. Categorical features were one-hot-encoded and numerical features were standardized to achieve a null mean and unit variance. A batch-effect correction for genomic expression data was achieved, inspired by the mean-only ComBat adjustment approach (23). The association between each feature independently with pCR outcome was assessed using non-parametric tests (Wilcoxon rank-sum test for continuous covariates, Fisher’s exact test for binary covariates).
A hierarchical clustering method was applied to reduce the number of radiomic features, with a bootstrap approach to determine the optimal number of clusters given the stability of the partitions. Six groups of radiomic features were defined, and only one feature per group was selected. Relevant non-radiomic features were selected based on their completion rate (less than 50%) and univariable feature-outcome analyses combined with clinical expertise. A single imputation by chained equations was then carried out to handle missing values in predictors using the MICE algorithm with randomized decision trees and 10 iterations (24).
Machine Learning Model Development
Several machine learning algorithms were then trained, including logistic regression models (with either LASSO, Ridge, or Elastic-net regularization), binary decision trees, support vector machines (with linear kernel) and random forests. Due to the small cohort size, a nested leave-pair-out cross-validation (LPOCV) approach (60 random pairs in the inner resampling, 756 (all) random pairs in the outer resampling) was used to correctly estimate the predictive performance of the models and select the best one (25). A grid search was applied with the Area Under the ROC Curve (AUC) as optimization criterion. Additional Materials give the grid of hyperparameters which was explored for each model. The uncertainty of the estimated predictive metrics (95% confidence intervals) and the comparison of performances obtained using various sets of data modalities (P-values) were quantified through a 10000-sample bootstrapping with outcome stratification over the pairs of patients used for the nested LPOCV.
The best ML model, according to the estimated predictive performances, was finally trained using a “standard” leave-pair-out cross-validation procedure. Global interpretability tools ensured a correct understanding, validation, and justification of the prediction model. In addition, Shapley Additive exPlanations (SHAP) values enabled to explain each patient-specific predicted probability of non-pCR.
EFS was estimated using the Kaplan-Meier method. The log-rank test was used to compare EFS among patients predicted pCR vs. non-pCR by the best ML model.
All statistical analyses were implemented in Python (version 3.8.0), using the scikit-learn (version 1.1.1) and lifelines (version 0.27.7) libraries.