Ethics approval
This study was approved by the ethics committee of each participating hospital. Informed consent was waived because of the retrospective nature of the study and the analysis used anonymous clinical and pathological data.
Study design
The overall study design is outlined in Fig. 1. We developed and validated an MRI-based radiomic model to predict DFS in osteosarcoma patients using datasets from multicentres. The generalization ability of radiomic model was validated on two independent external test sets. This work was carried out following the Methodological Radiomics Score (METRICS), a quality scoring tool for radiomics research endorsed by EuSoMII [19]. In addition, we investigated the biological significance of radiomic features using H&E- and IHC-stained WSIs from patients in the external test sets. Ten types of cell-level morphological features were automatically extracted from each nucleus in H&E WSIs and aggregated into a total of 150 patient-level features. Furthermore, hypoxia-related and immune-related TME biomarkers were obtained through IHC analysis. Spearman’s rank correlation analysis was conducted to evaluate the relationship between radiomic features and histopathological features.
Patient cohort and data acquisition
This retrospective study screened 896 osteosarcoma patients from 14 tertiary centres between January 2004 and January 2024. Inclusion criteria were as follows: 1) patients with histopathologically confirmed osteosarcoma based on the surgical specimen; 2) patients who underwent surgery with or without NAC; 3) patients who underwent contrast-enhanced MRI scans one or two weeks prior to treatment; 4) patients who had not received prior antitumor therapy before MRI examinations; and 5) patients without concurrent malignancies. Exclusion criteria were as follows: 1) patients lacking contrast-enhanced MRI images or with insufficient image quality (e.g., artifacts, low signal-to-noise ratio, and low resolution); 2) patients with less than six months of follow-up without disease progression. Finally, 270 patients were included. Figure 2 dispalys the patient recruitment pathway.
The distribution of case numbers across different centers was as follows: Centre 1 (n = 120), Centre 2 (n = 34), Centre 3 (n = 12), Centre 4 (n = 56), Centre 5 (n = 10), Centre 6 (n = 9), Centre 7 (n = 8), Centre 8 (n = 6), Centre 9 (n = 5), Centre 10 (n = 5), Centre 11 (n = 2), Centre 12 (n = 1), Centre 13 (n = 1), and Centre 14 (n = 1). The division of different datasets was based on the geographical distribution of centers. The training set for developing the predictive model included 166 patients from Centres 1 to 3 (101 males and 65 females; mean age, 21.1 years ± 13.2). Independent test set 1 comprised 56 patients from Centre 4 (39 males and 17 females; mean age, 21.4 years ± 11.5), while independent test set 2 comprised 48 patients from Centres 5 to 14 (27 males and 21 females; mean age, 21.4 years ± 12.0).
Baseline clinical, radiological, and pathological data were collected, including age, gender, tumor location, tumor stage, maximum tumor diameter, tumor volume, presence of pathological fracture, and pathological subtypes. All MRI images were acquired from the Picture Archiving and Communication System. Detailed MRI acquisition parameters can be found in the Supplementary Table S1. Tumor volume was calculated as follows: V = π/6 × L × W × H, where L, W, and H represent the length, width, and height of the tumor measured on MRI, respectively [20]. The tumor was staged according to American Joint Committee on Cancer (AJCC) Staging System for Bone Tumors (Eighth Edition) [21]. The presence of pathological fracture at diagnosis was determined by consensus among three experienced radiologists (with more than 10 years of experience). The endpoint was DFS, which was defiend as the time from surgery to disease progression or death. Data collection and evaluation were completed in January 2024.
MRI image normalization, tumor segmentation, and radiomic feature extraction
Contrast-enhanced fat-suppressed T1-weighted images (CE-FS-T1WI) were processed using the N4 bias correction algorithm in 3D Slicer software to mitigate MRI field distortions caused by radiofrequency field inhomogeneity and inherent MR equipment effects. A grayscale normalization algorithm was applied to standardize MRI intensity values to [0, 255], reducing grayscale variations among MRI images from different patients, acquisition times, and parameter settings.
Tumors in the training set were independently segmented by two radiologists (each with 5 years of experience in musculoskeletal imaging) using ITK-SNAP software (www.itksnap.org). All regions of interest (ROI) were manually delineated slice by slice to encompass each gross tumor volume on CE-FS-T1WI. Due to potential inter-observer variability in tumor segmentation, segmentations were evaluated by a senior radiologist specializing in musculoskeletal imaging with 15 years of experience. Final segmentation in the test sets was produced by a single reader, to better reflect clinical practice. Finally, the voxels of the ROI were resampled using a cubic spline interpolation algorithm (1*1*1 mm3), ensuring standardized analysis.
Using the open-source Pyradiomics v3.0 package, radiomic features were extracted from original images and preprocessed images using Laplacian of Gaussian (Log) filter and wavelet transformation. The majority of these features adhere to the definitions outlined by the Imaging Biomarker Standardization Initiative (IBSI) [22]. They encompass four categories: 1) Shape-based features (n = 14); 2) Intensity histogram-based features (n = 72); 3) Matrix texture features (n = 300), including the Grey-level co-occurrence matrix (GLCM), Grey-level run length matrix (GLRLM), Grey-level size zone matrix (GLSZM), Grey-level dependence matrix (GLDM), and Neighbourhood grey-tone difference matrix (NGTDM); and 4) Wavelet features (n = 744). A total of 1130 radiomic features were extracted from CE-FS-T1WI (see Supplementary Materials for specific feature names).
Radiomic model development
Prior to feature selection, radiomic feature values were normalized to [0, 1] using the MinMax algorithm. Subsequently, Pearson correlation analysis was applied to initially identify a minimally redundant feature set (with a correlation coefficient threshold set at 0.9). Further feature selection was further performed using Relief algorithm. Various machine learning classifiers, including Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), Adaptive Boosting (AdaBoost), Multilayer Perceptron (MLP), and Naive Bayes (NB), were applied. Five-fold cross-validation was utilized for feature selection and algorithm optimization to construct the top-performing radiomic model. The entire procedure was conducted on the training set, while the external test sets were used exclusively for model validation. Standardization parameters obtained from the training set were applied to preprocess the independent test set data to evaluate the model's generalization performance.
The potential association of radiomic signature (ie, Radscore) and DFS was assessed in the training set and then validated in the external test sets 1 and 2 using Kaplan-Meier survival analysis. The patients were divided into high- and low-risk subgroups according to the predictive Radscores; the threshold of which was identified by using X-tile software. The difference in the survival curves of the high- and low-risk subgroups was evaluated by using a weighted log-rank test.
Clinical models were established by employing univariate and multivariate Cox proportional hazards regression analysis to screen clinical characteristics. Likewise, utilizing the multivariate Cox regression method, the combined model was constructed by integrating significant clinical features and Radscores. The predictive performance of the radiomic model was compared with that of the clinical model, tumor volume, and combined model.
Radiology-pathology correlation analysis: Interpretability of the radiomic model
H&E WSI-derived morphological nuclear features
We collected 41 patients from the external test set 1, each with available H&E-stained slides of the osteosarcoma. The slides were digitalized into SVS format using the Digital Pathology Slide Scanner KF-PRO-400 (Ningbo, Zhejiang, China) at a 40x magnification. All the WSIs were manually checked by a senior pathologist for artifacts, and the images free of all types of artifacts were chosen. We pre-processed each WSI with a pixel threshold to filter out white background, where pixels with a mean value of the RGB channels greater than 210 were considered as background. We then divided all the non-background areas into non-overlapping 2048*2048 patches and leveraged an unsupervised segmentation algorithm [23] to segment cell nuclei on the patches. For each segmented nucleus, we extracted 10 types of morphological nuclear features, including nuclear area, lengths of the major and minor axes of cell nucleus, the ratio of major axis length to minor axis length (major, minor, and ratio), mean pixel values of nucleus in RGB three channels, and mean, maximum, and minimum distances (distant, distMax, and distMin) to its neighboring nuclei in the Delaunay triangulation graph [24] (see Supplementary Table S3). Finally, for each type of morphological nuclear feature, a 10-bin histogram and five statistic measurements (i.e., mean, standard deviation, skewness, kurtosis, and entropy) were used to aggregate the cell-level features and generate a 150-dimensional imaging feature for each patient (see Supplementary Table S4). All feature extraction was implemented using MATLAB R2021a.
IHC-derived hypoxia-immune-related biomarkers
Forty patients from the external test sets with available formalin-fixed paraffin-embedded (FFPE) tissue samples were included. FFPE samples were cut into 3-mm thick sections, which were then processed for IHC staining (see Appendix E1 of Supplementary Materials for details). The detected immune-related and hypoxia-related IHC biomarkers included CD3 (pan T cells), CD8 (cytotoxic T cells), CD68 (macrophages), FOXP3 (Treg cells), and Carbonic Anhydrase IX (CAIX) (Supplementary Figure S1). IHC evaluation was performed independently by two pathologists (with 5 and 10 years of experience, respectively) who were blinded to the outcome data. In cases of disagreement between the primary pathologists, a third pathologist was consulted to reach consensus. In detail, the tissue sections were initially screened at low power (100x) using an upright microscope (BX53; Olympus, Japan), and five most representative fields were selected. The density of immune and hypoxic cells was then assessed at 200× magnification within tumor regions. The cells stained for nuclear (FOXP3) and membrane (CD3, CD8, CD68, and CAIX) biomarkers were quantified and expressed as the number of cells per unit area (cells/mm²).
Correlation analysis
The 150 patient-level cytomorphological features derived from H&E WSIs, along with five hypoxia-immune-related biomarkers derived from IHC, were directly analyzed to assess Spearman’s rank correlation with the selected radiomic features. Correlation coefficients below 0.3 were considered weak correlations, 0.3 to 0.7 as moderate correlations, and > 0.7 as strong correlations [25].
Statistical Analysis
We compared two groups using the t-test for continuous variables and the chi-square test or Fisher's exact test for categorical variables. The predictive performance of the radiomic model was evaluated using time-dependent receiver operating characteristic (ROC) curves, calibration curve, decision curve analysis (DCA), and clinical impact curve (CIC). Time-dependent ROC analysis was conducted using the "timeROC" package. The DeLong test was employed to compare the area under the curve (AUC) between models. Calibration curve was generated using the "rms" package. Hosmer-Lemeshow test was carried out using the "ResourceSelection" package to examine the goodness of fit between model-predicted and observed probability. The DCA and CIC were plotted using the "rmda" package in R. The DCA illustrates the net benefit of using the predictive models compared to other strategies (such as always treating or never treating) at various probability thresholds [26]. The CIC shows the potential impact of adopting the predictive model on clinical decision outcomes across various probability thresholds [27]. The Shapley Additive Explanations (SHAP) algorithm was used to visually quantify the contribution of each feature within the radiomic model using “shap” package [28]. Statistical significance was set at a P value < 0.05. Statistical analyses were conducted using R version 4.4.0 and Python 3.7.6.
Role of funding source
The funders played no role in the study design, data collection, data analysis, data interpretation, and writing of the report.