Study design and population
The evaluation of the technical performance of single-slice body composition measurements was based on both prospective and retrospective cohorts. All studies received appropriate approvals from the West of Scotland Research Ethics Committee (REC reference: 21/WS/0066) and were performed in accordance with the principles of the Declaration of Helsinki. All participants provided informed consent to participate, including those who had previously taken part in ethically approved studies and consented to the use of their data.
Two primary software tools were used in this study for body composition analysis. The first tool, OsiriX MD (Pixmeo, Switzerland), is an FDA-cleared, commercially available medical imaging software used as the reference standard in this study. It provides general-purpose tools for medical image analysis, including both manual and semi-automated segmentation capabilities. This will hereafter be referred to as the "reference software." The second tool, VitruvianScan™ (Perspectum Ltd., UK), is a specialized single-slice (2D) body composition MRI analysis software. It incorporates semi-automated segmentation tools specifically designed for adipose and muscle tissue quantification at the L3 vertebral level. This will hereafter be referred to as the "study software." Both software packages were used to analyze the same MRI datasets, allowing for comparison of their performance in body composition assessment.
In the prospective study, 12 participants were recruited. Participants were scanned on Philips 1.5T (reference scanner) and Philips 3T scanners (Fig. 1A). For scan-rescan repeatability, each participant was scanned twice under the same conditions, with the patient leaving the MRI table between scans and being re-localized and re-registered with a new blinding code for each scan, with a maximum 30-minute interval between the scans. Cross-scanner reproducibility involved scanning each participant on the same day with both scanners, with a maximum 1-hour interval between the scans (Fig. 1A). Inter- and intra-analyst variability assessments were conducted on data acquired from the Philips 1.5T (reference) scanner. Adult subjects aged 18 years or older, with BMI between 20 and 35 kg/m2, and with a range of body types (from muscular to frail) were included. Exclusion criteria comprised the presence of any MRI contraindications (e.g. pacemaker, shrapnel injury, severe claustrophobia) [32].
In addition, retrospective data from 36 participants were selected to ensure comprehensive representation across a wide range of body types, scanner manufacturers (Siemens, Philips, GE), and field strengths (1.5T and 3T) (Fig. 1B). To achieve balanced representation, six participants were chosen for each combination of scanner manufacturer and field strength, with equal distribution across sex (3 females, 3 males) and BMI categories (lean, normal weight, obese). This dataset was used for inter-device assessments, comparing the performance of the study software with the reference software. Additionally, inter-reader assessment was performed by having trained analysts and radiologists independently analyze the same images using the study software to compare performance across different reader groups (Fig. 1B). Inclusion criteria ensured a diverse dataset, covering a range of body types (BMI, sex), scanner models (Siemens, Philips, GE at both 1.5T and 3T), and known height for all participants.
Data acquisition
In the prospective testing, all individuals were scanned using Philips Ingenia 1.5T and Philips dStream 3T. Data acquisitions were performed at the Leeds General Infirmary (The Leeds Teaching Hospitals NHS Trust), United Kingdom.
The MRI data were acquired using a 3D SPGR (Spoiled Gradient Echo) sequence with the Dixon option in the axial plane, a flip angle of 10° (1.5T) or 6° (3T), and minimum repetition time (TR). The maximum field of view (FOV) was set in both the right-left (RL) and anterior-posterior (AP) directions. Imaging was performed with an in-plane resolution of 2 × 2 mm² and a slice thickness of 6 mm, with a reconstructed slice thickness set to 3 mm (interpolation on). Parallel imaging techniques were utilized with acceleration factors of 2 in the phase-encoding direction and 2 in the slice-encoding direction with SENSE. A single average was used for acquisitions at 1.5T, while two averages were used for acquisitions at 3T. The number of slices was adjusted to achieve an acquisition time of approximately 12 seconds, with data collected in a single breath-hold and centred on the L3 vertebra.
The retrospective MRI data had been acquired using a 3D SPGR Dixon sequence. The in-plane resolution was between 1 x 1 mm2 and 4 x 4 mm2. The acquired slice thickness was between 1 mm and 10 mm. The flip angle was required to be less than or equal to 16° and 12° at 1.5T and 3T, respectively.
Data processing
Anonymised MR data were analyzed using both the study software and the reference software, as described earlier. Throughout all analyses, the radiologists and analysts remained blinded to the subject identities, scanner manufacturers/models, field strengths, scan session details, and any analyses or results produced by other radiologists or analysts. The analyses were performed by trained analysts with a minimum of 2 years of experience in MRI quantitative analysis and/or radiologists with a minimum of 5 years of experience in abdominal MRI. Analysts and radiologists underwent comprehensive training on the study software and had prior experience performing single-slice analyses at the L3 vertebral level.
Body composition analysis involved manual segmentations on a single L3 slice to obtain cross-sectional area (CSA) measurements (in cm2) for four classes; the visceral adipose tissue (VAT), the subcutaneous adipose tissue (SAT), the skeletal and the psoas muscles (Fig. 2).
The assessment of observer variability included: prospective intra- and inter-analyst evaluations as well as a retrospective inter-reader comparison (analysts vs. radiologists). For the prospective intra-analyst variability assessment, the same analyst analyzed a sub-sample of scans from the reference scanner twice (analysis 1 and analysis 2), presented in a random order and under blinded conditions, with a maximum of 4 days between the reads. Three analysts processed the same set of cases twice, with the anonymized cases presented randomly each time. For inter-analyst assessment, three analysts analyzed the same sub-sample of scans from the reference scanner under blinded conditions. For the retrospective inter-reader variability assessment (analysts vs. radiologists), a “gold standard” measurement was created on the median of successful measurements by the three radiologists using the study software. This “gold standard” was then used to evaluate the performance of each of the analysts.
The retrospective inter-device assessment compared results from the study software and the reference software on the same set of images. Three radiologists used the study software, while one of these radiologists (designated as the reference radiologist) used the reference software. The median results from the three radiologists using the study software were compared with the results from the reference radiologist using the reference software. To generate results using the reference software, the reference radiologist was presented with a single axial slice at the center of the L3 vertebrae. Initially, a region-growing tool within the software, utilizing image-dependent threshold values, was used to generate coarse masks for the four tissue classes. Subsequently, the reference radiologist edited these masks using a brush tool, where necessary, to generate the final output masks. The regions of interest (ROIs) were then exported, and the CSA measurements parsed from the resulting output file.
Within the study software, pre-processing was performed to generate a signal fat fraction (sFF) map calculated on a voxel-wise basis:
where F represents the signal in the fat image generated from the 3D SPGR acquisition, and W represents the signal in the corresponding water image. Analysts were presented with sagittal views of the fat, water, and calculated sFF maps, with a line indicating the central slice (Fig. 2a). This central slice was identical to the slice used by the radiologist in the reference device. Upon loading the central slice, the analysts used a circular brush tool to segment regions corresponding to a specific tissue class. During this interactive process, as brush strokes were applied to the image to generate an output segmentation mask, voxels within the bounds of the brush were only added to the segmentation mask if specific criteria were met in the intensity values of the corresponding voxels in the sFF image (Fig. 3). For VAT and SAT classes, voxels were included where sFF ≥ 0.5. For skeletal muscle and psoas classes, voxels were included where sFF < 0.5.
After analyses were completed by analysts and radiologists and measurements exported, these data were subsequently evaluated using an automated statistical pipeline.
Statistical analysis
For the prospective testing of repeatability, reproducibility, and intra/inter-analyst variability, the following statistical analyses were performed: Bland Altman analysis (bias and 95% limits of agreement, LoA) [33, 34], standard deviation (SD) of differences, mean coefficient of variation (CoV, in %), intra-class coefficient (ICC, two-way mixed effects, absolute agreement, single measurement), and repeatability coefficient (RC) or reproducibility coefficient (RDC). The RC/RDC, an estimate of the maximum difference that only 5% of measurement pairs will exceed asymptotically, was calculated according to standard formulae [35, 36]. Agreement was classified as follows: ICC > 0.9, excellent; 0.9–0.75, good; 0.75–0.5, moderate; < 0.5, poor [37]. These statistical analyses were also performed for the retrospective evaluation of inter- and inter-reader (analyst vs. radiologist) variability. In addition, Pearson correlation coefficients and corresponding p-values were calculated for inter-reader assessments. The statistical analyses were performed for all CSA (cm2): VAT, SAT, skeletal muscle, and psoas muscle; VAT/SAT ratio (no units); and VAT, SAT, skeletal muscle, and psoas indices (cm2/m2). Inter-device variability assessments were conducted only for CSAs. The indices were calculated by indexing the body composition CSA measurements (cm2) to height by dividing it by the height squared (m2). These height-standardized measurements can be used for comparison between individuals of different height.
In the analysis of participant demographics, values are presented as mean ± standard deviation (SD) for normally distributed data and as median with interquartile range (IQR) for non-normally distributed data. Descriptive statistics were also calculated for all prospective body composition measurements with all study subjects pooled.
The required sample size for the performance testing of the MRI body composition measurements was estimated based on Bland Altman estimates using previously described methodology [38].
All statistical analyses were performed using scripts written in the Python (v3.10).