While automatic cartilage segmentation through AI has been reported, manual segmentation remains the gold standard, offering superior accuracy and precision. The goal of this study was to examine the intra- and interrater reliability of knee cartilage MR relaxation metrics via manual cartilage segmentation. Our findings suggested good to excellent intra- and interrater reliability for the quantification of knee MR relaxation metrics, with slightly lower interrater reliability than intrarater reliability. Consistent with the literature [14, 28], tibia cartilage presented the greatest R1ρ and R2 relaxation rates, followed by femoral cartilage, whereas patella cartilage presented the lowest rates.
In our study, reliability was greater in the patellar, trochlear, and central femoral condylar cartilages than in the tibia and posterior femoral condylar cartilages. Since our segmentations were based on anatomical identification, the reproducibility of these landmarks became crucial in determining the boundaries for segmentation. Patellar and femoral trochlear cartilages, which comprise the patellofemoral joint, can be easily visualized on MR images with distinct cartilage-to-bone boundaries (Fig. 1). Additionally, the patellar cartilage, followed by the trochlear cartilage, is thicker than the other knee cartilages. Previous studies have suggested that the thicker the cartilage is, the lower the observed variability [29, 30], which is consistent with the results of the present study. However, posterior femoral condylar cartilage exhibited the greatest variability, which can be attributed to observer-dependent differences in the determination of the artificial boundaries, both anteriorly and posteriorly. Nevertheless, the overall high intra- and interrater reliability (ICCs ranging between 0.75 and 0.99) suggests that subtle differences in cartilage selection in specific regions do not meaningfully alter the mean R1ρ and R2 relaxation rates.
Since our cartilage segmentation was reproducible, the mean R1ρ and R2 relaxation rates were expected to demonstrate good to excellent intra- and interrater reliability, which is consistent with the literature. In the study by William et al. [30], the authors reported intrarater ICCs ranging between 0.80 and 0.98 for T2 mapping in the knee joint, whereas the current study reported intrarater ICCs ranging between 0.91 and 0.99 among the tibiofemoral and patellofemoral joints. Another study by Welsch et al. [31] reported an average interrater ICC of 0.90 for T2 mapping in the knee joint, whereas our interrater agreement, though slightly lower, was still considered excellent, with ICCs ranging between 0.75 and 0.99. Collectively, the evidence suggests that manual segmentation of specific MR slices and known subregions is highly reliable.
While AI-based segmentation has the advantage of eliminating the manual segmentation burden, its overall reliability and accuracy remain unclear. Norman et al. [32] utilized a deep learning model to perform automatic segmentation via DESS sequences and reported Dice coefficients ranging from 0.77 to 0.88 for cartilage. Similarly, Gatti and Maly [16] used knee MR images and reported Dice coefficients ranging from 0.88 to 0.91 for different cartilage compartments, with even better results for healthy cartilage. The Dice coefficient, a statistical tool that measures the agreement of segmentations, with 1 being a perfect match and 0.8 to 0.9 generally considered good and useful [13]. While the reliability of AI-based segmentation appears to be comparable with that of manual segmentation, it is important to note that AI-based segmentation was validated against manual analysis, which resulted in a doubling of the errors. Importantly, reliability does not necessarily imply accuracy. Since the true cartilage boundaries for the samples in this study are currently unknown, it is unclear how accurately these MR relaxation metrics are reflected. Future studies will need to address this critical gap for both manual and AI-based segmentations to enhance the assessment of degenerative joint diseases.
Even though this study is limited by having only two raters and five image sets, the results likely reflect the true reliability of generating knee MR cartilage relaxation metrics from manual segmentations of the respective ROIs. However, it is important to note that the study included only healthy subjects. Individuals with OA may have bone spurs and cartilage defects that could complicate the accurate outlining of the cartilage surface. Additionally, this study did not examine the zonal behavior of cartilage relaxation metrics, which have been shown to differ within cartilage layers due to varying loads [31, 33]. Future research should include subjects with OA and assess the zonal behavior of cartilage relaxation metrics to improve our understanding of cartilage health and pathology.