The proposed software was designed for use in clinical settings. Therefore, we prioritized shortening the measurement time and improving efficiency. The results showed that the time required for analyzing one image using the proposed software was approximately 6 s. After automatic analysis, the software involves a process wherein the user manually checks the measurement points for each vertebral body and makes the necessary corrections. Depending on the proficiency level of the user, most images were analyzed in less than 1 min. Based on this result, we believe the proposed software is suitable for use in clinical settings.
The correlation analysis showed that C/A had the lowest correlation coefficient between the AI models and each rater among the vertebral body height ratios in both the validation and test sets. This study focused on vertebral body height ratios and did not evaluate the position of the measurement point itself because we assumed that the measurement point at the vertebral body center varied the most between examiners. Additionally, decisions are made easier when the left and right endplates overlap and are clearly visible in the vertebral body. However, if the vertebral body is tilted and rotated owing to the limb position or scoliosis, the endplate on the side farther from the cassette appears thinner, making it more difficult to set the measurement point at the center of the vertebral body. Furthermore, when developing the software, preliminary observations showed that increasing the number of vertebral bodies used for learning in the validation set could improve the correlation between the height ratio of each vertebral body and the training data creator. Considering the almost no increase in the correlation coefficient of each vertebral body height ratio after annotating 2000 vertebral bodies suggests that a further increase in the number of learned vertebral bodies may not necessarily improve software accuracy. Discrepancies of 0.2 or more between the AI- and human-based calculations of vertebral body height ratios were 0.8 and 2.4% in the validation and test sets, respectively. For vertebral bodies whose accurate lateral images cannot be captured owing to scoliosis or lateral bending, the endplates and vertebral arches of adjacent vertebral bodies may be misrecognized owing to the difficulty of determining the edges of vertebral bodies. Vertebral bodies with a deviation of 0.2 or more are relatively easy to recognize through observations after automatic analyses. However, based on the low deviation rate, we believe their impact on the detection efficiency is limited.
QM involves cumbersome process that requires significant time to manually set measurement points, wherein more than 10 min are required for each X-ray image, and approximately 20 min for images including the thoracic and lumbar spine. Thus far, software that semi-automatically evaluates vertebral body height using a statistical decomposition method has been reported. [27–29] However, they require manual detection of vertebral bodies. Additionally, they require approximately 35 + 10 s to interpret each vertebral body, resulting in approximately 7 min for each case. In contrast, the proposed software performs fully automatic detection of vertebral bodies and requires approximately 6 s to interpret an image for one case, potentially offering significant advantages in daily clinical practice. Suri and colleagues' system is capable of analyzing the vertebral body heights across the entire spine from CT, MRI, and radiography within 2 s. [20] However, their analysis targets the vertebrae from T10 to L5 for radiography. Vertebral fractures can occur not only in the lower thoracic spine but also in the entire thoracic spine below T4, and they are not clinically rare. [21] While our software may take slightly longer than their system in terms of measurement time, we believe it offers value by providing a more comprehensive evaluation of vertebral body compression from radiography, the most widely used modality globally, for both thoracic and lumbar spines. Additionally, while they have validated the performance of their system in comparison with radiologists, a unique aspect of our study is that we have also conducted this comparison among external evaluators with significant clinical experience in the test set. It is noteworthy that a certain degree of discrepancy was observed between the software and external evaluators' vertebral height ratios, and a similar level of discrepancy was observed among the external evaluators. Addressing this might require efforts such as promoting a unified approach to evaluating vertebral heights across various vertebrae, potentially advocated by international bodies. Attempts have been made to reduce measurement discrepancies among evaluators through tutorial-based training programs. [17] In our study, both the training data creators and external evaluators referred to Genant et al.'s evaluation method [14]; however, they did not receive comprehensive training. Had they undergone such training, consistency might have been improved. However, as the results of that study suggest, a certain degree of divergence persists even after tutorials. [17] Therefore, it is important to acknowledge that achieving complete agreement among evaluators is realistically impossible, which is a limitation that must be recognized not only in the use of this software but also more broadly in evaluations using QM. Additionally, a software that uses deep CNN to detect fractures from thoracolumbar vertebral body images has also been developed,[18] with its fracture detection performance confirmed to be comparable to those of orthopedic specialists. In addition to its aim of detecting fractures rather than quantitatively evaluating them, it primarily differs from the proposed software in that old cases with more than one month of injury and those with SQ grade 1 are excluded from the learning data. Moreover, the proposed software may falsely recognize obsolete or deformed vertebral bodies as vertebral fractures. However, for diagnosing osteoporosis, the presence of vertebral fractures rather than the onset time is important; we believe that this method is significant as it has the potential to improve the diagnostic rate of osteoporosis.
Although detection of severe vertebral fractures is easy, diagnosing minor vertebral fractures requires training[17, 30] and is often overlooked in routine clinical practice. The proposed software automates the process of QM, thereby enabling faster and more efficient evaluations, and is expected to improve the diagnostic rate of minor vertebral body fractures. In particular, as two-thirds of vertebral fractures are asymptomatic fractures[4, 5] and crushing is often mild[5], the proposed software is expected to contribute toward improving the overall diagnostic rate of vertebral fractures and osteoporosis. Additionally, owing to its considerably short analysis time, the proposed software may be suitable for use in mass osteoporosis screenings and clinical research, which can help provide early treatment and fracture prevention, making it possible to maximize the effects of therapeutic drugs.
Currently, the mainstream quantitative evaluation methods are SQ, and the difference between them and the proposed software lies in their evaluation complexities. SQ only assigns a grade to each vertebral body, whereas QM requires setting six measurement points on each vertebral body and calculating the vertebral body height ratios. The proposed software makes high-throughput testing easier and may be more useful in daily clinical practice. In addition, QM allows subdividing and optimizing the vertebral body height ratio threshold for diagnosing fractures based on age, gender, and race. The reference range for this threshold is set based on research results for various races and ages[10, 12, 13, 31]. However, in recent years, several studies have reported that vertebral body morphology differs based on race,[32–34] gender,[33] and age.[34] Additionally, several studies have shown slight vertebral body crushing even in relatively young people[21, 35]; we believe that it is necessary to understand the reference range of vertebral body morphology across a wide range of generations, genders, and races, rather than solely focusing on the elderly population. To apply these research results in clinical practice, rapid QM is necessary, which is enabled by the proposed software.
However, several recent studies have stated that qualitative evaluation methods that emphasize the presence or absence of endplate damage are more useful for diagnosing new vertebral body fractures and assessing subsequent fracture risk. Additionally, qualitative assessment methods have been reported to be strongly associated with low bone mineral density,[15, 36] and development of vertebral[15] and non-vertebral osteoporotic fractures[15, 37]. In addition, it has been noted that QM may increase the false-positive rate by misrecognizing vertebral body deformities that are not fragility fractures (e.g., Schmorl's nodes and Scherermann's disease) as fractures.[31, 38] When evaluating endplate damage through QM, the deformation of the vertebral body owing to endplate damage is mainly reflected in the height of the central vertebral body;[14] however, the degree of deformation may be underestimated by using the midpoint. However, even if endplate damage occurs, the morphology of the endplate becomes smooth over time because of remodeling, and it may no longer be recognized as endplate damage through qualitative evaluation methods.[39] By using the proposed software, after the vertebral body height ratio is automatically calculated by AI, each vertebral body is enlarged and displayed, and the measurement point position can be manually confirmed and corrected. At this point, it is possible to improve the diagnostic accuracy of vertebral body fractures by combining QM with various qualitative evaluations, such as evaluation of endplate damage and exclusion of vertebral body deformity.[40]
This study had several limitations. The first was the problem of ground truths in training data. As it may be difficult to establish the ground truth for a measurement point by only using a simple X-ray image, it would be ideal if CT and MRI scans were performed simultaneously. However, deep learning of this software required more than 3,000 vertebral bodies, and it was impossible to simultaneously capture plain X-ray, CT, and MRI images. In this study, the training data were collected based on the measurement point creation method proposed by Genant et al.,[14] and were verified by an external expert to address issues of subjectivity and accuracy. The correlation and consistency of the measurement results with the external verifiers were stated in the results, and the correlation and consistency between the external verifiers were also the same. Each external evaluator was an expert with sufficient clinical experience, and it is considered that a certain degree of ambiguity in setting measurement points using only lateral X-ray images is inevitable. Second, the degree of collapse of the vertebral body images in the training data is not uniform. Although training the same number of vertebral bodies with various degrees of crushing is considered ideal, in this study, vertebral bodies with SQG0 accounted for more than 90% of the training data. In fact, in the validation set, as vertebral body collapse progressed, the consistency between the ratios calculated by the training data creator and the AI models decreased. As the training data were randomly selected from clinical images over a certain period, discrepancies in the proportions between SQ grades were inevitable. However, if the number of crushed vertebral bodies, such as those included in SQG2 and SQG3 can be increased, it may be possible to improve the measurement accuracy for crushed vertebral bodies. However, considering the aforementioned preliminary study on improving measurement accuracy by increasing the number of learning vertebral bodies, this possibility may be limited. Third, there is a concern about the inaccuracy of measurement points in scoliosis cases. In scoliosis, the vertebral body is not only laterally but also rotationally deviated, making it difficult to set measurement points at the rear in addition to the central measurement point. Additionally, scoliosis cases with a Cobb angle of 15° or more were excluded from the training and validation/test data. Therefore, it was not possible to evaluate the performance of the proposed software for cases of scoliosis as it was technically challenging. Developing a technology that can simulate 3D evaluations of the vertebral body and set measurement points more accurately by combining frontal X-ray, CT, and MRI images may allow for obtaining better-quality training data.