Subjects
Our Institutional Review Board approved this retrospective study and waived the requirement to obtain informed consent. By searching our electronic medical records, we identified 232 adult patients <40 years old who were diagnosed with ankylosis spondylitis at a tertiary hospital from February 2002 to March 2020 and underwent lower spine or abdominopelvic X-ray imaging. As a control group, we selected by matching age and gender patients who complained of lower back pain but were not diagnosed with ankylosis spondylitis, approximately 1.5 times the number of AS patients. Thus, the final study cohort had similar numbers of radiographs in the normal and abnormal groups. A board-certified abdominal radiologist (C.A. with 10 years of experience in reading plain X-ray) reviewed the radiographs and excluded 502 images because of the low image quality (n = 32) or the lack of the radiologic findings of sacroiliitis in patients with ankylosing spondylitis (n = 470). The final study subjects were randomly split into training and test sets with a ratio of 8:2, while preserving the proportion of positive cases and preventing the same patient's images from being assigned to different sets (Fig. 1).
Data
Image preprocessing
The X-ray images were downloaded in the Digital Imaging and Communications in Medicine format following de-identification and converted to the Joint Photographic Experts Group format. The converted images were first resized along the shorter dimension to 1,024 pixels and then resized along the longer dimension to 1,024 pixels such that the original aspect ratio was maintained.
Image annotation
Our goal was to automatically detect both SIJs on a plain radiograph and classify them as either normal or abnormal (i.e., the absence or presence of sacroiliitis). Therefore, two types of ground truth were required: the coordinates of a bounding box surrounding a SIJ and the presence or absence of sacroiliitis findings. First, the bounding boxes were drawn manually by the radiologist (C.A.) and a machine learning researcher (D.K.) in consensus using ImageJ software [9]. Second, regarding the presence or absence of sacroiliitis findings in each SIJ, two board-certified musculoskeletal radiologists (M.Y.C. and S.H.L, with 7 and 8 years of experience, respectively) reviewed the X-ray images and graded in consensus the severity according to the New York criteria: grade 0, normal; grade 1, suspicious (some blurring of the joint margins); grade 2, minimal sclerosis with some erosion; grade 3, definite sclerosis or severe erosion with joint space widening; grade 4, complete ankylosis [10]. They were unaware of the patients’ diagnosis and demographic information during the image review session. For the ground-truth, grades 0/1 and grades 2/3/4 were considered the absence and presence of sacroiliitis, respectively.
Deep Learning
Training
We used Python 3 with TensorFlow 2 Object Detection application programming interface (API) [11]. Among available models, we used EfficientDet-D4 1024 x 1024, pre-trained on COCO 2017 dataset. EfficientDet is currently a state-of-the-art architecture for object detection. It was built upon its predecessor, EfficientNet, and incorporates a novel bi-directional feature network and new scaling rules [12, 13].
We fine-tuned the pre-trained EfficientDet-D4 model on our training dataset with a batch size of 4. To avoid overfitting, we performed random data augmentation for training: horizontal flip with a probability of 0.5, brightness adjustment with a scale range of 0.9–1.1, contrast adjustment with a scale range of 0.9–1.1, and rescaling with a scale range of 0.8–1.2. The scaled image was then either cropped or padded to maintain its original size. A loss function was the weighted sigmoid focal function for classification loss and the smooth L1 function for localization loss. We used the Adam optimizer with learning rate warm-up, a heuristic method that increases the learning rate linearly over the warm-up period [14]. We optimized the learning rate parameters through 5-fold cross-validation and random grid search. The warm-up learning rate, learning rate base, warm-up steps used for our final model were 1.0000001e-05, 0.00019999998, and 2,500, respectively, which means that the initial learning rate of 1.0000001e-05 was increased by 0.00019999998 over 2,500 steps. For other hyperparameters, we used the default hyperparameter configuration provided by TensorFlow Object Detection API. The optimal number of iterations was 70,000 (i.e., 17,500 epochs); it took approximately five hours for two Titan RTX graphic processing units to train our model.
Evaluation
We fit the final model on the entire training dataset and validated the trained model on the test dataset. We had the model make ~100 inferences on the bounding box coordinate and whether sacroiliitis is present with its confidence score and chose one bounding box with the highest score for each SIJ. Inferences with a score lower than 0.5 were discarded.
In the qualitative analysis, the radiologist (C.A.) determined whether a drawn bounding box contains a SIJ correctly with an appropriate size. In the quantitative analysis, we used a mean average precision (mAP) of Intersection over Union (IoU) as an evaluation metric. IoU measures the overlap between a ground-truth bounding box and a predicted box, and mAP is an average of maximum precision scores across all recall values. We can predefine a threshold to determine whether a prediction is correct. For example, an mAP at 0.5 IoU is the score when at least a 50% overlap with the ground truth bounding box is considered a correct prediction. In addition, we calculated the sensitivity, specificity, accuracy, precision, NPV, and F1-score in the diagnosis of sacroiliitis.