Data preparation and development of the MMDLS program.
Our overall program mainly consists of two parts. The first part involves the input of multimodal data, including three datasets consisting of clinical skin images, multi-IHC images and modified 2019 EULAR/ACR scores. Initially, we collected 884 cases with a diagnosis of suspected lupus from 25 hospitals in China. The final diagnosis of all cases was further confirmed based on the diagnosis of two senior dermatologists and follow-up clinical information. Any case with incomplete data or with a final diagnosis beyond the 13 classifications was excluded. The final dataset included 386 cases with 580 clinical skin images, 3380 multi-IHC images and clinical data (Extended Data Table 1).
For the clinical skin image dataset, we retrospectively collected images of skin lesions on the face, trunk, and limbs captured by smartphones. For each case, we ensured that the images were clear and representative for training and testing procedures. Then, we performed multi-IHC staining of CD4, CD8, CD11b, and CD19 on the skin sections. In addition, the 2019 EULAR/ACR score is included in this study because the diagnosis of SLE in particular demands the comprehensive evaluation of overall clinical data. To expand the differences among LE subtypes and other similar skin diseases, we obtained the 2019 EULAR/ACR score at 100× magnification. Therefore, in this study, the 2019 EULAR/ACR score was modified and utilized for all SLE and non-SLE cases to evaluate and quantify systemic involvement.
The second part includes the algorithm and construction of the MMDLS. To prepare the MMDLS, we randomly selected 310 cases from all the cases used for training, and 76 of the remaining cases served as an independent testing dataset (training vs. testing = 8:2). Notably, for pattern recognition, a suitable feature encoder plays a significant role in the diagnosis results. Currently, advanced feature encoders include VGG-16, Inception-V3, ResNet-18 and EfficientNet-B3, and these encoders are widely used in feature extraction and pattern classification26. This study selected the most efficient feature extractor for images of different modalities according to the training time, accuracy, and number of parameters. The results showed that EfficientNet-B3 exhibited the highest accuracy of multi-IHC images (top-1 accuracy=77.21%, Extended Data Figure 1) and the lowest number of parameters (40.88 MB). For clinical skin images, ResNet-18 achieved the highest accuracy (top-1 accuracy=70.76%, Extended Data Figure 1). In addition, ResNet-18 showed the shortest training time of only 428 seconds, whereas other networks required at least 700 seconds to achieve similar results. Finally, we employed EfficientNet-B3 and ResNet-18 for deep feature extraction of multi-IHC images and clinical skin images, respectively. The 2019 EULAR/ACR scores were then integrated into the diagnostic system by binding neurons to the fully connected layer. The comparison of the different encoders is detailed in Extended Data Figure 1, and the overall workflow is shown in Figure 2.
Comparison of MMDLS performance with single-modal DLS and dual-modal DLS
To obtain a qualified MMDLS, we conducted 10 repeated experiments with training dataset, and testing dataset were used to evaluate the capacity of each MMDLS. The testing dataset with the highest classification accuracy composed of 76 cases was chosen to generate the final MMDLS and perform the comparison experiment (Fig. 3). To verify the superiority of MMDLS, we compared the overall performance of MMDLS with single-modal DLS including clinical skin image modality, multi-IHC image modality, and dual-modal DLS (i.e., clinical skin image and 2019 EULAR/ACR score; multi-IHC image and 2019 EULAR/ACR score; and clinical skin image and multi-IHC image).
First, the efficiencies of different DLS models were compared based on the average sensitivity, specificity, and precision using the testing dataset (Fig. 3a). By comparison, we found that the data of the 3 abovementioned evaluation indices steadily improved when the modalities were used in combination. Of all the models, the MMDLS developed based on the fusion of 3 modalities showed the best performance (Ave-Sen=0.9116, Ave-Spe=0.9921, Ave-Pre=0.9281), revealing the advanced efficiency of MMDLS in classification. The multiclassification models exhibited specificities ranging from 0.9751 to 0.9921, and the differences among these models were insignificant. The average sensitivity and precision of MMDLS showed the most significant progress compared with single- and dual-modal DLS models, suggesting a low misdiagnosis rate and high accuracy of MMDLS in the multiclassification task.
To visually assess the differences, we further applied the ROC curve to evaluate the classification efficiency of all DLS models (Fig. 3b). The ROC curve is universally used in diagnostic tests and is estimated by the area under the ROC curve (AUC), which is consistent with previous indices. In the comparison of different DLS models, we found that the MMDLS still had the highest AUC of 0.9956. From the perspective of deep learning, we visualized the advantages of the multimodal strategy by box plots, which are generated from the values of top-1 accuracy in the repeated experiments (n =10). The box plot simultaneously highlights the diagnostic accuracy and the stability of different models. As shown in Figure 3c, in single-modal DLS, the top-1 accuracy of the multi-IHC image modality was greater and the distribution was more concentrated (IHC: top-1 acc=69.00±1.13%) compared with that of the clinical skin image modality. In other words, outliers and variable performances were more frequently noted in the clinical skin image modality. Given that the multi-IHC image showed better diagnostic accuracy and better stability as a single-modal DLS, we then fused the 2019 ELUAR/ACR score into the multi-IHC image. The average top-1 accuracy was increased by 3.6% (IHC+ACR: top-1 acc=80.73±0.45%). Together, the MMDLS exhibited the highest accuracy and the best stability (CLI+IHC+ACR: top-1 acc=91.08±0.44%). This finding indicates that single or dual models are inadequate for decision-making.
Classification efficiency of 13 categories by proposed MMDLS
In this section, we employed the same testing dataset mentioned in the comparison of MMDLS with single-modal DLS and dual-modal DLS. To identify the special features of different models in specific diseases, we compared the classification parameters for 13 categories in detail.
Here, the F1 score was used to compare the diagnostic and classifying efficiency and accuracy of the modalities for each subtype (Fig. 4a, Extended Data Table 2). For the clinical skin image modality, the classification efficiency for 4 LE subtypes was unsatisfactory (average F1 score=0.4583), which is largely due to the similar appearances between LE skin lesions and other skin diseases. In contrast, the multi-IHC imaging modality exhibits remarkable predominance in diagnosing the LE subtypes, especially for DLE and P-SCLE, for which the F1 score was increased by 0.4645 (101.84%) and 0.4728 (224.61%), respectively. Although the diagnostic efficiency of the multi-IHC modality was superior to that of the clinical skin image modality in general, the clinical skin image modality had its own advantages in the EM, Ros, DM, and Ecz groups. Regarding the MMDLS, the results show that the F1 score of all 13 categories increased dramatically. Compared with the clinical skin image modality, the F1 score of MMDLS for DLE, A-SCLE, P-SCLE and SLE obviously improved (DLE: from 0.4561 to 0.9844, A-SCLE: from 0.5287 to 0.8644, P-SCLE: from 0.2105 to 0.7961, SLE: from 0.6378 to 0.8765, Fig. 4a). The F1 scores of EAC, Vas, DM, LP and EM experienced high rates of growth from clinical skin images to MMDLS, ranging from 0.1248 (15.08%) to 0.2773 (39.46%). Overall, the MMDLS showed the best performance on the diagnosis and classification of all 13 categories (average F1 score=0.9128).
Classification is visualized based on the t-SNE plots shown in Figure 4b-e. In t-SNE plots, the more convergent and independent the point clouds are, the easier it is to distinguish the diseases. For the clinical skin image modality, the t-SNE plot showed uniform and scattered points, and most of the 13 categories could not be accurately grouped. Although the dots of the multi-IHC image modality showed a clustering tendency, some groups could not be properly separated from other diseases, especially A-SCLE and parts of P-SCLE, SLE, EM and Ecz. After the inclusion of the 2019 EULAR/ACR score, the t-SNE graph showed a more obvious clustering distribution of approximately 13 categories, but the differentiation of SLE, A-SCLE and P-SCLE remained challenging. In MMDLS, the data of 13 categories were positioned relatively far away from each other, demonstrating the advanced diagnostic ability of the algorithm. MMDLS performed well in distinguishing the four LE subtypes, especially DLE and SLE. Our results showed that the clustering result constantly improved with the fusion of multimodal information. However, although we made progress compared to previous studies21,27, the differentiation of A-SCLE and P-SCLE remains challenging.
A confusion matrix was also constructed to further demonstrate the performance of MMDLS (Fig. 4f). First, in the clinical skin image modality, the classification efficiency of P-SCLE was unsatisfactory, and it was easily misjudged as DLE (57%), SLE (14%), or Ecz (16%). In contrast, in the multi-IHC imaging modality, the misdiagnosis rate of P-SCLE was significantly reduced from 88% to 27% (Extended Figure 2). Finally, in MMDLS, the accuracy of P-SCLE increases to 71%, which is far higher than that noted in other DLS models. The diagnostic accuracy of DLE in the multi-IHC imaging modality was also greater than that obtained with the clinical skin imaging modality (approximately 41% higher). In MMDLS, the accuracy of DLE was up to 98%. As shown in Figure 4f, the MMDLS exhibited the greatest ability to distinguish conditions based on all 13 groups, achieving an average accuracy of 92.54%.
Comparison of MMDLS to dermatologists and pathologists.
To further evaluate the performance of MMDLS in the real world, 76 cases in the testing dataset with the best diagnostic accuracy were selected, and we retrospectively collected the clinical diagnosis and pathological diagnosis of each case. Here, the clinical diagnosis refers to the first impression diagnosis based on features of skin lesions and clinical manifestations. Pathological diagnosis was made by pathologists based on HE staining. The MMDLS diagnosis is also modified to one diagnosis for each case. The gold standard for diagnosis, which acts as a true label, is the diagnosis made by two senior dermatologists who reached a consensus based on all clinical data.
To compare the MMDLS performance with that of the dermatologists and pathologists, we evaluated the average sensitivity, specificity, precision and F1 score (Fig. 5.a). Considering the nature of multiclassification models, the specificities were insignificantly different. Thus, further discussion is unnecessary. Compared with the clinical diagnosis and pathological diagnosis, MMDLS achieved marked increases in sensitivity (Sen: dermatologists=0.4033 pathologists=0.6516, MMDLS=0.9542), indicating that MMDLS can effectively reduce misdiagnosis. Regarding precision, our results suggested that the clinical diagnosis (Pre=0.5022) and pathological diagnosis (Pre=0.6876) appear to have relatively low precision, whereas the precision of MMDLS can reach 0.9622, further illustrating the high diagnostic accuracy of MMDLS. As our findings show, the classifying ability of MMDLS is superior to that of dermatologists and pathologists (F1 score: dermatologists=0.4471, pathologists=0.6691, MMDLS=0.9582). MMDLS maintained an overall performance advantage compared to dermatologists and pathologists (Fig. 5a).
In addition, the accuracy of the classification result was further evaluated based on the confusion matrix, which intuitively displays the performance of MMDLS, dermatologists and pathologists based on 13 classifications (Fig. 5.b). Notably, in clinical practice, an individual and definite diagnosis is unable to be made for many cases, or it is difficult for clinicians to determine the subtype. Dermatologists usually diagnose lupus skin lesions as “LE” instead of clearly classifying the lesion. Therefore, we added several prediction labels, including undefined diagnosis (UD), other skin diseases that do not belong to the 13 categories (OSD), and uncertain lupus subtype (ULE), to the clinical and pathological diagnosis. We used this method to obtain a diagnosis that was close as possible to the actual clinical diagnosis. As shown in Figure 5b, up to 62.96% of LE patients could not be accurately diagnosed in the clinical diagnosis. Some cases were misdiagnosed as uncertain lupus subtype or other skin diseases. In other cases, multiple "probable" diagnoses were taken into account. Although pathological diagnosis is regarded as the gold standard for diagnosing LE skin lesions, only 44.44% of LE skin samples could be correctly classified by pathologists. In contrast, both in the diagnosis and classification of LE, MMDLS exhibited the best performance, achieving an average accuracy of 88.98%. With the exception of A-SCLE, the diagnosis correction rate for other groups approached or even achieved 100%.