Although there is a disclaimer by the manufacturers of CAD4TB software currently in use in Tanzania, that the software was developed only for analysis of Chest Xray for detection of TB, some of the Xray images uploaded were of non-chest Xray. CAD4TB score ranged from 0 to 100; if a non-chest Xray image was uploaded, the software gave a negative score. Hence, any score outside the provided range were regarded as outliers in this study. Figure 2 shows some negative scores of outliers. In the figure below, in all three vans, there were several negative scores, meaning that the data contained non chest X-ray images. This may mean more training needs to be provided to the vans operators on how AI based CAD4TB software works and its limitations that it can’t do detection of cases in which the algorithm was no trained on.
Figure 3 shows the distribution of the tests conducted in all three vans, excluding the Tabora van. Table 1 below provides data description, showing total number of screens of each van, the number of radiologically suggestive cases and the number of confirmed cases based on GeneXpert as a ground truth. The last two columns of the Table 1 provide the number of suggestive cases based on 50 (predefined) and 30 (suggested by this paper) scores of CAD4TB. Dar es salaam did not have X-ray test results recorded in the registry but screened more people than any other van. With reference to figure 1 above, the absence of X-ray results may mean that many patients were subjected to the GeneXpert test directly, resulting in costs that could have been avoided. This may also mean there was no proper recording of data for Xray results. The other two sites show same number of individuals who did Xray also were subjected to GeneXpert diagnosis. The purpose of introducing CAD4TB in the TB screening program was to lower the workload on human readers, increase the screening rate while reducing the number of unnecessary cases subjected to GeneXpert. What is shown in this figure may mean human readers didn’t really take into consideration the results of CAD4TB, even their manual analysis and subjected all the cases to GeneXpert. This can be seen in table 1 in which only 2.12% of the screened cases were radiologically suggestive at Mbeya van and only 30% in the Mwanza van. In order to achieve the NTLP call and the target of the WHO of ending TB by 2030, The findings shown in Figure 2 and Table 1 may mean training should be done to the van’s operators on the screening protocol, which is to start with manual reading, having a second opinion on CAD4TB before subjecting suggestive cases to GeneXpert for confirmation.
Table 1: Screening data description
Region
|
Total screens
|
Radiologically
Suggested
|
GeneXpert
Confirmed
|
CAD4TB with >50
|
CAD4TB with >30
|
Dar
|
5145
|
No results
|
32
|
27
|
31
|
Mbeya
|
4353
|
95
|
6
|
5
|
5
|
Mwanza
|
2425
|
723
|
9
|
7
|
7
|
Total
|
11,923
|
818
|
47
|
39
|
43
|
Comparison between GeneXpert and X-ray test results.
This section provides insight into human analysts’ performance against the gold standard, GeneXpert. Although chest X-ray can be used for screening of Pulmonary TB, GeneXpert remains a confirmatory test. To assess the performance and efficiency of using chest X-rays without the use of CAD4TB, we compared the human performance in reading chest X-rays against the global standard, the GeneXpert. This comparison was performed for Mbeya and Mwanza alone because the Dar es salaam registry did not have X-ray results data. This comparison aimed to compare the ability of human X-ray readers to that of CAD4TB readers, while taking GeneXpert results as a ground truth. Figure 4a shows the results of manual X-ray analysis of 95 human-produced suggestive cases, with only 6 confirmed cases confirmed by GeneXpert. The case was more severe in Mwanza, in which out of the 723 patients whose chest X-rays were suggestive, only 9 were confirmed by GeneXpert, as shown in Figure 4b. This may mean that, in just two regions, more than 818 people were subjected to GeneXpert, while only 15 cases were confirmed positive. Although, this would have been the case if only radiologically suggestive cases were subjected. Unfortunately, all individuals attended the screening were subjected to GeneXpert as it can be seen in Figure 3 above. In both ways, even if only radiologically suggestive would have been subjected to GeneXpert, this added unnecessarily cost and time to obtain the TB results.
Establishing a threshold value for the CAD4TB
As stated earlier, the CAD4TB provides scores from 0 to 100, and the greater the number is, the greater the number of suggestive TB. Our study revealed that a considerable number of positive cases remained when a threshold value of 50 was used. As shown in Figure 5, eight (8) GeneXpert positively confirmed cases were left out, marked with an orange cross in the figure.
To address this uncertainty, we analysed the dataset to understand the distribution and variability of CAD4TB scores against GeneXpert results, as shown in Figure 6. The distribution of CAD4TB scores against GeneXpert results showed that in positively confirmed GeneXpert patients, the CAD4TB scores were normally distributed, with a minimum score of 31.96% and a maximum score of 98.71%, which accounted for approximately 99% (44 patients out of 47) of the positive cases. However, for negative results, CAD4TB scores are normally distributed within the range of 0.08% to 39.13%. We also identified the presence of outliers that comprised data points outside the specified ranges above.
Nevertheless, as this study aimed to analyse the reliability of CAD4TB software, from the analysis conducted above, the CAD4TB results correlated positively with the GeneXpert results 99% of the time.
To compare the two threshold scores, a score of 30% selected more of the suggestive cases than did a score of 50%, shown in figure 7. However, a very significant number of suggestive patients are still left behind. The high number of suggestive cases per radiology reading that are left behind shows a strong misalignment between the CAD4TB and manual X-ray readings. On the other hand, there was a very strong alignment between the CAD4TB score of 30 and that of GeneXpert.
CAD4TB performance
This section compares the performance of the CAD4TB against that of human readers on chest X-rays and GeneXpert, with an established score of 30% according to our analysis of the collected data, as shown in the previous section.
a) CAD4TB comparison with positive confirmed GeneXpert results
This section provides insight into the performance of the CAD4TB against the gold standard GeneXpert for Mwanza, Mbeya, and Dar es salaam vans. Figure 8 shows the performance achieved by using a manufacturer’s default score of 50, while Figure 9 shows the performance achieved by using a score of 30 as per our recommendation. Figure 7a shows that when a threshold score of 50% was used, 5
positive GeneXpert confirmed cases were missed by the CAD4TB in Dar es salaam, and 2 were missed for both Mwanza and Mbeya. Two (2), three (3) and one (1) negative case for Dar es salaam, Mwanza and Mbeya, respectively, were subjected to unnecessary GeneXpert with reference to Figure 8b as a result of CAD4TB analysis. When a score of 30% was used, only one (1) case was missed, while 5 cases were missed at a score of 50% in the Dar es salaam test, as shown in Figure 9a. However, the number of missed cases for Mwanza and Mbeya remained the same. In general, in both Table 1, Figure 7 and Figure 8, CAD4TB at a score of 30, detected 91.5% of all positive cases detected by GeneXpert. For fair comparison between human readers and CAD4TB against the gold standard of GeneXpert, we compared only two vans, Mbeya and Mwanza which both had Manual reading, CAD4TB and GeneXpert results. Our findings showed than, human readers positively suggested 12% of all screened cases in Mwanza and Mbeya, while CAD4TB at a score of 30 suggested only 0.63% of all cases, compared to 0.69% of cases confirmed by GeneXpert.