Deep Learning Prediction of Renal Anomalies for Prenatal Ultrasound Diagnosis

doi:10.21203/rs.3.rs-3101390/v1

Download PDF

Article

Deep Learning Prediction of Renal Anomalies for Prenatal Ultrasound Diagnosis

https://doi.org/10.21203/rs.3.rs-3101390/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 19 Apr, 2024

Read the published version in Scientific Reports →

You are reading this latest preprint version

Deep learning algorithms have demonstrated remarkable potential in clinical diagnostics, particularly in the field of medical imaging. In this study, we investigated the application of deep learning models in early detection of fetal kidney anomalies. To provide an enhanced interpretation of those models’ predictions, we proposed an adapted two-class representation and developed a multi-class model interpretation approach for problems with more than two labels and variable hierarchical grouping of labels. Additionally, we employed the explainable AI (XAI) visualization tools Grad-CAM and HiResCAM, to gain insights into model predictions and identify reasons for misclassifications. The study dataset consisted of 969 unique ultrasound images; 646 control images and 323 cases of kidney anomalies, including 259 cases of unilateral urinary tract dilation and 64 cases of unilateral multicystic dysplastic kidney. The best performing model achieved a cross-validated area under the ROC curve of 90.71% ± 0.54%, with an overall accuracy of 81.70% ± 0.88%, sensitivity of 81.20% ± 2.40% and specificity of 82.06% ± 1.74% on a test dataset. Our findings emphasize the potential of deep learning models in predicting kidney anomalies from limited prenatal ultrasound imagery. The proposed adaptations in model representation and interpretation represent a novel solution to multi-class prediction problems.

Biological sciences/Computational biology and bioinformatics/Machine learning

Health sciences/Health care/Medical imaging/Ultrasonography

Kidney Anomaly Diagnosis

Deep Learning

Ultrasound Imagery

Deep learning (DL) is a machine learning methodology that has gained momentum in clinical diagnostics, particularly in medical imaging¹. It utilizes artificial neural networks to process extensive datasets and identify important features that predict outcomes of interest². Through the analysis of large clinical datasets, DL has demonstrated its ability to analyze vast amounts of data to providing invaluable prediction tools and clinical insights with the potential to enhance, and in some cases transform, healthcare delivery. This approach brings significant advantages, as DL model performance can be continuously improved as more data accrues. Notably, those models excel at pattern recognition, making them ideal for analyzing medical images. Convolutional neural networks (CNN), a specific DL architecture, has been identified as the method of choice for analyzing and interpreting medical images^3,4.

Ultrasonography, more so than other imaging modalities such as X-ray, CT, and MRI, is heavily performer- and interpreter-dependent, presenting a challenge to the DL models in clinical practice^5,6. Since ultrasound (US) technology is a vital tool for assessing fetal anatomy, many researchers have found DL useful in prenatal US processes⁶. To date, most of the DL research on antenatal US has focused on fetal biometry and confirming normal structures, including heart and brain⁷. There has been only limited application of DL in identifying structural anomalies thus far^6,7.

Congenital anomalies of the kidneys and urinary tract (CAKUT) are relatively common, affecting 1 in 500 live births and comprising 20–30% of congenital fetal anomalies^8,9. The kidney and urinary tract are vital organ systems that produce amniotic fluid and facilitate lung development¹⁰. Although various CAKUT can be detected during prenatal US exams, the false-positive rate is relatively high¹¹. Therefore, more accurate diagnosis of CAKUT by prenatal US screening could improve fetal outcomes and reduce avoidable anxiety for expecting parents.

Previously, our research team successfully applied DL algorithms to the detection of cystic hygroma from fetal ultrasound images¹². To further investigate the role of DL algorithms in medical imaging, we consider fetal kidney anomalies given their high incidence in the general population and the significant clinical impact of early detection. Conversely, the incidence of specific CAKUT subtypes such as multicystic dysplastic kidney (MCDK) and urinary tract dilation (UTD) is relatively low. The objective of this study was to determine the accuracy of DL models in identifying CAKUT and establish the achievable class-specific discrimination of specific subtypes of CAKUT. We used several XAI explainability tools, including Grad-CAM and HiResCAM¹² to facilitate the interpretation of our final models.

Study setting and data acquisition

In this retrospective study, we included US images of singleton and twin pregnancies taken at a multi-site tertiary-care facility in Ontario, Canada between March 2014 and November 2021. All images were captured using a GE voluson™ ultrasound system, by fully trained sonographers in obstetrics, and interpreted by maternal-fetal specialists. Images for both ‘normal’ instances and those reporting CAKUT were obtained between 18 and 24 weeks of gestational age (GA). This window was selected because, following full differentiation of the renal corticomedullary system, fetal renal structures are typically well developed at the 18th week of GA, and early detection (before 24 GA) of fetal urinary tract anomalies could independently predict poor postnatal renal outcome¹¹.

Images from two-dimensional (2D) transverse planes of fetal abdomens, measuring the renal pelvis anteroposterior diameter, were extracted from the institutional Picture Archiving and Communication System (PACS) and saved in Digital Imaging and Communications in Medicine (DICOM) format. Since we only analyzed 2D US images, only images of MCDK and UTD were included as part of the CAKUT ‘abnormal’ class classification. For clarity, we denote this class as ‘abnormal’ and not ‘CAKUT’ to make explicit that only a subset of CAKUT conditions are considered in this work. We considered 4mm as the cut off value of UTD as per the 2014 UTD classification¹¹. We collected multiple images from patients who underwent several US exams within the designated GA range. We excluded images that either did not have a 2D transverse kidney section for diagnosis or were not captured in the standard gray scale of US imagery.

Data preprocessing

Figure 1 depicts the conceptual overview of this study. The DICOM images used in the study required variable degrees of preprocessing prior to their use within the DL framework. A subset of images contained various coloured annotations such as calipers, text, icons, and profile traces. Another subset of images contained patient personal health information (PHI) that was visually present. Following the de-annotation and de-markup framework presented by our team previously¹², we removed both coloured markup elements and all PHI to limit the introduction of bias and/or the leakage of class labels. All images were verified following de-annotation and de-markup to ensure the quality of the images within the DL modelling dataset.

The resulting dataset of images spanned three classes: the ‘normal’ (i.e., control group) class and two ‘abnormal’ classes comprised of the MCDK and UTD images. The set of normal class images were randomly sampled from the full collection of available control imagery, conditioned on matching the years from which the abnormal images were captured. We performed this stratified downsampling of all available normal images to achieve a relative class ratio of ~ 2:1 for normal:abnormal instances. This control group of instances were normal axial kidney images extracted from pregnancies without CAKUT.

Model training framework & performance metrics

Following the methodology used in our previous study focused on cystic hygroma¹², we leveraged the DenseNet CNN model architecture to categorize images¹³. Specifically, we utilized the DenseNet169 PyTorch model, modifying the input layer to accommodate a variety of input image sizes (e.g., $128\times 128\times 1$ or $256\times 256\times 1$ pixels). Depending on the specific experiment, the output layer was also adjusted to generate either two (binary) or three (multi-class) output values.

Throughout all experiments, the DenseNet169 models were trained from scratch over a variable number of epochs using 4-fold cross-validation. Experiments were also repeated k = 5 times to compute accurate confidence intervals (CIs) of the reported performance and in plotting the Receiver Operating Characteristic (ROCs) and Precision-Recall (PR) curves. We use the normal approximation interval to compute 95% CIs on reported metrics from the independent folds and repetitions given the computationally expensive training of DL models. The performance metrics reported in this work include the area under the ROC curve (AUC), Accuracy, Sensitivity, and Specificity; the latter three defined by the following equations:

$$\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}= \frac{\text{T}\text{P}+\text{T}\text{N}}{\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N}}$$

$$\text{S}\text{e}\text{n}\text{s}\text{i}\text{t}\text{i}\text{v}\text{i}\text{t}\text{y}= \frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}$$

$$\text{S}\text{p}\text{e}\text{c}\text{i}\text{f}\text{i}\text{c}\text{i}\text{t}\text{y}=\frac{\text{T}\text{N}}{\text{T}\text{N}+\text{F}\text{P}}$$

where TP, TN, FP, FN denote the number of instances correctly predicted to be positive (true positive), the number of instances correctly predicted to be negative (true negative), the number of instances incorrectly predicted to be positive (false positive), and the number of instances incorrectly predicted to be negative (false negative), respectively.

Hyperparameter tuning

To optimize the performance of our DenseNet169 models, we conducted hyperparameter tuning by varying three key parameters. Firstly, we resized the input images to three different dimensions: $128\times 128\times 1$, $256\times 256\times 1$, and $512\times 512\times 1$. This allowed us to investigate the influence of image resolution on both model performance and inference times. Secondly, we explored the impact of epoch number on performance by setting it to three different values: 100, 300, and 600. This enabled us to establish a baseline performance and determine whether longer training durations would lead to improved model accuracy. Lastly, we experimented with three different batch sizes: 32, 64, and 128. By varying the batch size, we aimed to evaluate the effect on model performance and the computational resources required for training. The adaptive setting of the positive weighting for abnormal classes was dependent on the prediction paradigm being considered, whether it was a two-class or three-class scenario. Through these systematic variations, we sought to identify the optimal combination of hyperparameters that would yield the best possible performance for our DenseNet169 models.

Across all experiments, the learning rate was set according to a specific learning rate decay and schedule. This technique allows for the gradual reduction in the learning rate as training progresses to help the network converge more rapidly. We set the initial learning rate to 0.1 with a step size of 55 and a gamma of 0.3. In Fig. 2 we depict the varying learning rate step-wise decrease as a function of epoch number expressed with a log-scale for the maximal 600 epochs.

Visual explanation of model predictions

To enhance the interpretability of our trained DenseNet models and provide visual context to the important features contributing to model predictions, we employed two methods from the emergent field of Explanable AI (XAI). The first, denoted the Gradient-weighted Class Activation Mapping (Grad-CAM) method, is a widely recognized technique in the field of DL for visually explaining the behavior of algorithms¹⁴. Grad-CAM considers the gradients of the classification score relative to the final convolutional feature map, allowing for the attribution of influence to specific areas of an input image, highlighting those regions that exert the most influence on the classification score¹⁴. Notably, locations where the gradient is substantial correspond to regions where the final score heavily relies on the underlying data.

While the Grad-CAM method has been popularized within vision-based DL applications, recent investigations have brought to light a critical issue associated with its reliability¹⁵. It has been observed that Grad-CAM occasionally highlights regions within an image that were not utilized by the model for making predictions¹⁵. This study raises concerns regarding the trustworthiness of Grad-CAM as a model explanation method and an alternative proposed method, denoted HiResCAM, offers a promising solution by guaranteeing that it exclusively highlights locations that were genuinely utilized by the model. Conveniently, HiResCAM draws inspiration from Grad-CAM, simplifying model interpretability for those previously familiar with Grad-CAM.

In this work, we consider both Grad-CAM and HiResCAM as visual explanation methods to interpret the trained DenseNet169 model predictions. The complementary use of the two methods enables the intuitive and accurate interpretation of model predictions for end users. To generate the CAMs, we specify the class_layer.relu as the target layer; this layer represents the terminal ReLU layer in the model. Given that the dimensions of the CAM are directly influenced by the size of the input image, we adaptively parameterize the CAM grid size to ensure that the resulting grid cells of the CAM are consistently $32\times 32\times 1$ pixels in size for a fair comparison across experiments. For example, an input image measuring $256\times 256\times 1$ uses a CAM grid size of $8\times 8\times 1$. Similarly, when the input image measures $128\times 128\times 1$, the resulting CAM grid is reduced to $4\times 4\times 1$. Here, we leverage the methods to confirm that the trained models indeed base their predicted outputs on regions of the image that clinicians would consider for the basis of their own classification and diagnosis.

Adapted class representation due to limited dataset size

In this work, we addressed the challenge of limited dataset size in the context of class representation for image data. Specifically, we aimed to investigate the classification of images into three distinct classes: normal, UTD, and MCDK. However, due to the limited number of available images in the CAKUT classes (UTD and MCDK with a total of only n = 259 and n = 64 images, respectively) we faced a significant imbalance in class distribution. To overcome this limitation, we adopted a pragmatic approach by grouping these images into a single "abnormal" metagroup, thereby mitigating the issue of data sparsity, and enabling a more balanced representation across the classes. By employing this adapted class representation strategy, we aimed to explore the impact of limited dataset size on classification performance, while also considering the practical constraints associated with acquiring a larger dataset for the UTD and MCDK classes (Fig. 3). To investigate the impact on model performance when formulating this as a 2-class problem (normal vs. abnormal) versus a 3-class problem (normal vs. UTD vs. MCDK), we trained an equivalent 3-class DenseNet model using the hyperparameters from the top-ranking 2-class models for a fair comparison.

This adapted 2-class representation also enabled the definition of an adapted multi-class model interpretation of the performance from binary classification, inviting a novel means of interpreting k-class predictions for a problem with > k labels and a variable grouping of hierarchal labels. The generalization of this concept and its impact on hierarchal prediction paradigms will be investigated as part of future work.

Study sample

The complete image dataset included 969 unique ultrasound images; 646 control images and 323 CAKUT (abnormal) cases, comprised of 259 UTD and 64 MCDK cases (Table 1). A total of 606 images were used for model training through 4-fold cross-validation (~ 25% of the training withheld as a validation set per fold), and a final 161 images were used for final independent evaluation of the selected model.

Table 1

Partitioning of data across training, validation, and holdout test datasets
	Overall, n (%)	Normal, n (%)	Abnormal
	Overall, n (%)	Normal, n (%)	MCDK, n (%)	UTD, n (%)
Total	969 (100%)	646 (100%)	64 (100%)	259 (100%)
Training	606 (62.54%)	404 (62.54%)	40 (62.50%)	162 (62.55%)
Validation	202 (20.85%)	135 (20.90%)	13 (20.31%)	54 (20.85%)
Holdout Test	161 (16.62%)	107 (16.56%)	11 (17.19%)	43 (16.60%)
MCDK: Multi-cystic dysplastic kidney, UTD: Urinary Tract Dilatation

CNN model validation and performance

Table 2 summarizes the results from all model hyperparameter tuning experiments (sorted by descending AUC of the validation data), each reporting averaged performance metrics following 4-fold cross-validation and 5 independent repetitions. The best performing model overall achieved an AUC of 91.28% ± 0.52%, a mean accuracy of 84.03% ± 0.76%, a mean sensitivity of 77.39% ± 1.99%, and a specificity of 87.35% ± 1.28% (Table 2). The top-performing model Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves are depicted in Fig. 4 (panels A and B).

For a fair comparison of model performance between the top-ranking 2-class model and an equivalent 3-class variant, Table 3 summarizes performance where we notea marked drop in performance in the 3-class prediction schema with an AUC of 75.46% ± 1.51%, a mean accuracy of 72.46% ± 1.18%, a mean sensitivity of 41.48% ± 2.51%, and a mean specificity of 86.05% ± 1.53%. Clearly, the limited availability of the instances for the UTD and MCDK classes made it challenging for the trained model to accurately differentiate these classes as compared to the binary problem formulation.

In Table 4, we report performance of the top-ranking model on an independently heldout test dataset (n = 161 test images). The model achieved a mean AUC of 90.71% ± 0.54%, a mean accuracy of 81.70% ± 0.88%, a mean sensitivity of 81.20% ± 2.40%, and a mean specificity of 82.06% ± 1.74%, each metric aligning well with the performance from the validation dataset, suggestive that this model generalizes well to other US imagery. In Fig. 4 (panels C and D), we visualize the ROC and PR curves when evaluating the final model on the heldout test dataset. The normalized confusion matrix summarizing the model performance on the heldout dataset is visualized in Fig. 5 along with an adapted 2-class representation to further interpret the binary model predictions based on the available 2-class labels. Interestingly, the binary classification model more reliably detects instances of UTD (85% ± 8%) than instances of MCDK (67% ± 12%).

Table 2

DenseNet161 Model Validation Performance from 2-Class Hyperparameter Tuning Experiments following 4-Fold Cross-Validation and 5 Independent Repetitions.
	Hyperparameters			Validation Performance (${\mu }\pm 95\mathbf{C}\mathbf{I}$)
Validation Rank	Input Shape	Num. Epochs	Batch Size	AUC	Accuracy	Sensitivity	Specificity
1	128 x 128	300	32	0.9128 ± 0.0052	0.8403 ± 0.0076	0.7739 ± 0.0199	0.8735 ± 0.0128
2	128 x 128	300	64	0.9001 ± 0.0063	0.8134 ± 0.0072	0.8409 ± 0.0143	0.7996 ± 0.0078
3	256 x 256	600	64	0.8937 ± 0.0077	0.8114 ± 0.0108	0.7526 ± 0.0221	0.8535 ± 0.0245
4	128 x 128	300	64	0.8923 ± 0.0060	0.8069 ± 0.0078	0.7733 ± 0.0164	0.8309 ± 0.0139
5	256 x 256	300	32	0.8883 ± 0.0095	0.8104 ± 0.0109	0.7303 ± 0.0226	0.8676 ± 0.0161
6	128 x 128	100	64	0.8843 ± 0.0099	0.7941 ± 0.0108	0.8439 ± 0.0161	0.7691 ± 0.0127
7	128 x 128	300	128	0.8704 ± 0.0084	0.8015 ± 0.0118	0.7492 ± 0.0174	0.8273 ± 0.0199
8	256 x 256	300	64	0.8663 ± 0.0087	0.8015 ± 0.0086	0.7056 ± 0.0321	0.8494 ± 0.0164
9	256 x 256	300	64	0.8663 ± 0.0087	0.8015 ± 0.0086	0.7056 ± 0.0321	0.8494 ± 0.0164
10	512 x 512	600	64	0.8620 ± 0.0099	0.7785 ± 0.0232	0.7086 ± 0.0417	0.8134 ± 0.0469
11	256 x 256	300	128	0.8302 ± 0.0123	0.7752 ± 0.0088	0.6579 ± 0.0346	0.8338 ± 0.0176
12	256 x 256	300	128	0.8302 ± 0.0123	0.7752 ± 0.0088	0.6579 ± 0.0346	0.8338 ± 0.0176
13	512 x 512	300	64	0.8236 ± 0.0101	0.7342 ± 0.0105	0.7783 ± 0.0168	0.7121 ± 0.0124

Table 3

DenseNet161 Model Comparison between Top-Ranking 2-Class and Equivalent 3-Class Variant
Model	Hyperparameters			Validation Performance (${\mu }\pm 95\mathbf{C}\mathbf{I}$)
Binary vs. Multiclas	Input Shape	Num. Epochs	Batch Size	AUC	Accuracy	Sensitivity	Specificity
2-Class	128 x 128	300	32	0.9128 ± 0.0052	0.8403 ± 0.0076	0.7739 ± 0.0199	0.8735 ± 0.0128
3-Class	128 x 128	300	32	0.7546 ± 0.0151	0.7246 ± 0.0118	0.4148 ± 0.0250	0.8605 ± 0.0153

Table 4

Holdout Test Performance for the Final Model ($\mu \pm 95CI$).
AUC	Accuracy	Sensitivity	Specificity
0.9071 ± 0.0054	0.8170 ± 0.0088	0.8120 ± 0.0240	0.8206 ± 0.0174

Visual explanation of model predictions

We illustrate in Fig. 6 various examples of the activation maps from Grad-CAM and HiResCAM, obtained from the trained model when applied to instances from the holdout test set. Notably, across all classes and for both XAI methods, the model accurately focuses upon regions of the US image that would be clinically relevant for the diagnosis of CAKUT. In particular, we highlight in C & F a comparison between a Grad-CAM activation map and a HiResCAM activation map for the same image and note that the HiResCAM appears to hone in on more relevant regions of the image.

Model misclassification

Four specific cases of misclassification by our model are highlighted in Fig. 7: the Grad-CAM activation map in A focuses upon an irrelevant portion of the US image; the Grad-CAM activation map in B appears to focus upon a region outside the US scan altogether (the maternal abdominal wall); the HiResCAM activation map in C focuses simultaneously upon an incorrect region (maternal abdominal wall) of the US image as well as external portions of the image; and panel D depicts a lack of activation map altogether.

In this study, we demonstrate that DL algorithms can effectively make binary decisions, distinguishing between normal and abnormal conditions, based on retrospectively collected 2D fetal kidney US images. The top-performing model trained and evaluated within this work achieved an AUC around 90% and an accuracy greater than 80% on independently heldout test data, showing promise in its ability to generalize to independent US images containing CAKUT. A novel facet of this work is the flexible interpretation of multi-class performance from binary-type predictions. We generated a highly accurate 2-class DL model from which we could interpret 3-class labels (with the UTD and MCDK labels grouped into the ‘abnormal’ metaclass), thereby investigating what proportion of the abnormal labels were more reliably predicted, despite our binary problem formulation. Interestingly, the adapted 2-class confusion matrix suggested that the proportion of UTD instances can be more accurately predicted (85% ± 8%) than instances of MCDK (67% ± 12%). It is important to note that within the heldout test set, only 11 instances of MCDK were present as compared to the 43 UTD instance, suggesting that visually, with a greater number of available instances for training, the UTD images are more distinguishable from normal US images.

Significant advancements have been made in obstetric ultrasonography over the past four decades, enabling physicians to safely and effectively assess fetal conditions and accurately diagnose structural anomalies. However, despite improvements in US technology, ultrasonograms still heavily rely on the skills of the performer, which can contribute to under- or over-diagnosis, lead to medical malpractice concerns, and hinder accessibility in disadvantaged areas. It has been recognized that DL algorithms have the potential to serve as practical aids for performers of fetal ultrasonography, facilitating accurate image acquisition and diagnosis¹⁶. In a recent example, Xie et al. have reported the feasibility of utilizing AI models to diagnose fetal brain abnomalies¹⁷ and our prior research work has evidenced the use of DL models for the diagnosis of cystic hygroma from fetal US imagery¹². While the detection of fetal congenital abnormalities represents a critical objective in fetal ultrasonography, there is a paucity of reports demonstrating the application of AI for CAKUT diagnosis and, of those reported examples, only limited transfer learning-based experiments exist^18–20. To the best of our knowledge, this study represents the first large-scale initiative to investigate US images of CAKUT using a DL algorithm trained from scratch and leveraging XAI techniques to interpret model predictions.

There are several limitations and strengths to this study. First, this study only leverages data collected from a single center and thus, the sample size for developing and validating DL models is relatively small. As part of our mitigation strategy, we implemented the use of k-fold cross validation with numerous repetitions, and final model evaluation on a fully heldout test dataset. This approach is widely recognized as effective in mitigating the challenges associated with small datasets²¹. A strength of this work is our investigation of the hierarchal grouping of CAKUT through a complimentary binary (‘normal’ vs. ‘abnormal’) and multi-class (‘normal’ vs. ‘MCDK’ vs. ‘UTD’) experimental design, enabling a fair comparison between prediction paradigms with varying number of classes. This introduces a novel concept on how variable grouping of instances within classes or meta-classes can mitigate a lack of data within any one class, and explores the impact on model performance; consequently, this concept could be extended to other data representation hierarchies (including unsupervised learning where arbitrary class labels may be discarded in favor of embedding similarity).

Another limitation of this study is the lack of standardized data acquisition process for fetal ultrasound images, which raises concerns about data quality and consistency between normal and abnormal images. Within a retrospective study, we are limited by the quality and consistency of the data available to train a DL model and understandably, the DL model performance is dependent on both the quality and quantity of the imagery available to train and evaluate it. While the versatile comprehension of AI through emergent model explainability methods (e.g., GradCAM & HiResCAM) can potentially aid in the analysis of these diverse data, the absence of a standardized method of data acquisition introduces risks. Furthermore, clinicians may not have a comprehensive understanding of the impact of data quality and consistency on model training, which increases the likelihood of introducing bias or class leakage into the dataset. This potential data leakage compromises the reliability and generalizability of the results obtained from the study. Therefore, it is essential to acknowledge the limitations stemming from the method of data acquisition and consider possible implications on the accuracy and robustness of the models.

The utilization of DL in ultrasonography has the potential to mitigate risks associated with the inherent nature of this imaging technique. Ultrasonography is heavily reliant on the operator, resulting in images that lack reproducibility and introduce subjectivity during acquisition and interpretation²². This dependence on the operator can lead to inconsistencies in image quality and diagnostic outcomes, posing challenges for physicians and limiting the broader utilization of ultrasound in medically underserved regions²².

Although DL-based models have demonstrated remarkable success across various domains in recent years, it is important to recognize that these models are not infallible, and they may occasionally produce erroneous predictions despite achieving high performance metrics. Several factors could contribute to the observed erroneous predictions. One possible explanation lies in the limitations of the training data itself. Despite efforts to curate high-quality and representative datasets, the presence of biased or noisy data can adversely affect the model's ability to generalize accurately. Additionally, DL models are highly complex and often consist of numerous interconnected layers, making it challenging to interpret their decision-making process and pinpoint the source of errors. To better interpret the predictions of such a model, it is worth considering more intuitive model visualization techniques such as the recently proposed CAManim method. Furthermore, DL models rely heavily on the optimization of loss functions during training, which may lead to overfitting or the exploitation of statistical patterns that do not truly generalize to real-world scenarios (e.g., the regular occurrence of blackout regions from US scans). This can result in models that perform well at the task they were trained on, but fail when applied to unseen or ambiguous inputs, leading to unexpected and erroneous outputs. Our observations highlight the need for further research and development to enhance the reliability and robustness of DL models, particularly in critical domains where erroneous predictions can have severe consequences. With sufficient safeguards and human-in-the-loop review, we may improve our understanding of the underlying causes of these errors and strive towards increasingly trustworthy and accurate DL-based models that can be effectively deployed in real-world applications.

Future directions in evaluating the use of DL models in fetal US diagnostics should include prospective studies conducted with well-defined data collection protocols, established in consultation with a panel of both AI and subject-matter experts, who collectively possess a deep understanding of the clinical challenges as well as the impact of data quality and consistency on model training and evaluation. Data collection protocols can help mitigate the risks of data leakage, bias, and class leakage, and can ensure that the collected images are representative, consistent, and of high quality, leading to a more reliable and generalizable dataset. Such a protocol may also ensure that multi-institutional data collection remain compatible. The inclusion of AI experts in the development of the protocol will also enhance the understanding of the specific requirements and challenges associated with training machine learning models for fetal ultrasound diagnosis. Consequently, a prospective study with a robust data collection protocol will contribute to the advancement of accurate and trustworthy AI-based diagnostic tools in this domain.

This study demonstrates the potential of DL algorithms in clinical diagnostics, specifically in the early detection of fetal kidney anomalies, using limited ultrasound imagery. This outcome encourages further development of DL algorithms capable of prospectively evaluating fetal US images in a multi-classification prediction schema. The potential establishment of such a DL system holds promise for the integration of AI-assisted prenatal US diagnostic systems in clinical settings, offering benefits to both patients and clinicians. By leveraging this technology, prenatal US examinations may be enhanced, leading to improved accuracy and efficiency in the diagnosis of fetal conditions, and ultimately, to improved patient care. The integration of XAI techniques strengthens our understanding of the model's decision-making process, enabling identification of areas for improvement, and fostering trust in deep learning predictions within clinical settings

Ethical approval

This study was reviewed and approved by the Ottawa Health Sciences Network Research Ethics Board (OHSN REB #20210079). All methods were performed in accordance with the relevant institutional guidelines and regulations and in alignment with the Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans (TCPS 2). This study involved images previously collected at the study centre, which were de-identified prior to model training and validation. Seeking participant consent was waived by the Ottawa Health Sciences Network Research Ethics Board as this retrospective study relied exclusively on secondary use of non-identifiable information. The data management and analysis for this study was conducted within the secure institutional network environment.

Availability of data and materials

The original image files are not publicly available due to their containing information that could compromise the privacy of research participants. The datasets generated and analyzed for this study are available from the corresponding author (MCW), upon reasonable request.

Competing interests

All authors declare no competing interests.

Funding

This study was supported by a Canadian Institutes of Health Research Foundation Grant (FDN 148438). The funding agency was not involved in study design, analysis or interpretation of data. The agency was not involved writing of this manuscript or in the decision to submit the article for publication.

Author Contributions

OXM, EK, KD, MCWand SH conceived and designed the study. MCW, IL, BB and CMA provided clinical expertise and content. OXM, EK, IL and KD contributed to data acquisition and analysis. OXM, EK, and KD conducted all model development, evaluation, and visualization of results. SH provided methodological and analytic expertise. KD drafted the manuscript. ADH assisted with protocol development, in obtaining REB approval and program management. All of the authors revised the manuscript critically for important intellectual content, gave final approval of the version to be published. MCW has primary responsibility for the final content.

Acknowledgements

We would like to acknowledge Dr. Katherine A. Muldoon for her support of this work. The authors also acknowledge that this study took place on unceded Algonquin Anishinabe territory.

Shen, D., Wu, G. & Suk, H.-I. Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 19, 221–248 (2017).
Hinton, G. Deep learning—a technology with the potential to transform health care. Jama 320, 1101–1102 (2018).
Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 2019 251 25, 44–56 (2019).
Kulkarni, S., Seneviratne, N., Baig, M. S. & Khan, A. H. A. Artificial Intelligence in Medicine: Where Are We Now? Acad. Radiol. 27, 62–70 (2020).
Currie, G., Hawk, K. E., Rohren, E., Vial, A. & Klein, R. Machine Learning and Deep Learning in Medical Imaging: Intelligent Imaging. J. Med. Imaging Radiat. Sci. 50, 477–487 (2019).
Soffer, S. et al. Convolutional Neural Networks for Radiologic Images: A Radiologist’s Guide. Radiology 290, 590–606 (2019).
Park, S. H. Artificial intelligence for ultrasonography: unique opportunities and challenges. Ultrasonography 40, 3 (2021).
Uy, N. & Reidy, K. Developmental Genetics and Congenital Anomalies of the Kidney and Urinary Tract. J. Pediatr. Genet. 05, 051–060 (2015).
Song, R. & Yosypiv, I. V. Genetics of congenital anomalies of the kidney and urinary tract. Pediatr. Nephrol. 26, 353–364 (2011).
Thomas, I. T. & Smith, D. W. Oligohydramnios, cause of the nonrenal features of Potter’s syndrome, including pulmonary hypoplasia. J. Pediatr. 84, 811–814 (1974).
Nguyen, H. T. et al. Multidisciplinary consensus on the classification of prenatal and postnatal urinary tract dilation (UTD classification system). J. Pediatr. Urol. 10, 982–998 (2014).
Walker, M. C. et al. Using deep-learning in fetal ultrasound analysis for diagnosis of cystic hygroma in the first trimester. PLoS One 17, e0269323 (2022).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. in Proceedings of the IEEE conference on computer vision and pattern recognition 4700–4708 (2017).
Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. in Proceedings of the IEEE international conference on computer vision 618–626 (2017).
Draelos, R. L. & Carin, L. Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks. arXiv e-prints arXiv--2011 (2020).
Yi, J. et al. Technology trends and applications of deep learning in ultrasonography: image quality enhancement, diagnostic support, and improving workflow efficiency. Ultrasonography 40, 7 (2021).
Xie, H. N. et al. Using deep-learning algorithms to classify fetal brain ultrasound images as normal or abnormal. Ultrasound Obstet. Gynecol. 56, 579–587 (2020).
Yin, S. et al. Multi-instance deep learning of ultrasound imaging data for pattern classification of congenital abnormalities of the kidney and urinary tract in children. Urology 142, 183–189 (2020).
Zheng, Q., Furth, S. L., Tasian, G. E. & Fan, Y. Computer-aided diagnosis of congenital abnormalities of the kidney and urinary tract in children based on ultrasound imaging data by integrating texture image features and deep transfer learning image features. J. Pediatr. Urol. 15, 75--e1 (2019).
Sri, V. S., Kumar, P. S. & Rajendran, V. A Review on Detection of Kidney Disease Using Machine Learning and Deep Learning Techniques. Appl. Deep Learn. Methods Healthc. Med. Sci. 1–22 (2022).
Kohavi, R. & others. A study of cross-validation and bootstrap for accuracy estimation and model selection. in Ijcai 14, 1137–1145 (1995).
Rumack, C. M. & Levine, D. Diagnostic ultrasound. (Elsevier Health Sciences, 2017).

No competing interests reported.

Download PDF

Journal Publication

published 19 Apr, 2024

Read the published version in Scientific Reports →

Editorial decision: Major revision
06 Sep, 2023
Reviews received at journal
20 Aug, 2023
Reviewers agreed at journal
16 Aug, 2023
Reviewers agreed at journal
16 Aug, 2023
Reviewers agreed at journal
16 Aug, 2023
Reviewers agreed at journal
16 Aug, 2023
Reviewers agreed at journal
16 Aug, 2023
Reviewers agreed at journal
16 Aug, 2023
Reviewers agreed at journal
16 Aug, 2023
Reviewers agreed at journal
16 Aug, 2023
Reviewers agreed at journal
16 Aug, 2023
Reviewers invited by journal
16 Aug, 2023
Editor assigned by journal
16 Aug, 2023
Editor invited by journal
06 Jul, 2023
Submission checks completed at journal
06 Jul, 2023
First submitted to journal
23 Jun, 2023

You are reading this latest preprint version

Deep Learning Prediction of Renal Anomalies for Prenatal Ultrasound Diagnosis

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Data & Methods

Study setting and data acquisition

Data preprocessing

Model training framework & performance metrics

Hyperparameter tuning

Visual explanation of model predictions

Adapted class representation due to limited dataset size

Results

Study sample

CNN model validation and performance

Visual explanation of model predictions

Model misclassification

Discussion

Conclusion

Declarations

References

Additional Declarations

Status:

Journal Publication

Version 1