5.1. AI-based CTG interpretation
This review discusses the present literary landscape of various machine learning techniques in relation to the interpretation of intrapartum CTG signals. To do this, we analysed the current state-of-the-art and identified several techniques which have been applied. These included AIs using the following base algorithms:
5.2. Support Vector Machine (SVM)
SVM makes a good choice for a classifier for the CTG databases that were employed and can be divided into normal and abnormal states. However, without having developed an adequate hyperplane and margin to differentiate the two classes, the system will lose precision while learning the difference between normal and abnormal states [14, 20, 40, 57]. This could have clinical implications as this loss of precision could result in unnecessary intervention due to false classification of borderline normal fetuses into the abnormal class or increase missed diagnoses as borderline abnormal fetuses get misclassified as normal, resulting in a fetal or neonatal morbidity or mortality. In addition to the resultant morbidity and mortality, this would open the hospital or legal manufacturer of the AI to litigation (depending on which party is legally liable for malpractice)[58–60]. Though this limitation applies for all ML techniques, the inherent binary classification capacity of SVM places it at greater risk of incidence.
While the mean accuracy of the SVM was the weakest in the CTU-UHB database (84.19%), it performed well when applied in the UCI Machine Learning Repository (mean accuracy: 94.00%). After pre-processing Nagendra et al. scored the highest accuracy of all SVM-based classifiers (98%) with a more specified feature vector [35]. If extra variables were employed, they could have carried meaningful information that would influence the overall outcome and introduce the possibility for a third class[35]. A good example for a better application of SVM while maintaining more than two classes is Harimoorthy et al., where three different diseases with common symptoms have been investigated [61]. However, the system was modified to employ a more improved SVM-radial. Therefore, to define SVM’s relevance for future development, if the task at hand has two scissile classes, SVM can be a suitable choice. However, if it requires multi-class classification then it is recommended to use an alternate form of SVM or a different classification technique.
5.3. Decision Trees (DT) and Random Forests (RF)
In CTG interpretation, these technologies can be used to predict fetal state using signal changes and the probability that a particular branch of the DT/RF model matches with the signal [62]. This provides both, DT and RF, with an advantage over SVM as developers can incorporate more categories to represent the fetal state (eg. 3: abnormal, suspect, normal). When compared with each other across the CTU-UHB and UCI databases, DT was seen to have performed poorly whilst RF performed extremely well. This suggests that RF may be the superior technique for CTG classification of fetal state.
Interestingly, DT and RF can be combined with SVM to create a broader system to introduce more variables into the system, but that has been shown to compromise accuracy [46]. Fergus et al. employed a deep learning approach for the random forest. In doing so, they were able to incorporate more information that enhanced their accuracy (98.12%)[36]. This signifies that the greater the training dataset and appropriateness of the classifier could make RF models clinically relevant and thus useful in guiding the development of future DT/RF models.
5.4. Neural Networks (NN)
In this study, we identified that NN-based AIs were often utilised, with 21 different models developed: thirteen of which having accuracies above 90%. When we compare the mean performance metrics for NN with the other techniques in both databases, it performed very well. Additionally, two studies reported a 99.9% accuracy though it is uncertain if this was tested on a training dataset or a separate validation dataset[36, 37]. If the latter, this becomes an indicator that the system can associate each class to an almost perfect score and that these models hold considerable promise clinically if it can maintain a high standard in clinical evaluations.
Another advantage of using neural networks is relative alteration depending on the application. Artificial neural networks can be used for deep and shallow networks[21–25, 27–29, 31, 33, 37–39, 41, 44, 47, 54, 56]. Convolutional neural networks help look at the shape of signals image processing focus which helps predict accelerations[43, 46, 49, 50], and recurrent neural networks can perform time-based predictions using trends of previous samples of a signal to analyse current and future samples of the same signal[32, 38].
As NNs improve with time, hyperparameters will future proof the concept and increase the relevance of the technique in applications like CTG interpretation. Using hyperparameters, the network structure can be optimised before training or bias has been introduced into the model[63]. This could improve the accuracy and reliability of NN-based models through the incorporation of hidden layers, or even denser, bigger layers via the addition of hidden units. Furthermore, hyperparameters permit different optimizers, regularisation, activation, enhanced learning, and the reduction of overfitting (dropouts)[64]. All these capabilities highlight the promise for NN future NN applications in CTG.
5.5. Custom Algorithms
The category of custom algorithms related to the models that used novel techniques and either combinations or modified versions of the SVM, DT, RF, or NN. Modified versions of NNs can be beneficial as they are less computational hungry than traditional NNs whilst also not compromising the accuracy of the model. Although these models did not score the highest accuracies, they offer a good option for clinical applications where limited computational power and IT resources are common, for example in low-middle income countries, regional hospitals, small clinics, etc., where access to the up-to-date computers or high-powered computers are limited. That said, combinations of different AI techniques can help increase the accuracy. In Fergus et al., Fishers' Linear Discriminant Analysis, SVM and RF were used to produce their final model which performed at 96% accuracy and required little computational power[41].
5.6. Clinical Implications
As mentioned, most of the included studies used publicly available databases which is an inadequate representation of common practices and the general population. Also, these databases make no distinction of the condition of the maternal, fetal, or placental factors (eg. placental insufficiency or fetal growth restriction which can influence fetal risk to hypoxic damage. However, these databases provide sufficient samples to justify developing and testing modern machine learning techniques. In any CTG trace, there are two streams of signals that are detected: one corresponding to the fetal heart rate and the other to the mother’s uterine contractions. Additionally, an event tracker may be provided to the mother to mark on the CTG trace, when the mother detects a fetal movement. The samples of these signals are stored in the form of values which makes it easy to load as a table for processing and set-up as an input for the models we have discussed. However, for interpretation, the signals are given to medical staff and parents as a graph print out which is time dependent[65]. Such graph printouts are a good dataset for time-series models such as recurrent neural networks, Warrick et al. and Tang et al. set out to test [32, 38]. Alternatively, the printouts have been seen as 2D images inputs to convolutional neural networks, where various features and sequences have been successfully observed by the models.
Another point that must be raised is that almost all the studies in this review focused on a 2 (normal and abnormal) or 3 (normal, suspect, and pathological) classifier system. The latter system emulates the current clinical standard of the FIGO classification system where CTGs are also designated as either normal, suspect, or pathological[65]. That said, CTG classification does not necessarily correlate to neonatal outcome [9]. However, the capabilities of AI in clinical decision support presently supersede this and as such, there is potential, particularly with NN-based models, to go one step further and provide a clinical diagnosis. Indeed, in cardiology wards, many commercial electrocardiography (ECG) telemetry units are connected to a central screen which can detect when a patient has a sinus rhythm or is experiencing arrhythmias such as tachycardia, ventricular fibrillation, and atrial flutter to name a few to then alarm and alert clinicians[66–69]. In this same way, AI-based CTG interpretation could go further to identify and mark CTG features such as accelerations, early and late decelerations, variability; and potentially identify or aid managing fetal or maternal issues that can be associated with fetal heart rate, such as: congenital abnormalities, fetal compromise, chorioamnionitis, fetal immaturity at gestational age, acute and chronic hypoxemia, and fetal growth restriction[70–72]. In doing so, AI will help change existing point of care CTG based interpretation to a more constant and longitudinal based assessment.
One well known clinical decision support system in the field of fetal monitoring is the INFANT [73]. The system was built to suggest decisions to be carried out by the clinicians during intrapartum periods of the pregnancy. Though the system was still in development [74] the clinical trial was deemed failed by the team, as the system failed to perform better than the professional staff [75]. The system’s patent supplied has shown that the system runs time-based neural network (RNN) where each sample relies on the previous sample to predict the current state of the fetus while employing a sigmoid based classifier [76]. Similar systems in this review have shown to score 90% accuracy in separating different fetal states with minimal pre-processing of the samples. In addition, all the information found regarding the INFANT system focused on the signal analysis, while neglecting the innovative side of the machine learning technique.
Based on the findings of this review, AI has demonstrated the potential to distinguish different fetal states with high accuracy. However, none of these systems have been proven to provide diagnostic level details for clinical implementation, as the training datasets are skewed by a lack of understanding of the complexity of the clinical challenge. At the current stage of development, the models have only been tested internally, thus have not been validated externally. As such, there is no evidence that the model will perform at the same accuracy in a clinical setting. Although the validation is promising and justifiable to test for development, more concrete clinical trials are required before considering whether implementation is feasible.
5.7. The evolving technological paradigm in FHR monitoring
There is no doubt that CTG has significant clinical utility for obstetricians, midwives, and pregnant women. It is a technology that has come to represent the standard-of-care for non-invasive electronic fetal monitoring. However, with technological advancements and the growing interest in fetal electrocardiography (fECG), the future of electronic fetal monitoring is likely to place decreasing reliance on CTG as the industry shifts to fECG. This is because fECG can provide more accurate information than CTG and does not have the physical challenges of losing signal due to transducer placement, limiting maternal mobility, inconvenient attachments, limitations with use for women with high BMIs, etc [7, 77–79]. Indeed, large industry players such as Philips Healthcare [80] and GE Healthcare [81] are already introducing fECG-based devices, and numerous small-to-medium sized enterprises are developing technologies using a similar foundation.
Whilst CTGs are unlikely to be replaced immediately, there will likely be a slow, phased transition towards the implementation of fECG technologies. This is a consideration for those developing AI models for CTG interpretation as their target user will shift towards rural clinics and developing markets over the next decade or so[82].
5.8. Avenues for Future Research Direction
With the growing interest in digital technologies and artificial intelligence in healthcare, fetal monitoring and CTG interpretation is certainly an area that is promising, yet still requires a lot of work. Based on the findings of this review, the authors believe that there are several opportunities for improvement.
Firstly, there is the need for more accurate and reliable models to be developed and evaluated. Though many of the models performed extremely well in their pre-clinical validation, it must be anticipated that clinical performance may be considerably lower, and this should be factored in the development. Along this line is the need for the models to be evaluated in a clinical scenario under trial circumstances. As such, clinical evaluation studies should be pursued to determine if the high accuracies reported by the studies included in this review are indeed representative of how the model will perform in a clinical setting. In turn, such studies could assist in the implementation of promising models as potential adopters will become more aware of the technologies[83].
The second, as previously discussed, is in expanding the endpoints of the AI tools beyond simple classification as per FIGO guidelines alone, to providing more comprehensive clinical decision support and in the future, perhaps fully automated interpretation. Hopefully, this will enable a more complete overview of the fetus and establish an individualised risk model for hypoxic injury. However, to achieve this, developers must develop a better understanding of CTG traces and its link with metabolic academia in the fetus. This will also require an understanding of the underlying maternal, fetal, and placental factors which contribute towards evolving antenatal fetal risk; the link between CTG abnormalities and neonatal outcomes; the link between observed changes on CTG and the subsequent clinical management; and the needs of the clinicians who will use this information in their day-to-day practice. For the last point, the needs of the clinician is highly important as alarm fatigue and other human factors considerations can play a significant role in the design of the model and the broader software package [84].
Lastly, with growing interest in alternative fetal monitoring technologies such as fECG, there is the potential that development of AIs for fECG may prove to be a promising new research direction soon. Particularly as fECG can provide more detailed information about the fetuses' wellbeing [85].
5.9. Limitations of this study
The findings of this review should be interpreted considering the following limitations.
Firstly, none of the papers presented in this study discussed the application of these AIs to the clinical setting without the external validation of a clinical evaluation. This could be a limitation that is associated with the readiness or adequacy of these AIs for clinical evaluation, that the research in this field is progressing at a slow rate, or more likely a sign that the techniques are not mature yet given that only 3 studies were published prior to 2010. However, an associated limitation with this is that most studies used one of two databases (CTU-UHB or UCI Machine Learning Repository) as the source of data for training and validating their respective AI models. Though these databases form a good starting point for AI developers, this raises potential issues with external validity and generalisability of the models to the general population. Particularly as the UCI Machine Learning Repository did not present any inclusion/exclusion criteria or demographic data for patients, whilst the CTU-UHB database only included patients with gestational ages above 36 weeks and experienced a Stage 2 labour duration of < 30 minutes.
Secondly, being of a primarily pre-clinical nature, the studies included in this review were deemed to be at high risk of bias due to their inherent design and methodological shortcomings. For this reason, meta-analysis was ruled inappropriate, but this limitation should be considered when interpreting the findings.
Lastly, whilst the databases were largely the same across all studies, differences in skill, experience, and training of the developer can influence the accuracy and reliability of the AI model.
Despite these limitations, it must be highlighted that the included studies have demonstrated promise and have outlined the merit in further evaluating their ability to interpret CTG accurately and reliably.