In this study, we evaluated the performance of GNNs on medical data using the usecase sepsis prediction from blood count data. When GNNs are applied on similarity graphs, they achieve similar performance as ensemble-based machine learning algorithms (XGBoost, RUSBoost, Random Forest) and neural networks. The reason for this similar performance is that the nodes of complete blood counts only sample information from similar node blood count measurements (measurement-centric graphs). GNNs, the neural network and ensemble-based algorithms could outperform shallow algorithms (decision tree and logistic regression) due to more expressive representations of the underlying information. However, this performance increase is associated with an increased computational complexity which requires more training time compared to shallow algorithms. The increased computational complexity can be compensated when exploiting modern hardware (e.g., usage of multiple threads or GPUs). Thereby, the training of XGBoost on the GPU (NVIDIA A6000) requires even less time than the training of shallow algorithms. GNNs were also trained on a GPU but still required much more training time due to the high computational complexity of the underlying sampling, transformation, and aggregation steps (see Introduction).
In addition to computational time, tree-based ensemble algorithms (XGBoost, RUSBoost, Random Forest) are more robust against noise compared to the decision tree, GNNs and the neural network. The increased robustness might be the result of the aggregation of multiple tree-based algorithms (ensembles). It is noteworthy, that the neural network and GNN required less training time with more noisy features which is due to the faster convergence to new (but worse) local minima while training.
Afterwards, we evaluated the slope and importance of different features for the final classification of each model. Tree-based algorithms (decision tree, random forest, RUSBoost, XGBoost) showed similar feature variation curves for classification which results from a similar prediction mechanism (usage of a single or multiple decision tree(s)). However, these mechanisms differ from non-tree-based algorithms (logistic regression, neural networks, GNNs) since they are based on linear transformation with (neural network and GNNs) or without (logistic regression) some kind of non-linearity (e.g., sigmoid, or rectified linear unit). Additionally, tree-based algorithms create harder decision boundaries than non-tree-based algorithms. Our approach for increasing the interpretability of machine learning models assumes that all features are independent from each other. However, in reality features are dependent on each other (e.g., red blood cells and hemoglobin). This simplification might skew the synthetic feature inputs for specific combinations. Future approaches could integrate existent feature dependencies to prevent distortions in the synthetic dataset.
Finally, we tested the performance of a GNN (Graph Attention Networks) on patient-centric graphs (i.e., graphs which integrate measurements of the same patient). The exploitation of time series information through the patient-centric graphs improved the classification performance of all previous model and achieved an AUROC of up to 0.9565 on Graph Attention Networks. The reason for this improvement is that the GNN on a patient-centric graph inherently reduces patient-specific fluctuations in the dataset. However, the performance improvement is also associated with the exploitation of a real-world bias in the underlying dataset. About 2/3 of the sepsis cases are not part of a sequence of examinations (i.e., they represent only a single measurement for a patient). However, the other 1/3 of the sepsis cases are part of examination sequences and sepsis is diagnosed in the last positions in most cases (92.14 %). This highlights the benefit of regular monitoring of patient data as baseline information for machine learning algorithms.
The fact that most sepsis cases occur only at the last positions is exploited by biased features attributes (feature-induced bias) and/or a biased underlying graph structure (structure-induced bias). When incorporating positional encodings, we represent later positions (i.e., measurements) with higher feature attributes and earlier ones with lower feature attributes (feature-induced bias). With a specific graph structure (reverse directed, Fig. 4 E), the underrepresented sepsis cases do not integrate feature information from control information (structure-induced bias). However, the control cases can share information between each other which reduces potential fluctuations. Although the control cases can also integrate information from sepsis cases, the attention mechanism reduces their influence (Fig. 4 H, Supplementary Table 2). In the directed and undirected graph, control cases still share information with each other to reduce potential fluctuations. However, sepsis cases also integrate information from control cases which reduces the differences between the two groups. The integration of information from control cases is partially compensated by the attention mechanism which lowers the influence of control cases to sepsis cases (Fig. 4 G & I, Supplementary Table 2) but cannot be fully compensated due to the high number of control cases in contrast to sepsis cases. Thereby, the reverse directed patient-graphs achieves a much higher classification performance (AUROC of up to 0.9565) compared to the directed and (AUROC of up to 0.9094) and undirected graphs (AUROC of up to 0.8902).
We can use undirected and reversed directed patient-graphs for retrospective analysis (e.g., when a patient diseased or recovered). This application might help to evaluate the success of a treatment (e.g., with specific antibiotics) or to evaluate potential causes of a disease (e.g., infection after a specific event). However, we cannot use the undirected and reversed directed patient-graphs when we want to diagnose sepsis at the current time point since they are incorporating information from subsequent measurements (i.e., information not available at the current time point). Therefore, we can only use the directed patient-centric graphs with and without positional encodings which achieved a lower classification performance compared to GAT on the undirected and reverse directed graphs. However, the performance of the directed patient-centric graph with positional encodings (AUROC of up to 0.8902) is still better than the standard ML algorithms (AUROC of up to 0.8806) which did not use time-series information.
To sum up, we compared the classification performance of different graph learning and other machine learning algorithms on sepsis blood count data and revealed different classification mechanisms in the trained models. Furthermore, we evaluated the performance of Graph Attention Networks on several patient-centric graphs and reached an outstanding AUROC of up to 0.9565 for retrospective usecases.
We would suggest the following directions for future research:
I. Integrating additional features,
II. Integrating more samples.
III. Diagnoses of further diseases
IV. Integration of time series information in other machine learning models
The integration of more features (I.) could include information of other laboratory measurements (e.g., specific biomarkers), vital signs of patients (e.g., body temperature and pulse rate), predisposing factors (e.g., genetic polymorphism25 or chronic medical conditions like diabetes26), and previously administered drugs. These features might help to provide a more holistic view of a patient's health status. Furthermore, sparse information like the existence of predisposing factors or previously administered drugs could be represented as a graph structure. However, data with more features must be collected for all patients which could increase measurement times and costs. Furthermore, specific information like administered drugs could contain information clinicians might have only in retrospect. The integration of more samples (II.) in the dataset is time consuming but could reduce the impact of outliers in the dataset. One promising direction might be the integration of samples from electronic health records like MIMIC-IV27, Amsterdam University Medical Center Database28, high time resolution ICU dataset (HiRID)29 and eICU Collaborative Research Database30. Additionally, complete count data could enable diagnosing further diseases (III.) like thrombosis31 or leukemia32. For the classification of further diseases, labels for the respective diseases are required. However, this labeling process might be time-consuming and requires domain experts like clinicians. Within in the scope of this study we only evaluated the performance of Graph Attention Networks on patient-centric graphs to integrate time series information. However, future studies could compare the performance of other Graph Neural Networks (e.g., GraphSAGE6), one-dimensional Convolutional Neural Networks33, Long Short-Term Memory34 and Transformer architectures12 (IV.). Furthermore, studies could investigate how state-of-the-art machine learning models like XGBoost3 could integrate features from connected nodes (e.g., previous or future measurements) to exploit time series or other graph structured information.