Edges are all you need: Potential of Medical Time Series Analysis with Graph Neural Networks.

doi:10.21203/rs.3.rs-3573549/v1

Machine learning is a powerful tool to develop algorithms for clinical diagnosis. However, standard machine learning algorithms are not perfectly suited for clinical data since the data are interconnected and may contain time series. As shown for recommender systems and molecular property predictions graph neural networks (GNNs) may represent a powerful alternative. In this study, we evaluated the performance and time consumption of GNNs compared to state-of-the-art algorithms on the classification of sepsis from blood count data as well as the importance and slope of each feature for the final classification. GNNs on evaluated similarity-graphs achieved an AUROC of up to 0.8747 comparable to the performance of ensemble-based machine learning algorithms and a neural network. The exploitation of time series information with GNNs achieved a superior AUROC of up to 0.9565. Finally, we discovered that feature slope and importance highly differ between tree-based (e.g., XGBoost) and non-tree-based algorithms (e.g., GNN).

Artificial Intelligence and Machine Learning

graphs

graph learning

sepsis

machine learning

Recently, artificial intelligence (AI) showed its great potential in several biological and medical applications such as diagnosing heart diseases¹ and chronic kidney disease from input matrices with clinical data². For such classification and prediction tasks, researchers proposed several modern state-of-the-art machine learning algorithms (e.g., XGBoost³) within the last decades. However, real-world data such as medical data are often connected (e.g., time-dependent measurements of the same patient)⁴. These connections can carry valuable information which help in increasing the predictive power of machine learning algorithms. However, this information is mostly neglected by state-of-the-art machine learning algorithms. Graphs are data structures that can store such connections in the form of nodes (also called vertices) connected by edges (also called links or relations)⁵. Each node entity and each edge can have features (attributes) attached for a more detailed data characterization. Furthermore, graphs can either have one node type and edge type (homogeneous graph) or multiple node and edge types (heterogeneous graph). An example for a homogeneous graph is a medical graph containing patients with attached features (e.g., age and lab measurements) as nodes connected by edges based on their similarity. In a heterogeneous graph, we store additionally features (e.g., lab features) as separate node types and connect patient nodes with their respective feature nodes.

Graphs can be analyzed using graph learning with several algorithms, such as., Graph Neural Networks (GNNs) ^6–11. GNNs sample information (i.e., features) from neighboring nodes, transform this information (e.g., linear transformation with a subsequent activation function), and finally aggregate (e.g., averaging) the transformed information. Sampling, transformation, and aggregation is performed for each node and repeated for a predefined number of iterations (i.e., GNN layers). While all GNNs are based on these steps (i.e., sampling, transformation and aggregation), they can differ in their architecture (e.g., different sampling and aggregation strategies or by the use of attention mechanisms¹²)¹³. GNNs have the advantage that they can utilize attached features and parallelize computations on modern hardware (e.g., GPUs) in contrast to other algorithms like shortest paths, DeepWalk¹⁴ or Node2Vec¹⁵.

However, there are currently two main challenges for the application of GNNs on medical data:

A. Although GNNs showed great potential in diverse applications^16–18, the performance (e.g., AUROC) of GNNs on medical data is currently unclear.

B. Furthermore, GNNs and other complex machine learning algorithms (e.g., XGBoost) are often treated as black-boxes limiting their interpretability and transparency which is essential for medical applications.

To overcome the unclear performance (challenge A), we evaluated the performance of GNNs on medical data^19,20 against shallow and ensemble-based machine learning algorithms and a neural network. The selected dataset¹⁹ contains complete blood count data (five blood parameter and additional age and sex) classified as “sepsis” or “control” (“not sepsis”). Sepsis is still one of the leading causes of death in critically ill patients worldwide ^21,22 and is well studied ^19,23. An early prediction of sepsis allows fast initiation of an appropriate treatment (e.g., with antibiotics)²⁴. Furthermore, we studied the importance and slope of each feature to increase interpretability and transparency of the used models (challenge B).

After data pre-processing (Fig. 1 A), we evaluated the performance of GNNs on similarity graphs compared to other machine learning algorithms (Fig. 1 B & C) on two data sets representing the complete blood count for sepsis and non-sepsis patients’ data. Afterwards, we evaluated the feature slope and importance of different algorithms to increase their transparency and interpretability (Fig. 1 D). Finally, we created patient-centric graphs and applied the attention mechanisms on these graphs to achieve a superior performance and highlight the importance of an appropriate graph structure for the desired use case.

Performance of graph learning on similarity graphs for classifying complete blood counts

First, we wanted to evaluate the performance of different GNNs on medical data compared to other machine learning algorithms. Therefore, we applied several GNNs and other machine learning algorithms on sepsis blood count data, evaluated their performance (i.e., AUROC, Matthews Correlation Coefficient, F1-Macro) and their required training time (Table 1). Nearly all GNNs revealed a similar performance on the homogeneous similarity graph (AUROC:
≤0.8741) and heterogeneous similarity graph (AUROC: ≤0.8747) on both datasets. Furthermore, the neural network (AUROC: ≤0.8806), ensemble-based machine learning algorithms (random forest (AUROC: ≤0.8700), RUSBoost (AUROC: ≤0.8680) and XGBoost (AUROC: ≤0.8643)) had a similar performance. However, some GNNs (GAT, GATv2, HGT and HAN) on heterogeneous similarity graphs and shallow algorithms, like logistic regression (AUROC: ≤0.8369) and decision tree (AUROC: ≤0.8391), performed much worse on both datasets compared to other GNNs, the neural network and ensemble-based algorithms. Our results of the RUSBoost algorithm (AUROC: ≤0.8680) are consistent with the results of Steinbach et al.¹⁹ (AUROC: ≤0.872).

XGBoost had the lowest required training time (0.54 s) followed by the decision tree (2.00 s) and logistic regression (5.97 s). The random forest (17.36 s) and one GNN (GIN: ≤ 54.95 s) are faster than the RUSBoost ensemble algorithm (212.88 s). However, nearly all homogeneous and heterogeneous GNNs had the highest required training times (up to 13,039.06 s) compared to all other considered algorithms.

Afterwards, we compared the robustness of algorithms against 10 and 100 noisy features (Fig. 2, Supplementary Table 1). The ensemble-based machine learning algorithms (random forest, RUSBoost, XGBoost) and logistic regression required more training time (2.74 s to 809.45 s) but had nearly the same performance (only up to 0.0184 worse AUROC) compared to the original datasets. However, the decision tree (up to 0.1487 worse AUROC), the neural network (up to 0.1154 worse AUROC) and GNNs (up to 0.0804 worse AUROC) lost performance which indicates a high sensitivity against noise.

Feature evaluation of graph and machine learning algorithms

After evaluating the performance, we evaluated the slope and importance of features for the classification of sepsis under the assumption of independent features. Therefore, we passed a synthetic dataset with different combinations of feature values as input to our trained GNNs and machine learning models (Fig. 1 D). As a result, we plotted the ratio of diseased (sepsis) cases against different feature values (lowest to highest value) (Fig. 3 A - F). We did not plot the feature “sex” since it is a categorical and not a continuous feature. If the resulting curve has a positive gradient, an increasing feature value (e.g., older people for the feature age) results in an increased probability of developing sepsis according to the model, and vice versa for a negative gradient. Overall, there was a positive gradient for the features age, white blood cells, red blood cells and mean corpuscular volume, showing increased sepsis risk with rising values of these parameters. Due to the input data, white blood cells and red blood cells had a minimum around 4 – 8 Gpt/l (Giga- particles per liter) and 4 Tpt/l (Tera-particles per liter), respectively for most algorithms. A few models have slightly shifted minima. In contrast to the positive gradient, we observed a negative gradient for hemoglobin, and platelets for all algorithms, indicating decreased sepsis risk for rising values according to the models.

The feature variation plots (Fig. 3 A - F) contain the ratios of diseased cases for different feature values We estimated the importance of each feature by calculating the standard deviation of all plotted ratios for each feature. This feature importance is normalized over the sum of all features’ importance in the model to obtain values between 0 and 1 which sum up to one for all features (Fig. 3 G). For nearly all algorithms (besides logistic regression, and homogeneous GNN) white blood cells was the most important feature. In contrast, the feature “sex” has the lowest feature importance across all models.

Graph attention networks on patient-centric graphs

Since the similarity graphs do not consider multiple measurements of the same patient (Fig. 4 A-C), we created patient-centric graphs and applied graph attention networks on them (Table 2). The graph attention networks sample, weight, and aggregate information from other measurements of the same patient either from previous (directed patient-graphs), subsequent (reverse directed patient-graphs) or all measurements (undirected patient graphs). The reversed directed patient-graphs (Fig. 4 E) with and without positional encodings achieved the highest performance on both datasets with an AUROC of up to 0.9565. The undirected graph (Fig. 4 F) also achieved a superior performance with an AUROC of up to 0.9094. The directed graph (Fig. 4 D) only achieved an AUROC of up to 0.8902. However, the performance without any positional encodings was lower with an AUROC of, 0.9502, 0.8996 and 0.8734 on the internal dataset for the reverse directed graph, the undirected graph and the directed graph, respectively.

In this study, we evaluated the performance of GNNs on medical data using the usecase sepsis prediction from blood count data. When GNNs are applied on similarity graphs, they achieve similar performance as ensemble-based machine learning algorithms (XGBoost, RUSBoost, Random Forest) and neural networks. The reason for this similar performance is that the nodes of complete blood counts only sample information from similar node blood count measurements (measurement-centric graphs). GNNs, the neural network and ensemble-based algorithms could outperform shallow algorithms (decision tree and logistic regression) due to more expressive representations of the underlying information. However, this performance increase is associated with an increased computational complexity which requires more training time compared to shallow algorithms. The increased computational complexity can be compensated when exploiting modern hardware (e.g., usage of multiple threads or GPUs). Thereby, the training of XGBoost on the GPU (NVIDIA A6000) requires even less time than the training of shallow algorithms. GNNs were also trained on a GPU but still required much more training time due to the high computational complexity of the underlying sampling, transformation, and aggregation steps (see Introduction).

In addition to computational time, tree-based ensemble algorithms (XGBoost, RUSBoost, Random Forest) are more robust against noise compared to the decision tree, GNNs and the neural network. The increased robustness might be the result of the aggregation of multiple tree-based algorithms (ensembles). It is noteworthy, that the neural network and GNN required less training time with more noisy features which is due to the faster convergence to new (but worse) local minima while training.

Afterwards, we evaluated the slope and importance of different features for the final classification of each model. Tree-based algorithms (decision tree, random forest, RUSBoost, XGBoost) showed similar feature variation curves for classification which results from a similar prediction mechanism (usage of a single or multiple decision tree(s)). However, these mechanisms differ from non-tree-based algorithms (logistic regression, neural networks, GNNs) since they are based on linear transformation with (neural network and GNNs) or without (logistic regression) some kind of non-linearity (e.g., sigmoid, or rectified linear unit). Additionally, tree-based algorithms create harder decision boundaries than non-tree-based algorithms. Our approach for increasing the interpretability of machine learning models assumes that all features are independent from each other. However, in reality features are dependent on each other (e.g., red blood cells and hemoglobin). This simplification might skew the synthetic feature inputs for specific combinations. Future approaches could integrate existent feature dependencies to prevent distortions in the synthetic dataset.

Finally, we tested the performance of a GNN (Graph Attention Networks) on patient-centric graphs (i.e., graphs which integrate measurements of the same patient). The exploitation of time series information through the patient-centric graphs improved the classification performance of all previous model and achieved an AUROC of up to 0.9565 on Graph Attention Networks. The reason for this improvement is that the GNN on a patient-centric graph inherently reduces patient-specific fluctuations in the dataset. However, the performance improvement is also associated with the exploitation of a real-world bias in the underlying dataset. About 2/3 of the sepsis cases are not part of a sequence of examinations (i.e., they represent only a single measurement for a patient). However, the other 1/3 of the sepsis cases are part of examination sequences and sepsis is diagnosed in the last positions in most cases (92.14 %). This highlights the benefit of regular monitoring of patient data as baseline information for machine learning algorithms.

The fact that most sepsis cases occur only at the last positions is exploited by biased features attributes (feature-induced bias) and/or a biased underlying graph structure (structure-induced bias). When incorporating positional encodings, we represent later positions (i.e., measurements) with higher feature attributes and earlier ones with lower feature attributes (feature-induced bias). With a specific graph structure (reverse directed, Fig. 4 E), the underrepresented sepsis cases do not integrate feature information from control information (structure-induced bias). However, the control cases can share information between each other which reduces potential fluctuations. Although the control cases can also integrate information from sepsis cases, the attention mechanism reduces their influence (Fig. 4 H, Supplementary Table 2). In the directed and undirected graph, control cases still share information with each other to reduce potential fluctuations. However, sepsis cases also integrate information from control cases which reduces the differences between the two groups. The integration of information from control cases is partially compensated by the attention mechanism which lowers the influence of control cases to sepsis cases (Fig. 4 G & I, Supplementary Table 2) but cannot be fully compensated due to the high number of control cases in contrast to sepsis cases. Thereby, the reverse directed patient-graphs achieves a much higher classification performance (AUROC of up to 0.9565) compared to the directed and (AUROC of up to 0.9094) and undirected graphs (AUROC of up to 0.8902).

We can use undirected and reversed directed patient-graphs for retrospective analysis (e.g., when a patient diseased or recovered). This application might help to evaluate the success of a treatment (e.g., with specific antibiotics) or to evaluate potential causes of a disease (e.g., infection after a specific event). However, we cannot use the undirected and reversed directed patient-graphs when we want to diagnose sepsis at the current time point since they are incorporating information from subsequent measurements (i.e., information not available at the current time point). Therefore, we can only use the directed patient-centric graphs with and without positional encodings which achieved a lower classification performance compared to GAT on the undirected and reverse directed graphs. However, the performance of the directed patient-centric graph with positional encodings (AUROC of up to 0.8902) is still better than the standard ML algorithms (AUROC of up to 0.8806) which did not use time-series information.

To sum up, we compared the classification performance of different graph learning and other machine learning algorithms on sepsis blood count data and revealed different classification mechanisms in the trained models. Furthermore, we evaluated the performance of Graph Attention Networks on several patient-centric graphs and reached an outstanding AUROC of up to 0.9565 for retrospective usecases.

We would suggest the following directions for future research:

I. Integrating additional features,

II. Integrating more samples.

III. Diagnoses of further diseases

IV. Integration of time series information in other machine learning models

The integration of more features (I.) could include information of other laboratory measurements (e.g., specific biomarkers), vital signs of patients (e.g., body temperature and pulse rate), predisposing factors (e.g., genetic polymorphism²⁵ or chronic medical conditions like diabetes²⁶), and previously administered drugs. These features might help to provide a more holistic view of a patient's health status. Furthermore, sparse information like the existence of predisposing factors or previously administered drugs could be represented as a graph structure. However, data with more features must be collected for all patients which could increase measurement times and costs. Furthermore, specific information like administered drugs could contain information clinicians might have only in retrospect. The integration of more samples (II.) in the dataset is time consuming but could reduce the impact of outliers in the dataset. One promising direction might be the integration of samples from electronic health records like MIMIC-IV²⁷, Amsterdam University Medical Center Database²⁸, high time resolution ICU dataset (HiRID)²⁹ and eICU Collaborative Research Database³⁰. Additionally, complete count data could enable diagnosing further diseases (III.) like thrombosis³¹ or leukemia³². For the classification of further diseases, labels for the respective diseases are required. However, this labeling process might be time-consuming and requires domain experts like clinicians. Within in the scope of this study we only evaluated the performance of Graph Attention Networks on patient-centric graphs to integrate time series information. However, future studies could compare the performance of other Graph Neural Networks (e.g., GraphSAGE⁶), one-dimensional Convolutional Neural Networks³³, Long Short-Term Memory³⁴ and Transformer architectures¹² (IV.). Furthermore, studies could investigate how state-of-the-art machine learning models like XGBoost³ could integrate features from connected nodes (e.g., previous or future measurements) to exploit time series or other graph structured information.

In this section, we describe and explain the workflow of our study to evaluate the performance of graph learning algorithms (Fig. 1). First, we pre-processed the complete blood count dataset from Steinbach et al.^19,20 (Fig. 1 A). Then, we constructed several graph structures from the dataset and applied GNNs on them (Fig. 1 B). Afterwards, we benchmarked the GNNs against several other machine learning algorithms and measured their required training time (Fig. 1 C). Additionally, we evaluated the importance of individual features for the classification of sepsis (Fig. 1 D). Finally, we evaluated the performance of Graph Attention Networks (GAT) on several patient-centric graph structures.

Pre-processing and setup

We used the dataset from Steinbach et al.^19,20 that contains patients hospitalized into non-intensive care units from German tertiary care center in Leipzig (internal dataset) and Greifswald (external dataset) between January 2014 and December 2021¹⁹. Each patient can have multiple complete blood counts (i.e., rows in the dataset). Each complete blood count measurement contains a patient id, age, biological sex (i.e., only male or female were reported), five blood parameters (hemoglobin, red blood cells, white blood cells, mean corpuscular volume and platelets), a binary label (“sepsis” or “not sepsis”), and information where and when the measurement was performed. We pre-processed dataset and separated it into train, internal and external test sets according to Steinbach et al.¹⁹ (Fig. 1 A). To visualize the distribution of each feature we plotted violin plots for each continuous feature (Supplemental Fig. 1). Afterwards, we analyzed the data distribution of each set (Supplementary Information).

We used the following setup for all analyses:

Mainboard Supermicro X12SPA-TF
CPU: Intel® Xeon® Scalable Processor “Ice Lake” Gold 6338, 2.0 GHz, 32- Core
GPU: NVIDIA® RTX A6000 (48 GB GDDR6)
RAM: 8x32 GB DDR4-3200
ROM: 2TB Samsung SSD 980 PRO, M.2

Graph construction and analysis

After pre-processing, we constructed two similarity graphs from the complete blood count data and analyzed them using GNNs (Fig. 1 B). The first similarity graph is a homogeneous k-nearest neighbors (k-nn) graph (Fig. 4 B). It contains a node for each complete blood count measurement and connects them directly based on similarity (normalized Euclidean distance of the features). The second graph is a heterogeneous similarity graph (Fig. 4 C). It contains a node for each complete blood measurement and nodes with discretized values (lower and upper limit) for each blood parameter. Discretization was performed based on percentiles to have less sensitivity against outliers. With this graph structure, patients are indirectly connected via similar blood parameters. Each complete blood count node contains standard normalized patient features (age, sex, hemoglobin, red blood cells, white blood cells, mean corpuscular volume and platelets).

The basic assumption behind these graph structures is that similar complete blood counts might have the same label (homophily) and therefore, similarities might increase the classification performance. Afterwards, we applied several GNNs (GraphSAGE⁶, GAT⁷, GATv2³⁵, GIN⁹, HGT¹¹, HAN¹⁰) with two layers (128 neurons) and a learning rate of 0.0003 using Pytorch Geometric³⁶. Due to memory constraints while training on the heterogenous graph, we reduced the size of the hidden dimension of GAT with two attention heads, GATv2 with two attention heads, HGT, and HAN to 64 dimensions. We trained the GNNs for 10,000 epochs with an early stopping after an increase of the validation loss for ten consecutive epochs similar to the work of Kipf and Welling ⁸. All GNNs were evaluated using Area under Receiver Operating Curve (AUROC), F1- Macros Score and Matthews Correlation Coefficient (MCC).

Benchmarking

As benchmarks, we evaluated the performance (AUROC, F1- Macro Score, and MCC) on tree-based and non- tree-based algorithms (Fig. 1 C). As tree-based algorithms, we used a decision tree and three ensemble-based algorithms (i.e., Random Forest³⁷, RUSBoost³⁸, and XGBoost³). As non-tree-based algorithms, we used a logistic regression and a neural network. The neural network was implemented in PyTorch³⁹ and used standard normalized complete blood count features with two layers (128 neurons) and a learning rate of 0.0003 similar to the GNNs. The neural network was trained for 10,000 epochs with an early stopping after an increase of the validation loss for ten consecutive epochs. Hyperparameters for the logistic regression and all tree-based algorithms were tuned using grid search with 10-fold cross validation using sklearn (version 1.2.2)⁴⁰. To test the robustness of each benchmark against noise, we added 10 and 100 noisy features (random features between 0 and 1) to the dataset and evaluated the performance of each benchmark algorithm and GraphSAGE⁶ as a representative of GNNs.

Feature importance

After performance evaluation, we evaluated the importance and slope of different features (i.e., age, sex, hemoglobin, red blood cells, white blood cells, mean corpuscular volume, platelets) under the assumption that all features are independent from each other (Fig. 1 D). Therefore, we created a synthetic dataset and evaluated the predictions of each synthetic sample for each algorithm. The synthetic dataset was generated by creating 20 variations for each feature from the minimum value to the maximum value of the features’ original values (i.e., 20 percentiles from 0 %-percentile to 100 %- percentile). For the feature sex only male and female were reported (i.e., sex as a binary feature). Then, each feature variation is combined with each other to explore the complete feature space. Thereby, we obtained a dataset with 2*20⁶ rows (synthetic complete blood counts) and 7 features (1 binary feature and 6 continuous features). After passing this dataset through each trained algorithm, we evaluated the ratio of as diseased classified complete blood counts for each variation of each feature (Fig. 1 D). Data were plotted using Microsoft Excel. The feature importance is then estimated with the standard deviation of each feature. Each feature importance is normalized over the sum of all feature importance within one algorithm to obtain values between 0 and 1. We evaluated the performance on all benchmark algorithms and on GraphSAGE (homogenous and heterogeneous) as a representative of GNNs. Note, that we reduced the number of percentiles to 14 for the heterogeneous GNN due to memory constraints.

Patient-centric graphs

The constructed similarity graphs are measurement-centric, i.e., they do not consider multiple measurements of the same patient (Fig. 4 A-C) for the classification. For incorporating multiple measurements of the same patient, we construct several patient-centric graphs (Fig. 4 D-F). In these graphs node represents standard normalized complete blood counts and edges represent connections from previous to following measurements (directed graph, Fig. 4 D), following to previous measurements (reverse directed graph, Fig. 4 E) or connecting all measurements of the same patient with each other independent from their order (undirected graph, Fig. 4 F). Afterwards, we added positional encodings on each node to represent the position of the measurement in the sequence of measurements. We applied Graph Attention Networks⁷ with two layers (128 neurons), a learning rate of 0.0003, a batch size of 50,000 and trained the GNN for 10,000 epochs with an early stopping after an increase of the validation loss for five consecutive epochs. Then, we compared the classification performance (AUROC) with and without positional encodings for all patient-centric graphs. Finally, we evaluated the attention weights (influence between different nodes) on each graph to increase the interpretability of the trained Graph Attention Networks. Therefore, we returned all attention weights from each layer of the graphs and analyzed, mean, standard deviation and quantiles of nodes connecting to nodes with the same label (e.g., connection of a “control” node to a “control” node) and nodes connecting to nodes with a different label (e.g., connection of a “sepsis” node to a “control” node).

Code availability.

All Jupyter Notebooks, python files and datasets of the methodology developed and used in this study is available at https://github.com/danielwalke/SBCDataAnalysis. The used dataset containing the complete blood counts is available at https://github.com/ampel-leipzig/sbcdata.

Acknowledgements

This research was funded by the German Research Foundation (DFG) under the project ‘Optimizing graph databases focusing on data processing and integration of machine learning for large clinical and biological datasets’ [grant numbers HE 8077/2-1, SA 465/53-1]).

Competing Interests

The authors declare that they have no competing interests.

Budholiya, K., Shrivastava, S. K. & Sharma, V. An optimized XGBoost based diagnostic system for effective prediction of heart disease. Journal of King Saud University - Computer and Information Sciences 34, 4514–4523; 10.1016/j.jksuci.2020.10.013 (2022)Budholiya, Kartik; Shrivastava, Shailendra Kumar; Sharma, Vivek.
Ogunleye, A. & Wang, Q.-G. XGBoost Model for Chronic Kidney Disease Diagnosis. IEEE/ACM transactions on computational biology and bioinformatics 17, 2131–2140; 10.1109/TCBB.2019.2911071 (2020)Ogunleye, Adeola; Wang, Qing-Guo.
Chen, T. & Guestrin, C. XGBoost, 2016.
Walke, D., Micheel, D., Schallert, K., Muth, T., Broneske, D., Saake, G. & Heyer, R. The importance of graph databases and graph learning for clinical applications. Database : the journal of biological databases and curation 2023; 10.1093/database/baad045 (2023)Walke, Daniel; Micheel, Daniel; Schallert, Kay; Muth, Thilo; Broneske, David; Saake, Gunter; Heyer, Robert.
Wilson, R. J. Introduction to graph theory. 3rd ed. (Longman, London, 1985).
Hamilton, W. L., Ying, R. & Leskovec, J. Inductive Representation Learning on Large Graphs, 07.06.2017.
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P. & Bengio, Y. Graph Attention Networks, 30.10.2017.
Kipf, T. N. & Welling, M. Semi-Supervised Classification with Graph Convolutional Networks, 09.09.2016.
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How Powerful are Graph Neural Networks?, 01.10.2018.
Wang, X., Ji, H., Shi, C., Wang, B., Ye, Y., Cui, P. & Yu, P. S. Heterogeneous Graph Attention Network. In The World Wide Web Conference, edited by L. Liu & R. White (ACM, New York, NY, USA, 2019), pp. 2022–2032.
Hu, Z., Dong, Y., Wang, K. & Sun, Y. Heterogeneous Graph Transformer, 03.03.2020.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. & Polosukhin, I. Attention Is All You Need, 12.06.2017.
Veličković, P. Message passing all the way up, 22.02.2022.
Perozzi, B., Al-Rfou, R. & Skiena, S. DeepWalk, 2014.
Grover, A. & Leskovec, J. node2vec: Scalable Feature Learning for Networks, 03.07.2016.
Fan, W., Ma, Y., Li, Q., He, Y., Zhao, E., Tang, J. & Yin, D. Graph Neural Networks for Social Recommendation, 19.02.2019.
Oliver Wieder, Stefan Kohlbacher, Mélaine Kuenemann, Arthur Garon, Pierre Ducrot, Thomas Seidel & Thierry Langer. A compact review of molecular property prediction with graph neural networks. Drug Discovery Today: Technologies 37, 1–12; 10.1016/j.ddtec.2020.11.009 (2020)Oliver Wieder; Stefan Kohlbacher; Mélaine Kuenemann; Arthur Garon; Pierre Ducrot; Thomas Seidel; Thierry Langer.
Wu, S., Sun, F., Zhang, W., Xie, X. & Cui, B. Graph Neural Networks in Recommender Systems: A Survey, 04.11.2020.
Steinbach, D., Ahrens, P. C., Schmidt, M., Federbusch, M., Heuft, L., Lübbert, C., Nauck, M., Gründling, M., Isermann, B., Gibb, S. & Kaiser, T. Applying Machine Learning to Blood Count Data Predicts Sepsis with ICU Admission (2022).
Gibb, Sebastian, Ahrens, Paul, Steinbach, Daniel, Schmidt, Maria, & Kaiser, Thorsten. sbcdata: Laboratory Diagnostics from Septic and Non-septic Patients Used in the AMPEL Project. R package version 1.0.0.; DOI: 10.5281/zenodo.6922968, 24.04.2023.
Póvoa, P. C-reactive protein: a valuable marker of sepsis. Intensive care medicine 28, 235–243; 10.1007/s00134-002-1209-6 (2002)Póvoa, Pedro.
Moor, M., Rieck, B., Horn, M., Jutzeler, C. R. & Borgwardt, K. Early Prediction of Sepsis in the ICU Using Machine Learning: A Systematic Review. Frontiers in medicine 8, 607952; 10.3389/fmed.2021.607952 (2021)Moor, Michael; Rieck, Bastian; Horn, Max; Jutzeler, Catherine R.; Borgwardt, Karsten.
Kaji, D. A., Zech, J. R., Kim, J. S., Cho, S. K., Dangayach, N. S., Costa, A. B. & Oermann, E. K. An attention based deep learning model of clinical events in the intensive care unit. PloS one 14, e0211057; 10.1371/journal.pone.0211057 (2019)Kaji, Deepak A.; Zech, John R.; Kim, Jun S.; Cho, Samuel K.; Dangayach, Neha S.; Costa, Anthony B.; Oermann, Eric K.
Liu, V. X., Fielding-Singh, V., Greene, J. D., Baker, J. M., Iwashyna, T. J., Bhattacharya, J. & Escobar, G. J. The Timing of Early Antibiotics and Hospital Mortality in Sepsis. American journal of respiratory and critical care medicine 196, 856–863; 10.1164/rccm.201609-1848OC (2017)Liu, Vincent X.; Fielding-Singh, Vikram; Greene, John D.; Baker, Jennifer M.; Iwashyna, Theodore J.; Bhattacharya, Jay; Escobar, Gabriel J.
Angus, D. C. & Wax, R. S. Epidemiology of sepsis: An update. Critical care medicine 29 (2001)Angus, Derek C.; Wax, Randy S.
Koh, G. C. K. W., Peacock, S. J., van der Poll, T. & Wiersinga, W. J. The impact of diabetes on the pathogenesis of sepsis. European journal of clinical microbiology & infectious diseases : official publication of the European Society of Clinical Microbiology 31, 379–388; 10.1007/s10096-011-1337-4 (2012)Koh, G. C. K. W.; Peacock, S. J.; van der Poll, T.; Wiersinga, W. J.
Johnson, A. E. W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., Lehman, L.-W. H., Celi, L. A. & Mark, R. G. MIMIC-IV, a freely accessible electronic health record dataset. Scientific data 10, 1; 10.1038/s41597-022-01899-x (2023)Johnson, Alistair E. W.; Bulgarelli, Lucas; Shen, Lu; Gayles, Alvin; Shammout, Ayad; Horng, Steven; Pollard, Tom J.; Hao, Sicheng; Moody, Benjamin; Gow, Brian; Lehman, Li-Wei H.; Celi, Leo A.; Mark, Roger G.
Thoral, P. J., Peppink, J. M., Driessen, R. H., Sijbrands, E. J. G., Kompanje, E. J. O., Kaplan, L., Bailey, H., Kesecioglu, J., Cecconi, M., Churpek, M., Clermont, G., van der Schaar, M., Ercole, A., Girbes, A. R. J. & Elbers, P. W. G. Sharing ICU Patient Data Responsibly Under the Society of Critical Care Medicine/European Society of Intensive Care Medicine Joint Data Science Collaboration: The Amsterdam University Medical Centers Database (AmsterdamUMCdb) Example. Critical care medicine 49, e563-e577; 10.1097/CCM.0000000000004916 (2021)Thoral, Patrick J.; Peppink, Jan M.; Driessen, Ronald H.; Sijbrands, Eric J. G.; Kompanje, Erwin J. O.; Kaplan, Lewis; Bailey, Heatherlee; Kesecioglu, Jozef; Cecconi, Maurizio; Churpek, Matthew; Clermont, Gilles; van der Schaar, Mihaela; Ercole, Ari; Girbes, Armand R. J.; Elbers, Paul W. G.
Hyland, S. L., Faltys, M., Hüser, M., Lyu, X., Gumbsch, T., Esteban, C., Bock, C., Horn, M., Moor, M., Rieck, B., Zimmermann, M., Bodenham, D., Borgwardt, K., Rätsch, G. & Merz, T. M. Early prediction of circulatory failure in the intensive care unit using machine learning. Nature medicine 26, 364–373; 10.1038/s41591-020-0789-4 (2020)Hyland, Stephanie L.; Faltys, Martin; Hüser, Matthias; Lyu, Xinrui; Gumbsch, Thomas; Esteban, Cristóbal; Bock, Christian; Horn, Max; Moor, Michael; Rieck, Bastian; Zimmermann, Marc; Bodenham, Dean; Borgwardt, Karsten; Rätsch, Gunnar; Merz, Tobias M.
Pollard, T. J., Johnson, A. E. W., Raffa, J. D., Celi, L. A., Mark, R. G. & Badawi, O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific data 5, 180178; 10.1038/sdata.2018.178 (2018)Pollard, Tom J.; Johnson, Alistair E. W.; Raffa, Jesse D.; Celi, Leo A.; Mark, Roger G.; Badawi, Omar.
Jang, H. J., Schellingerhout, D., Kim, J., Chung, J. & Kim, D.-E. Towards a biomarker for acute arterial thrombosis using complete blood count and white blood cell differential parameters in mice. Scientific reports 13, 4043; 10.1038/s41598-023-31122-9 (2023)Jang, Hee Jeong; Schellingerhout, Dawid; Kim, Jiwon; Chung, Jinyong; Kim, Dong-Eog.
Davis, A. S., Viera, A. J. & Mead, M. D. Leukemia: an overview for primary care. American family physician 89, 731–738 (2014)Davis, Amanda S.; Viera, Anthony J.; Mead, Monica D.
LeCun, Y., Kavukcuoglu, K. & Farabet, C. Convolutional networks and applications in vision. In ISCAS 2010. 2010 IEEE International Symposium on Circuits and Systems, Nano-Bio Circuit Fabrics and Systems : May 30th-June 2nd 2010, Paris, France (IEEE, [Piscataway, N.J.], 2010), pp. 253–256.
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural computation 9, 1735–1780; 10.1162/neco.1997.9.8.1735 (1997)Hochreiter, S.; Schmidhuber, J.
Brody, S., Alon, U. & Yahav, E. How Attentive are Graph Attention Networks?, 30.05.2021.
Fey, M. & Lenssen, J. E. Fast Graph Representation Learning with PyTorch Geometric, 06.03.2019.
Breiman, L. Random Forests. Machine Learning 45, 5–32; 10.1023/A:1010933404324 (2001)Breiman, Leo.
Seiffert, C., Khoshgoftaar, T. M., van Hulse, J. & Napolitano, A. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Syst., Man, Cybern. A 40, 185–197; 10.1109/TSMCA.2009.2029559 (2010)Seiffert, Chris; Khoshgoftaar, Taghi M.; van Hulse, Jason; Napolitano, Amri.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J. & Chintala, S. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, edited by H. Wallach, et al. (Curran Associates, Inc2019), pp. 8024–8035.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Müller, A., Nothman, J., Louppe, G., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. & Duchesnay, É. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research ( (2011)Pedregosa, Fabian; Varoquaux, Gaël; Gramfort, Alexandre; Michel, Vincent; Thirion, Bertrand; Grisel, Olivier; Blondel, Mathieu; Müller, Andreas; Nothman, Joel; Louppe, Gilles; Prettenhofer, Peter; Weiss, Ron; Dubourg, Vincent; Vanderplas, Jake; Passos, Alexandre; Cournapeau, David; Brucher, Matthieu; Perrot, Matthieu; Duchesnay, Édouard.
Belson, W. A. Matching and Prediction on the Principle of Biological Classification. Applied Statistics 8, 65; 10.2307/2985543 (1959)Belson, William A.
Wright, R. E. Logistic regression (American Psychological Association., 1995).
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536; 10.1038/323533a0 (1986)Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J.
Nebe, T., Bentzien, F., Bruegel, M., Fiedler, G. M., Gutensohn, K., Heimpel, H., Krebs, N., Ossendorf, M., Schuff-Werner, P., Stamminger, G. & Baum, H. Multizentrische Ermittlung von Referenzbereichen für Parameter des maschinellen Blutbildes/Multicentric Determination of Reference Ranges for Automated Blood Counts. LaboratoriumsMedizin 35, 3–28; 10.1515/JLM.2011.004 (2011)Nebe, Thomas; Bentzien, Frank; Bruegel, Mathias; Fiedler, Georg Martin; Gutensohn, Kai; Heimpel, Hermann; Krebs, Nicole; Ossendorf, Manfred; Schuff-Werner, Peter; Stamminger, Gudrun; Baum, Hannsjörg.
Gulati, G. L. & Hyun, B. H. The automated CBC. A current perspective. Hematology/oncology clinics of North America 8, 593–603 (1994)Gulati, G. L.; Hyun, B. H.

Table 1 | Comparison of GNNs against benchmarks on complete blood count data for sepsis classification (higher values represent better performance) and their required training time. We evaluated the classification performance on two datasets (internal and external dataset). Bold values represent the best values in each column. MCC (Matthew’s Correlation Coefficient), AUROC (Area under receiver operating curve).

Models		AUROC		F1-Macro		MCC		Training time [s]
Models		Internal	External	Internal	External	Internal	External	Training time [s]
Tree-based benchmarks	Decision Tree⁴¹	0.8391	0.7870	0.4313	0.4018	0.0432	0.0291	2.00
	Random Forest³⁷	0.8700	0.8178	0.4770	0.4609	0.0605	0.0385	17.36
	RUSBoost³⁸	0.8680	0.8153	0.4701	0.4497	0.0576	0.0361	212.88
	XGBoost³	0.8643	0.8121	0.4373	0.4184	0.0495	0.0327	0.54
Non-tree-based benchmarks	Logistic regression ⁴²	0.8369	0.7558	0.4412	0.3736	0.0442	0.0222	5.97
Non-tree-based benchmarks	Neural Network ⁴³	0.8806	0.8145	0.4479	0.4502	0.0521	0.0383	19.97
Homogeneous graph learning	Graph SAGE ⁶	0.8741	0.8052	0.4411	0.3964	0.0499	0.0308	394.69
	GAT⁷ (single attention head)	0.8726	0.8086	0.4440	0.4457	0.0501	0.0374	561.23
	GAT⁷ (two attention heads)	0.8707	0.8114	0.4476	0.4502	0.0513	0.0393	1008.311
	GATv2³⁵ (single attention head)	0.8723	0.8057	0.4413	0.4442	0.04889	0.0371	553.55
	GATv2³⁵ (two attention heads)	0.8746	0.8130	0.4469	0.4500	0.0511	0.0384	1189.47
	GIN⁹	0.8649	0.8050	0.4499	0.4530	0.0502	0.0372	31.45
Heterogeneous graph learning	Graph SAGE⁶	0.8747	0.8176	0.4422	0.4020	0.0506	0.0326	981.10
	GAT⁷ (single attention head)	0.8426	0.8055	0.4396	0.4094	0.0464	0.0328	1086.07
	GAT⁷ (two attention heads) *	0.8396	0.8069	0.4404	0.4164	0.0459	0.0334	679.95
	GATv2³⁵ (single attention head)	0.8402	0.8061	0.4384	0.4136	0.0455	0.0329	323.36
	GATv2³⁵ (two attention heads) *	0.8422	0.8067	0.4401	0.4144	0.0468	0.0326	463.59
	GIN⁹	0.8696	0.8051	0.4215	0.3626	0.0456	0.0273	54.95
	HGT¹¹*	0.8317	0.7778	0.4509	0.4197	0.0479	0.0289	2696.46
	HAN¹⁰*	0.8401	0.8036	0.4397	0.4106	0.0457	0.0328	13,039.06
Minimum		0.8317	0.7558	0.4215	0.3626	0.0432	0.0222	0.54
Mean		0.8584	0.8040	0.4444	0.4221	0.0491	0.0335	1166.10
Maximum		0.8806	0.8178	0.4770	0.4609	0.0605	0.0393	13039.06

* Hidden dimension reduced to 64 due to memory constraints.

Table 2 | Comparing AUROC, exploited biases, use cases and issues of Graph Attention Networks on similarity and patient-centric graphs (directed, reverse directed and undirected) for the classification of sepsis on complete blood count data (higher values represent better performance). We evaluated the classification performance on two datasets (internal and external dataset). Bold values represent the best values in each column.

Graph		AUROC		Structure bias	Feature bias	Use case	Issue
Graph		Internal	External	Structure bias	Feature bias	Use case	Issue
Measurement-centric graphs	Homogeneous similarity graph (Fig. 4 B)	0.8741	0.8052	-	-	Time-specific diagnostic (e.g., to predict sepsis)	Does the patient have sepsis according to the available information?
Measurement-centric graphs	Heterogeneous similarity graph (Fig. 4 C)	0.8747	0.8176	-	-
Patient-centric graphs	Directed graph (Fig. 4 D)	0.8734	0.8114	-	-
	Directed graph with positional encodings (Fig. 4 D)	0.8902	0.8203	-	+
	Undirected graph (Fig. 4 F)	0.8996	0.8628	-	-	Retrospective diagnostics (e.g., for evaluating treatment strategies or disease causes)	When did the patient diseased or recovered?
	Undirected graph with positional encodings (Fig. 4 F)	0.9094	0.8652	-	+
	Reverse directed graph (Fig. 4 E)	0.9502	0.9481	+	-
	Reverse directed graph with positional encodings (Fig. 4 E)	0.9565	0.9498	+	+

SupplementalMaterial.docx

Edges are all you need: Potential of Medical Time Series Analysis with Graph Neural Networks.

Status:

Version 1

Abstract

Figures

Introduction

Results

Discussion and future directions

Methods

Declarations

References

Tables

Supplementary Files

Status:

Version 1