All methods were carried out in accordance with relevant guidelines and regulations. This study was approved by the ethic committee of Rennes academic hospital (number of approval: 20.93). Informed consent was waived by the local ethics committee in accordance with the reference methology MR 004 edicted by the French national data privacy regulatory commission.
Software. Data extractions, manipulations, statistical analyses, and modelizations were performed with “R-studio Server”, version 1.3.959, RStudio PBC, 2009-2020. Specialized packages and functions were used for specific analysis: “Dplyr”, version 1.0.0 was used for data manipulation, “Purrr”, version 0.3.4 for data simplification, and “missForest”, version 1.4 for missing data imputation. Random forests were built with “randomForest” version 4.6-14 and artificial neural networks with “neuralnet” version 1.44.2. “pROC” version 1.16.2 was used to generate the receiver operating characteristic (ROC) curves and calculate the area under the curve (AUC) for each model.
Setting. Data were collected retrospectively from patients admitted to the adult ED of Rennes Academic Hospital, France.
Patient selection. All post-emergency hospitalized patients ≥18 years old admitted between 2020/03/20 and 2020/05/05 and examined for COVID-19 with chest-CT and RT-PCR were included in the study. Patients opposed to the use of their data for research purposes were excluded.
Data collection. Data were automatically collected from “eHOP”, a local clinical data warehouse in which health data are integrated and de-identified in real time (14). Structured data, such as laboratory results, were directly collected from the data warehouse. Text fields were structured by using regular expressions (15).
Data pre-processing. In the raw data-frame, all values were associated with a unique identifier (ID) corresponding to each patient’s admission. This data-frame contained multiple lines per ID (figure 1, step 1). Variables collected more than once during the patient journey appeared as lists (figure 1, step 2). Lists were simplified according to the type of variable (figure 1, step 3).
Predicted variable. The predicted variable for each patient was the presence of COVID-19. This variable, denominated “COVID”, was coded “true” when RT-PCR and/or chest CT results were positive for COVID-19 and “false” otherwise. Chest-CT were coded “positive” when typical COVID-19 defects were identified by radiologists. Patients were allocated to “COVID” and “NOT-COVID” groups accordingly.
Predicting variables. All clinico-biological variables present in our local database were collected. Student’s t- and chi-square tests were used to compare means between groups for numeric and binary variables, respectively. A p-value < 0.05 was considered statistically significant. Variables with a p-value < 0.2 were considered as variables of interest and selected to build machine learning models.
Data split. Data were randomly divided in two parts: the train data-frame, and the test data-frame. The train data-frame, corresponding to 80% of the whole data-frame, was used to build models. Models performances were evaluated on the test data-frame, corresponding to the remaining 20%.
Missing data imputation. Imputations of missing values were performed independently on each data-frame before the training process.
Model training. Three types of models were constructed: binary logistic regressions, random forests and artificial neural networks. Each type was trained with three sets of variables: clinico-biological variables, clinico-biological variables with chest-CT, and clinico-biological variables with RT-PCR.
Performance measurement. The area under ROC curves are commonly used to evaluate and compare classifiers in machine learning and biomedical and bioinformatics applications (16). In this study, the model’s predictions were compared to the “COVID” variable in the test data-frame and ROC curves were constructed accordingly. AUC was the primary outcome used to evaluate the model’s performances.