Study population and study design
The prospective study described herein was conducted at the A.U.O. S. Giovanni di Dio e Ruggi D’aragona, University of Salerno, Campania Region, Southern Italy, during the month of April 2020. One hundred fifty-two subjects were enrolled and stratified in 4 classes: subjects without symptoms referable to viral infection (e.g. temperature higher than 37 °C, cough, muscle pain, tiredness, breathing difficulties) and with a negative serological evaluation against SARS-CoV-2 Spike protein (see below), defined as controls (CTRL); asymptomatic subjects carrying a higher than diagnostic cut-off immunoglobulin IgM or IgG level, defined as asymptomatic (AS); patients with a mild symptomatology, requiring ordinary hospitalization, defined as mild symptomatic patients (MI) and patients with a severe symptomatology, requiring ICU hospitalization (SE).
The study was approved by the ethics committee CE Campania Sud (IRB n.9/2020, prot. 0061907/20) and a written consent form was signed by each participant or their legal representative. The recruited patients (or a legal representative) completed a questionnaire addressing anamnestic and demographic characteristics. A complete clinical evaluation was obtained for all patients.
Statistical analysis of demographic and clinical data
Study data were collected and managed using the REDCap electronic data capture tools [18] hosted at the INFN (Istituto Nazionale di Fisica Nucleare), University of Salerno (Italy). Statistical analysis was performed using R-Studio ver. 1.2.5042 [19]. Data are presented as mean ± standard deviation for continuous variables and number (percentage) for categorical variables. Demographic and clinical data were tested for normality via the Kolmogorov-Smirnov test. Since data resulted to be normally distributed, the Student-t-test and analysis of variance (ANOVA) with Tukey post-hoc test were employed to compare the results among the several classes.
Sample collection
Human tissue collection strictly adhered to the guidelines outlined in the Declaration of Helsinki IV edition [20]. All patients were asked to respect a 12-h fast before blood collection.
Blood samples were collected using a BD vacutainer (Becton Dickinson, Oxfordshire, UK) blood collection tube (red top with no additives). After centrifugation, serum samples were immediately frozen to -80 °C until analysis.
Serum antibody anti SARS-CoV-2 quantification
The quantitative assays for antibody detection were performed using the MAGLUMI™ 2000 Plus 2019-nCov IgM and IgG assays (Snibe, Shenzhen, China).
The test was considered positive for an IgG or IgM level higher than 1.1 AU/mL. Test’s precision around the threshold level, expressed as CV%, was 5.05% at 0.61 UA/mL and 3.31% at 1.96 UA/mL.
Metabolite extraction, derivatization and analysis
The metabolome extraction, purification, and derivatization were conducted by the MetaboPrep GC kit (Theoreo, Montecorvino Pugliano, Italy) according to manufacturer instructions. Instrumental analyses were performed with a GC-MS system (GC-2010 Plus gas chromatograph and QP2010SE mass spectrometer; Shimadzu Corp., Kyoto, Japan). The analytical details are reported in Troisi et al. [21–26].
Metabolite identification was performed according to Troisi et al. [23], briefly, the linear index difference maximum tolerance was set to 10, while the minimum matching for NIST library search was set to 85% (Level 2 identification according to Metabolomics Standards Initiative [MSI]) [27]. Metabolites that emerged as the most relevant in separating cases from controls (see below) were further confirmed using external standards (MSI level = 1).
Dataset preparation
Within each total ion count (TIC) chromatogram, > 290 signal peaks were detected in each specimen. Chromatograms were first aligned by means of parametric time warping (PTW) using the PTW package [28]. Some of the peaks were not investigated further as they were not consistently found in at least 80% samples, too low in concentration, or of poor spectral quality to be confirmed as metabolites. A total of 229 endogenous metabolites were detected consistently. The aligned chromatograms were tabulated with one sample per row and one metabolite area ratio (with respect to the internal standard area) per column. Each value was transformed by taking the natural log and then scaled by mean-centering and dividing by the standard deviation of that column (i.e., autoscaled) [29].
Features selection
To reduce the dataset dimension and focus the analysis on the most relevant metabolites, a process referred to as feature selection was performed using a genetic algorithm that is a heuristic search that mimics the process of natural evolution such as inheritance, mutation, selection, and crossover [30]. In genetic algorithms for feature selection, “mutation” means switching features on and off and “crossover” means interchanging used features. Feature selection was performed by means of the “Optimize Selection (Evolutionary)” algorithm implemented in Rapid Miner Studio ver. 9.6.0 (RapidMiner GmbH, Boston, MA, USA) [31]. These features were used to train the classification models.
Partial Least Square Discriminant Analysis (PLS-DA)
PLS-DA was performed in order to find the combination of metabolites that best separated the different classes on the basis of a specific metabolomic profile.
PLS-DA is a supervised method that uses multivariate regression techniques to extract, by means of linear combinations of original variables, the information able to predict class membership. PLS regression was performed by means of the MetaboanalystR [32] package that uses the plsr function from the R pls package [33]. Classification and cross-validation were performed using the wrapper function from the caret package [34]. Permutation test was performed to verify the significance of class discrimination. For each permutation, a PLS-DA model was built between the data and the permuted class labels, using the optimal number of components determined by cross validation for the model based on the original class assignment. Two types of test statistics were used to measure class discrimination. The first was based on prediction accuracy during training. The second made use of separation distance based on the Between/Within distance ratio (B/W). If the observed test statistics was part of the distribution based on the permuted class assignments, class discrimination could not be considered significant from a statistical point of view [35].
The “Metacost” algorithm [36] was used to correct the imbalance effect for each class, which was expected to be minimal but higher in comparison with CTRL. A cost matrix was built based on the number of samples in each class.
Identification of metabolites related to the SARS-CoV-2 infection symptomatology
Two separate selection strategies were used to find the most relevant metabolites. First, the importance of each metabolite in class separation was evaluated using the variable importance in projection (VIP) scores [37] calculated for each metabolite used in the PLS-DA classification model. Second, metabolites were selected based on their fold change (FC) and t-test-based p-values. Metabolites which showed both FC>2 or FC<-2 and p-value lower than the false discovery rate (FDR) adjusted cut-off were selected. The molecular identity of the metabolites of interest (i.e. metabolites with a VIP-score >2.0 [38], or in the interest areas in FC, p-value diagram) was determined comparing the corresponding mass spectrum with a mass spectrum library [39]. These identified metabolites were further confirmed using external standards, according to level 1 Metabolomics Standards Initiative (MSI) [27]. The selected metabolites’ ontology was reported in Supplementary S1.
Metabolites occurrences in the several features selection strategies were summarized in a UpSet diagram [40].
Next, the metabolites were investigated by metabolomic pathway analysis using the interactive pathways explorer iPath ver. 3.0 [41]. This application allows the visualization and interpretation of metabolomic data in the context of human metabolism, analyzing networks of genes and compounds, identifying enriched pathways, and visualizing changes in metabolite data. iPath uses data from the KEGG (Kyoto Encyclopedia of Genes and Genomes) [42].