Study Population
The study was approved by the institutional review board (IRB) of Hamad Medical Corporation under study protocol (IRB protocol #16245/16). Participants with normal body mass index (lean healthy control, LHC) or with obesity only (OBO) were recruited at the Qatar Metabolic Institute and provided written informed consent. For this study, we included men and women with BMI ≤ 25 kg/m2 for the LHC group and BMI ≥ 35 kg/m2 for the OBO group, and who are generally healthy and free of any chronic illness. Fasting blood samples were collected, centrifuged at 1200 × g at 4°C for 10 min to isolate serum which was aliquoted and stored at -80°C until measurements.
Statistical Analysis
Means, standard deviations, and p-values were computed for study participants in the LHC and OBO groups. Missing values (less than 0.5% of all parameters) were substituted with the median for the corresponding variables. Subsequent to this imputation, various statistical tests were employed based on the type and distribution of the values. For categorical variables, the chi-square test was utilized, while the Wilcoxon test was performed for continuous variables that did not adhere to a normal distribution. To assess the normality of variables, the Shapiro–Wilk test was employed. For those variables demonstrating normal distribution, Student’s t-test was applied, providing a comprehensive and robust data analysis.
Metabolomics Profiling and Quality Control
Serum samples from the LHC and OBO groups were subjected to untargeted, ultrahigh-performance liquid chromatography–tandem mass spectroscopy (UPLC–MS/MS) and were curated by Metabolon on serum metabolites. After normalizing samples across batches to correct for minor instrument variations by scaling each batch’s medians to one and scaling the rest of the data points proportionally, batch-normalized data were generated. Within ten super classes, 1069 metabolites were profiled. These super classes included lipids (n = 360, 33.68%), amino acids (n = 203,18.99%), xenobiotics (n = 169, 15.81%), peptides (n = 47, 4.40%), nucleotides (n = 38, 3.55%), cofactors and vitamins (n = 31, 2.90%), carbohydrates (n = 20, 1.87%), partially characterized molecules (n = 17, 1.59%), energy related metabolites (n = 10, 0.94%), and unknown (n = 174, 16.28%). After removing the unknown metabolites, 895 metabolites were used for further analysis. To ensure data quality, data was pre-processed using the criteria suggested by Wei et al. 12; metabolites and samples with more than 20% missing data were removed. After removing 184 metabolites, 711 metabolites were available for further analysis. Data imputation was performed by replacing values of metabolite below detection limit with the minimum detectable value for that metabolite.
Identification of outliers was conducted through the application of principal component analysis (PCA). Specifically, samples with first five principal component values falling outside the range of (µ ± 2.5 SD) were deemed outliers. Subsequently, one sample from the LHC group and three samples from the OBO group were identified as outliers and removed from the dataset (refer to Supplementary Figure S1).
To mitigate the potential impact of outliers on the data, a Winsorization procedure was employed. This involved setting metabolite values above the 90th percentile to the 90th percentile and values below the 10th percentile were replaced with the 10th percentile for each metabolite. This approach helps to normalize extreme values and ensure a more robust dataset for downstream analyses.
Following the aforementioned quality control steps, a total of 711 metabolites and 36 samples were retained for subsequent analyses. Among these, 19 samples belonged to the LHC group, while 17 samples belonged to the OBO. These refined datasets were utilized for further exploration and interpretation of metabolic patterns in the subsequent stages of our analysis.
Univariate Statistical Analysis
Logistic regression was used for assessing discrepancies among individual metabolites, with adjustments made for age, sex, and BMI to minimize their influence on outcomes. Owing to the limited sample size, the analysis did not incorporate false discovery rate (FDR) control. The 'glm' function within the R stats package (R Core Team, version 4.2.1, 2022, Vienna, Austria) facilitated the computation of p-values and effect sizes in logistic regression. The effect size was instrumental in determining the direction of change in metabolite concentration between LHC and OBO: a positive effect size indicated a higher concentration in LHC compared to OBO, and conversely.
Pathway Enrichment Analysis
The pathway enrichment analysis was conducted using MetaboAnalystR v3.0 13. Metabolite sets were obtained by providing HMDB (Human Metabolome Database) IDs, which were supplied by Metabolon for matching the compounds in the metabolite sets. The auto-normalization option was employed to preprocess the data, transforming it to zero mean and unit variance.
Quantitative enrichment analysis (QEA) 14 was then applied, utilizing metabolite concentrations to construct a generalized linear model for the estimation of the Q-stat (a statistical measure) for each metabolite set. The Q-stat serves as an indicator of the correlation between compound concentrations and the specified phenotype. This approach enables the identification of metabolite sets that exhibit a significant change in concentration in only a few compounds or those with numerous compounds displaying small, correlated changes.
The enrichment ratio (ER) was calculated as the ratio of the Q-stat for the given data to its expected value by chance. An ER greater than one for a particular metabolite set indicates differential metabolite concentrations, signifying potential biological relevance. This comprehensive approach was employed to uncover meaningful insights into the association between metabolite concentrations and the observed phenotypic changes.
The visualization of KEGG pathway maps was accomplished through the annotation process using the online tool Pathview (https://pathview.uncc.edu/) 15, 16. Pathview facilitated the annotation of KEGG pathway maps by filling the metabolite nodes with colors that depict the ratio of normalized metabolite concentrations between LHC and OBO samples. Specifically, the color scheme employed in the visualization was designed to convey information about the relative metabolite concentrations in OBO compared to LHC. The color assignment was as follows: red represented higher concentrations of metabolites, yellow indicated similar concentrations, and blue signified lower concentrations in OBO as compared to LHC.
This approach not only provided a visually intuitive representation of the metabolite data but also allowed for a quick and comprehensive assessment of the relative abundance of metabolites across the LHC and OBO groups, contributing to a more insightful interpretation of the pathway maps.
RNA isolation and quality control
Whole blood (2.5 mL) was collected into PaXgene Blood RNA Tubes (PreAnalytix). The tubes were inverted 8–10 times then placed at room temperature for 2–3 h and then frozen at − 80°C for storage. The samples were thawed overnight, then total RNA was isolated with a PAXgene Blood RNA Kit including the DNase Set (Qiagen). The concentrations and purity of the RNA samples were evaluated spectrophotometrically (Nanodrop ND-1000, Thermo, Wilmington, DE USA). The RNA isolation process was validated by analyzing the integrity of several RNAs with the RNA 6000 Nano Chip Kit (Agilent). The presence of the small RNA fraction was confirmed by the Agilent Small RNA Kit (Agilent).
Transcriptomic Profiling
A Clariom D Assay-Human system (Applied Biosystems) containing 47,231 expressions and 770 control probes was used for transcriptomic analysis from the whole blood samples. In brief, 100 ng of total RNA was reverse-transcribed using GeneChip™ WT PLUS Reagent Kit (Applied biosystems); 5.5 µg of ss-cDNA was fragmented and labeled then hybridized for 16 h to the GeneChip™ cartridge array in a GeneChip Hybridization Oven 640. Arrays were washed and stained in a GeneChip Fluidics Station 450 and scanned in a GeneChip Scanner 3000 7G. The signal values were evaluated using the Affymetrix® GeneChip™ Command Console software.
The CEL files containing the transcriptional signatures were further analyzed using the affy v1.74.0 17, oligo v1.60.0 18, pd.clariom.d.human v3.14.1 19, affycoretools 1.68.1 20 packages in R. The raw transcriptional data was read through the ‘read.celfiles’ function in the oligo package. The quality control steps were performed using the ‘rma’ function in the oligo package; these include background correction, quantile normalization, and a log transformation of the transcript counts. The background transcript was filtered out using the ‘getMainProbes’ function in the affy coretools package. The ‘annotateEset’ function in the affy package was used by passing the transcriptional matrix along with the pd.clariom.d.human for mapping the probe IDs to their corresponding ENTREZ IDs and Gene Symbols. By selecting only those probe IDs which can map to a corresponding ENTREZ ID, we obtained the final transcriptional matrix comprising 32,304 genes (including PCGs as well as non-coding genes) with their corresponding expression values for 21 profiles (including 11 LHC and 10 OBO subjects). Next, the differences between LHC (control) and OBO (case) subjects was deduced using a Bayesian statistical framework, namely LIMMA (limma package v3.52.4 in R) 21, which identifies differentially expressed PCGs (DE-PCGs) at a p-value threshold of 0.05 (with FDR correction).
Single sample gene set enrichment.
To obtain an estimate of enrichment of the metabolic pathways, identified through our metabolomic profiling framework, for each sample, we used the Gene Set Variation Analysis (GSVA) function in 22–24 GSVA is a non-parametric, unsupervised technique, used to estimate gene set enrichment scores as a function of genes inside and outside the gene set analogously to a competitive gene set test. In our scenario, the gene set comprises the list of DE-PCGs which are participating as kinases/enzymes in the metabolic pathways. Mathematically, this can be represented as G(pi) = (g1, …, gk), where G(pi) represents the gene set corresponding to ith metabolic pathway and consists of k DE-PCGs with respect to the phenotype of interest. Each of these G(pi) is passed to GSVA along with the transcriptomic expression matrix keeping all other parameters to default settings to obtain the single sample enrichment scores for each G(pi). Once the enrichment score for G(pi) for each subject, we perform the Wilcoxon-ranksum test to identify the differentially enriched pathways w.r.t. whole blood transcriptomics when comparing the LHC to OBO patients.