Ethics Statement. Our study was performed in accordance with the guidelines of the Helsinki Declaration of the World Medical Association. The national level ethics committee (Hungarian Scientific and Research Ethics Committee of the Medical Research Council (ETTTUKEB- 50302-2/2017/EKU)) officially approved the study. All patients recruited were consented to the study. The clinicopathological information was collected, then patient identifiers were removed, and afterwards, patients cannot be identified either directly or indirectly.
Study population. In total 31 lung cancer patients (12 female and 19 male) were enrolled between 2017 and 2018 at National Koranyi Institute of Pulmonology, Budapest, Hungary and at the County Hospital of Pulmonology, Torokbalint, Hungary (Supplementary Table S1). We included patients with histologically confirmed adenocarcinoma (ADC) (n=16), squamous cell carcinoma (SCC) (n=10), non-small cell lung carcinoma not otherwise specified (NSCLC-NOS) (n=1) and small cell lung carcinoma (SCLC) (n=4). The 58% (n=18) of the patients included were diagnosed with advanced stage disease (Stage IIIB/IV). Clinical TNM (Tumor, Node, Metastasis) stage according to the Union for International Cancer Control (8th edition), and age at the time of diagnosis were recorded. Patients were scored A (n=19), B (n=8) and C (n=4) based on abridged Patient-Generated Subjective Global Assessment (aPG-SGA) [17]. The SGA scores were measured based on BMI, weight changes, food intake, symptoms of eating (appetite), and functional capacity. Clinicopathological data included gender, age, stage, and overall survival (OS). OS was calculated from the time of diagnosis until death, or last available follow-up. Date of the last follow-up included in this analysis was February 2019.
Treatments. All treatments across all centers were conducted in accordance with contemporary National Comprehensive Cancer Network guidelines.
Schedule of sample collection procedures. Stool and blood baseline samples were obtained at the same time point before the initiation of systemic therapy after signed informed consent was obtained. All samples were placed on the day of collection in the -80°C freezer.
US validation cohort information. Stool samples were collected from a human lung cancer cohort of 7 individuals (Supplementary Table S2) at Western Regional Medical Center, Goodyear, Arizona, USA, after signed informed consent under a protocol approved by the Western Institutional Review Board (WIRB protocol number 20140271, Pallyup, Washington, USA). Bacterial DNA was subject to Illumina PE 150-bp whole metagenome sequencing. The sequenced reads were processed using the same approach as the EU Hungary cohort.
Plasma metabolomic analysis. Untargeted metabolomics profiling of patient plasma samples was performed by Afekta (Kuopio, Finland), as detailed below.
Sample preparation. The plasma samples were prepared as follows: an aliquot of the sample, 100 μL, was mixed with 400 μL of acetonitrile and mixed by pipetting. The samples were placed on a 96-well filter plate, which was centrifuged at 700 × g for 5 min at 4 °C. Small aliquots were taken from each sample, mixed together in a single tube, prepared in an identical way to the other samples, and used as the quality control (QC) sample in the analysis. The fecal samples were prepared as follows: 300 μl of cold 80% aqueous methanol was added per 100 mg of sample into homogenizer tubes. The sample preparation procedures were performed on dry ice with cooled instruments. The samples were homogenized with Bead Ruptor 24 Elite (OMNI International) with Heart program (6 m/s, 30 s). Next, the samples were vortexed for 10 s and centrifuged at 13000 rpm and 4 °C for 10 min. The supernatant was collected on a 96-well filter plate, which was centrifuged at 700 × g for 5 min at 4 °C. The QC sample was prepared in the same way as for the plasma samples.
LC-MS analysis. The samples were analyzed by liquid chromatography-mass spectrometry consisting of a 1290 Infinity Binary UPLC coupled with a 6540 UHD Accurate-Mass Q-TOF (Agilent Technologies), as described previously [40]. In brief, a Zorbax Eclipse XDB-C18 column (2.1 × 100 mm, 1.8 μm; Agilent Technologies) was used for the reversed-phase (RP) separation and an Acquity UPLC BEH amide column (Waters) for the HILIC separation. After each chromatographic run, the ionization was carried out using jet stream electrospray ionization (ESI) in the positive and negative mode, yielding four data files per sample. The collision energies for the MS/MS analysis were selected as 10, 20 and 40 V, for compatibility with spectral databases.
Data analysis. The data analysis was performed separately on each of the four modes and sample type combinations, resulting in total 8 preprocessing runs. The analysis was conducted in R version 3.5.0 using in-house scripts. Signals with too many missing values were removed by requiring a measured value in at least 60% of the samples in at least one of the study groups. The signals were corrected for the drift pattern caused by the LC-MS procedures. Regularized cubic spline regression was fit separately for each signal on the QC samples. The smoothing parameter was chosen from an interval between 0.5 and 1.5 using leave-one-out cross validation to prevent overfitting. The performance of the drift correction was assessed using non-parametric, robust estimates of relative standard deviation of QC samples (RSD*) and D-ratio* as quality metrics. Drift correction was only applied if the value of both quality metrics decreased, leading to enhanced quality. Otherwise, the original signal was retained. After the drift correction, low quality signals were removed. Signals were kept if their RSD* was below 20% and their D-ratio below 40%. In addition, signals with classic RSD, RSD* and basic D-ratio all be-low 10% were kept. This additional condition prevents the removal of signals with very low values in all but a few samples. These signals tend to have a very high value of D-ratio*, since the median absolute deviation of the biological samples is not affected by the large concentration in a handful of samples, causing the D-ratio* to overestimate the significance of random errors in measurements of QC samples. Thus, other quality metrics were applied with conservative limit of 0.1 to ensure that only good quality signals were kept this way. Missing values were imputed using random forest imputation. Signals were then normalized using inverse-rank normalization, to approximate a normal distribution. QC samples were removed prior to imputation and normalization, to prevent them from biasing the procedures.
Compound identification. The chromatographic and mass spectrometric characteristics (retention time, exact mass, and MS/MS spectra) of the significantly differential molecular features were compared with entries in an in-house standard library and publicly available databases, such as METLIN and HMDB, as well as with published literature. The annotation of each metabolite and the level of identification were given based on the recommendations published by the Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI) (Sumner et al. 2007).
Metagenomic sequencing and read quality control. To examine the gut microbiome of our lung cancer cohort, fecal samples were collected from 31 lung cancer patients at diagnosis, before the initiation of oncotherapy (baseline). Bacterial DNA was isolated from the fecal samples for shotgun metagenomic sequencing. Sequencing was performed using Illumina HiSeq 4000 with PE150 at an average depth of 6 Gb. The sequenced reads were processed with quality control to remove the adapter regions, low quality reads, and human DNA contaminations (bwa (version 0.7.4-r385) mem against human reference genome ucsc.hg19) following the previously described steps [41]. Approximately 95% of the reads remained after the quality control.
The 471 metagenomic data from the 500FG project were used as European healthy control in the taxa comparison [22]. The taxonomic profiles of these 500FG samples were acquired by using R package curatedMetagenomicData (R 3.5.1, curatedMetagenomicData 1.13.3 package) [42].
Microbial taxonomic profiling and community diversity analysis. The high-quality reads were taxonomically profiled using MetaPhlAn2 [43] with default settings. The differentially abundant taxa were identified using the Wald test implemented in the R package DESeq2 [44] v1.22.2 on the unrarefied relative abundance data, and the statistical significance was filtered with FDR-corrected p <0.05 unless otherwise stated. The alpha-diversity (Shannon index) of each sample and beta-diversities (Bray-Curtis dissimilarities) among samples were calculated with VEGAN (v2.5.3) [45] based on rarefied data. Rarefaction was applied to the abundance table in estimated mapped reads to the depth of the less abundant sample in order to equalize the depth among the samples. To test the difference in the microbial composition between two or more groups, ANOSIM (analysis of similarities) was employed based on the Bray-Curtis dissimilarity. For Faecalibacterium prausnitzii strain abundance comparison, the high quality reads were further taxonomically classified by using Kaiju [46], which is a protein-level classification tool, with the microbial subset of the NCBI BLAST non-redundant protein database nr was used.
Assembly-free functional annotation. The high-quality reads after the quality control were processed by using HUMAnN2 (Franzosa et al. Nat Methods. 2018). In the pipeline, the reads were mapped to the database of UniRef90 gene families, and then the gene families were regrouped to MetaCyc reactions and KEGG Orthologs (KOs) for pathways annotation. The quantified pathway abundances in the units of RPKs (read per kilobase) were normalized to copies per million (CPM) units by the provided script for further analyses. KEGG pathway enrichment analysis was performed using GAGE [30].
De novo assembly and CAZy annotation. The high quality reads after the quality control were further assembled using IDBA-UD [47] with k-mer size ranging from 20 to 150 bp. The coding DNA sequence (CDS) regions were predicted using MetaGeneMark [48] with the default parameters. The predicted peptide sequences were mapped to the dbCAN database [49] using DIAMOND [50] with the default parameters for CAZy annotation. The abundance of genes was quantified with RPKM (Reads Per Kilobase of transcript per Million mapped reads).
Classifier model. A random forest model was built and trained by performing five-fold cross-validation using an R package, caret (R 3.3.0, caret 6.0.81 package) based on the predictors of the differentially abundant bacterial species (p<0.05) and MetaCyc pathways (p<0.05) that were identified by comparing cachexia and non-cachexia patient groups. The model performance was evaluated using the area under the ROC curve (AUC). For external validation of the classifier, 7 additional stool samples were obtained from US lung cancer patients (cachexia n=2, non-cachexia n=5) and were processed for metagenomics sequencing following the same protocol as for the training cohort.