Study design and subjects. Within the NAS-IARC cross-sectional study, participants were residents aged 18-60 years in the local vicinity (within 20 km) of the NAS, the Republic of Korea, between March and October 2018. We excluded volunteers who, prior to recruitment 1) were underweight (body mass index (BMI) < 18.5 kg/m2) or obese (BMI ≥ 30 kg/m2), 2) reported any chronic disease such as inflammatory bowel disease, hypertension, diabetes, hyperlipidemia, or cancer, 3) had taken medication including antibiotics within the past 2 weeks, 4) had taken hormone replacement therapy or used oral contraceptives within the past 2 weeks, or 5) were pregnant or breastfed within the past 6 months. Volunteers who had taken any dietary supplements within the past 3 months were not excluded, but this information was collected using lifestyle questionnaires. The study participants were initially invited to an information meeting, approximately one week prior to the start of the study, where anthropometric data including height and weight were measured by trained research assistants, and exclusion criteria were ascertained. Those eligible for the study were provided with a lifestyle questionnaire (physical activity, alcohol intake, smoking, and socioeconomic status) and a food frequency questionnaire (FFQ) with instructions, and were asked to fill in and return on the study day. During the study day, on-site fecal samples were collected and FFQ and lifestyle data of participants were reviewed by trained research assistants following standardized protocols. Of a total of 229 eligible participants, seven participants failed to collect fecal samples, leading to a sample size of 222 healthy Korean adults (49% males) for this study.
All procedures and protocols of the study were approved by the Public Institutional Review Boards Institutional Review Board of the Ministry of Health and Welfare, Korea (Approval no: P01-201801-11-003), and were registered at the Clinical Research Information Service (CRIS) of the Centers for Disease Control and Prevention of Korea (KCT0002831). All study participants provided written informed consent.
Dietary data collection. Long-term dietary intake data from participants were collected with a semi-quantitative FFQ, which was developed and validated for the Korean diet by the Korea National Institute of Health (KNIH) [38]. The FFQ included 106 food/dish items, including 9 Korean staple dishes (rice and noodles), 25 soups and stews, 54 side dishes, 9 non-alcoholic beverages, and 9 fruits. Subjects were asked to report the consumption frequency and average portion size of each item during the previous year. During the visit of the participants, trained research assistants reviewed the questionnaires with participants together for completeness. The 106 food/dish items were classified into 22 food groups – potatoes, vegetables, fermented vegetables, legumes, fermented legumes, fruit/fruit juice, nuts/seeds, dairy, refined grains, multi/whole grains, other cereal products, meats, fish/seashells, eggs, vegetable oils, other fats, sugar/confectionery, cakes/sweets, coffee, tea, non-alcoholic beverages, salty snacks based on their recipe. In particular, vegetable and legume groups were divided into two sub-groups such as non-fermented and fermented to take into account fermentation, which could affect gut microbial composition and diversity. Intakes of macronutrients including protein, fat, carbohydrates (CHO), and dietary fiber were also estimated based on the FFQ data. Protein and fat intake were classified as either plant-based or animal-based separately. Additionally, saturated fatty acids (SFA), monounsaturated fatty acids (MUFA) and polyunsaturated fatty acids (PUFA) were estimated separately. The intakes of food groups and macronutrients were calculated as gram per day (g/day) based on the consumption frequency and average portion size based on a food composition database established for the FFQ [38]. Alcohol intake of the previous year was collected with a lifestyle questionnaire and converted into g/day.
Fecal sample collection. The fecal specimens were collected on-site on the study day at the NAS. We provided a collection tube (SARSTEDT AG & Co., Germany) for the fecal sample to each participant. Following the collection, the samples were immediately delivered to the laboratory for processing. Each fecal specimen was mixed manually using a spatula, and approximately 1-2 g of feces for each participant was aliquoted, representing a full scoop of feces, into stool nucleic acid collection tubes (Norgen Biotek Co., Canada). Samples were then frozen and stored at 4 until further processing (average time between sample collection and storage: approx. 12 mins).
16s rRNA gene sequencing and taxonomic assignment. All procedures from extracting bacterial DNA from the collected fecal samples to generating the gut microbial composition and diversity data have been performed by a biotechnology company (Macrogen Inc.) in Seoul, the Republic of Korea. On a weekly basis, the fecal samples collected for one week period were transferred to Macrogen Inc., and bacterial DNA from each sample was extracted using PowerSoil® DNA Isolation Kit (Cat. No. 12888, MO BIO) according to the manufacturers’ protocol and stored at -80 until all samples were collected for further analysis. DNA quantity and quality were measured by PicoGreen and Nanodrop (ThermoFisher Sci. Inc. Waltham, MA, USA). The 16S rRNA amplicons covering variable regions V3-V4 were generated using the primers (forward: 5'-TCGTCGGCAGCGTCAGATGTGTATAAGA GACAGCCTACGGGNGGCWGCAG-3' and reverse: 5'-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG GACTACHVGGGTATCTAATCC-3') incorporating multiplexing indices and Illumina sequencing adapters. The final products were normalized and pooled using PicoGreen, the size of libraries were verified using TapeStation DNA screentape D1000 (Agilent Tech., Santa Clara, USA), and the amplicons were sequenced using the MiSeq™ platform (Illumina, San Diego, USA). In order to achieve the high quality of data on Illumina sequencing platforms, optimal cluster densities were created across every lane of every flow cell. The Rapid library standard Quantification solution and calculator (Roche, Basel, Switzerland) were used to generate a standard curve of fluorescence readings and calculate the library sample concentration. Using the QIIME 1.8.0 pipeline, the sequences were binned into operational taxonomic units (OTUs) from phylum to species levels with 97% identity [39].
In total, 5.7 million sequence reads from 222 subjects were obtained with an average of 25,852 (5-95 percentiles: 11,129-47,363) reads per subject, which were clustered into OTUs, and subsequently assigned taxonomy at different levels from phylum to genus levels. The gut microbial taxonomic composition and diversity data generated by this procedure included individual-level information on 1) relative abundance (proportion (%) of OTU) at different bacterial taxonomic levels, 2) within-sample (α-) diversity to understand the number (richness) and distribution (evenness) of species within a single subject by estimating a widely used α-diversity index [40] – Shannon index [41], and 3) between-sample (β-) diversity to understand differences of gut microbial composition in one subject compared to another [40] by measuring the phylogenetic distance between microbial communities of two subjects with weighting the relative abundance of species [42] – weighted phylogenetic UniFrac distance matrix.
Statistical analysis. Dietary intake data were log-transformed to render the distributions symmetrical and to approximate normality and were adjusted for total energy intake using the residual method. The Shannon α-diversity index was also log-transformed. The differences of relative abundance (% OTU) of the four major phyla and of the Firmicutes-to-Bacteroidetes (F/B) ratio, which are the two major phyla in human gut microbiota and are known to be modulated by diet [9, 11], by basic characteristic and lifestyle factors (sex; age group: <40 years vs. ≥ 40 years; BMI group: <23 kg/m2 vs. ≥ 23 kg/m2; dietary supplement intake within 3 month prior to the enrolment: yes vs. no; regular physical activity: yes vs. no; smoking status: ever vs. never, education: < university graduation vs. ≥ university graduation; household income: <4,000 USD/month vs. ≥ 4,000 USD/month) were examined by Wilcoxon-Mann-Whitney tests. Associations of within-sample (α-) diversity and between-sample (β-) diversity of gut microbiota with basic and lifestyle factors of study populations were examined by general linear models (GLMs) and permutational multivariate analysis of variance (PERMANOVA), respectively.
In order to examine the gut microbial composition in relation to dietary intake, partial Spearman’s correlation coefficients of relative abundance (% OTU) of the four major phyla, the F/B ratio, and genera within the major phyla of human gut microbiota with the intakes of food groups and macronutrients were estimated. Adjustment for sex, age, BMI, dietary supplement intake, smoking status, and sample batch was performed. Correlation values were displayed using heatmaps after false discovery rate corrections.
Partial Spearman’s correlation coefficients of the Shannon index with the intakes of food groups and macronutrients were estimated after adjustment for sex, age, BMI, dietary supplement intake, smoking status, and sample batch. To identify dietary patterns associated with high within-sample (α-) diversity, reduced rank regression (RRR) was used to derive patterns of 22 food groups (predictor variables) maximizing the explained variability of gut microbiota diversity (Shannon index as response variables). We then examined partial Spearman’s correlation coefficients between the score of the high α-diversity dietary pattern (HiαDP score) and relative abundance (% OTU) of major phyla including F/B ratio and genera within the major phyla of human gut microbiota with sex, age, BMI, dietary supplement intake, smoking status, and sample batch as covariates.
Enterotypes of gut microbiota in healthy Korean adults were explored by a modified method to determine enterotype discovery (9) with a combination of principal coordinate analysis (PCoA) based on the weighted UniFrac distance matrix as a between-sample (β-) diversity index, and then k-means cluster analysis based on the PCoA scores of the first two principal coordinates (PCos). The optimal number of clusters was determined by visual inspection of clusters derived by three different methods – Elbow [43], Silhouette [44] and Gap statistic [45] methods (Figure S2 in Additional file 1) and by a priori knowledge [8]. The differences of general characteristics and lifestyle factors by enterotypes were examined by GLMs for continuous variables and chi-square test for categorical variables, and the differences in dietary intake – the HiαDP score and intakes of food groups and macro-nutrients by enterotypes were examined by GLMs with sex, age, BMI, dietary supplement intake, smoking status, and sample batch as covariates.
All analyses were performed using the R statistical software (version 3.6.1, R Development Core Team, 2019) for PCoA and k-means cluster analyses (using cmdscale, kmeans, and fviz_nbcluster functions) and generating heatmaps and boxplots, and SAS (version. 9.4, The SAS Institute, Cary, NC) for the rest of analyses.