4.1 Participants and study design
This study consisted of two distinct sample collection periods, one in the office environment and one in a home environment. The office study consisted of 22 participants who each occupied a single-occupant office of approximately the same size located in the same building. On a weekly basis for three consecutive weeks the participants’ dominant palm, entire computer mouse, entire keyboard, and a one square foot area of desk space were swabbed by an investigator. Each participant was the sole occupant for that desk although visitors to the participant were allowed during the study period. Approximately one month after the office study sampling was completed, the home samples began for a subset of participants from the office study (n = 11). Participants were asked to self-sample their home environment for three consecutive weeks. The participants collected samples from their dominant palm, bedroom nightstand, bathroom counter, bedroom floor, and living room floor. In addition, when present, adult partners of the participant self-sampled their dominant palm (n = 6). All sampling was done pre-COVID pandemic with participants working in the office approximately 40 hours in a five-day work week.
Participants were faculty and staff at the Air Force Academy and resided within close proximity to Colorado Springs, Colorado, USA. This study was conducted according to the guidelines in the Declaration of Helsinki, and all procedures involving human participants were approved by the Air Force Academy Institutional Review Board (IRB Protocol: FAC20160003H). Written informed consent was obtained before individuals participated in any study procedures. Additionally, the study was given exempt status from the University of Colorado Boulder Institutional Review Board as data were de-identified and no further contact with participants was necessary in the analysis.
4.2 Sample collection and preparation
BD BBL™ CultureSwab™ EZ II polyurethane sponge swabs (Cat. No. 220145, Becton, Dickinson and Company, Sparks, MD, USA) dipped in sterile (autoclaved) phosphate buffer solution with DNA-free water, were used to swab all surfaces except for the floor samples. Floor samples were obtained with a thimble filter (Cat. No. 8715600 EMSL, Cinnaminson, New Jersey, USA) attached to a vacuum hose. Collection of microbial biomass was obtained by vacuuming a one-square foot section of the floor. All samples from the office study were stored at −80°C within an hour of sampling. All samples from the home study were stored at −20°C for up to 24 hours immediately following sampling, then moved to −80°C. All samples were transported by car to University of Colorado Boulder (~ 90 miles) on dry ice for further processing.
4.3 Molecular processing
DNA was extracted using the PowerSoil DNA extraction kit (Cat No. 12888-100 & 12955-4, MoBio Laboratories, Carlsbad, CA, USA) according to the manufacturer’s instructions. Marker genes in isolated DNA were PCR-amplified using HotStarTaq Master Mix (Cat No. 203433, Qiagen, Valencia, CA, USA) and 515 F (5’-GTGCCAGCMGCCGCGGTAA-3’)/806 R (5’-GGACTACHVGGGTWTCTAAT-3’) primer pair (Integrated DNA Technologies, Coralville, IA, USA) targeting the V4 hypervariable region of the 16S rRNA gene modified with a unique 12-base sequence identifier for each sample and the Illumina adapter, as previously described28. The thermal cycling program consisted of an initial step at 94 °C for 3 min followed by 35 cycles (94 °C for 45 sec, 55 °C for 1 min, and 72 °C for 1.5 min), and a final extension at 72 °C for 10 min. PCR reactions were run in duplicate and the products from the duplicate reactions were pooled and visualized on an agarose gel to ensure successful amplification. PCR products were cleaned and normalized using a SequalPrep Normalization Kit (Cat. No. A1051001, ThermoFisher, Waltham, MA, USA) following manufacturer’s instructions. The normalized amplicon pool was sequenced on an Illumina MiSeq run using V3 chemistry, 600 cycles, and 2 x 300-bp paired-end sequencing. All sequencing and library preparation were conducted at the University of Colorado Boulder BioFrontiers Next-Gen Sequencing core facility.
4.4 Computational Analysis
Raw sequences were trimmed, demultiplexed, merged, quality filtered (maxee value of 1 and singletons removed), and clustered into greater than or equal to 97% similar phylotypes using UPARSE 829. Quality reports from the fastq_eestats2 command were used to determine the fixed length at which the raw sequences were trimmed prior to merging (forward read: 200 nucleotides; reverse read: 150 nucleotides). Taxonomy was assigned using the Greengenes (version13_8) 16S rRNA gene database30. For downstream analysis, samples were rarefied to a depth of 1,500 and 4,130 sequences for the office study and home study, respectively. In analyses where both experiments were examined together, samples were rarefied to a depth of 1,500. Quantitative Insights Into Microbial Ecology (QIIME v 1.9.1) was used to generate weighted UniFrac distances to examine the microbial community structure31,32. Further computational analysis and figure generation were performed in R (v 3.4.1). Analysis of alpha and beta diversity was performed with the R packages phyloseq (v 1.20.0)33 and mctoolsr (0.1.1.1) https://github.com/leffj/mctoolsr. Bacterial community composition differences between: 1) office and home environments, 2) sample types, and 3) participants were tested with permutational analysis of variance (PERMANOVA), performed with the vegan (v 2.4-4) package in R using weighted UniFrac distance matrices34. All microbial community measures in the present study utilized weighted UniFrac, accounting for the phylogenetic variance and relative abundance of operational taxonomic units. For each sample type, we used linear mixed modeling as a function of time, accounting for the correlation by participant, to model outcomes of median weighted UniFrac distance and Shannon diversity. Comparisons of Shannon diversity metric by sample type were determined through Wilcoxon Signed Rank-Sum test. Linear mixed model analysis was performed with the nlme (v 3.1–159) package in R. Pearson correlation was performed to assess the association between longitudinal stability and prediction accuracy. False Discovery Rate was used to adjust for multiple comparisons35. All statistics were performed in R (v 3.4.1).
The QIIME command shared_phylotypes.py was used to gather the shared OTUs between different sample types that were collapsed using the collapse_samples.py (normalized and in sum mode). In the home study, a correction factor was applied to the collapsed samples of the partner’s hand because there were only 6 participants that had partner samples of the 11 total participants.
Random forest analyses were done with the caret (v 6.0–77) and randomForest (4.6–12) packages in R. Random Forest models were generated using the repeated 10-fold cross-validation method repeated 3 times with default parameters. Genus level taxa were used to predict participant, excluding low prevalence taxa (taxa that were present in less than 10 samples). Random forests were run on all data for each environment (office and home, Fig. 3a and 4a respectively). Subsequent random forests were run using each sample type to train a model to test on the remaining sample types within each environment (Figs. 3 and Supplemental Table 3 for office, Figs. 4 and Supplemental Table 4 for home). The iterative modeling approach allowed for the predictive capability of individual sample types to be assessed and to alleviate concerns of “overfitting” that arose with the high accuracy of the random forests that included all sample types. The practice of separating the training and testing data ensured that the models were never exposed to the testing data prior to making predictions.
The datasets for this study can be found in the QIITA under study ID 14787 https://qiita.ucsd.edu/study/description/14787 and EBI-ENA under the accession number PRJEB56766. This manuscript complies with the STORMS reporting checklist, v1.0336.