Citizen science dust sampling campaign
To increase the number of study houses and cover a broad geographical area, citizen scientists were recruited through scientific networks and diverse actions in social/public media: Facebook website, outreach articles in the Titan newspaper [49] from the University of Oslo (UiO) and the Agarica magazine from the Norwegian Mycological Society, as well as a radio interview at the Norwegian public broadcasting. A total of 359 volunteers signed up in this study and provided relevant information about their houses by filling out an online questionnaire (details in the section Environmental data). Sampling kits (Additional file 1: Fig. S1), including instructions, return envelope, three sterile FLOQSwabs in tubes (Code 552C, Copan Italia spa, Brescia, Italy) and two adhesive tapes (Mycotape2, Mycoteam AS, Oslo, Norway), were sent to volunteers by post.
For DNA metabarcoding analyses, three dry dust samples were swabbed from different compartments in each house: outside (entrance door), living room (main room) and bathroom. The samples were preferentially collected from the upper surface of doorframes, but in cases where this was not possible, similar areas on shelves or windowsills were sampled. As stated in previous studies [18, 19], these selected areas, with little contact from the house occupants, act as passive collectors of dust that was deposited during an unknown amount of time. In addition, one adhesive tape was collected from shelves or windowsills in the living room to calculate the percentage of dust coverage, which was later included as an environmental variable in the study. Samples were sent back to UiO by post, where they were registered and the swabs were stored at -80 °C until DNA extractions. Whereas the adhesive tapes were immediately scanned using a scanner Epson Perfection V850 Pro (Seiko Epson Corporation, Nagano, Japan), and the percentage of dust coverage was calculated on a surface area of 45 × 18 mm by image analysis using the Olympus Stream v 1.9 software (threshold at the maximum value 61100).
This large-scale sampling campaign was mostly conducted in May 2018 (from 27th April to 5th June). Overall, we received 869 swabs from 290 houses. However, 57 samples failed during the DNA laboratory works (extractions, PCR and library preparation). Thus, the HTS was performed on 812 dust samples from 271 houses, including two houses from Svalbard.
Environmental data
Meta data about the study houses and their occupants were provided by the volunteers in the online questionnaire at UiO website. In addition to the location of houses, with the complete addresses and their corresponding geographic coordinates (latitude and longitude), the following variables (with categories for factor variables) were extracted from the questionnaire: building type (detached house/semi-detached house/block), area (urban/rural), construction year, building material (wood/brick and concrete), ventilation type (natural/mechanical/balanced), number of people, number of children, number of females, pets (no/dog/cat), allergies (no/pollen/food/skin), asthma (yes/no), moisture problem (yes/no), water damage (yes/no), odour problem (yes/no) and pests (no/mousses/rats/grey silverfish). Data about the location of dust samples in the house were included as two factor variables: house compartment (outside/living room/bathroom) and indoor vs. outdoor (indoor/outdoor).
Based on the geographic coordinates of study houses, data for six relevant WorldClim 2 bioclimatic variables (annual mean temperature BIO1, temperature seasonality BIO4, mean temperature of driest quarter BIO9, mean temperature of warmest quarter BIO10, mean temperature of coldest quarter BIO11, and annual precipitation BIO12), at 30 seconds resolution (~ 1 km2), were extracted using the R package dismo following the authors’ instructions [50]. Moreover, data for 116 environmental variables, related to geology, topography, climate and hydrology, were also explored. They were the explanatory variables analyzed in a recent study modelling the vegetation types in Norway [51], and were kindly provided by their authors. The contribution of the numerical variables, 46 from this dataset plus the 6 previously extracted from WorldClim, were evaluated by principal component analysis (PCA; Additional file 1: Fig. S7). Based on PCA results, 10 numerical variables were finally selected for the statistical analyses: the six detailed WorldClim bioclimatic variables, growing season length (The Norwegian Metereological Institute, MET), snow covered area in February (sca-2, MET), snow water equivalent in April (swe-4, MET), and potential incoming solar radiation (Geodata AS). Two additional factor variables: land cover AR50 (developed area/agricultural area/forest/barren land/bog and fen/fresh water; Norwegian Institute of Bioeconomy Research, NIBIO) and bedrock nutrient (poor/average/rich; Norwegian Geological Survey, NGU), were included in the final selection (Fig. 1a).
Fungal DNA metabarcoding: DNA extraction, amplification and sequencing
DNA was extracted from the swabs using chloroform and the EZNA Soil DNA Kit (Omega Bio-tek, Norcross, GA, USA). Swab tips were transferred to the kit Disruptor tubes that contain glass pearls and 800 µl SLX-Mlus buffer. After a first bead-beating cycle (1 min at 4.5 m‧s-1) using the FastPrep-24 homogenizer (MP Biomedicals. Irvine, Ca, USA), the samples were frozen at -20ºC for at least 30 min. Afterwards, samples were incubated at 70ºC for 15 min, and again shaken at FastPrep-24 (2 cycles of 30 s at 4.5 m‧s-1). These successive thermal-shocking and bead-beating steps were carried out to get a proper homogenization of dust samples and lysis of fungal conidia and spores. After adding 600 µl chloroform, samples were vortexed for 30 s and centrifuged at 13,000 rpm for 5 min. DNA from the aqueous top phase was further purified using the HiBind DNA Mini Column following the kit’s instructions. Final DNA extracts were eluted in 30 µl EB buffer and quantified using the fluorometric Qubit dsDNA HS Assay Kit (Invitrogen, Thermo Fisher Scientific, Waltham, MA, USA). Low DNA yields, ranged from 0.05 to 1 ng‧µl-1, were recovered from the swabs, which were expected considering the small amount of dust collected with dry swabs. Nine blank controls (unused sterile swabs) from different extractions batches were included through the complete DNA metabarcoding protocol.
The internal transcribed spacer 2 (ITS2) region of the nuclear rDNA were amplified using the primers gITS7 5′-GTGARTCATCGARTCTTTG-3′ [52] and ITS4 5′-TCCTCCGCTTATTGATATGC-3′ [53]. The selected marker and primers have shown a good species resolution in previous mycobiome studies of diverse environments. Both forward and reverse primers were designed with 96 unique tags (barcodes) of 7–9 bp at the 5′-end, which differed in at least three positions from each other. To avoid tag switching errors [54], samples were combined in pools of 96 samples, each with a unique tag combination (Additional file 1: Table S3). Nine pools (96 samples each) were analyzed in this study, each of them included an extraction blank, a PCR negative and a mock community that was used as positive control. Positive controls contained 1 ng of an equimolar mixture of DNA from three fungal species that are not expected in the Norwegian built environment: Mycena belliarum, Pycnoporellus fulgens and Inonotus dryadeus. They were included to evaluate the efficiency of the DNA metabarcoding workflow, and more specifically, to assess potential tag switching errors. In total, 17 dust samples were duplicates, as technical replicates across different PCR libraries.
PCR reactions in 25 µl contained 1 U of AmpliTaq Gold DNA polymerase (Applied Biosystems, Thermo Fisher Scientific), 0.4 µM of each primer, 0.8 mg/ml bovine serum albumin (BSA; Thermo Scientific, Thermo Fisher Scientific), 1X Buffer II, 2.5 mM of MgCl2, 0.2 mM each of dNTPs and 4 µl of DNA extract (~ 0.2-4 ng of DNA depending on dust sample). Amplifications were carried out using the following cycling parameters: an initial denaturing step at 95 °C for 5 min followed by 35 cycles consisting of 30 s at 95 °C, 30 s at 55 °C, and 1 min at 72 °C, and a final elongation step at 72 °C. PCR products of each library were initially purified and normalized using a SequalPrep Normalization Plate Kit (Applied Biosystems, Thermo Fisher Scientific), and subsequently pooled. After an additional purification using 0.8 volume of Agencourt AMPure XP magnetic beads (Beckman Coulter, Inc., CA, USA), DNA concentration and length of the final pooled amplicons were checked using Qubit dsDNA HS Assay Kit and Bioanalyzer High Sensitivity DNA chip (Agilent Technologies, Inc., Santa Clara, CA, USA), respectively. Sequencing was carried out at Fasteris SA (Plan-les-Ouates, Switzerland) using the Metafast protocol, which incorporated Illumina adapters using a PCR-free ligation procedure to minimize errors such as chimera formation and tag switching. Three full Illumina 250 bp paired-end MiSeq v3 runs (Illumina Inc., San Diego, CA, USA) were used. Each run included three pooled libraries labelled with specific indexes in their Illumina barcodes. The complete resulting dataset contained 55,568,124 paired reads and are deposited in Dryad Digital Repository [55].
Bioinformatics pipeline
After an initial quality checking of sequencing results, using FastQC [56], samples were demultiplexed independently (R1 and R2) with CUTADAPT v 1.8 [57] allowing zero miss matches in tags and primers; these were simultaneously removed along with sequences shorter than 100 bp. The demultiplexed R1 and R2 reads were kept separate for the next steps using DADA2 v 1.12 [58]: (i) quality filtering and trimming [filterAndTrim with options maxN = 0, maxEE = 2.5, trimLeft = c(10,10), truncLen = c(200,200), rm.phix = TRUE, compress = TRUE, matchIDs = TRUE, multithread = TRUE], (ii) dereplication [derepFastq with option verbose = TRUE], (iii) generating error models and denoising [dada with options err = NULL, pool = "pseudo", selfConsist = TRUE], (iv) merging in contigs [mergePairs with options verbose = TRUE, minOverlap = 5], (v) creating the table of amplicon sequence variants (ASVs) [makeSequenceTable], and removal of chimeras [removeBimeraDenovo with options method = "pooled", verbose = TRUE]. Additional clustering of sequences in OTUs was done using VSEARCH v 2.11.1 [59] at 98% similarity. This clustering level is similar to the 98.5% level used to define the species hypotheses (SHs) in the UNITE database [60]. OTUs containing only one read (singletons) were removed after clustering. To correct for potential over-splitting of OTUs due to remaining sequencing errors, the OTU table was curated using LULU [61] with default settings. An initial matchlist (sequence similarity) was created with blastn [62] [options: -qcov_hsp_perc 80 -perc_identity 84], for the subsequent LULU run [options: minimum_match = 84, minimum_relative_cooccurence = 0.95]. Taxonomic assignment of the OTUs was carried out using VSEARCH against the eukaryotic ITS dataset from UNITE v 8.0 [63].
In the resulting OTU table, we initially kept and identified all dust biodiversity captured by this DNA metabarcoding approach, including mostly fungi but also members of the clade Viridiplantae (green plants). Previous studies have reported that gITS7/ITS4 primers can also amplify plant DNA [52]. Two filters were subsequently applied on the table to select the OTUs that contained at least 10 reads, and showed at least 70% of identity in the taxonomic assignment. Finally, we selected the OTUs assigned to the kingdom Fungi on the quality filtered table. To refine the taxonomic annotation of the top-100 most abundant fungi, a double-checking was done on those OTUs that initially failed at the species level. This was performed using BLAST + v 2.8 against both UNITE and NCBI [64] databases. The key steps of this bioinformatics pipeline, as well as the resulting of numbers of reads and OTUs throughout the pipeline, are summarized in the Table S4 (Additional file 1).
Assessment of control and replicates samples
Prior to filtering the fungal OTUs, the quality of controls and replicates were assessed on the matrix that contained 8,033 OTUs, 88.5% attributed to Fungi, 11.2% to Viridiplantae (green plants mostly belonging to the phyla Streptophyta and Anthophyta), and the remaining 0.2% (19 OTUs) corresponded to other kingdoms. The number, identity and abundance of OTUs in the controls (extraction blanks, PCR negatives and positives) were checked and corrected considering their frequency in the study samples. All positive controls (mock community of three fungal species), included in the nine sequencing libraries, showed an identical pattern composed of the same four OTUs. The three major OTUs corresponded to the mock-community members, identified as Mycena belliarium, Pycnoporellus fulgens and Inonotus hispidus, which represented ~ 99.96% of reads present in positive controls. The additional minor OTU (~ 0.04% of reads) detected in the positives corresponds to Saccharomyces sp. (OTU3), one of the most abundant and widely distributed OTU in the whole dataset. Remarkably, reads from mock species were exclusively detected in the positive controls, with the exception of a few reads (< 23) present in two dust samples, suggesting that the tag switching rate was insignificant in this study.
Regarding the negative controls, six extraction blanks (unused sterile swabs) and three PCR negatives contained a relatively low number of reads, representing an average of 4.1 ± 2.6 OTUs per negative control. After checking the abundance and frequency of these OTUs in the study samples, two of them (< 10 reads in two samples) were deleted. The remaining 22 OTUs were kept because they were widely distributed in the dataset and correspond to ubiquitous fungi in the built environment.
The similarity of the community profiles for 17 technical replicates (duplicates in different PCR pools and sequencing libraries) was confirmed by NMDS (Additional file 1: Fig. S8), and the replicate with lower number of reads were discarded. Hence, confirming the reproducibility of the DNA metabarcoding workflow.
Statistical analyses
Statistical analyses were conducted in R v 3.5.2 [65] through RStudio v 1.2.1335. Tidyverse v 1.2.1 [66] and the vegan v 2.5-6 [67] R packages were used for data manipulation and plotting, and ecological analyses, respectively. The most relevant R functions used are detailed below, remarking when they belong to R packages different from vegan. Initially, the OTU table was rarefied (× 10 times resampling with the median value taken per OTU) to 2,000 reads per sample using the function rrarefy, and further adapted for the three datasets: all samples (full dataset), indoor samples, and outdoor samples.
Abundance of fungi was estimated using two different ways: (i) relative abundance of OTUs as percentages of rarefied reads on the total count in the dataset, and (ii) relative abundance of a certain taxa (at different levels, e.g. phylum, order and genus) as percentage of rarefied number of reads per sample. Prevalence of OTUs was also calculated as percentages of samples or houses in which each OTU was detected, based of rarefied tables.
Alpha-diversity of samples was assessed calculating species richness (number of observed OTUs) and evenness (equitability between OTUs), as well as Shannon, Simpson and Chao1 indices, on the rarefied OTU table. Significant differences in the variance of these parameters were evaluated with the analysis of variance (ANOVA) test. Beta-diversity was assessed with NMDS ordination of both dust samples and OTUs using metaMDS, Bray-Curtis dissimilarity index and 200 random starts in search of stable solution. NMDS analyses were done on the Hellinger-transformed rarefied OTU tables, after testing three transformations: Hellinger and log using the function deconstand, and Cumulative Sum Scaling (CSS) using cumNorm of the metagenomeSeq package [68]. Continuous environmental variables and alpha-diversity indices were regressed against NMDS ordination and added as vectors on the ordination plots using the function gg_envfit of the package ggordiplot v 0.3.0 [69] to visualize their association with the dust mycobiomes. In addition, beta-diversity was also assessed using the function betadisper to test the homogeneity of variance in different groups of samples.
To evaluate the correlation between each environmental variable and the observed variance in fungal community composition, permutational multivariate analysis of variance (PERMANOVA; 999 permutations) with the function adonis2 was used. The effects of four groups of variables (building, occupants, climate and house compartment), were assessed by variation partitioning analysis (VPA) based on the Bray-Curtis dissimilarities using the function varpart and vegdist.
To evaluate the overlap between outdoor and indoor mycobiomes, we compared the OTUs detected in the three house compartments using two different estimates: percentages of OTUs on the total counts in the dataset, and percentages of OTUs per house. Indicator species analysis was performed to reveal the significant associations (p < 0.05) between OTUs and some relevant environmental variables related to the house compartments, building features (e.g. construction materials, water damages and moisture problems) and occupants (e.g. presence of allergies and pets). These analyses were performed using the multipatt function of the indicspecies package [70] in R. Pearson's correlation coefficients and their corresponding p values were calculated to explore the associations between OTUs and the continuous variables related to climate and number of occupants.
Finally, to unravel the most relevant variables predicting (i) the species richness per sample, and (ii) the percentage of shared OTUs between indoor and outdoor, we conducted GLM analyses using the glm function. A forward selection was performed using AIC in order to assess model improvement in comparison with the null model.