Taxonomic sampling and data compilation
Our sampling included representatives of the nine formally recognized species within the R. melanophthalma species complex [31] and two outgroup taxa – R. subdiscrepans (Nyl.) R. Sant. and Protoparmeliopsis peltata (Ramond) Arup, Zhao Xin & Lumbsch (Additional file 1). For this study, we analyzed short-read metagenomic data from a total of 33 specimens, representing ten Rhizoplaca s. lat. species (Leavitt et al., 2016), including: R. arbuscula Rosentreter, St. Clair & Leavitt (n=2), R. melanophthalma (DC.) Leuckert (n=7), R. novomexicana (H. Magn.) S.D. Leav., Zhao Xin & Lumbsch (n=1), R. parilis S.D. Leav. (n=4), Fern.-Mend., Lumbsch, Sohrabi & St. Clair, R. occulta S.D. Leav. (n=2), Fern.-Mend., Lumbsch, Sohrabi & St. Clair, R. polymorpha S.D. Leav., Fern.-Mend., Lumbsch, Sohrabi & St. Clair (n=6), R. porteri S.D. Leav., Fern.-Mend., Lumbsch, Sohrabi & St. Clair (n=5), R. shushanii S.D. Leav., Fern.-Mend., Lumbsch, Sohrabi & St. Clair (n=5), and single representatives of two outgroup taxa – R. subdiscrepans and P. peltata. Initial identifications were made based on morphology and DNA barcode identification using the ITS region [31]. All specimens used in this study are housed at the Herbarium of Non-Vascular Cryptogams (BRY-C) at Brigham Young University.
In order to more fully characterize the range of ITS diversity in the Rhizoplaca melanophthalma species complex, amplicon-based sequence data was generated for a total of 496 specimens from the R. melanophthalma species complex collected from sites throughout western North America, the center of species diversity for this group [30]. For all new specimens, DNA was extracted using the Wizard Genomic DNA Purification Kit (Promega), and for amplification and sequencing of the ITS marker followed previously described methods were used [29]. Newly generated sequences were combined with previously available nrDNA sequence data from the Rhizoplaca melanophthalma group (https://treebase.org/; study No. 19048).
Short-read data, genome assembly, and identification of the nuclear ribosomal operon
Short reads from 33 Rhizoplaca specimens reported in a previous study [32] were used for genome assembly to identify contigs containing the nuclear ribosomal operon. Full details of specimen preparation and sequencing are described in [32]. In short, DNA was extracted from a small fragment of each lichen thallus (comprised of multiple lichen symbionts, including the targeted mycobiont), libraries were prepared using the Illumina Nextera XT DNA library prep kit (product discontinued), then pooled and sequenced using a single lane on the Illumina HiSeq2000 platform, generating 100-bp paired-end reads with a 350-bp insert size. Four specimens were sequenced individually on the MiSeq platform (Illumina) generating 250-bp paired-end (PE) reads with a 550-bp insert size. While reads from the axenic reference culture (‘mela_REF’) were exclusively derived from the targeted R. melanophthalma fungal genome, genomic libraries prepared from all field-collected specimens were comprised not only of DNA from the targeted mycobiont, but also DNA from the complete holobiont, e.g., associated Trebouxia photobiont, secondary fungi, bacteria, etc. [46-50].
All paired-end (PE) reads were filtered using Trimmomatic v0.33 [51] before assembly to remove low-quality reads and/or included contamination from Illumina adaptors using the following parameters: ILLUMINACLIP; LEADING:3; TRAILING:3; SLIDINGWINDOW:4:15; and MINLEN:36. De novo genome assemblies were constructed using SPAdes v3.5.0 [52] while running a single read error correction iteration prior to the genome assembly using kmer values of 55, 77, 99, with the mismatch careful mode (--careful) enabled. From each assembly, contigs containing the nrDNA were identified using a custom BLAST [53] search implemented in the program Geneious R11 [54] against available regions of the nrDNA, e.g., 28S, IGS, and ITS, generated from Rhizoplaca melanophthalma s. lat. specimens.
Some regions of the nrDNA operon are highly conserved across divergent lineages (e.g. 18S, 5.8S, and portions of 28S subunit), and reads from non-target genomes (e.g., the photobiont, accessory fungi) may potentially bias the interpretation of intragenomic variation within Rhizoplaca species. Therefore, we used a de novo assembly approach for all nrDNA reads in order to separate nrDNA cluster of the targeted Rhizoplaca mycobiont from reads of other symbionts that co-occur within lichens. For each specimen, PE reads were mapped back to the respective contigs containing nrDNA from the SPAdes assembly using the Geneious R11 Read Mapper, with the “medium-low sensitivity/ fast” settings, iterated 5 times. Successfully mapped reads were then assembled de novo using the native Geneious R11 Assembler using ‘medium-low sensitivity’ parameter. Resulting contigs were searched against NCBI’s GenBank database [55] using BLAST [56] to identify non-target contigs, which were then excluded from further analysis.
Assessing intragenomic variation of the ITS, inferring copy number of the ribosomal operon, and intron identification
Our assessment of potential intragenomic variation focused on the ITS region (ITS1, 5.8S, and ITS2) [9]. To identify potentially polymorphic sites in the nrDNA, PE reads from each specimen were mapped back to their corresponding ITS region extracted from the Geneious assembly, with a 600 bp buffer on either end, using the BWA [57]. The 600 bp buffer on either end was used to ensure that all reads containing portions of the ITS region were indeed mapped back to the reference rather than being discarded because part of the read mapped to a region of the ribosomal operon before or after the ITS region. The Samtools v1.6 genomics utilities package [58] was used to process alignment output, filtering out unmapped reads so that only reads corresponding to the ITS and bordering regions remained. A samtools pileup file was then generated to identify the bases aligned with each position of the reference sequence which was visually confirmed using Geneious v11. Custom R scripts were used to identify mismatches and calculated percent variance at each position in the pileup file (available on GitHub: https://github.com/MSBradshaw/LichenBarcoding). To calculate percent variance, the number of reads that varied from the consensus at each nucleotide position character was divided by coverage at that location. When calculating percent variance, there was no effort was made to identify bases within a read that were sequencing error versus true variation, e.g., sequencing error and true variation would be distinguishable based on the percent variance. Sequencing error would be comparable to known error rates of the sequencing technology, ≤0.1% is achieved for ≥ 75-85% of base [59], while true intragenomic variation would exceed the error rate for Illumina sequencing.
To estimate the total copy number of the nrDNA operon, we compared average read depth coverage of nrDNA relative to coverage of putative single-copy regions of the nuclear genome [7]. For each specimen, reads were mapped back to their respective Geneious assembly of the nrDNA operon, the three largest contigs (ca. 363 kb, 307kb, and 227 kb, respectively) from the draft genome assembly from the axenic culture [32], and three known single-copy genes, MCM7, RPB1 and RPB2. The average coverage depth of the nrDNA operon was divided by the average coverage depth of the nuclear single copy genes and nuclear genomic regions. The difference in coverage was interpreted as an approximation of the copy number of the nrDNA operon.
We used Rfam models to demark boundaries of 18S, 5.8S, and 28S regions and identify introns [60]. We used an annotated nrDNA sequence from Saccharomyces paradoxus (GenBank accession No. BR000309) to corroborate boundaries between the 18S, ITS1, 5.8S, ITS, 28S, and IGS regions and introns inferred using Rfam. In contrast to most other eukaryotic genomes, yeast genomes have few introns [61] and the highly conserved 18S & 28S regions in S. paradoxux can help demark boundaries among nrDNA regions. A multiple sequence alignment (MSA) of the nearly complete nrDNA operon assemblies from the Rhizoplaca specimens and the S. paradoxux sequence was performed using the program MAFFT v7 [62, 63], implementing the FFT-NS-i iterative refinement method. No attempt was made to distinguish different intron types – e.g., group I, group II, and spliceosomal introns.
A group I intron at the 3’ end of the SSU has previously been shown to be present in all species within the R. melanophthalma group, except R. porteri [29]; and the absence of this intron serves as a diagnostic character in the description of this taxon [31]. However, PCR amplifications may not provide an accurate assessment of repetitive genomic regions due to PCR bias or an overwhelming signal from the most commonly amplified variant. Therefore, to verify the absence of this group I intron, we attempted to map reads from R. porteri specimens to a consensus sequence representing this intron with the Geneious v11 Read Mapper, using the “medium-low sensitivity/ fast” settings, iterated 5 times. To test if this group I intron might be absent in some copies of nrDNA found in other species in the R. melanophthalma group, we searched PE reads from all R. arbuscula, R. melanophthalma, R. parilis, R. polymorpha, R. porteri, R. occulta, and R. shushanii specimens for the conserved motif lacking the intron using a custom script.
Multiple sequence alignments and phylogenetic inferences
An initial multiple sequence alignment (MSA) of the nearly complete nrDNA operon assemblies from the Rhizoplaca specimens (n=33) was performed using the program MAFFT v7 [62, 63], implementing the FFT-NS-i iterative refinement method. To improve alignment accuracy for specific phylogenetic comparisons of different regions within the ribosomal operon, individual alignments were constructed independently for the 18S, ITS (ITS1, 5.8S, ITS2), 28S, the IGS, and each intron present in the 28S and 18S regions. After excluding introns, MSAs of the 18S, 28S nrDNA, and ITS region were aligned in MAFFT using the G-INS-i algorithm. The IGS and intronic regions were aligned individually using the E-INS-i algorithm for sequences with conserved domains and long gaps.
Previous studies have indicated that species in the R. melanophthalma species complex can be distinguished using phylogenetic inferences based on the standard barcoding marker for fungi (Leavitt et al., 2011; Leavitt et al., 2013a). Here we also investigated the question as to whether or not species within this complex can be recovered as monophyletic using other regions of nrDNA for phylogenetic inference. We inferred phylogenies from different regions of the ribosomal operon: (i) 18S nrDNA, excluding introns; (ii) introns within the 18S region; (iii) 28S nrDNA, excluding introns; (iv) introns within the 28S region; (v) concatenated 18S and 28S nrDNA, excluding introns; (vi) concatenated introns from both 28S and 18S regions, (vii) the IGS region; and (viii) a complete matrix comprised of 18S and 28S nrDNA, and associated introns, and the IGS region. Only introns that were present in all of the ingroup samples – the R. melanophthalma group – were included in phylogenetic analyses to minimize bias from highly mobile introns that may have been incorporated or lost more recently than the most recent common ancestor of the R. melanophthalma group.
Maximum likelihood (ML) topologies were inferred individually from each of these regions individually using the program RAxML v8.2.2 [64]. All ML analyses were performed using the CIPRES Science Gateway server (http://www.phylo.org/portal2/), using the ‘GTRGAMMA’ model and evaluating nodal support using 1000 bootstrap pseudo-replicates. A ML topology was also inferred from the complete nrDNA matrix using RAxML, treating each region (IGS, SSU [including associated introns], ITS, and LSU [including associated introns]) as separate partitions; otherwise analyses were performed as described above.
To compare the nrDNA sequence variation of the 33 specimens sampled from the R. melanophthalma group within the context of a broader sampling of specimens, we compiled a nrDNA matrix comprised of IGS and ITS sequences from a previous study [65] with our newly sampled specimens, resulting in a total of 496 specimens (https://treebase.org/; study No. 24225). For comparison, a number of specimens were represented by multiple ITS sequences, including those assembled from PE reads for this study and sequences generated using Sanger sequencing from the initial DNA extractions used for high-throughput sequencing library preparation. A ML topology was inferred from this larger dataset using IQ-tree v1.6.3 [66], treating each region (IGS and ITS), with 1,000 ultra‐fast bootstrap replicates [67] to assess nodal support, and the best‐fit substitution model as predicted by ModelFinder [68]