Sample collection and processing
Collection and outgroup choice: We collected New Zealand cicadas in 95% ETOH from 1995 to 2018 and stored them at -80 Celsius until processing (See Table S1 for specimen details). Based on previous work 39, we included several Caledopsalta (Cicadettini, Cicadettinae) cicada specimens from New Caledonia as an outgroup comparison to the New Zealand cicada clade comprising Kikihia, Maoricicada, and Rhodopsalta. We also included several Magicicada (Lamotialnini, Cicadettinae) from North America. Although we focus on the aforementioned groups, our dataset also includes, for comparative purposes, Neotibicen (North American, Cryptotympanini, Cicadinae) and Platypedia (North America, Tibicininae) specimens but these are not discussed as part of our main findings. In summary, we caught cicadas by net or lured them using manually produced mating clicks, leveraging the attraction that male cicadas have for wing flicking sounds and movements associated with conspecific females 40,41. We identified each specimen to species in the field using a combination of song, morphology, and knowledge of their evolutionary history and distribution 42–44. One hybrid-descended lineage, K. “muta x tuta”, was sampled. This lineage possesses K. muta nuclear DNA, song, and morphology, but K. “tuta” mtDNA as a result of ancient hybridization and introgression of the mitochondrion 44. The identity of K. “muta x tuta'' samples were inferred based on the above criteria and the geographic distribution ofthe K. muta and K. “muta x tuta'' lineages. For specimens sampled from localities known to have both K. muta and K. “muta x tuta” lineages, we sequenced the mitochondrial COI gene and treated 100% matches to K. “tuta” sequences in GenBank as confirmation of K. “muta x tuta” identity.
Dissection: We sterilized specimens by submerging them in 2% household bleach, letting them sit for 1-2 minutes and then washing them in both 50-70% alcohol and sterile water. We then dissected specimens using small scissors, forceps and pins to access gut tissue ventrally. The complex structure of the cicada gut required prolonged (15-30 min) and relatively tedious dissection compared to insects with relatively simpler guts (Fig. S1). We either extractedboth gut and reproductive tissue or only gut tissue depending on the specific dataset produced (dataset specifications provided below). Tissue samples were either directly placed into Powersoil bead tubes or stored in sterile cryotubes and kept frozen until DNA extraction. All dissection equipment was sterilized with 10% bleach and then treated with UV light in a crosslinker for at least one minute prior to dissection. We carried out dissections over the course of several months, with 2-15 dissections on any given day.
We binned processed samples into three sample batches based on the timing and methodological differences in processing, the workers who processed them, and sampling design: B1, B2, B3 (Table S1). Dataset-specific variations are described as follows:
B1 Dataset: Combined gut and male reproductive tissue from Kikihia muta and Kikihia “tuta” representing nine parental populations and six previously identified hybrid populations extracted using a Powersoil DNA Isolation kit (MO BIO Laboratories) under the standard protocol. The entire purification process was performed using the Powersoil DNA Isolation protocol (not DNeasy).
B2 Dataset: Gut tissue from a broad sampling of New Zealand cicada species mostly within Kikihia extracted by mechanical lysing within Powersoil bead tubes containing Powersoil lysis buffer and subsequent DNeasy 96-well plate extraction under standard protocols (beginning after Proteinase-K treatment and incubation).
B3 Dataset: Gut tissue from a broad sampling of New Zealand cicada species including outgroup species from New Caledonia and various North American cicadas extracted by mechanical lysing within Powersoil bead tubes containing Powersoil lysis buffer and subsequent DNeasy 96-well plate extraction under standard protocols (beginning after Proteinase-K treatment and incubation).
Amplicon sequencing of 16s V4 rRNA
We amplified the V4 region of bacterial 16S rRNA using universal barcoded primers 515F (5’-GTGCCAGCMGCCGCGGTAA-3’) and 806R (5’-GGACTACHVGGGTWTCTAAT-3’) with attached Illumina-compatible adapters and indices (Microbial Analysis, Resources, and Services Facility, University of Connecticut) under the following PCR conditions: 95°C for 2 min, 35 cycles of 95°C for 30 sec, 55°C for 1 min, 72°C for 1 min, and then a final extension at 72°C for 5 min. All libraries were quantified with QIAxcel and manually inspected for proper marker alignment. We normalized libraries by pooling to the lowest concentration for each of the sample sets. Samples that could not be quantified due to low concentration were pooled using the maximum product available. Pooled samples were cleaned and size-selected using a bead-based approach. Final 16S amplicon libraries were sequenced paired end on Illumina MiSeq. Dataset-specific variations are described as follows:
B1 Dataset: Amplified with V4 16S rRNA primers as above using EmeraldAmp GT PCR Master Mix (TAKARA BIO).
B2 Dataset: Amplified in a separate facility (Microbial Analysis, Resources, and Services Facility, University of Connecticut) with V4 16S rRNA primers as above using GoTaq DNA Polymerase (PROMEGA).
B3 Dataset: First amplified with primers 27F (5’-AGAGTTTGATCMTGGCTCAG-3’) and 1492R (5’-GGTTACCTTGTTACGACTT-3’) using EmeraldAmp GT PCR Master Mix (TAKARA BIO) to minimize the amplification of non-bacterial taxa under the following cycling conditions: 94°C for 5 min, then 5 cycles of 94°C for 45 sec, 56°C for 45 sec, 72°C for 1.5 min, and then a final extension at 72°C for 10 min. The product was used as a template for amplification with V4 16S rRNA primers as in dataset B1.
Negative Controls
We took various controls when processing the B2 and B3 datasets, most of which were taken as part of the B3 dataset. Six PCR controls across the two datasets were taken (two controls in the B2 dataset and four controls in B3 dataset) during library amplification using V4 16s rRNA primers (Fig. S2, pcr and control). The remaining controls are associated with the B3 dataset and are as follows: Six dissection reagent controls of fluid used prior to dissection (Fig. S2, dissection), 10 controls of surface contents of forceps after transferring tissue from dissection plates to extraction tubes and sterilizing (Fig. S2, transfer), five surface sterilization controls of the fluid used to wash specimens prior to dissection (Fig. S2, wash), and six extraction kit controls from both the DNeasy and powersoil kits (Fig. S2, dneasy and powersoil).
16S rRNA Amplicon Data Processing
We denoised and merged reverse and forward reads in QIIME 2 45 using the DADA2 pipeline 46 separately for each dataset, with the exception of the B3 dataset in which we only considered forward reads due to high error rates in the reverse reads. We aligned the denoised sequences in MAFFT 47, filtered the alignments, and constructed a midpoint-rooted phylogeny using the “align-to-tree-mafft-fasttree” pipeline in QIIME 2. We classified amplicon sequence variants (ASVs) with the QIIME 2 “classify-sklearn” plug-in after training a classifier with the “fit-classifier-naive-bayes” plug-in on sequences of the V4 region of 16S rRNA extracted from SILVA 99% OTUs (Release 132) database. The resulting feature tables, phylogenies, classifications, and sample metadata were primarily processed using the R package phyloseq 48. We removed ASVs classified to mitochondria, chloroplast, cicada bacterial endosymbiont (Hodgkinia and Sulcia), eukaryote, and archaea. To minimize noise introduced by sequencing errors, unclassified taxa, and low sequencing coverage, we removed ASVs that could not be classified at the Kingdom, Phylum, or Class taxonomic levels and ASVs with a total abundance across all samples within a dataset of less than three. We included only samples with a total abundance greater than 100. The remaining ASVs per dataset were further collapsed using the “tip_glom” function in phyloseq to cluster ASVs based on cophenetic distances with a tree height of 0.03. We then used the R package decontam 49 to identify putative contaminants. However, we were unsuccessful in identifying contaminants with our post-PCR DNA concentration data using the frequency-based filters under reasonable thresholds due to large inter-sample variability in the presence and abundance of different ASVs. Instead, we relied on control samples in the B2 and B3 datasets. We used the prevalence-based filter with a threshold of 0.3 to remove ASVs that were enriched in the controls and then we subsequently removed all ASVs in all datasets that were classified to these putative contaminant genera (Fig. S2).
Quantitative PCR (qPCR) of 16S rRNA amplicons
To estimate how initial copy numbers of target molecules may differ across our samples, we submitted a subset of samples for qPCR using universal V4 region primers and a CFX Opus 384 Real-Time PCR System. We sought to assess whether quantification of samples after PCR but before pooling would be predictive of the number of target molecules initially present. Thus we analyzed DNA extracts of 24 samples including those of six negative controls and other samples with a range of expected amplicon copy numbers based upon initial quantification as well as sequencing results (Fig. S3 & Table S4). Samples were quantified in triplicate and average initial amplicon copy number estimated based upon correlation to curve calculated with standard.
Cicada Mitochondrial Genome Phylogeny
We assembled host mitochondrial genomes using off-target capture data from an anchored hybrid enrichment dataset of worldwide cicada lineages 50,51. We first deduplicated merged and unmerged reads using the function clumpify in BBMap 52, trimmed adapters and reads with Q < 20 using Trimmomatic 53, and then assembled reads using both merged and unmerged data in SPAdes v. 3.12.0 54,55. We extracted mitochondrial contigs from the resulting assembly using tblastn with a published partial K. muta reference mitochondrial genome – Genbank MG737737 56 – used BWA v. 0.7.5a 57 in a second processing step, and then reassembled the resulting reads with SPAdes 3.12.0. The final sets of mitochondrial contigs for each sample were aligned to various published and nearly complete mitochondrial genomes 56 using the MAFFT v. 7 E-INS-i algorithm 58, combined into single mitochondrial genome sequences, and manually edited to exclude misassembled regions in Geneious v. 10.1.3 59. We used these mitochondrial sequences as sample-specific baits in MITObim v. 1.9.1 60 to assemble improved, higher-quality mitochondrial genomes that were aligned with the MAFFT v. 7 E-INS-i algorithm and manually edited in Geneious v. 10.1.3 for a final mitochondrial genome alignment. We designed a partitioning scheme that included combined 1st and 2nd codon positions and the 3rd codon position for each protein-coding gene, partitions for each of the 12S rRNA and 16S rRNA loci, and a single partition for all tRNAs. We ran a maximum likelihood analysis with this partitioning scheme in RAxML v.8 61 on the CIPRES web server 62 to produce a final phylogeny.