Our primary goal was to develop Orthoptera-specific target hybrid enrichment probes that could capture hundreds of phylogenetically informative loci from any orthopteran species and resolve relationships across broad phylogenetic scales. We generated such a probe set using the myBaits technology to target 1,828 loci, including both fast-evolving and slow-evolving genes, designed from 80 orthopteran transcriptomes. We named our new probe set the OR-TE (ORthoptera Target Enrichment) probe set. We have shown that our OR-TE probe set can reliably capture an average of 1,009 loci from diverse orthopteran lineages and resolve expected phylogenetic relationships across broad timescales. Particularly, the probe set successfully captured loci from the 13 families that were not included in the probe design, which demonstrates the robustness of our design.
Some unique features of the OR-TE probe set deserve further discussion. While the probe set can capture hundreds of loci across all lineages within Orthoptera, it is more efficient in capturing target loci from Caelifera than Ensifera. The transcriptome data that we used for designing the probe set were biased toward Caelifera (58 species vs. 22 ensiferan species) mainly because our available data included many more grasshoppers belonging to two particular families (28 Acrididae and 15 Pyrgomorphidae). This bias in the initial design stage may have contributed to the differences in capture efficiency between the suborders. Nevertheless, the fact that we captured an average of 835 loci from Ensifera demonstrates the utility of our OR-TE probe set.
Another attractive feature of this probe set is that a single capture reaction can multiplex up to 24 libraries. Although we did find that the 24-plex capture reaction yielded slightly fewer loci than the 12-plex capture reaction (Fig. 2b, d), the difference was not statistically significant. This finding is relevant for reducing the cost of data generation because it demonstrates that one can likely use half the number of capture reactions to generate a comparable amount of data. One caveat is that the quality and quantity of DNA should be sufficiently high enough (1 µg high molecular weight genomic DNA) to multiplex up to 24 samples per capture reaction reliably. If degraded DNA (i.e., from dried museum specimens) is used, it is recommended to lower the number of samples (e.g. 12 samples per capture reaction) for multiplexing.
To achieve the desired capture efficiency, it is essential to consider the amount of raw sequence data generated per sample. The average amount of data generated per sample was 1.94 GB, and we show a positive correlation between the amount of data and the number of captured loci, especially for Caelifera. Orthopterans are known to have the largest genomes among all insects16, so it is essential to sequence deeply enough to recover captured loci. This is a unique problem for Orthoptera, as it has been shown that other insect orders with smaller genome sizes may not need nearly 2 GB of data per sample to capture targeted loci. We can reliably generate sufficient data for downstream analyses by sequencing two capture reactions, each pooled with 24 libraries in a single lane of HiSeq4000 (PE150), yielding approximately 90 GB of data.
Our OR-TE probe set will enable users to resolve higher-level relationships (family, superfamily) and lower-level relationships (genus, species). This study demonstrated the probe set’s ability to resolve higher-level relationships. We found a small number of taxa whose recovered phylogenetic positions were unexpected and that the NT and the AA datasets recovered phylogenetic trees with some topological differences. However, this is not necessarily due to an issue with our probe design. Previously, we successfully used the OR-TE probe set to generate phylogenomic data to resolve species-level relationships in the Jerusalem cricket genus Stenopelmatus (Weissman et al.34) and the lentulid grasshopper genus Eremidium (Song, unpublished). We also confirmed the probe set’s ability to capture targeted loci from dried museum specimens of various cricket species (Song, unpublished). Furthermore, the data generated using the OR-TE probe set could be combined in the future because the bioinformatics pipeline relies on a pre-defined reference gene set. This means we can continue adding taxa to the existing datasets to achieve greater resolution. The OR-TE probe set is thus a highly versatile tool useful at multiple taxonomic levels.
We designed the OR-TE probe set by identifying loci from 80 orthopteran transcriptomes across the phylogeny. Transcriptome data inherently include only the mRNA transcripts after splicing. Genomic DNA, a starting point for target enrichment, consists of both exons, which the baits can hybridise with, and introns, which the baits will not match. Orthoptera are known to have very large genome sizes16 and long intronic regions between the coding regions35. Because the OR-TE probe set was designed from post-splicing mRNA transcripts, some of these baits might target exons separated by long introns in genomic DNA. In such cases, the baits will not fully hybridise with the target loci using genomic DNA, and therefore be less functional in pulling down molecules. This is a potential limitation of using the OR-TE probe set although our baits are tiled, which should increase the chances of successful capture. Still, given that we could capture an average of 1,009 loci per taxon across the phylogeny, our baits must target many exons uninterrupted by intronic regions.
The number of loci captured per taxon differed widely. Even for the same locus, the lengths of captured regions often differed across taxa. This unequal recovery of loci across samples is a general feature of most hybrid enrichment techniques used for phylogenetically diverse taxa36. These gene and taxon sampling differences would naturally lead to a large amount of missing data in the final concatenated dataset. The negative effect of missing data in phylogenomic analyses is an important issue to consider, which has been investigated for analyses using RADseq data37, UCE data36,38, and AHE data39. In general, these studies have concluded that uneven missing data can potentially lead to spurious phylogenetic inference, and we agree that the effect of missing data should be explored in depth, especially for large-scale phylogenomic studies.
Interestingly, the unequal number of captured loci seemed to have little impact on phylogeny estimation in our taxon sampling. For instance, we could only capture 254 loci for the pygmy grasshopper Metrodora reticulata, but the placement of this taxon within Tetrigidae was consistent in all our datasets. The total number of recovered nucleotides for this species was 40,905 bp. As a comparison, we recovered 1,695 loci for the grasshopper Eremidium sambaba, which collectively included 414,858 bp and the phylogenetic placement of this species was also very consistent. This observation suggests that the relatively low number of captured loci, which still comprise tens of thousands of nucleotides, likely contained sufficient phylogenetic information to correctly place the species within the phylogeny.
Although our taxon sample was intentionally designed to test the phylogenetic utility of the OR-TE probe set rather than to test previous phylogenetic hypotheses, the resulting topology nonetheless revealed a few novel insights regarding the diversification of Orthoptera.
The phylogenetic position of Rhaphidophoridae, the sole member of the superfamily Rhaphidophoroidea (cave crickets and camel crickets), may need a critical re-evaluation. Traditionally, the suborder Ensifera is considered to consist of two infraorders, Gryllidea and Tettigoniidea, based on both morphological and molecular evidence28, and the most recent phylogenomic analysis recovered Rhaphidophoridae as the earliest diverging lineage within Tettigoniidea10. Morphologically, it is uniquely different from other ensiferans in that it is the only completely apterous family without the ability or the structures to produce sound and hear 40. In our dataset, this family’s placement within Ensifera changed depending on whether the data were analysed as nucleotides or amino acids. When the NT dataset was used, it was recovered at the base of Tettigoniidea (Fig. 3b). Still, when the AA dataset was used, it was recovered at the base of Ensifera (Fig. 3a).
Interestingly, when comparing the four transcriptome-based datasets differing in PDs used for exploring phylogenetic signals of different loci (Fig. 1a–d, Supplementary Figure S1), all of which were coded as amino acids, we recovered Rhaphidophoridae at the base of Ensifera in three datasets with the slow-evolving loci (1–9%, 10–19%, 20–29% mean PD), while in the expected position at the base of Tettigoniidea in one dataset with the fastest-evolving loci (30–45% mean PD). These observations collectively suggest that Rhaphidophoridae as a lineage could have experienced different rates of molecular evolution compared to other ensiferan lineages, which could have affected phylogenetic estimation. However, because these observations were based on a minimal sampling of the family, it is difficult to make a definitive statement about the cause of this discrepancy. How the data will behave when a much more extensive taxon sampling is included in the future remains to be seen, as the phylogenetic position of Rhaphidophoridae within Ensifera is essential for inferring the evolution of sound production and hearing.
Secondly, the monophyly of the families belonging to the superfamily Stenopelmatoidea needs to be critically tested. Stenopelmatoidea currently includes three families: Stenopelmatidae (Jerusalem crickets), Gryllacrididae (raspy crickets), and Anostostomatidae (king crickets, wetas, and Cooloola monsters). This superfamily includes about 1,200 described species that are morphologically diverse and ecologically interesting41 but remains poorly studied compared to other ensiferan groups, such as crickets and katydids. Previously, Vandergast et al.41 conducted a large-scale molecular phylogenetic study of the superfamily based on three loci and found Gryllacrididae to be monophyletic and the remaining three families paraphyletic. In our 46-taxa matrices, we included nine taxa belonging to Stenopelmatoidea and found the superfamily to be monophyletic but Anostostomatidae and Stenopelmatidae to be paraphyletic (Fig. 3). Gryllacrididae was recovered as monophyletic as expected, but this clade was recovered as a sister to a Central American anostostomatid Glaphyrosoma beretka, which did not group with other anostostomatids. Anostostomatidae shows a classic Gondwanan distribution, and most species within the family, except the New Zealand endemic wetas, are remarkably similar in terms of morphology41,42. This morphological convergence could have contributed to the current state of classification. We included just two representatives of Stenopelmatidae, Stenopelmatus piceiventris from Mexico and Sia sp. from South Africa. Still, they did not form a monophyletic group, which was also the pattern found in Vandergast et al.41. Our taxon sampling is too small to suggest a reclassification of the superfamily—still, our results point to the need to evaluate the current classification of Stenopelmatoidea further.
Finally, the superfamily-level relationships within Caelifera recovered using the OR-TE probe set largely agree with the current phylogenetic understanding. We recovered Tridactyloidea as the earliest diverging lineage, followed by Tetrigoidea, congruent with all previously published molecular phylogenies10,28. The phylogenetic relationships within the superfamily group Acridomorpha were largely consistent with all previous studies. The superfamily Acridoidea was not recovered as monophyletic in several analyses, including those using the most data (Fig. 4), because Pyrgomorphoidea was placed within Acridoidea. There is overwhelming morphological evidence that Acridoidea is monophyletic, especially based on male internal genitalia43, and thus, our results were not congruent with the current classification. In our 126-taxa analysis, all of the phylogenomic data for Pyrgomorphoidea and a large number of the remaining Acridoidea (including all of Acrididae and Romaleidae) came from transcriptomes, which included thousands of loci, much more than the loci generated using the OR-TE probe set. Perhaps because of the large number of shared loci, Pyrgomorphoidea could have grouped strongly within Acridoidea. A separate analysis, including more Pamphagidae, may correct this issue. Our analyses found Eumastacoidea to be paraphyletic because the Dominican Republic episactid Antillacris inflaticercus was nested among other members of Acridoidea. This pattern is difficult to explain because this species undoubtedly belongs to Eumastacoidea based on several morphological traits, so we suspect potential sequencing errors or contamination. The placement of Ixalidium sp., a wingless grasshopper from East Africa currently classified as an acridid, is also unexpected. Still, based on our preliminary examination of male genitalia and other traits of this species, there is a strong possibility that it does not belong to Acrididae but to an undescribed lineage more closely related to Tristiridae or Pyrgacrididae.
Target enrichment was introduced as a revolutionary new technique for molecular phylogenetics a decade ago12,13, and numerous insect phylogenomic studies have been published using this technique. However, it has remained challenging for researchers familiar with traditional molecular data generation (PCR and Sanger sequencing) to incorporate this new technique into their tool kits. The reasons for this challenge vary but are mainly due to cost and resource availability. Thus, below, we describe our experience developing this tool and the specific costs associated with each step to paint a realistic picture.
The first major cost for developing the OR-TE probe set was the expenses associated with generating transcriptomic data to use as a genomic resource for identifying targeted loci. Of the 80 orthopteran transcriptomes, the 1KITE project had previously generated 41, and nine were otherwise previously published10, and thus freely available. However, because the available transcriptome data did not cover the phylogenetic diversity within the order, we set out to include additional taxa to achieve diverse taxon sampling, especially by adding different subfamilies of Acrididae and Tettigoniidae, as well as previously unsampled families. Because RNA-grade samples had to be freshly collected from the field and directly preserved in RNAlater, we conducted domestic and international expeditions to collect live specimens. The costs associated with collecting expeditions are often not incorporated into the calculation of data generation, but it is a significant expense that cannot be overlooked. The cost of RNA extraction was estimated to be about $10 USD per sample, and at the time of data generation, we decided to outsource library preparation to a sequencing core. Library preparation cost was ~$180 per sample, and three lanes of HiSeq4000 cost $7,200. Thus, the upfront cost of building genomic resources by generating 30 new transcriptomes, excluding the costs associated with sample acquisition, was ~$12,900. This was a significant but necessary investment because there were not many orthopteran transcriptomes available from public databases at the time of this project. Now, these data are freely available for anyone to use.
The next significant costs were associated with manufacturing the baits and testing the capture efficiency of our probe set. There was no cost for designing the custom baits because we performed the bioinformatics to identify the targeted loci ourselves, and the bait design was performed collaboratively with Arbor. The smallest unit of myBaits custom target capture kit that we could purchase was the 16-capture-reaction kit, which cost $3,240 at the time of purchase. We estimated the cost of our high molecular weight DNA extraction to be about $5 per sample. We outsourced library preparation, target capture reaction, and Illumina sequencing to Arbor, and the total cost for generating data from the 36 taxa was ~$6,985, not including the cost of the target capture kit.
The specific dollar amount described here is possibly prohibitive for a typical research lab to generate phylogenomic data just for 36 taxa. We intended to pay the upfront costs of developing a new tool so that the researchers interested in the phylogenomics of Orthoptera do not have to be burdened and duplicate the efforts. Considering the amount of data generated per dollar, the OR-TE probe set is potentially the most cost-effective approach for generating phylogenomic data for Orthoptera. If one outsources all the steps of data generation beyond DNA extraction, we estimate the cost per sample to be around $150 - $200 to generate about 2 GB of data per sample. If one can perform library preparation and target capture in-house, the cost would drop below $100 per sample. This means the cost per gene would be $0.15 to $0.20 if all the steps are outsourced. Target enrichment offers an exceptional value because the cost of data generation by PCR and Sanger sequencing would be about $5 to $10 per gene. We anticipate that the cost of sequencing will come down more in the future, making this approach more affordable.
One important resource required for data processing and analyses of target enrichment data is appropriate computational infrastructure with bioinformatics expertise. We were fortunate to have institutional access to high-performance research computing (HPRC) clusters, which allowed us to handle a large amount of molecular data effectively and run pipelines in parallel. The target captured data are raw Illumina reads, which must go through initial quality control, filtering, de novo assembly, and orthology search before undertaking phylogenomic analyses. Because of the sheer data size, all of the bioinformatics pipelines need to be run using the HPRC clusters, and it is nearly impossible to process data on a single desktop computer. The computational time necessary for all downstream analyses, including alignment, data partitioning, and phylogenetic analyses, depends on the project’s scope. Still, these analyses do require significant computational resources as well. It is difficult to put monetary values to these computational usages. Still, our data processing and analyses would not have been possible if we did not have access to these resources. Thus, access to HPRC is a potential limiting factor for widely using the OR-TE probe set, especially for those researchers without these computational resources. However, this can be overcome by collaboration with those who have access or by using freely available computational resources, such as the U.S. National Science Foundation funded ACCESS (https://access-ci.org/about/).
In conclusion, we have developed a new phylogenomic tool using target hybrid enrichment specifically for Orthoptera that can resolve relationships over broad phylogenetic scales, thereby advancing the systematics of this important order of insects. With the capacity to reliably capture over 1,000 loci from any orthopteran taxa, our approach is the most cost-effective method for generating phylogenomic data within Orthoptera. While we have delved into its utilities and limitations, the analytical challenges associated with uneven missing data across taxa require more rigorous exploration. We envision widespread adoption of this new tool for the future of orthopteran phylogenetic studies, ushering in a new era of discovery and knowledge in this field.