Orthoptera-specific target enrichment (OR-TE) probes resolve relationships over broad phylogenetic scales

doi:10.21203/rs.3.rs-3918796/v1

Download PDF

Article

Orthoptera-specific target enrichment (OR-TE) probes resolve relationships over broad phylogenetic scales

https://doi.org/10.21203/rs.3.rs-3918796/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Phylogenomic data are revolutionizing the field of insect phylogenetics. One of the most tenable and cost-effective methods of generating phylogenomic data is target enrichment, which has resulted in novel phylogenetic hypotheses and revealed new insights into insect evolution. Orthoptera is the most diverse insect order within Polyneoptera and includes many evolutionarily and ecologically interesting species. Still, the order as a whole has lagged behind other major insect orders in terms of transitioning to phylogenomics. In this study, we developed an Orthoptera-specific target enrichment (OR-TE) from 80 transcriptomes across Orthoptera. The probe set targets 1,828 loci from genes exhibiting a wide range of evolutionary rates. The utility of this new probe set was validated by generating phylogenomic data from 36 orthopteran species that had not previously been subjected to phylogenomic studies. The OR-TE probe set captured an average of 1,009 loci across the tested taxa, resolving relationships across broad phylogenetic scales. Our detailed documentation of the probe design and bioinformatics process is intended to facilitate the widespread adoption of this tool.

Biological sciences/Zoology/Entomology

Biological sciences/Evolution/Phylogenetics

Biological sciences/Evolution/Taxonomy

phylogenomics

hybrid capture

transcriptome

systematics

With advances in high-throughput sequencing and bioinformatics techniques, we are witnessing a revolution in insect phylogenetics^1–4. Novel phylogenetic hypotheses for major insect orders have routinely been proposed based on phylogenomic data^5–11, confirming and challenging previous hypotheses based on a small number of genes and/or morphology. One of the most promising methods for phylogenomic data generation is a technique known as target enrichment, which uses hybrid capture probes (or baits) to collect specific genes of interest before sequencing¹². Generally, the targeted orthologs are single-copy and exhibit an appropriate amount of variation for phylogenetic analysis. The targeted genes are enriched after hybridization, greatly increasing the coverage of genes of interest for final sequencing¹³. Due to the enrichment process, this method does not require live specimens or freshly collected samples, and it can even be applied to dried museum samples, thus facilitating a broad sampling of taxa¹⁴. Ultra-Conserved Elements (UCE)¹⁵ and Anchored Hybrid Enrichment (AHE)¹³ are the two most frequently used approaches for insect phylogenomics using target hybrid enrichment. Though they have different strategies for target selection, both systems capture highly conserved loci in sequencing libraries and use sequence variation within and flanking those targets to infer phylogenetic relationships at various scales.

Genomic resources for identifying target loci are prerequisites for developing target enrichment probes¹⁴. High-quality reference genomes are the gold standard for identifying orthologs that can be used for phylogenomics, but such genomes are not available for many insect orders. While hundreds of insect genomes have been sequenced so far, many are either model organisms or agriculturally or medically important species and are not necessarily representative of the phylogenetic diversity of the group(s) in question. Also, most available insect genomes belong to holometabolous insects, and only a small number of hemimetabolous insect orders (except Hemiptera) have been sequenced. Moreover, many polyneopteran insects have larger genome sizes than holometabolous insect orders¹⁶, contributing to the general lack of genome sequencing projects in these groups. Molecular systematists studying Coleoptera, Diptera, Hymenoptera, and Lepidoptera were early adopters of this technique into their toolkits, resulting in ground-breaking phylogenomic studies with extensive taxon sampling^7,9,17–25. Now, pre-designed probes are commercially available for these insect orders. For other insect orders, however, the application of phylogenomics has been slower. As such, there has been a discrepancy in applying phylogenomics across different insect orders.

Orthoptera is one of the insect orders that has lagged in transitioning to phylogenomics. It is the most species-rich order among the polyneopteran insect lineages, with more than 29,400 described species worldwide²⁶, and includes some of the most recognisable and familiar insects, such as grasshoppers, locusts, crickets, and katydids, which are economically and culturally significant and evolutionarily fascinating^10,27,28. Nevertheless, the number of researchers who study the phylogeny of Orthoptera is small compared to those who study other major insect orders, and most of the molecular phylogenetic studies still rely on Sanger sequencing data or mitochondrial genome sequences. Genomic resources for developing phylogenomic tools have been largely lacking because orthopteran genomes are known to be the largest among insects^29,30, and thus, very challenging to sequence and annotate. Although recent efforts to sequence orthopteran genomes have made great strides³¹, initiating a genome sequencing project for any orthopteran species remains challenging. Recently, Song et al.¹⁰ partnered with the 1KITE (1000 Insect Transcriptome Evolution) project to generate nearly 5,000 single-copy orthologs from transcriptomes of 50 orthopteran and ten polyneopteran species to resolve the higher-level relationships within the order, which represents one of the first attempts to apply phylogenomics in Orthoptera. However, there has not been an effort to produce target enrichment tools that can be used broadly for Orthoptera.

In this study, we aimed to develop an Orthoptera-specific Target Enrichment (OR-TE) probe set as a new phylogenomic toolkit. Specifically, we compiled transcriptomes from 80 orthopteran species from across the phylogeny, 30 of which were newly generated, as a new genomic resource to identify phylogenetically informative orthologs. From this initial set of orthologs, we identified both slow-evolving and fast-evolving loci that could resolve relationships at different taxonomic scales to narrow the number of target loci to 1,828, which were used to develop target enrichment probes. We designed and manufactured a custom probe set with 39,809 baits and validated the effectiveness of this probe set by generating target enrichment data from 36 orthopteran species across the phylogeny, which were then used to infer the phylogeny of Orthoptera. We carefully documented the probe design and bioinformatics process so that the orthopteran systematics community can widely adopt and use this newly developed tool.

Designing the OR-TE probe set

The development of genomic resources to support identifying orthologs suitable for target enrichment proceeded by sampling whole RNA from 80 orthopteran species, of which 41 were previously collected within the 1KITE project, nine were previously generated by HS, and the remaining 30 were newly generated for this study. We also included ten polyneopteran outgroups previously generated for the 1KITE project. The 80 orthopteran species sampled for this study included representatives of 24 orthopteran families belonging to both suborders, Caelifera (58 species) and Ensifera (22 species), covering the majority of higher-level taxonomic diversity within the order (Supplementary Data 1). After assembling transcriptomes, we first explored the phylogenetic information content of the orthologs identified from the transcriptomes to identify target genes to be included in the OR-TE probe set. From the initially identified orthologs, we reduced the set to 2,378 genes as initial candidates for the probe design, from which we calculated a mean pairwise distance (PD) for each gene. We assumed that a gene with a low mean PD would be slow-evolving while those with a high mean PD would be fast-evolving. To explore phylogenetic signals in these genes, we created the following four datasets consisting of genes with different PDs: (i) 517 genes with 1–9% mean PD; (ii) 990 genes with 10–19% mean PD; (iii) 609 genes with 20–29% mean PD; and (iv) 262 genes with 30–45% mean PD. After phylogenetic analyses of these datasets, we compared the results across the four resulting trees in terms of topology, nodal support, and branch lengths. We found that all four analyses recovered the monophyly of Orthoptera and each of the two suborders but differed in resulting branch lengths and nodal support values (Fig. 1a–d, Supplementary Figure S1). The analysis based on 517 genes with 1–9% mean PD (Fig. 1a) resulted in longer internodes for ensiferan taxa but very short internodes with low support values for Caelifera, while the remaining three analyses (Fig. 1b-d) resulted in better resolution for Caelifera. Compared to the analysis based on slow-evolving genes (Fig. 1a), the placements of Grylloidea and Hagloidea shifted in the analyses with fast-evolving genes (Fig. 1b–d). The placements of Rhaphidophoroidea and Pamphagidae were congruent with the accepted classification only in the analysis with the fastest-evolving genes (Fig. 1d). We found that the genes with mean PD ranging between 10% and 45% could unambiguously resolve relationships across all major lineages within Orthoptera. Therefore, we selected 1,853 orthologs with 10–45% PD as input data for our probe design to maximise the potential for phylogenetic resolution.

We collaborated with Daicel Arbor Biosciences (“Arbor”, Ann Arbor, MI, USA) to design and manufacture the OR-TE probe set. We first generated individual nucleotide alignments of the 1,853 orthologs to identify regions that could be used as baits. We aimed to find baits that could capture only the regions present in each alignment but remain diverse enough to capture all potential species with the region in each alignment. We achieved this with the following clustering logic and two-stage design. For the first stage, we tiled 120-nucleotide baits every 20 nucleotides across each entry for each ortholog, generating hundreds of thousands of starting bait candidates. To reduce this complexity, we employed a ‘greedy’ clustering technique based on USEARCH³² to generate centroids that represent several bait candidates within a given pairwise distance. To increase the chance that centroids were drawn from random selections of individual ortholog entry members, we first shuffled the starting bait sequences. Then we clustered the baits into centroids tolerating up to 85% alignment divergence within a minimum 111bp alignment overlap. The 25 centroids to which the most bait candidates collapsed were kept for each locus. Each centroid candidate was then searched against the reference transcriptomes using BLAST³³, scored for specificity using Arbor’s proprietary method, and removed from the probe set if strong potential off-target hits were predicated for the centroid. This filtration resulted in 28,563 bait sequences. To further expand the probe set’s capabilities to capture even more divergent sequences, we added back 11,246 bait sequences that diverged 20–30% from the original centroids. This final OR-TE probe set consisted of 39,809 120-nucleotide bait sequences that were designed to capture a total of 1,828 loci. These were then manufactured as part of a myBaits® target capture kit by Arbor.

Capture efficiency of OR-TE probe set

To test the capture efficiency of the OR-TE probe set, we performed target enrichment and sequencing for 36 orthopteran taxa (18 caeliferan and 18 ensiferan species) representing 27 families and 11 superfamilies from across the order, specifically selected to examine capture efficiency across the order. We included representatives from the following 13 families, which had not been included in any previous phylogenomic study: Mogoplistidae, Trigonidiidae, Cooloolidae, Ripipterygidae, Cylindrachetidae, Chorotypidae, Morabidae, Eumastacidae, Pamphagodidae, Lithidiidae, Pyrgacrididae, Tristiridae, and Ommexechidae. These taxa are rare and not frequently collected, and transcriptomes were not available for any of them at the time of probe design. Therefore, our taxon sampling represented a robust design to test the phylogenetic utility of the OR-TE probe set.

We recovered a total of 69.95 gigabytes (GB) of data from Illumina sequencing. The amount of raw data recovered per taxon ranged from 61 megabytes (MB) to 10 GB, with an average of 1.94 GB (Supplementary Data File 2, Fig. 2a). Samples were divided between two pools to test the efficacy of multiplexing hybrid enrichment with different numbers of samples, 12-plex and 24-plex. Both pools resulted in statistically similar amounts of data (two-tailed t-test, p = 0.5924), with the 12-plex producing an average of 1.773 (Std Dev [±] 0.835) GB per sample and the 24-plex producing an average of 2.027 (± 1.979) GB per sample (Supplementary Data File 2, Fig. 2b). The amount of data generated did not significantly differ between Caelifera (1.59 ± 0.987 GB) and Ensifera (2.292 ± 2.132 GB) (two-tailed t-test, p = 0.2192) (Fig. 2c).

The OR-TE probe set was designed to capture 1,828 orthologs identified from the orthopteran transcriptomes. After filtering, assembling, and conducting an orthology search, the mean capture efficiency was 1,009.44 ± 412.04 loci per taxon across the 36 taxa sampled. The capture efficiency varied from 13.9% (254 genes from a tetrigid Metrodora reticulata) to 92.8% (1,695 genes from a lentulid Eremidium sambaba). The average capture efficiency was statistically higher for Caelifera (1,183.67 ± 488.10 genes) than for Ensifera (835.22 ± 216.84 genes) (two-tailed t-test, p = 0.0108) (Fig. 2e). When comparing the 12-plex capture reaction and the 24-plex capture reaction, we found that the capture efficiency of the former (1,179.33 ± 363.29 genes) was higher than but not statistically different from the latter (924.5 ± 415.51 genes) (two-tailed t-test, p = 0.0705) (Fig. 2d). In terms of correlation between the amount of sequencing data and the capture efficiency, we found a lineage-specific pattern in that they were positively correlated in Caelifera (Pearson’s correlation coefficient = 0.7986, p < 0.0001) but not strongly correlated in Ensifera (Pearson’s correlation coefficient = 0.3698, p = 0.1309).

Phylogenetic utility of OR-TE probe set

We created four matrices to test the performance of the OR-TE probe set. The first two matrices (NT and AA) comprised 46 taxa, with ten polyneopteran outgroups and 36 orthopteran ingroup taxa. The sequence data for the outgroups were obtained from transcriptome data, while the sequence data for the ingroup taxa were generated using the probe set. The 46-taxon NT dataset was 1,670,196 aligned nucleotide sequences, and the 46-taxon AA dataset was 556,732 aligned amino acid sequences. These matrices were constructed to test how well the target capture data would independently resolve higher-level relationships. The third and fourth matrices (NT and AA) comprised 126 taxa, with the same 46 taxa as the first dataset, combined with an additional 80 orthopteran taxa used to develop the OR-TE probe set. The 126-taxon NT dataset was 1,897,806 aligned bp, and the 126-taxon AA dataset was 632,602 aligned AA. These matrices were constructed to test the effect of combining target capture data with transcriptome data regarding the resulting topology.

All four phylogenetic analyses recovered Orthoptera as monophyletic and the two suborders, Caelifera and Ensifera, with maximal nodal support (bootstrap value of 100 or posterior probability value of 1.00). For the 46-taxon NT and AA matrices, the ML and the Bayesian analyses recovered identical topologies (Supplementary Figure S2). Maximum likelihood and Bayesian inference also recovered identical topologies in analyses of the AA dataset (Supplementary Figure S2). However, some differences existed between the relationships recovered from analyses of the NT and AA datasets. Within Ensifera, Troglophilus neglectus was placed at the base of the infraorder Tettigoniidea in the NT dataset (Fig. 3b). In contrast, it was placed at the base of the suborder Ensifera in the AA dataset (Fig. 3a). The position of Glaphyrosoma beretka also differed between the two datasets.

For the 126-taxon matrices, several topological incongruences existed between the NT and AA datasets (Supplementary Figure S2). In the NT dataset (Fig. 4), the superfamily Rhaphidophoroidea was recovered as the earliest diverging lineage within the infraorder Tettigoniidea, but in the AA dataset, it was recovered as the earliest diverging lineage within Ensifera. The placement of the superfamily Schizodactyloidea also differed between the two datasets. Within Caelifera, the placements of Pyrgacris descampsi (Pyrgacrididae) and Rhicnoderma humilis (Romaleidae) differed between trees generated from analyses of the two datasets. In analyses of both the NT and the AA datasets, the superfamily Acridoidea was recovered as paraphyletic because of the placement of the clade consisting of Pamphagidae and Pamphagodidae.

In terms of the topology, the NT datasets of both the 46-taxon (Fig. 3b) and the 126-taxon matrices (Fig. 4) resulted in higher-level relationships that were largely congruent with the most recent phylogenomic study (Song et al. 2020) with some exceptions. The NT datasets recovered two monophyletic ensiferan infraorders, Gryllidea and Tettigoniidea, and two monophyletic caeliferan infraorders, Tridactylidea and Acrididea, and superfamily-level relationships were consistent with the previous finding. Within Caelifera, Ixalidium sp. (Acrididae) and Antillacris inflaticercus (Episactidae) were placed in unexpected positions. The former was expected to be recovered as sister to Ommexecha virens, and the latter as within the clade that included other Eumastacoidea. The 46-taxon NT dataset did not include any member of the family Pyrgomorphidae, but the 126-taxon NT dataset included many members of this family whose data came from transcriptomes. Pyrgomorphidae is the sole member of the superfamily Pyrgomorphoidea, which is currently hypothesised to be sister to Acridoidea. Still, in our analysis, this family is nested within Acridoidea, rendering the latter superfamily paraphyletic. Within Ensifera, Anostostomatidae and Stenopelmatidae were paraphyletic. Analyses of the AA datasets of both matrices recovered the infraorder Tettigoniidea as paraphyletic because Rhaphidophoroidea was at the base of the Ensifera, which conflicts with the currently accepted classification adopted by the Orthoptera Species File²⁶.

Our primary goal was to develop Orthoptera-specific target hybrid enrichment probes that could capture hundreds of phylogenetically informative loci from any orthopteran species and resolve relationships across broad phylogenetic scales. We generated such a probe set using the myBaits technology to target 1,828 loci, including both fast-evolving and slow-evolving genes, designed from 80 orthopteran transcriptomes. We named our new probe set the OR-TE (ORthoptera Target Enrichment) probe set. We have shown that our OR-TE probe set can reliably capture an average of 1,009 loci from diverse orthopteran lineages and resolve expected phylogenetic relationships across broad timescales. Particularly, the probe set successfully captured loci from the 13 families that were not included in the probe design, which demonstrates the robustness of our design.

Some unique features of the OR-TE probe set deserve further discussion. While the probe set can capture hundreds of loci across all lineages within Orthoptera, it is more efficient in capturing target loci from Caelifera than Ensifera. The transcriptome data that we used for designing the probe set were biased toward Caelifera (58 species vs. 22 ensiferan species) mainly because our available data included many more grasshoppers belonging to two particular families (28 Acrididae and 15 Pyrgomorphidae). This bias in the initial design stage may have contributed to the differences in capture efficiency between the suborders. Nevertheless, the fact that we captured an average of 835 loci from Ensifera demonstrates the utility of our OR-TE probe set.

Another attractive feature of this probe set is that a single capture reaction can multiplex up to 24 libraries. Although we did find that the 24-plex capture reaction yielded slightly fewer loci than the 12-plex capture reaction (Fig. 2b, d), the difference was not statistically significant. This finding is relevant for reducing the cost of data generation because it demonstrates that one can likely use half the number of capture reactions to generate a comparable amount of data. One caveat is that the quality and quantity of DNA should be sufficiently high enough (1 µg high molecular weight genomic DNA) to multiplex up to 24 samples per capture reaction reliably. If degraded DNA (i.e., from dried museum specimens) is used, it is recommended to lower the number of samples (e.g. 12 samples per capture reaction) for multiplexing.

To achieve the desired capture efficiency, it is essential to consider the amount of raw sequence data generated per sample. The average amount of data generated per sample was 1.94 GB, and we show a positive correlation between the amount of data and the number of captured loci, especially for Caelifera. Orthopterans are known to have the largest genomes among all insects¹⁶, so it is essential to sequence deeply enough to recover captured loci. This is a unique problem for Orthoptera, as it has been shown that other insect orders with smaller genome sizes may not need nearly 2 GB of data per sample to capture targeted loci. We can reliably generate sufficient data for downstream analyses by sequencing two capture reactions, each pooled with 24 libraries in a single lane of HiSeq4000 (PE150), yielding approximately 90 GB of data.

Our OR-TE probe set will enable users to resolve higher-level relationships (family, superfamily) and lower-level relationships (genus, species). This study demonstrated the probe set’s ability to resolve higher-level relationships. We found a small number of taxa whose recovered phylogenetic positions were unexpected and that the NT and the AA datasets recovered phylogenetic trees with some topological differences. However, this is not necessarily due to an issue with our probe design. Previously, we successfully used the OR-TE probe set to generate phylogenomic data to resolve species-level relationships in the Jerusalem cricket genus Stenopelmatus (Weissman et al.³⁴) and the lentulid grasshopper genus Eremidium (Song, unpublished). We also confirmed the probe set’s ability to capture targeted loci from dried museum specimens of various cricket species (Song, unpublished). Furthermore, the data generated using the OR-TE probe set could be combined in the future because the bioinformatics pipeline relies on a pre-defined reference gene set. This means we can continue adding taxa to the existing datasets to achieve greater resolution. The OR-TE probe set is thus a highly versatile tool useful at multiple taxonomic levels.

We designed the OR-TE probe set by identifying loci from 80 orthopteran transcriptomes across the phylogeny. Transcriptome data inherently include only the mRNA transcripts after splicing. Genomic DNA, a starting point for target enrichment, consists of both exons, which the baits can hybridise with, and introns, which the baits will not match. Orthoptera are known to have very large genome sizes¹⁶ and long intronic regions between the coding regions³⁵. Because the OR-TE probe set was designed from post-splicing mRNA transcripts, some of these baits might target exons separated by long introns in genomic DNA. In such cases, the baits will not fully hybridise with the target loci using genomic DNA, and therefore be less functional in pulling down molecules. This is a potential limitation of using the OR-TE probe set although our baits are tiled, which should increase the chances of successful capture. Still, given that we could capture an average of 1,009 loci per taxon across the phylogeny, our baits must target many exons uninterrupted by intronic regions.

The number of loci captured per taxon differed widely. Even for the same locus, the lengths of captured regions often differed across taxa. This unequal recovery of loci across samples is a general feature of most hybrid enrichment techniques used for phylogenetically diverse taxa³⁶. These gene and taxon sampling differences would naturally lead to a large amount of missing data in the final concatenated dataset. The negative effect of missing data in phylogenomic analyses is an important issue to consider, which has been investigated for analyses using RADseq data³⁷, UCE data^36,38, and AHE data³⁹. In general, these studies have concluded that uneven missing data can potentially lead to spurious phylogenetic inference, and we agree that the effect of missing data should be explored in depth, especially for large-scale phylogenomic studies.

Interestingly, the unequal number of captured loci seemed to have little impact on phylogeny estimation in our taxon sampling. For instance, we could only capture 254 loci for the pygmy grasshopper Metrodora reticulata, but the placement of this taxon within Tetrigidae was consistent in all our datasets. The total number of recovered nucleotides for this species was 40,905 bp. As a comparison, we recovered 1,695 loci for the grasshopper Eremidium sambaba, which collectively included 414,858 bp and the phylogenetic placement of this species was also very consistent. This observation suggests that the relatively low number of captured loci, which still comprise tens of thousands of nucleotides, likely contained sufficient phylogenetic information to correctly place the species within the phylogeny.

Although our taxon sample was intentionally designed to test the phylogenetic utility of the OR-TE probe set rather than to test previous phylogenetic hypotheses, the resulting topology nonetheless revealed a few novel insights regarding the diversification of Orthoptera.

The phylogenetic position of Rhaphidophoridae, the sole member of the superfamily Rhaphidophoroidea (cave crickets and camel crickets), may need a critical re-evaluation. Traditionally, the suborder Ensifera is considered to consist of two infraorders, Gryllidea and Tettigoniidea, based on both morphological and molecular evidence²⁸, and the most recent phylogenomic analysis recovered Rhaphidophoridae as the earliest diverging lineage within Tettigoniidea¹⁰. Morphologically, it is uniquely different from other ensiferans in that it is the only completely apterous family without the ability or the structures to produce sound and hear ⁴⁰. In our dataset, this family’s placement within Ensifera changed depending on whether the data were analysed as nucleotides or amino acids. When the NT dataset was used, it was recovered at the base of Tettigoniidea (Fig. 3b). Still, when the AA dataset was used, it was recovered at the base of Ensifera (Fig. 3a).

Interestingly, when comparing the four transcriptome-based datasets differing in PDs used for exploring phylogenetic signals of different loci (Fig. 1a–d, Supplementary Figure S1), all of which were coded as amino acids, we recovered Rhaphidophoridae at the base of Ensifera in three datasets with the slow-evolving loci (1–9%, 10–19%, 20–29% mean PD), while in the expected position at the base of Tettigoniidea in one dataset with the fastest-evolving loci (30–45% mean PD). These observations collectively suggest that Rhaphidophoridae as a lineage could have experienced different rates of molecular evolution compared to other ensiferan lineages, which could have affected phylogenetic estimation. However, because these observations were based on a minimal sampling of the family, it is difficult to make a definitive statement about the cause of this discrepancy. How the data will behave when a much more extensive taxon sampling is included in the future remains to be seen, as the phylogenetic position of Rhaphidophoridae within Ensifera is essential for inferring the evolution of sound production and hearing.

Secondly, the monophyly of the families belonging to the superfamily Stenopelmatoidea needs to be critically tested. Stenopelmatoidea currently includes three families: Stenopelmatidae (Jerusalem crickets), Gryllacrididae (raspy crickets), and Anostostomatidae (king crickets, wetas, and Cooloola monsters). This superfamily includes about 1,200 described species that are morphologically diverse and ecologically interesting⁴¹ but remains poorly studied compared to other ensiferan groups, such as crickets and katydids. Previously, Vandergast et al.⁴¹ conducted a large-scale molecular phylogenetic study of the superfamily based on three loci and found Gryllacrididae to be monophyletic and the remaining three families paraphyletic. In our 46-taxa matrices, we included nine taxa belonging to Stenopelmatoidea and found the superfamily to be monophyletic but Anostostomatidae and Stenopelmatidae to be paraphyletic (Fig. 3). Gryllacrididae was recovered as monophyletic as expected, but this clade was recovered as a sister to a Central American anostostomatid Glaphyrosoma beretka, which did not group with other anostostomatids. Anostostomatidae shows a classic Gondwanan distribution, and most species within the family, except the New Zealand endemic wetas, are remarkably similar in terms of morphology^41,42. This morphological convergence could have contributed to the current state of classification. We included just two representatives of Stenopelmatidae, Stenopelmatus piceiventris from Mexico and Sia sp. from South Africa. Still, they did not form a monophyletic group, which was also the pattern found in Vandergast et al.⁴¹. Our taxon sampling is too small to suggest a reclassification of the superfamily—still, our results point to the need to evaluate the current classification of Stenopelmatoidea further.

Finally, the superfamily-level relationships within Caelifera recovered using the OR-TE probe set largely agree with the current phylogenetic understanding. We recovered Tridactyloidea as the earliest diverging lineage, followed by Tetrigoidea, congruent with all previously published molecular phylogenies^10,28. The phylogenetic relationships within the superfamily group Acridomorpha were largely consistent with all previous studies. The superfamily Acridoidea was not recovered as monophyletic in several analyses, including those using the most data (Fig. 4), because Pyrgomorphoidea was placed within Acridoidea. There is overwhelming morphological evidence that Acridoidea is monophyletic, especially based on male internal genitalia⁴³, and thus, our results were not congruent with the current classification. In our 126-taxa analysis, all of the phylogenomic data for Pyrgomorphoidea and a large number of the remaining Acridoidea (including all of Acrididae and Romaleidae) came from transcriptomes, which included thousands of loci, much more than the loci generated using the OR-TE probe set. Perhaps because of the large number of shared loci, Pyrgomorphoidea could have grouped strongly within Acridoidea. A separate analysis, including more Pamphagidae, may correct this issue. Our analyses found Eumastacoidea to be paraphyletic because the Dominican Republic episactid Antillacris inflaticercus was nested among other members of Acridoidea. This pattern is difficult to explain because this species undoubtedly belongs to Eumastacoidea based on several morphological traits, so we suspect potential sequencing errors or contamination. The placement of Ixalidium sp., a wingless grasshopper from East Africa currently classified as an acridid, is also unexpected. Still, based on our preliminary examination of male genitalia and other traits of this species, there is a strong possibility that it does not belong to Acrididae but to an undescribed lineage more closely related to Tristiridae or Pyrgacrididae.

Target enrichment was introduced as a revolutionary new technique for molecular phylogenetics a decade ago^12,13, and numerous insect phylogenomic studies have been published using this technique. However, it has remained challenging for researchers familiar with traditional molecular data generation (PCR and Sanger sequencing) to incorporate this new technique into their tool kits. The reasons for this challenge vary but are mainly due to cost and resource availability. Thus, below, we describe our experience developing this tool and the specific costs associated with each step to paint a realistic picture.

The first major cost for developing the OR-TE probe set was the expenses associated with generating transcriptomic data to use as a genomic resource for identifying targeted loci. Of the 80 orthopteran transcriptomes, the 1KITE project had previously generated 41, and nine were otherwise previously published¹⁰, and thus freely available. However, because the available transcriptome data did not cover the phylogenetic diversity within the order, we set out to include additional taxa to achieve diverse taxon sampling, especially by adding different subfamilies of Acrididae and Tettigoniidae, as well as previously unsampled families. Because RNA-grade samples had to be freshly collected from the field and directly preserved in RNAlater, we conducted domestic and international expeditions to collect live specimens. The costs associated with collecting expeditions are often not incorporated into the calculation of data generation, but it is a significant expense that cannot be overlooked. The cost of RNA extraction was estimated to be about $10 USD per sample, and at the time of data generation, we decided to outsource library preparation to a sequencing core. Library preparation cost was ~$180 per sample, and three lanes of HiSeq4000 cost $7,200. Thus, the upfront cost of building genomic resources by generating 30 new transcriptomes, excluding the costs associated with sample acquisition, was ~$12,900. This was a significant but necessary investment because there were not many orthopteran transcriptomes available from public databases at the time of this project. Now, these data are freely available for anyone to use.

The next significant costs were associated with manufacturing the baits and testing the capture efficiency of our probe set. There was no cost for designing the custom baits because we performed the bioinformatics to identify the targeted loci ourselves, and the bait design was performed collaboratively with Arbor. The smallest unit of myBaits custom target capture kit that we could purchase was the 16-capture-reaction kit, which cost $3,240 at the time of purchase. We estimated the cost of our high molecular weight DNA extraction to be about $5 per sample. We outsourced library preparation, target capture reaction, and Illumina sequencing to Arbor, and the total cost for generating data from the 36 taxa was ~$6,985, not including the cost of the target capture kit.

The specific dollar amount described here is possibly prohibitive for a typical research lab to generate phylogenomic data just for 36 taxa. We intended to pay the upfront costs of developing a new tool so that the researchers interested in the phylogenomics of Orthoptera do not have to be burdened and duplicate the efforts. Considering the amount of data generated per dollar, the OR-TE probe set is potentially the most cost-effective approach for generating phylogenomic data for Orthoptera. If one outsources all the steps of data generation beyond DNA extraction, we estimate the cost per sample to be around $150 - $200 to generate about 2 GB of data per sample. If one can perform library preparation and target capture in-house, the cost would drop below $100 per sample. This means the cost per gene would be $0.15 to $0.20 if all the steps are outsourced. Target enrichment offers an exceptional value because the cost of data generation by PCR and Sanger sequencing would be about $5 to $10 per gene. We anticipate that the cost of sequencing will come down more in the future, making this approach more affordable.

One important resource required for data processing and analyses of target enrichment data is appropriate computational infrastructure with bioinformatics expertise. We were fortunate to have institutional access to high-performance research computing (HPRC) clusters, which allowed us to handle a large amount of molecular data effectively and run pipelines in parallel. The target captured data are raw Illumina reads, which must go through initial quality control, filtering, de novo assembly, and orthology search before undertaking phylogenomic analyses. Because of the sheer data size, all of the bioinformatics pipelines need to be run using the HPRC clusters, and it is nearly impossible to process data on a single desktop computer. The computational time necessary for all downstream analyses, including alignment, data partitioning, and phylogenetic analyses, depends on the project’s scope. Still, these analyses do require significant computational resources as well. It is difficult to put monetary values to these computational usages. Still, our data processing and analyses would not have been possible if we did not have access to these resources. Thus, access to HPRC is a potential limiting factor for widely using the OR-TE probe set, especially for those researchers without these computational resources. However, this can be overcome by collaboration with those who have access or by using freely available computational resources, such as the U.S. National Science Foundation funded ACCESS (https://access-ci.org/about/).

In conclusion, we have developed a new phylogenomic tool using target hybrid enrichment specifically for Orthoptera that can resolve relationships over broad phylogenetic scales, thereby advancing the systematics of this important order of insects. With the capacity to reliably capture over 1,000 loci from any orthopteran taxa, our approach is the most cost-effective method for generating phylogenomic data within Orthoptera. While we have delved into its utilities and limitations, the analytical challenges associated with uneven missing data across taxa require more rigorous exploration. We envision widespread adoption of this new tool for the future of orthopteran phylogenetic studies, ushering in a new era of discovery and knowledge in this field.

Transcriptome sequencing and assembly

The newly sequenced 30 samples were originally collected in RNAlater in the field and kept at -20^oC until RNA extraction. We followed the same procedures for RNA extraction and de novo transcriptome assembly described in Song et al.¹⁰. Briefly, RNA was extracted using a Trizol-chloroform extraction, followed by a clean-up with an RNeasy mini kit (Qiagen, Valencia, CA). RNA concentrations were measured with a spectrophotometer (DS-11, DeNovix, Wilmington, DE), and RNA integrity was analysed with a Fragment Analyzer (Agilent Technologies, Ankeny, IA). Library preparation, sequencing, and pre-processing were all performed at Texas A&M AgriLife Research Genomics and Bioinformatics Service. Illumina’s TruSeq Stranded Total RNA Library Prep Kit was used for library preparation, and paired-end sequencing (150 bp) was performed using three lanes on an Illumina HiSeq4000 (San Diego, CA). Raw reads were imported into a personalised Galaxy environment on a supercomputing cluster of the High-Performance Research Computing group of Texas A&M University (Ada, https://hprc.tamu.edu) for trimming and quality check. We transformed reads to Sanger format with FastQ Groomer⁴⁴ and filtered them using Trimmomatic⁴⁵. In Trimmomatic, bases were trimmed at both ends if their quality score was lower than 30, whole reads were trimmed with a sliding window of 3 bases and a minimum average quality score of 30, and finally all reads of less than 30 bp were discarded. Subsequently, FastQ Screen⁴⁶ was used to filter out reads from bacterial and other contaminating sources (UniVec core (June 6, 2015), PhiX (NC_001422.1), Illumina adapters, Gregarina niphandrodes genome (GNI3), Encephalitozoon romaleae genome (ASM28003v2), Escherichia coli genome (K12), Methylobacterium sp., Bosea sp., Bradyrhizobium sp., Klebsiella pneumoniae, Sphingomonas sp., Rhodopseudomonas sp. and Propionibacterium acnes). The filtered reads were used for de novo transcriptome assembly using Trinity v2.2.07⁴⁷. Details on all samples used for this study are provided in Supplementary Table S1 and can be found at the National Center for Biotechnology Information (NCBI) under the respective BioSample numbers.

Selection of target genes

Using the newly assembled transcriptomes as well as the previously generated data¹⁰, we built the reference orthologous gene groups originally identified by Song et al.¹⁰ using the following reference genomes on OrthoDB v7^48–50: Acyrthosiphon pisum (Hemiptera)⁵¹, Nasonia vitripennis (Hymenoptera)⁵², Pediculus humanus (Psocodea)⁵³, Rhodnius prolixus (Hemiptera)⁵⁴, and Zootermopsis nevadensis (Blattodea)⁵⁵. These taxa were chosen based on annotation and sequence quality and because of the lack of availability of reference genomes in Orthoptera or other closely related polyneopteran orders at the beginning of this project. Orthologous gene IDs were clustered for node Insecta on OrthoDB, and single-copy genes across the five reference taxa were listed. We used the Orthograph pipeline described in Song et al.¹⁰ to identify orthologs from the 90 transcriptomes (80 orthopteran and 10 outgroups). In total, 5,414 single-copy protein-coding genes were identified as a reference gene set and used to recover orthologs from the transcriptome data. From the initially identified orthologs, we filtered further to select specific genes that satisfied the following conditions: (1) G.C. content > 40%, (2) length of nucleotide sequences 500–1,500 bp (166–500 amino acids), (3) taxon coverage > 60%, and (4) proportion of parsimony informative sites > 30%. From the 2,378 genes that met these conditions, we calculated a mean pairwise distance (PD) for each gene using MEGA X⁵⁶ to determine its evolutionary rate. To explore phylogenetic signals in these genes, we created four datasets consisting of genes with different PDs. For each dataset, individual genes were aligned as amino acids using MAFFT⁵⁷, concatenated into a single unpartitioned dataset, and subjected to a maximum likelihood (ML) analysis using LG + G as a model of amino acid substitution in RAxML with 100 standard bootstrap replications. After the phylogenetic analyses, we compared the results across the four resulting trees regarding topology, nodal support, and branch lengths.

Target enrichment

To extract the high molecular weight genomic DNA required for target enrichment and Illumina sequencing (at least 1 µg of DNA), we used Gentra Puregene Tissue Kit (Qiagen) following the manufacturer’s guidelines. The quality and concentration of DNA extracts were initially measured using DeNovix Spectrophotometer. The genomic DNA extracts were sent to Arbor to be processed as part of their myReads® targeted sequencing services. They performed initial quality control, library preparation, and target enrichment following their manual for myBaits (https://arborbiosci.com/wp-content/uploads/2018/04/myBaits-Manual-v4.pdf). After the target enrichment, the samples were sequenced on a single lane of Illumina HiSeq2500 using 125bp paired-end (PE) sequencing by Novogene (Sacramento, CA).

Bioinformatics pipelines

We trimmed and assembled the raw reads using Trimmomatic⁵⁸ and SOAPdenovo2⁵⁹, respectively, after which contamination was checked for the entire assembly by VecScreen (http://www.ncbi.nlm.nih.gov/tools/vecscreen/) and the UniVec database build 7.1 (http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/) following Misoft et al.² and Peters et al.⁶⁰ Cross-contamination was checked for all taxa by BLAST. Orthograph⁶¹ was used for orthology assessment using non-strict reciprocal searches and default parameters, and problematic genes were filtered with ‘outlier’ custom scripts, which removed outlier sequences for whole gene sets instead of individual outlier sequences to prevent possible phylogenetic rogue regions^2,62 after MAFFT (v7.130b) alignment using the L-INS-I option⁵⁷. As described above, all 5,414 reference genes from the five insect genomes used by Song et al.¹⁰ were used to predict orthology. The filtered sequences were masked with Aliscore v1.2^63–66 using default sliding window size and following options used by Peters et al.²², which identifies putative alignment ambiguities or randomised multiple sequence alignment (M.S.A) sections in the alignments for each gene. The problematic sequences and positions were removed individually using Alicut (https://github.com/PatrickKueck/AliCUT). After generating filtered amino acid (AA) alignments, the nucleotide (NT) data were filtered with the same procedure as the processed AA dataset. The NT alignment was generated using Pal2Nal v14⁶⁷, corresponding to the AA alignments. These alignments were concatenated into supermatrices for downstream analyses.

Phylogenomic analysis

PartitionFinder 2.1.1⁶⁸ was used to identify the model-based combination of blocks from each of the four matrices. Parameters were: model_selection = AICC; branch lengths = linked; search = rcluster, and default for all others; weight options as rate = 1.0, base = 1.0, model = 0.0, alpha = 1.0 with rcluster-percent = 0.1. The models calculated by PartitionFinder were based on the RAxML option, which only selected the best-fitting models RAxML could use. We inferred the phylogenetic relationships in a maximum likelihood framework by using RAxML. Because RAxML could not handle mixed model partitioned analyses, we used GTR + I + G for NT and LG + G for AA datasets. We ran 1,000 standard bootstraps with ten individual ML searches to find the best tree with RAxML. For the 46-taxa matrices, we also inferred relationships in a Bayesian framework, using the best-fitting model for each partition as suggested by PartitionFinder for both NT and AA datasets. We ran 10,000,000 generations, sampling every 1,000 using four chains with MrBayes. Tracer⁶⁹ was used to check the effective sample size (ESS) for each node. We removed 25% as burn-in, which resulted in 7,500 trees to summarise. The resulting trees were examined in FigTree⁷⁰.

Acknowledgments

We thank numerous collaborators who provided valuable specimens used in this study: the late Christiane Amédégnato, Corinna Bazelet, Sven Bradler, Maria Marta Cigliano, Antoine Foucart, David Gray, Claudia Hemp, Paul Lenhart, Kelly Miller, Joey Mugleston, Daniel Otte, Nik Tatarnic, Precious Tshililo, and Michael Whiting. We also thank several colleagues who provided logistic support and expertise during our field expeditions to Australia, Costa Rica, Dominican Republic, Mexico, Mozambique, South Africa, and the U.S.: Adrian Armstrong, Greg Cowper, Paolo Fontana, Eugenio Gonzalez, Brigido Hierro, Piotr Naskrecki, Kurt Nguyen, Ricardo Mariño-Pérez, Joey Mugleston, Oscar Salomon Sanabria-Urban, Ryan Selking, Nik Tatarnic, and Derek Woller. We thank Charlie Johnson at Texas A&M AgriLife Research Genomics and Bioinformatics Service for the NGS data generation and data processing. We thank the Texas A&M High-Performance Research Computing facility for enabling data analyses. We also thank Brandon Woo for generously providing macro photographs of various orthopterans used in Figure 4. Fieldwork in the Dominican Republic was conducted under the Ministerio de Medio Ambiente y Recursos Naturales, authorization number 1424, and export permit number 692. Fieldwork in Western Australia was conducted under license number SF007010. Fieldwork in Mozambique was conducted under permit number PNG/DSC/C34/2016. Fieldwork in Costa Rica was conducted under Comision Nacional para la Gestion de la Biodiversidad (CONAGEBIO) permit R-050-2018-OT-CONAGEBIO. Fieldwork in South Africa was conducted under ordinary permit number OP 4344/2018. This work was supported by the National Science Foundation (grant numbers DEB-1064082, IOS-1253493, DEB-1655202, and DEB-1937815 to H.S.) and the United States Department of Agriculture (Hatch Grant TEX0-2-6584 to H.S.).

Author contributions

S.S. and H.S. conceived the study and designed experiments. B.F. generated new transcriptomic data. A.G.V. and D.B.W. provided critical samples and contributed to the design of taxon sampling. S.S., J.E., and H.S. designed the OR-TE probes. D.D.M. provided computational expertise and resources for data analysis. S.S., A.J.B., and H.S. analysed the data. S.S., A.J.B., and H.S. produced the original draft. All authors contributed to the writing and revision of the manuscript.

Data availability statement

NCBI BioProject and BioSample accession numbers are provided in online supplementary data. All raw and processed data for developing Orthoptera-specific target enrichment probe set as well as phylogenetic datasets have been deposited to Dryad (https://doi.org/10.5061/dryad.fttdz091b). The OR-TE probe set is assigned the Arbor Design ID D10583GRHP2.

Competing interests

J.E. is an employee of Daicel Arbor Biosciences, but declares no competing interests. All authors declare no competing interests.

Johnson, K. P. Putting the genome in insect phylogenomics. Curr. Opin. Insect Sci. 36, 111-117, doi:10.1016/j.cois.2019.08.002 (2019).
Misof, B. et al. Phylogenomics resolves the timing and pattern of insect evolution. Science 346, 763-767 (2014).
Yeates, D. K., Meusemann, K., Trautwein, M., Wiegmann, B. & Zwick, A. Power, resolution and bias: recent advances in insect phylogeny driven by the genomic revolution. Curr. Opin. Insect Sci. 13, 16-23, doi:10.1016/j.cois.2015.10.007 (2016).
Chester, D. The phylogeny of insects in the data-driven era. Syst. Entomol. 45, 540-551 (2020).
Blaimer, B. B. et al. Key innovations and the diversification of Hymenoptera. Nat. Commun. 14, 1212, doi:10.1038/s41467-023-36868-4 (2023).
Johnson, K. P. et al. Phylogenomics and the evolution of hemipteroid insects. Proc. Natl. Acad. Sci. USA 115, 12775-12780, doi:10.1073/pnas.1815820115 (2018).
Kawahara, A. Y. et al. Phylogenomics reveals the evolutionary timing and pattern of butterflies and moths. Proc. Natl. Acad. Sci. USA 116, 22657-22663 (2019).
Kutty, S. N., Wong, W. H., Meusemann, K., Meier, R. & Cranston, P. S. A phylogenomic analysis of Culicomorpha (Diptera) resolves the relationships among the eight constituent families. Syst. Entomol. 43, 434-446 (2018).
McKenna, D. D. et al. The evolution and genomic basis of beetle diversity. Proc. Natl. Acad. Sci. USA 116, 24729–24737, doi:10.1073/pnas.1909655116 (2019).
Song, H. et al. Phylogenomic analysis sheds light on the evolutionary pathways towards acoustic communication in Orthoptera. Nat. Commun. 11, 4939, doi:10.1038/s41467-020-18739-4 (2020).
Bybee, S. M. et al. Phylogeny and classification of Odonata using targeted genomics. Mol. Phylogenet. Evol. 160, 107115, doi:10.1016/j.ympev.2021.107115 (2021).
Lemmon, E. M. & Lemmon, A. R. High-throughput genomic data in systematics and phylogenetics. Annu. Rev. Ecol., Evol. Syst. 44, 99-121 (2013).
Lemmon, A. R., Emme, S. A. & Lemmon, E. M. Anchored hybrid enrichment for massively high-throughput phylogenomics. Syst. Biol. 61, 727-744 (2012).
Young, A. D. & Gilling, J. P. Phylogenomics — principles, opportunities and pitfalls of big-data phylogenetics. Syst. Entomol. 45, 225-247 (2020).
Faircloth, B. C. et al. Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Syst. Biol. 61, 717-726, doi:10.1093/sysbio/sys004 (2012).
Hanrahan, S. J. & Johnston, J. S. New genome size estimates of 134 species of arthropods. Chromosome. Res. 19, 809-823, doi:10.1007/s10577-011-9231-6 (2011).
Baker, A. J. et al. Inverse dispersal patterns in a group of ant parasitoids (Hymenoptera: Eucharitidae: Oraseminae) and their ant hosts. Syst. Entomol. 45, 1–19, doi:10.1111/syen.12371 (2020).
Breinholt, J. W. et al. Resolving relationships among the megadiverse butterflies and moths with a novel pipeline for anchored phylogenomics. Syst. Biol. 67, 78-93 (2018).
Faircloth, B. C., Branstetter, M. G., White, N. D. & Brady, S. G. Target enrichment of ultraconserved elements from arthropods provides a genomic perspective on relationships among Hymenoptera. Mol. Ecol. Resour. 15, 489-501, doi:10.1111/1755-0998.12328 (2015).
Gillung, J. P. et al. Anchored phylogenomics unravels the evolution of spider flies (Diptera, Acroceridae) and reveals discordance between nucleotides and amino acids. Mol. Phylogenet. Evol. 128, 233-245, doi:10.1016/j.ympev.2018.08.007 (2018).
Haddad, S. et al. Anchored hybrid enrichment provides new insights into the phylogeny and evolution of longhorned beetles (Cerambycidae). Syst. Entomol. 43, 68–89, doi:10.1111/syen.12257 (2018).
Peters, R. S. et al. Evolutionary history of the Hymenoptera. Curr. Biol. 27, 1013-1018 (2017).
Shin, S. et al. Phylogenomic data yield new and robust insights into the phylogeny and evolution of weevils. Mol. Biol. Evol. 35, 823-836 (2018).
Young, A. D. et al. Anchored enrichment dataset for true flies (order Diptera) reveals insights into the phylogeny of flower flies (family Syrphidae). BMC Evol. Biol. 16, 143, doi:10.1186/s12862-016-0714-0 (2016).
Cruaud, A. et al. The Chalcidoidea bush of life: evolutionary history of a massive radiation of minute wasps. Cladistics, doi:10.1111/cla.12561 (2023).
Cigliano, M. M., Braun, H., Eades, D. C. & Otte, D. Orthoptera Species File. Version 5.0/5.0. [1/12/2024]. [http://Orthoptera.SpeciesFile.org]. (2019).
Song, H. in Insect Biodiversity: Science and Society, Volume II, 1st edition (eds R.G. Foottit & P.H. Adler) (John Wiley & Sons Ltd., 2018).
Song, H. et al. 300 million years of diversification: elucidating the patterns of orthopteran evolution based on comprehensive taxon and gene sampling. Cladistics 31, 621–651 (2015).
Hawlitschek, O. et al. New estimates of genome size in Orthoptera and their evolutionary implications. PLoS One 18, e0275551, doi:10.1371/journal.pone.0275551 (2023).
Yuan, H. et al. The evolutionary patterns of genome size in Ensifera (Insecta: Orthoptera). Front. Genet. 12, 693541, doi:10.3389/fgene.2021.693541 (2021).
Nakamura, T., Ylla, G. & Extavour, C. G. Genomics and genome editing techniques of crickets, an emerging model insect for biology and food science. Curr. Opin. Insect Sci. 50, 100881, doi:10.1016/j.cois.2022.100881 (2022).
Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460-2461, doi:10.1093/bioinformatics/btq461 (2010).
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinform. 10, 421, doi:10.1186/1471-2105-10-421 (2009).
Weissman, D. B. et al. Generic relationships of New World Jerusalem crickets (Orthoptera: Stenopelmatoidea:Stenopelmatinae), including all known species of Stenopelmatus. Zootaxa 4917, 1-122 (2021).
Wang, X. et al. The locust genome provides insight into swarm formation and long-distance flight. Nat. Commun. 5, 2957 (2014).
Smith, B. T., Mauck, W. M., Benz, B. W. & Andersen, M. J. Uneven missing data skew phylogenomic relationships within the lories and lorikeets. Genome Biol. Evol. 12, 1131-1147, doi:10.1093/gbe/evaa113 (2020).
Huang, H. & Knowles, L. L. Unforeseen consequences of excluding missing data from next-generation sequences: Simulation study of RAD sequences. Syst. Biol. 65, 357-365, doi:10.1093/sysbio/syu046 (2016).
Hosner, P. A., Faircloth, B. C., Glenn, T. C., Braun, E. L. & Kimball, R. T. Avoiding missing data biases in phylogenomic inference: An empirical study in the landfowl (Aves: Galliformes). Mol. Biol. Evol. 33, 1110-1125, doi:10.1093/molbev/msv347 (2016).
Roure, B., Baurain, D. & Philippe, H. Impact of missing data on phylogenies inferred from empirical phylogenomic data sets. Mol. Biol. Evol. 30, 197-214, doi:10.1093/molbev/mss208 (2013).
Strauß, J., Stritih, N. & Lakes-Harlan, R. The subgenual organ complex in the cave cricket Troglophilus neglectus (Orthoptera: Rhaphidophoridae): comparative innervation and sensory evolution. Royal Soc. Open Sci. 1, 140240 (2014).
Vandergast, A. G. et al. Tackling an intractable problem: Can greater taxon sampling help resolve relationships within the Stenopelmatoidea (Orthoptera: Ensifera)? Zootaxa 4291, 1-33 (2017).
Field, L. H. The biology of wetas, king crickets and their allies. (CABI Publishing, 2001).
Song, H. & Mariño-Pérez, R. Re-evaluation of taxonomic utility of male phallic complex in higher-level classification of Acridomorpha (Orthoptera: Caelifera). Insect Syst. Evol. 44, 241-260 (2013).
Blankenberg, D. et al. Manipulation of FASTQ data with Galaxy. Bioinformatics 26, 1783-1785 (2010).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114-2120 (2014).
Wingett, S. W. & Andrews, S. FastQ Screen: A tool for multi-genome mapping and quality control. F1000Res. 7, 1338 (2018).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644-652, doi:10.1038/nbt.1883 (2011).
Kriventseva, E. V., Rahman, N., Espinosa, O. & Zdobnov, E. M. OrthoDB: the hierarchical catalog of eukaryotic orthologs. Nucleic Acids Res. 36, D271-275, doi:10.1093/nar/gkm845 (2008).
Waterhouse, R. M., Tegenfeldt, F., Li, J., Zdobnov, E. M. & Kriventseva, E. V. OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res. 41, D358-365, doi:10.1093/nar/gks1116 (2013).
Waterhouse, R. M., Zdobnov, E. M., Tegenfeldt, F., Li, J. & Kriventseva, E. V. OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011. Nucleic Acids Res. 39, D283-288, doi:10.1093/nar/gkq930 (2011).
International Aphid Genomics, C. Genome sequence of the pea aphid Acyrthosiphon pisum. PLoS Biol. 8, e1000313 (2010).
Werren, J. H. et al. Functional and evolutionary insights from the genomes of three parasitoid Nasonia species. Science 327, 343-348 (2010).
Kirkness, E. F. et al. Genome sequences of the human body louse and its primary endosymbiont provide insights into the permanent parasitic lifestyle. Proc. Natl. Acad. Sci. USA 107, 12168-12173 (2010).
Mesquita, R. D. et al. Genome of Rhodnius prolixus, an insect vector of Chagas disease, reveals unique adaptations to hematophagy and parasite infection. Proc. Natl. Acad. Sci. USA 112, 14936-14941 (2015).
Terrapon, N. et al. Molecular traces of alternative social organization in a termite genome. Nat. Commun. 5, 3636 (2014).
Kumar, S., Stecher, G., Li, M., Knyaz, C. & Tamura, K. MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. Mol. Biol. Evol. 35, 1547-1549, doi:10.1093/molbev/msy096 (2018).
Katoh, K. & Standley, D. M. MAFFT Multiple sequence alignment software version 7: Improvements in performance and usability. Mol. Biol. Evol. 30, 772-780 (2013).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114-2120, doi:10.1093/bioinformatics/btu170 (2014).
Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1, 18, doi:10.1186/2047-217X-1-18 (2012).
Peters, R. S. et al. Transcriptome sequence-based phylogeny of chalcidoid wasps (Hymenoptera: Chalcidoidea) reveals a history of rapid radiations, convergence, and evolutionary success. Mol. Phylogenet. Evol. 120, 286-296, doi:10.1016/j.ympev.2017.12.005 (2018).
Petersen, M. et al. Orthograph: a versatile tool for mapping coding nucleotide sequences to clusters of orthologous genes. BMC Bioinform. 18, 111, doi:10.1186/s12859-017-1529-8 (2017).
Peters, R. S. et al. Evolutionary history of the Hymenoptera. Curr. Biol. 27, 1013-1018, doi:10.1016/j.cub.2017.01.027 (2017).
Meusemann, K. et al. A phylogenomic approach to resolve the arthropod tree of life. Mol. Biol. Evol. 27, 2451-2464, doi:10.1093/molbev/msq130 (2010).
Li, B., Lopes, J. S., Foster, P. G., Embley, T. M. & Cox, C. J. Compositional biases among synonymous substitutions cause conflict between gene and protein trees for plastid origins. Mol. Biol. Evol. 31, 1697-1709, doi:10.1093/molbev/msu105 (2014).
Kuck, P. et al. Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Front. Zool. 7, 10, doi:10.1186/1742-9994-7-10 (2010).
Misof, B. & Misof, K. A Monte Carlo approach successfully identifies randomness in multiple sequence alignments: a more objective means of data exclusion. Syst. Biol. 58, 21-34, doi:10.1093/sysbio/syp006 (2009).
Suyama, M., Torrents, D. & Bork, P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34, W609-W612, doi:10.1093/nar/gkl315 (2006).
Lanfear, R., Frandsen, P. B., Wright, A. M., Senfeld, T. & Calcott, B. PartitionFinder 2: New methods for selecting partitioned models of evolution for molecular and morphological phylogenetic analyses. Mol. Biol. Evol. 34, 772-773, doi:10.1093/molbev/msw260 (2017).
Tracer: MCMC Trace Analysis Tool Version v1.5.0 (2003-2009).
FigTree: Tree Figure Drawing Tool Version 1.3.1 (2006-2009).

No competing interests reported.

Download PDF

Editorial decision: Revision requested
11 Jun, 2024
Reviews received at journal
10 Jun, 2024
Reviewers agreed at journal
17 May, 2024
Reviews received at journal
22 Mar, 2024
Reviewers agreed at journal
21 Feb, 2024
Reviewers agreed at journal
19 Feb, 2024
Reviewers invited by journal
19 Feb, 2024
Editor assigned by journal
19 Feb, 2024
Editor invited by journal
09 Feb, 2024
Submission checks completed at journal
09 Feb, 2024
First submitted to journal
01 Feb, 2024

You are reading this latest preprint version

Orthoptera-specific target enrichment (OR-TE) probes resolve relationships over broad phylogenetic scales

Status:

Version 1

Abstract

Figures

Introduction

Results

Designing the OR-TE probe set

Capture efficiency of OR-TE probe set

Phylogenetic utility of OR-TE probe set

Discussion

Methods

Transcriptome sequencing and assembly

Selection of target genes

Target enrichment

Bioinformatics pipelines

Phylogenomic analysis

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1