Discovering Novel Genes in Non-Model Fly Accessory Glands Using De Novo Nanopore Transcriptomics

doi:10.21203/rs.3.rs-1080954/v1

Download PDF

Research Article

Discovering Novel Genes in Non-Model Fly Accessory Glands Using De Novo Nanopore Transcriptomics

https://doi.org/10.21203/rs.3.rs-1080954/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Oxford Nanopore Technologies (ONT) long-read transcriptomes offer many advantages including long reads (>10kbp), end-to-end transcripts, structural variants, isoform-level resolution of genes and expression. However, uptake of ONT transcriptomics is still low, largely due to high error rates (2 to 13%) and reliance on reference databases that are unavailable for many non-model species. Additionally, bioinformatics tools and pipelines for de novo ONT transcriptomics are still in early stages of development.

Results

Here, we use de novo ONT GridION transcriptomics to discover novel genes from the male accessory glands (AG) of a widespread, non-model dung fly, Sepsis punctum. Insect AGs are of particular interest for this as they are hotspots for rapid evolution of novel reproductive genes, and they synthesize seminal fluid proteins that lack homology to any other known proteins. We implement a completely de novo ONT GridION transcriptome pipeline, incorporating quality-filtering and rigorous error-correction procedures, to characterize this novel gene set and to quantify their expression. Specifically, we compare these ONT genes and their expression against de novo lllumina HiSeq transcriptome data. We find 40 high-quality and high-confidence ONT genes that cross-verify against Illumina genes; twenty-six of which are novel and specific to S. punctum. Read count based expression quantification in ONT samples is highly congruent with Illumina’s Transcript per Million (TPM), both in overall pattern and within functional categories. Novel genes account for an average of 81% of total gene expression underscoring their functional importance in S. punctum AGs. Eighty percentage of these genes are secretory in nature, responsible for 74% total gene expression. Notably, median sequence similarities of ONT nucleotide and protein sequences match within-Illumina sequence similarities indicating that our de novo ONT transcriptome pipeline successfully mitigated sequencing errors.

Conclusions

This is the first study to adapt ONT transcriptomics for completely de novo characterization of novel genes in animals. Our study demonstrates that ONT long-reads, constituting a quarter of the number of bases sequenced at less than a third the cost of Illumina reads, can be a resource-friendly and cost-effective solution for end-to-end sequencing of unknown genes even in the absence of a reference database.

Epigenetics & Genomics

gene expression

GridION

Illumina

novel gene

Oxford Nanopore

reproduction

Sepsis punctum

sexual selection

Advances in sequencing technologies have facilitated the incorporation of ‘omics data into investigations of both microevolutionary processes and macroevolutionary patterns of diversification in biological systems. For instance, our ability to generate high-throughput data from both model and non-model organisms has dramatically enriched our knowledge of how populations diverge, using genome-wide sequencing (Carneiro et al., 2014; Jansson et al., 2020; Samuk et al., 2020), and how new genes are born and genomes evolve, using RNAseq based transcriptomics (Au et al., 2013; Klasberg et al., 2016; Mrinalini et al., 2021; Wu et al., 2011). Specifically, several short-read technologies have been developed for RNAseq transcriptomics, and among these Illumina is a market leader due to high base calling accuracy (> 99.9%), high data yield, and the availability of well-established bioinformatics tools and best practices for data analysis (Conesa et al., 2016; Corchete et al., 2020; Hölzer & Marz, 2019). However, Illumina short-reads span ≤ 600 bp in length which presents significant challenges for resolving full length, protein coding transcripts, especially in non-model species that lack genic and genomic reference databases. Instead of complete DNA sequences (CDS), short-reads are stitched together in silico using a reference database or via de novo concatenated assembly to create contiguous sequences (contigs), and from this, full-length, protein-coding transcripts can be derived. Well-annotated and high-quality reference databases can be very useful for assembling CDS and quantifying gene expression, however they are often only available for well-studied organisms such as humans (Garg et al., 2020; Venter et al., 2001), fruit flies (Adams et al., 2000), mice (Bult et al., 2019) etc. Partially or spuriously assembled CDS and chimeric transcripts are common pitfalls of de novo short-read assembly (Conesa et al., 2016; Freedman et al., 2021). Moreover, in evolutionarily closely-related genes and in gene families consisting of multiple isoforms, it is difficult to resolve CDS and quantify gene expression to the isoform level using short-reads (Conesa et al., 2016; Steijger et al., 2013).

Recent inventions in Third Generation Sequencing technologies that provide high throughput, long-read data have allowed end-to-end transcript sequencing, thereby eliminating the need for assembling contigs. Two long-read sequencing technologies dominate the market at present: Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) Single-Molecule Real Time (SMRT) Sequencing. Among the two, ONT is arguably the leader in long-read transcriptomics and offers short to ultra-long DNA/RNA molecules longer than 10 kbp in length. Sequencing is done in real-time, based on the principle of passing long, unfragmented complementary DNA (cDNA) or RNA molecules through a protein nanopore, recording minute changes to electric current, and translating these changes into sequence data (Jain et al., 2016; Rang et al., 2018). Single-use cartridges with pre-loaded reagents can be easily used with portable, bench-top instruments, which makes ONT a convenient platform for ecological studies even in the field (Chang et al., 2020). Further, ONT offers cDNA sequencing, in both polymerase chain reaction (PCR) based and PCR-free formats, and direct RNA sequencing that bypasses the need for converting RNA into cDNA. Despite all these advantages, a major obstacle to the adoption of ONT long-read transcriptomics is the high error rate in both cDNA and direct RNA sequencing (Workman et al., 2019). One way to mitigate sequencing errors is by adopting a reference-based approach, and as mentioned earlier, well-characterized, species-specific reference databases can go a long way in resolving transcript sequences and their expression levels, even to the level of isoforms (Dong et al., 2021; Sessegolo et al., n.d.; Workman et al., 2019). This approach with ONT has been documented in species with good quality reference databases such as humans (e.g., Soneson et al., 2019; Weirather et al., 2017; Workman et al., 2019), mice (Sessegolo et al., n.d.), cattle (e.g., Halstead et al., 2021), fruit flies (Bayega et al., 2018), viruses (e.g., Boldogkői et al., 2018), and well-studied plants (e.g., Cui et al., 2020; Wang et al., 2021).

However, in non-model species without reference databases, the uptake of ONT long-read transcriptomics has been negligible and there are no studies comparing ONT transcriptomics with other sequencing platforms (e.g., Illumina, PacBio). In fact, a search for ONT transcriptomics of lesser-studied species led us to just one publication on the Muscovy duck where a database from an alternate duck species was used as a reference (Lin et al., 2021) (search conducted on 8th September 2021). The combination of lack of reference databases and high sequencing error rates, makes it especially challenging to undertake de novo ONT transcriptomics (Sahlin et al., 2021; Sahlin & Medvedev, 2019). Further, although over 555 tools are available for long-read analysis (https://long-read-tools.org/, accessed 9th September, 2021)(Amarasinghe et al., 2020, 2021), no clear bioinformatics pipelines have been established for de novo ONT transcriptomics and we are still in very early stages of reference-free ONT transcriptomics. However, there is promise for wider application of de novo ONT transcriptomics if sequencing data quality can be significantly improved by implementing appropriate post-sequencing error correction procedures, such as de novo clustering, consensus sequence calling, and polishing (Amarasinghe et al., 2020; Rang et al., 2018; Sahlin et al., 2021; Sahlin & Medvedev, 2019; Zhang et al., 2019). Specifically, a non-hybrid approach using only long-reads or a hybrid approach that capitalizes on higher quality short-reads are two alternate approaches currently in use to address long-read error correction (Amarasinghe et al., 2020; Sahlin et al., 2021; Sahlin & Medvedev, 2019; Zhang et al., 2019).

In this study, we explore the use of reference-free, de novo ONT long-read transcriptomics for novel gene discovery and gene expression quantification in the accessory glands of a dung fly species, Sepsis punctum (Diptera; Sepsidae). This is an ecologically relevant insect, often found on decaying organic material such as cattle dung and is widespread across North America and Europe. Sepsid flies are emerging models in a range of disciplines such as eco-toxicology (Blanckenhorn et al., 2013a; Blanckenhorn, et al., 2013b), biogeography (Blanckenhorn et al., 2021; Giesen et al., 2017, 2019), evo-devo (Herath et al., 2015) and sexual selection (Puniamoorthy et al., 2009; Puniamoorthy et al., 2008). In particular, S. punctum populations in North America and Europe differ in mating behaviour as well as male reproductive investments and female remating frequencies, making it an interesting model for sexual selection studies (Puniamoorthy et al., 2012a; Puniamoorthy et al., 2012b; Rohner et al., 2016). However, with the exception of species from the genus Themira, which is distantly-related to S. punctum, there is generally a lack of genic, genomic, or transcriptomic data for sepsid species.

Insect reproductive genes and proteins are known to diverge rapidly at the species level and even at the population level (Abry et al., 2017; Bayram et al., 2017, 2019; Goenaga et al., 2015; Mrinalini et al., 2021; Swanson et al., 2001). The male reproductive tissues i.e. the testes, are responsible for sperm production and are often closely associated with paired accessory glands (AG) that synthesize seminal fluid proteins (Figure 1). These play a crucial role in post-copulatory sexual selection and AGs, in particular, can be hotspots for the evolution of completely novel genes (Mrinalini et al., 2021). Rapid evolution of insect genomes, through the birth of novel genes and subsequent recruitment for high-expression in the AGs, underpins starkly divergent reproductive gene repertoires and protein compositions even in closely-related species (Mrinalini et al., 2021). Furthermore, proteins synthesized by newly-evolved AG genes often do not show any similarity to proteins from other organisms, and their functions are largely unknown (Bayram et al., 2017, 2019; Mrinalini et al., 2021; Parthasarathy et al., 2009). Given the lack of genomic/transcriptomic reference databases, the likely presence of novel, species-specific genes, and the lack of protein homology to any other species, it is challenging to resolve full-length accessory gland genes of insects, especially using a de novo transcriptomics approach with an emerging technology such as ONT.

Given this backdrop, we aim to: (i) perform ONT GridION long-read transcriptomics of S. punctum male accessory glands; (ii) implement a de novo ONT GridION transcriptome pipeline with error correction procedures; (iii) characterize novel accessory gland genes; (iv) quantify gene expression; and (v) evaluate the usefulness of de novo ONT transcriptomics by comparing results with Illumina HiSeq data.

De novo Transcriptome Statistics

After implementing two separate de novo transcriptome pipelines for ONT GridION and Illumina HiSeq read data (Figure 2), we generated summary statistics for the transcriptomes. Table 1 provides the statistics for experimental details, cDNA library preparation, RNAseq reads, and filtered de novo transcriptomes.

RNAseq Read Statistics: ONT long-read sequencing generated 4.48 M and 5.28 M reads for samples ONT-1 and ONT-2 respectively, whereas Illumina sequencing generated 29.64 M and 32.96 M reads for ILL-1 and ILL-2 respectively. Long-read lengths range from 50 bp to a maximum of 11,837 bp for ONT-1 and 50 bp to 9,370 bp for ONT-2. Average read quality, represented by Phred scores, are 11.7 and 11.6 for ONT-1 and ONT-2 respectively. For Illumina, read qualities were much higher at 35.5 on average from the two sample. Filtered RNAseq read GC content was 39.5% and 39.2% for ONT-1 and ONT-2, whereas GC content for both Illumina samples was 46%.

After quality filtering, a majority of long-reads were found to be distributed within the 4000 bp range, with an average base quality ≤ 19 (minimum base quality 7) for both ONT-1 and ONT-2 (Figure 3a & b). A histogram of read length distribution shows that the maximum number of long-reads occur in the range of 250 to 450 bp for both samples and read N50 were 448 bp and 407 bp for ONT-1 and ONT-2 respectively (Figure 3c & d; Table 1). Finally, the cost of sequencing was USD 285/sample for ONT GridION and USD 882/sample for Illumina HiSeq.

Filtered De Novo Transcriptome Statistics: After de novo gene clustering, gene consensus calling, gene polishing and subsequent filtering (Figure 2a), de novo transcriptomes contained 44,958 and 33,564 transcripts in ONT-1 and ONT-2 respectively (Table 1). For Illumina, de novo transcriptome assembly followed by filtering steps showed 44,523 and 45,681 assembled contigs for ILL-1 and ILL-2 respectively (Table 1). ONT long-read transcript N50 were 363 and 345 bp, with the largest transcripts at 4,422 and 2,964 bp, for ONT-1 and ONT-2 respectively (Table 1). For Illumina short-read transcripts, both values were much higher, with N50 values at 702 and 645 bp and longest transcript lengths at 13,932 and 12,372 bp, for ILL-1 and ILL-2 respectively. Although GC content of filtered RNAseq reads varied between ONT and Illumina, the GC content of the final de novo transcriptomes from both technologies were comparable at c. 54% (Table 1).

Characterisation of Gene Expression and Gene Function

The functional aspects of S. punctum AG genes were characterized using gene expression analysis, BLASTP against nr for functional annotation, and signal peptide analysis of the translated protein sequences (Figure 4). Ranking of accessory gland genes from high expression (1) to low expression (40) showed high level of congruence in overall gene expression pattern across all four samples despite using two different methods of quantification for the two technologies, i.e., read count for ONT samples and TPM for Illumina samples (Figure 4a). The highest expressed transcript showed read count of 72,731 and 68,120 for ONT-1 and ONT-2 respectively and TPM of 62,424 and 76,279 for ILL-1 and ILL-2 samples respectively. In both ONT and Illumina, a steep drop in gene expression occurred and expression levels tapered off from transcripts ranked at 12 and 13 and reached negligible levels by transcript 40.

Functional annotation showed that S. punctum AG genes fell into four main categories: novel genes; protease inhibitors; C-type lectins; and cystatins. Of the 40 genes analysed, 65% (26) had no BLASTP hits in nr, indicating that these genes were novel and evolved specifically in the genome of S. punctum (Figure 4b). The functions of these novel S. punctum genes are unknown due to complete lack of BLASTP homology to any other species. However, these 26 novel genes accounted for a majority of gene expression in the accessory glands, i.e., 87% and 90% of total expression in ONT-1 and ONT-2 respectively, and 78% and 71% of total expression in ILL-1 and ILL2 respectively (Figure 4c). Of the remaining 14 genes, 10 genes were protease inhibitors and they were the second highest expressed gene category, with 10% and 8% in ONT-1 and ONT-2, and 17% and 23% in ILL-1 and ILL-2 respectively (Figure 4b and c). Two genes each for cystatins and C-type lectin were found, and both gene families were the least expressed, with 1–3% of total expression in both ONT and Illumina (Figure 4b and c). Sequence analysis for the presence of secretory signals showed that 80% (32) of 40 AG genes synthesize proteins that were secretory in nature (Figure 4c). Moreover, secretory genes also accounted for the majority AG gene expression, with 90% and 83% of total expression in samples ONT-1 and ONT-2 and 87% and 75% in samples ILL-1 and ILL-2 respectively.

Gene expression quantification of individual genes in all the four samples are represented as a heatmap, with genes grouped by functional categories (Figure 5a). Novel genes of unknown function were among the highest expressed genes in S. punctum accessory glands in both ONT and Illumina samples (Figure 5a). Expression levels of several individual genes were largely comparable across samples and across sequencing technologies, with some variation between and even within technologies (Figure 5a).

Evaluation of ONT Transcriptomes

Summary statistics of sequence similarity generated for three test samples, i.e., ONT-1, ONT-2, and ILL-2, by comparing against a representative Illumina control sample, ILL-1, are shown in Table 2. Median nucleotide sequence similarities of 40 genes in test samples ranged from 99.53% to 99.65 and were highly congruent between ONT samples and the single Illumina short-read sample. At the protein level, median similarities of ONT-1 and ONT-2 were 98.99% and 98.93% respectively, whereas for ILL-2, it was slightly higher at 99.31%. Further, 18–19% of protein sequences from test samples were 100% similar to ILL-1 control protein sequences. At the nucleotide level however, 53% and 50% of proteins from ONT-1 and ONT-2 respectively were 100% similar to ILL-2, whereas 65% of ILL-2 sequences were 100% similar to ILL-2. Sequences similarities of individual genes are represented as heatmaps in Figure 5b and c. No particular trend is observed within or among the four functional categories, i.e., novel genes, protease inhibitors, C-type lectins, and cystatins.

To date, this is the first study to adapt ONT long-read technology in a reference-free, completely de novo transcriptomics approach, and it is also the first study to use ONT long-read RNAseq to characterize novel, species-specific genes in any animal species. So far, ONT long-read transcriptomics has been used with one other species that did not have a species-specific reference database available (the Muscovy duck). However, in that study, the authors used a reference database from a different genus, and characterized well-established and conserved genes (Lin et al., 2021). In contrast, our genes of interest are rapidly evolving, often species-specific genes from AG of males insects, and we successfully discover more than two dozen novel genes that are found only in S. punctum (Figure 4b). Our de novo transcriptome pipeline incorporated read quality filtering and rigorous post-sequencing error correction procedures that successfully mitigated high error rates in ONT long-reads (Figure 2a) (Sahlin et al., 2021; Sahlin & Medvedev, 2019). Gene sequences from our assembly-free and reference-free ONT long-read pipeline were of high quality, with sequence similarity levels comparable to genes derived from Illumina short-read assembly (Table 2; Figure 5b and c).

Although two different methods of gene expression quantification for ONT and Illumina pipelines were used (Figure 2a), there was a high degree of congruence in gene expression patterns between the two technologies: in terms of overall expression (Figure 4a) and in terms of expression in each functional categories (4d and e). At the level of individual S. punctum genes, expression patterns were less congruent, and could be attributed to variation in biological replicates (Figure 5a). This study achieved results comparable to Illumina with a quarter the number of sequenced bases and at less than a third the cost of Illumina sequencing (Table 1), demonstrating that ONT long-reads can be a resource-friendly, computationally less-demanding, and highly cost-effective solution for end-to-end transcript sequencing even in the absence of a reference database.

Novel Gene Discovery Using ONT GridION

A total of 26 novel, species-specific genes in S. punctum were discovered using ONT GridION long-reads (Figure 4b). These novel accessory gland genes constituted 65% of 40 of high-quality and high-confidence gene set that were also cross-verified against evidence from Illumina HiSeq transcriptomes (Figure 4b). Thus far, novel reproductive genes and gene products have been discovered in flies, beetles, and honey bees using microarrays and Illumina based RNA sequencing (Bayram et al., 2017; Mrinalini et al., 2021; Sayadi et al., 2016; Vedelek et al., 2018; Vibranovski et al., 2009), as well as using traditional protein or proteomics analyses (Goenaga et al., 2015; Gorshkov et al., 2015; Parthasarathy et al., 2009; Peferoen & De loof, 1984; Wei et al., 2015; Yamane et al., 2015). Our study suggests that error-prone Third Generation Sequencing, when combined with effective read quality filtering and rigorous error correction methods, can be a reliable new technology for end-to-end novel gene discovery and gene expression quantification without the need for transcriptome assemblies or reference databases. The functions of novel S. punctum AG genes are unclear as we find no homology to any other proteins in nr. However, these genes are among the highest expressed in AG tissues, both as a gene category (81% of total expression) (Figure 4c) and at the level of individual genes (Figure 5a). This suggests that novel genes are likely important for AG function and that they likely play a crucial role in S. punctum reproduction. Novel reproductive genes, particularly AG genes, are known to evolve through rapid genomic evolution and recruitment for high expression in the AG of other dung-dwelling insects like the dung beetle (Mrinalini et al., 2021). It remains to be seen whether novel S. punctum genes evolve through a similar or a different mechanism.

Role of Protease Inhibitors, C-type Lectins, and Cystatins

In addition to novel genes, S. punctum AG transcriptomes revealed ten protease inhibitors and two genes each of C-type lectins and cystatins (Figure 4c). Protease inhibitors were the largest group and the second highest expressed functional category of genes in S. punctum (Figure 4b and 4c). Consisting of a large and diverse group of genes that synthesize many classes of proteins, protease inhibitors have been found in seminal fluids of D. melanogaster (Mueller et al., 2008; Swanson et al., 2001). They play a role in male sperm competitiveness and protect seminal fluids and sperm from proteolysis in the sperm storage organs of female flies post-mating (Mueller et al., 2008; Park and Wolfner, 1995). C-type lectins are immune-related genes that have been found in many insects that could be involved in protecting seminal fluids from microbial infections (Tian et al., 2017). Cystatins are also a large and diverse group of genes, that regulate the activity of cysteine and serine proteases. Although little is known about their specific functions in male reproductive tissues, cystatins have been found in seminal fluids and reproductive tissues of flies, flatworms and ticks (Garlovsky et al., 2020; Geadkaew et al., 2014; Sonenshine et al., 2011). They are thought to be involved in spermatogenesis and fertilization (Geadkaew et al., 2014), and have been found to play a critical role in regulating programmed cell death during embryogenesis in plants (P. Zhao et al., 2014)

Secretory Genes

Eighty percent (32) of AG genes analysed in S. punctum were secretory in nature, i.e., they synthesize secretory proteins (Figure 4d). All 10 protease inhibitors and both C-type lectins and cystatins contain secretory signals, whereas among 26 novel S. punctum genes, 69% (18) were secretory. Secretory genes also accounted for the majority of AG gene expression, with an average of 84% of total expression in the four samples (Figure 4e). These patterns observed in S. punctum were similar to those found in dung beetles, where 73% of AG genes analysed were found to be secretory in nature, and they accounted for over 80% of total gene expression (Mrinalini et al., 2021). This supports that the primary function of male AG in most male insects is the synthesis of secretory proteins.

Quality of ONT GridION Transcriptome

Among several sequencing platforms offered by ONT, GridION is a scaled-up version of the earlier MinION. ONT GridION is able to analyse data of up to five MinION flowcells or Flongles and therefore returns higher data yields, but with the added advantage of a compact benchtop device with onboard data processing capacity. Further, multiple Flongles and flowcells can be run independent of each other, giving greater flexibility for managing sequencing run and data analyses. Currently, GridION raw base calling error rate has been estimated to be ~ 4%, and it is generally prone to deletion errors. In our analysis of sequence similarity, median sequence similarities of ONT GridION genes and proteins (to their respective orthologs in the designated Illumina HiSeq control sample) are comparable to within-Illumina median sequence similarities (Table 2). At the nucleotide level, 64% of Illumina test genes were 100% similar to Illumina control orthologs, whereas ONT GridION genes show lower proportions of 53% and 50% in ONT-1 and ONT-2 respectively (Table 2). However, both ONT GridION and Illumina HiSeq transcriptomes show equal proportions of proteins (18–19%) that are 100% similar to Illumina control orthologs (Table 2). Moreover, due to the small size of S. punctum and the minute size of our tissue, our samples were collected by pooling tissues dissected from multiple flies collected at different time points. Therefore, higher proportion of similar protein sequences suggests that natural genetic variation together with differential representation and incorporation of transcripts during de novo transcriptome construction could be the likely source of sequence variation rather than sequencing errors.

Our study demonstrates that de novo ONT long-read transcriptomics is a reliable and cost-effective approach for novel gene discovery and gene expression analysis in the absence of reference databases. We find that by implementing rigorous post-sequencing error correction procedures, error-prone ONT long-reads can produce gene sequence and gene expression data that are comparable to Illumina HiSeq. We discovered 26 novel reproductive genes that are recruited for high expression in the accessory glands of male S. punctum. Novel gene discovery has important implications for understanding fundamental evolutionary processes such as phenotypic trait evolution, adaptation, and speciation. In particular, male reproductive genes of insects are known to synthesize seminal fluid proteins that interact with the female reproductive environment and thereby direct post-copulatory sexual selection. Hence, understanding rapid specialization and diversification of male reproductive genes in a species sheds light on mechanisms of divergence of populations and the processes of speciation.

Dissection of Accessory Glands From S. punctum

The sampling and maintenance of S. punctum cultures followed previously published work (Puniamoorthy et al., 2012a; Rohner et al., 2016). For this study, a North American population from Ottawa (45.43 °N, -75.67 °E) was used and adult flies were housed in plastic containers measuring 11 cm by 9 cm by 9 cm and reared at a temperature of 26 ºC with cattle dung, sugar, and water given ad libitum. A mixture of four to ten day old males were aspirated from culture containers into plastic vials, cooled at -20 ºC for 10 min, and placed on ice until dissection. Each fly was transferred to a glass slide and the abdomen was dissected into 1X PBS. Paired accessory glands were separated from the testes and collected into 1.5 ul microcentrifuge tube snap frozen on dry ice. For ONT GridION, tissues from 80 flies were pooled to allow for protocol optimisation, and for Illumina HiSeq, tissues from 63 flies were pooled. For each sequencing technology two biological replicates of pooled tissues were collected and samples were stored at -80° C until RNA extraction.

RNA Extraction

Total RNA was extracted using Aurum Total RNA Mini Kit (BIO-RAD Cat # 732-6820). Samples stored at -80° C were centrifuged at 13,000 rpm for 20 minutes at 4° C and placed on ice. 700 ml lysis solution was added to each sample and homogenized using PTFE pestles. The lysate was centrifuged for 3 minutes at 4°C and the supernatant was transferred to new tube. 700 ul of 60% ethanol was added and thoroughly mixed by vortexing for 2-3 minutes to make sure there was no visible bilayer. 700 ul of homogenized lysate was transferred into an RNA binding column inserted into a wash tube and the set up was centrifuged for 1 minute. The filtrate was discarded and the same wash was repeated a second time. 700 ul of low stringency wash was added to column and centrifuged for 1 minute and filtrate discarded. 80 ul of DNase (5ul of DNase I solution + 75 ul of DNase solution) was added to each column and incubated at room temperature for 25 minutes. The samples were washed two more times, first with 700 ul of high stringency wash solution and second with 700ul of low stringency wash with centrifuging for 1 minute and discarding of filtrate after each wash. The samples were spun for 3 minutes to remove residual wash solution and the RNA binding column was transferred to 1.5 ul microcentrifuge tubes. 40 ul of elution solution was added to the membrane of the binding column and after one minute of membrane saturation, the sample was centrifuged for 2 minutes to elute total RNA.

cDNA library preparation and RNAseq

Oxford Nanopore Technologies (ONT) GridION

ONT offers Direct RNA or Direct cDNA library preparation and sequencing options, however these technologies require high amounts of starting RNA input. The RNA quantities of our samples were inherently low given the small size of our study species S. punctum (2-7 mm in length) and even smaller size of reproductive tissues (Figure 1). Therefore, the PCR-cDNA (PCB109) protocol was used, which allows for lower RNA input. Total RNA samples were submitted to Genome Institute of Singapore, Singapore, for ONT GridION long-read RNAseq. 100 ng total RNA was used for cDNA synthesis and strand switching of full-length poly AAA tail. cDNA was amplified with 5’ barcoded primers and sequencing adapter annealing. Barcoded libraries were multiplexed by pooling at 100 fmol based on average Agilent DNA 12000 size, and sequencing was performed on one GridION flowcell.

Illumina HiSeq

For Illumina short-read sequencing, total RNA was shipped to Genomics Research Center at University of Rochester, New York, for cDNA library preparation and sequencing. Total RNA concentration was determined with NanoDrop 1000 spectrophotometer (NanoDrop, Wilmington, DE) and RNA quality assessed with the Agilent Bioanalyzer (Agilent, Santa Clara, CA). TruSeq RNA Sample Preparation Kit V2 was used for library construction as per manufacturer’s protocols. Briefly, mRNA was purified from 100 ng total RNA with oligo-dT magnetic beads and then fragmented. First-strand cDNA was synthesized with random hexamer priming followed by second-strand cDNA synthesis. End repair and 3` adenylation was performed on the double stranded cDNA. Illumina adaptors were ligated to both ends of the cDNA, purified by gel electrophoresis and amplified with Polymerase Chain Reaction (PCR) primers specific to the adaptor sequences to generate amplicons of approximately 200 - 500 bp in size. Libraries were amplified at a concentration of 8 pM per lane and Paired End reads of length 125 bp were sequenced on HiSeq 2500 v4 platform.

De Novo Transcriptome Pipelines and Gene Expression Analysis

Two different bioinformatics pipelines were employed, with some steps commonly implemented in both pipelines, for de novo transcriptome construction and analysis of ONT long-reads and Illumina short-reads (Figure 2a & b). For ONT, de novo gene clustering, consensus sequence calling, and gene polishing were used to derive error corrected gene sequences (Sahlin et al., 2021; Sahlin & Medvedev, 2019). For Illumina, a de novo transcriptome assembly approach was used to reconstruct contigs from which full length CDS could be derived. The two methods differed in gene expression quantification in that, read count was used as a proxy for gene expression in the case of ONT, whereas Transcript per Million (TPM) was used for Illumina based gene expression calculation (Figure 2).

De Novo ONT GridION Pipeline

A suite of bioinformatics tools, including standalone tools and those developed by ONT, were used for de novo transcriptome analysis of ONT long-reads. Read quality filtering, orientation, and trimming was performed using Pychopper v2 (https://github.com/nanoporetech/pychopper) with default parameters, and read statistics were analysed using NanoPlot 1.32.1 (De Coster et al., 2018). A non-hybrid approach of using ONT long-reads for error correction was employed, as our ONT and Illumina read data were derived from separate biological samples. Error correction of each ONT transcriptome was performed using ONT long-reads within the same biological sample. IsONclust2, implemented in pipeline-nanopore-denovo-isoforms (https://github.com/nanoporetech/isONclust2) was used for de novo clustering of ONT long-reads and one sequence cluster was generated for each gene (Sahlin et al., 2021; Sahlin & Medvedev, 2019). Consensus sequences were called for each sequence cluster to generate one consensus sequence per gene. The consensus gene sequences were further polished using raw reads in medaka 1.2.5 (https://github.com/nanoporetech/medaka). Open Reading Frames (ORFs) were derived by translating polished sequences in all six frames using getorf provided with EMBOSS:6.6.0 (https://www.bioinformatics.nl/cgi-bin/emboss/getorf). ORF sequences ≥ 200 bp were translated into protein sequences using transeq provided with EMBOSS:6.6.0 (https://www.bioinformatics.nl/cgi-bin/emboss/transeq), and a final dereplication was performed at 100% protein sequence identity using CD-HIT v4.7 (Fu et al., 2012; Li Ã & Godzik, 2006). For gene expression quantification, long-reads were mapped back to filtered de novo ONT transcriptomes using minimap2, excluding any secondary alignments (H, 2018). Samtools 1.7 (Li et al., 2009) was used to further filter aligned reads, with any supplementary and secondary alignments discarded. Only reads aligning on ≥ 80% of their length were counted towards gene expression quantification.

De Novo Illumina Hiseq Pipeline

Raw reads were processed in Trimmomatic-0.36 (Bolger et al., 2014) for adapter removal and quality trimming. A sliding window quality score cut-off of Q30 was applied and reads of minimum 100 bp in length were retained. For each sample, cleaned reads were de novo assembled into contigs using Trinity v2.8.6 (Grabherr et al., 2011), and the resulting contigs were de-replicated at 100% identity at nucleotide level using CDHIT 4.7 (Fu et al., 2012; Li Ã & Godzik, 2006). The remaining contigs were translated in all six frames to search for ORF prediction using getorf (EMBOSS:6.6.0) (https://www.bioinformatics.nl/cgi-bin/emboss/getorf), and all sequences containing ORFs of ≥200 bp were retained. The contigs with ORFs were translated into protein sequences using transeq (EMBOSS:6.6.0) (https://www.bioinformatics.nl/cgi-bin/emboss/transeq), and a final dereplication was performed at 100% protein sequence identity. Reads were mapped back to the filtered transcriptome assembly using an alignment-free method in salmon v1.0.0 (R et al., 2017) to generate TPM values that represent gene expression.

Sequence Curation and Gene Orthology

Sepsis punctum lacks species-specific reference databases to compare our de novo transcriptome constructions. While genomic/transcriptomic data are available Themira sp., it is not an appropriate reference point for curating AG transcripts from S. punctum because it is from a basal, distantly-related genus and insect AG genes and protein compositions vary at species and even population levels (Abry et al., 2017; Bayram et al., 2019; Goenaga et al., 2015; Mrinalini et al., 2021; Swanson et al., 2001). Therefore, building on our de novo approach transcriptomics, a reference-free approach was taken for transcript curation and an extensive manual curation of S. punctum AG genes was performed.

A gene expression cut-off was applied, and the top 100 highest expressed transcripts were selected from each of the four de novo transcriptomes (ONT-1, ONT-2, ILL-1, ILL-2) since transcriptome-based quantification of gene expression generally shows a steep drop after the first few transcripts. The subset of 400 sequences was further examined to filter out chimeric and contaminant sequences with BLASTX in nr using DIAMOND v 0.8 (Buchfink et al., 2014, 2021). Using the cleaned sequence set from each sample, putative gene orthologs were established in the de novo transcriptomes of the remaining three samples by a reciprocal BLASTP with an e-value cut-off of 1e-5. These putative orthologs were further curated by manually examining end-to-end alignments. Finally, a set of 40 high-confidence and high-quality accessory gland genes were derived with orthologs established in all four samples, and used for the downstream analyses .

Evaluation of ONT Transcriptome

ONT long-reads are prone to high error rates, therefore the usefulness of our de novo ONT GridION transcriptomics pipeline and error correction procedures in mitigating the effects of sequencing errors was evaluated (Figure 2a). Given that our study is reference-free, and Illumina short-reads are of high quality, a sequence similarity analysis was performed by comparing ONT gene sequences to Illumina gene sequences. The Illumina sample, ILL-1, was designated as the control sample and the remaining samples ONT-1, ONT-2, as well as ILL-2 were the test samples. Comparing ILL-2 to ILL-1 control sample allowed for within-Illumina assessment that can uncover effects of tissue pooling and natural genetic variation in the population. Sequence similarity values were generated by performing BLASTN of nucleotide sequences and BLASTP of translated protein sequences from 40 genes of three test samples against 40 genes from ILL-1 control. Percent sequence similarities were summarized, including median sequence similarity and percentage of sequences with 100% match to ILL-1 control. Similarities of all 40 genes were plotted as a heatmap for visualization at individual gene level.

AG: Accessory Gland

cDNA: Complementary DNA

ONT: Oxford Nanopore Technologies

PacBio: Pacific Biosciences

PCR: Polymerase Chain Reaction

SMRT: Single-Molecule Real Time

TPM: Transcript per Million

Ethics Approval and Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Availability of Data and Materials

The datasets generated and analysed during the current study are available in GenBank at https://www.ncbi.nlm.nih.gov/ and can be accessed with Project ID: PRJNA765219.

Competing Interests

The authors declare that they have no competing interests.

Funding

This work was supported by Ministry of Education (Singapore) Tier-1 research grants awarded to N.P. (R154-000-A75-114; R154-000-C16-114).

Author Contributions

M. and N.P. conceptualized and designed the study. Both authors collected the samples, and M. generated transcriptome data and conducted all bioinformatics and data analyses. Both authors were involved in preparation of the manuscript.

Acknowledgements

We thank members of the Reprolab at the National University of Singapore as well as Centre for Reproductive Evolution at Syracuse University for their support.

Abry, M. F., Kimenyi, K. M., Masiga, D., & Kulohoma, B. W. (2017). Comparative genomics identifies male accessory gland proteins in five Glossina species. Wellcome Open Res, 2, 73. https://doi.org/10.12688/wellcomeopenres.12445.2
Adams, M. D., Celniker, S. E., Holt, R. A., Evans, C. A., Gocayne, J. D., Amanatides, P. G., Scherer, S. E., Li, P. W., Hoskins, R. A., Galle, R. F., George, R. A., Lewis, S. E., Richards, S., Ashburner, M., Henderson, S. N., Sutton, G. G., Wortman, J. R., Yandell, M. D., Zhang, Q., … Venter, J. C. (2000). The genome sequence of Drosophila melanogaster. Science, 287(5461), 2185–2195. https://doi.org/10.1126/science.287.5461.2185
Amarasinghe, S. L., Ritchie, M. E., & Gouil, Q. (2021). long-read-tools.org: an interactive catalogue of analysis methods for long-read sequencing data. GigaScience, 10(2), 1–7. https://doi.org/10.1093/GIGASCIENCE/GIAB003
Amarasinghe, S. L., Su, S., Dong, X., Zappia, L., Ritchie, M. E., & Gouil, Q. (2020). Opportunities and challenges in long-read sequencing data analysis. Genome Biology 2020 21:1, 21(1), 1–16. https://doi.org/10.1186/S13059-020-1935-5
Au, K. F., Sebastiano, V., Afshar, P. T., Durruthy, J. D., Lee, L., Williams, B. A., Bakel, H. van, Schadt, E. E., Reijo-Pera, R. A., Underwood, J. G., & Wong, W. H. (2013). Characterization of the human ESC transcriptome by hybrid sequencing. Proceedings of the National Academy of Sciences, 110(50), E4821–E4830. https://doi.org/10.1073/PNAS.1320101110
Bayega, A., Oikonomopoulos, S., Zorbas, E., Wang, Y. C., Gregoriou, M.-E., Tsoumani, K. T., Mathiopoulos, K. D., & Ragoussis, J. (2018). Transcriptome landscape of the developing olive fruit fly embryo delineated by Oxford Nanopore long-read RNA-Seq. BioRxiv, 478172. https://doi.org/10.1101/478172
Bayram, H., Sayadi, A., Goenaga, J., Immonen, E., & Arnqvist, G. (2017). Novel seminal fluid proteins in the seed beetle Callosobruchus maculatus identified by a proteomic and transcriptomic approach. Insect Mol Biol, 26(1), 58–73. https://doi.org/10.1111/imb.12271
Bayram, H., Sayadi, A., Immonen, E., & Arnqvist, G. (2019). Identification of novel ejaculate proteins in a seed beetle and division of labour across male accessory reproductive glands. Insect Biochem Mol Biol, 104, 50–57. https://doi.org/10.1016/j.ibmb.2018.12.002
Blanckenhorn, W. U., Berger, D., Rohner, P. T., Schäfer, M. A., Akashi, H., & Walters, R. J. (2021). Comprehensive thermal performance curves for yellow dung fly life history traits and the temperature-size-rule. Journal of Thermal Biology, 100, 103069. https://doi.org/10.1016/J.JTHERBIO.2021.103069
Blanckenhorn, W. U., Puniamoorthy, N., Schäfer, M. A., Scheffczyk, A., & Römbke, J. (2013a). Standardized laboratory tests with 21 species of temperate and tropical sepsid flies confirm their suitability as bioassays of pharmaceutical residues (ivermectin) in cattle dung. Ecotoxicology and Environmental Safety, 89, 21–28. https://doi.org/10.1016/J.ECOENV.2012.10.020
Blanckenhorn, W. U., Puniamoorthy, N., Scheffczyk, A., & Römbke, J. (2013b). Evaluation of eco-toxicological effects of the parasiticide moxidectin in comparison to ivermectin in 11 species of dung flies. Ecotoxicology and Environmental Safety, 89, 15–20.
Boldogkői, Z., Moldován, N., Szűcs, A., & Tombácz, D. (2018). Transcriptome-wide analysis of a baculovirus using nanopore sequencing. Scientific Data 2018 5:1, 5(1), 1–10. https://doi.org/10.1038/sdata.2018.276
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), 2114–2120. https://doi.org/10.1093/BIOINFORMATICS/BTU170
Buchfink, B., Reuter, K., & Drost, H.-G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods 2021 18:4, 18(4), 366–368. https://doi.org/10.1038/s41592-021-01101-x
Buchfink, B., Xie, C., & Huson, D. H. (2014). Fast and sensitive protein alignment using DIAMOND. Nature Methods 2014 12:1, 12(1), 59–60. https://doi.org/10.1038/nmeth.3176
Bult, C. J., Blake, J. A., Smith, C. L., Kadin, J. A., Richardson, J. E., Group, the M. G. D., Anagnostopoulos, A., Asabor, R., Baldarelli, R. M., Beal, J. S., Bello, S. M., Blodgett, O., Butler, N. E., Christie, K. R., Corbani, L. E., Creelman, J., Dolan, M. E., Drabkin, H. J., Giannatto, S. L., … Zhu, Y. (2019). Mouse Genome Database (MGD) 2019. Nucleic Acids Research, 47(D1), D801–D806. https://doi.org/10.1093/NAR/GKY1056
Carneiro, M., Albert, F. W., Afonso, S., Pereira, R. J., Burbano, H., Campos, R., Melo-Ferreira, J., Blanco-Aguiar, J. A., Villafuerte, R., Nachman, M. W., Good, J. M., & Ferrand, N. (2014). The Genomic Architecture of Population Divergence between Subspecies of the European Rabbit. PLOS Genetics, 10(8), e1003519. https://doi.org/10.1371/JOURNAL.PGEN.1003519
Chang, J. J. M., Ip, Y. C. A., Ng, C. S. L., & Huang, D. (2020). Takeaways from mobile dna barcoding with bentolab and minion. Genes, 11(10), 1–18. https://doi.org/10.3390/genes11101121
Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., Szcześniak, M. W., Gaffney, D. J., Elo, L. L., Zhang, X., & Mortazavi, A. (2016). A survey of best practices for RNA-seq data analysis. Genome Biology 2016 17:1, 17(1), 1–19. https://doi.org/10.1186/S13059-016-0881-8
Corchete, L. A., Rojas, E. A., Alonso-López, D., De Las Rivas, J., Gutiérrez, N. C., & Burguillo, F. J. (2020). Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis. Scientific Reports 2020 10:1, 10(1), 1–15. https://doi.org/10.1038/s41598-020-76881-x
Cui, J., shen, N., Lu, Z., Xu, G., Wang, Y., & Jin, B. (2020). Analysis and comprehensive comparison of PacBio and nanopore-based RNA sequencing of the Arabidopsis transcriptome. Plant Methods 2020 16:1, 16(1), 1–13. https://doi.org/10.1186/S13007-020-00629-X
De Coster, W., D’Hert, S., Schultz, D. T., Cruts, M., & Van Broeckhoven, C. (2018). NanoPack: visualizing and processing long-read sequencing data. Bioinformatics, 34(15), 2666–2669. https://doi.org/10.1093/BIOINFORMATICS/BTY149
Dong, X., Tian, L., Gouil, Q., Kariyawasam, H., Su, S., De Paoli-Iseppi, R., Prawer, Y. D. J., Clark, M. B., Breslin, K., Iminitoff, M., Blewitt, M. E., Law, C. W., & Ritchie, M. E. (2021). The long and the short of it: unlocking nanopore long-read RNA sequencing data with short-read differential expression analysis tools. NAR Genomics and Bioinformatics, 3(2). https://doi.org/10.1093/NARGAB/LQAB028
Freedman, A. H., Clamp, M., & Sackton, T. B. (2021). Error, noise and bias in de novo transcriptome assemblies. Molecular Ecology Resources, 21(1), 18–29. https://doi.org/10.1111/1755-0998.13156
Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), 3150–3152. https://doi.org/10.1093/BIOINFORMATICS/BTS565
Garg, S., Fungtammasan, A., Carroll, A., Chou, M., Schmitt, A., Zhou, X., Mac, S., Peluso, P., Hatas, E., Ghurye, J., Maguire, J., Mahmoud, M., Cheng, H., Heller, D., Zook, J. M., Moemke, T., Marschall, T., Sedlazeck, F. J., Aach, J., … Li, H. (2020). Chromosome-scale, haplotype-resolved assembly of human genomes. Nature Biotechnology 2020 39:3, 39(3), 309–312. https://doi.org/10.1038/s41587-020-0711-0
Garlovsky, M. D., Evans, C., Rosenow, M. A., Karr, T. L., & Snook, R. R. (2020). Seminal fluid protein divergence among populations exhibiting postmating prezygotic reproductive isolation. Molecular Ecology, 29(22), 4428–4441. https://doi.org/10.1111/MEC.15636
Geadkaew, A., Kosa, N., Siricoon, S., Grams, S. V., & Grams, R. (2014). A 170 kDa multi-domain cystatin of Fasciola gigantica is active in the male reproductive system. Molecular and Biochemical Parasitology, 196(2), 100–107. https://doi.org/10.1016/J.MOLBIOPARA.2014.08.004
Giesen, A., Blanckenhorn, W. U., & Schäfer, M. A. (2017). Behavioural mechanisms of reproductive isolation between two hybridizing dung fly species. Animal Behaviour, 132, 155–166. https://doi.org/10.1016/J.ANBEHAV.2017.08.008
Giesen, A., Schäfer, M. A., & Blanckenhorn, W. U. (2019). Geographic patterns of postzygotic isolation between two closely related widespread dung fly species (Sepsis cynipsea and Sepsis neocynipsea; Diptera: Sepsidae). Journal of Zoological Systematics and Evolutionary Research, 57(1), 80–90. https://doi.org/10.1111/JZS.12239
Goenaga, J., Yamane, T., Rönn, J., & Arnqvist, G. (2015). Within-species divergence in the seminal fluid proteome and its effect on male and female reproduction in a beetle. BMC Evolutionary Biology. https://doi.org/10.1186/s12862-015-0547-2
Gorshkov, V., Blenau, W., Koeniger, G., Rompp, A., Vilcinskas, A., & Spengler, B. (2015). Protein and peptide composition of male accessory glands of Apis mellifera drones Investigated by mass spectrometry. PLoS One, 10(5), e0125068. https://doi.org/10.1371/journal.pone.0125068
Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., Palma, F., Birren, B. W., Nusbaum, C., Lindblad-toh, K., … Regev, A. (2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome. 29(7). https://doi.org/10.1038/nbt.1883
H, L. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics (Oxford, England), 34(18), 3094–3100. https://doi.org/10.1093/BIOINFORMATICS/BTY191
Halstead, M. M., Islas-Trejo, A., Goszczynski, D. E., Medrano, J. F., Zhou, H., & Ross, P. J. (2021). Large-Scale Multiplexing Permits Full-Length Transcriptome Annotation of 32 Bovine Tissues From a Single Nanopore Flow Cell. Frontiers in Genetics, 0, 621. https://doi.org/10.3389/FGENE.2021.664260
Herath, B., Dochtermann, N. A., Johnson, J. I., Leonard, Z., & Bowsher, J. H. (2015). Selection on bristle length has the ability to drive the evolution of male abdominal appendages in the sepsid fly Themira biloba. 28, 2308–2317. https://doi.org/10.1111/jeb.12755
Hölzer, M., & Marz, M. (2019). De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers. GigaScience, 8(5), 1–16. https://doi.org/10.1093/GIGASCIENCE/GIZ039
Jain, M., Olsen, H. E., Paten, B., & Akeson, M. (2016). The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biology 2016 17:1, 17(1), 1–11. https://doi.org/10.1186/S13059-016-1103-0
Jansson, E., Besnier, F., Malde, K., André, C., Dahle, G., & Glover, K. A. (2020). Genome wide analysis reveals genetic divergence between Goldsinny wrasse populations. BMC Genetics 2020 21:1, 21(1), 1–15. https://doi.org/10.1186/S12863-020-00921-8
Klasberg, S., Bitard-Feildel, T., & Mallet, L. (2016). Computational Identification of Novel Genes: Current and Future Perspectives. Bioinformatics and Biology Insights, 10, 121. https://doi.org/10.4137/BBI.S39950
Li Ã, W., & Godzik, A. (2006). BIOINFORMATICS APPLICATIONS NOTE Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. 22(13), 1658–1659. https://doi.org/10.1093/bioinformatics/btl158
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & Subgroup, 1000 Genome Project Data Processing. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. https://doi.org/10.1093/BIOINFORMATICS/BTP352
Lin, J., Guan, L., Ge, L., Liu, G., Bai, Y., & Liu, X. (2021). Nanopore-based full-length transcriptome sequencing of Muscovy duck (Cairina moschata) ovary. Poultry Science, 100(8), 101246. https://doi.org/10.1016/J.PSJ.2021.101246
Mrinalini, Koh, C. Y., & Puniamoorthy, N. (2021). Rapid Genomic Evolution Drives the Diversification of Male Reproductive Genes in Dung Beetles. Genome Biology and Evolution, 13(8). https://doi.org/10.1093/gbe/evab172
Mueller, J. L., Linklater, J. R., Ram, K. R., Chapman, T., & Wolfner, M. F. (2008). Targeted Gene Deletion and Phenotypic Analysis of the Drosophila melanogaster Seminal Fluid Protease Inhibitor Acp62F. Genetics, 178(3), 1605. https://doi.org/10.1534/GENETICS.107.083766
Park, M. and Wolfner, M. F. (1995). Male and female cooperate in the prohormone-like processing of a Drosophila melanogaster seminal fluid protein.
Parthasarathy, R., Tan, A., Sun, Z., Chen, Z., Rankin, M., & Palli, S. R. (2009). Juvenile hormone regulation of male accessory gland activity in the red flour beetle, Tribolium castaneum. Mech Dev, 126(7), 563–579. https://doi.org/10.1016/j.mod.2009.03.005
Peferoen, M., & De loof, A. (1984). Intraglandular and extraglandular synthesis of proteins secreted by the accessory reproductive glands of the Colorado potato beetle, Leptinotarsa decemlineata. Insect Biochem, 14(4), 407–416.
Puniamoorthy, N., Blanckenhorn, W. U., & Schäfer, M. A. (2012a). Differential investment in pre- vs. post-copulatory sexual selection reinforces a cross-continental reversal of sexual size dimorphism in Sepsis punctum (Diptera: Sepsidae). Journal of Evolutionary Biology. https://doi.org/10.1111/j.1420-9101.2012.02605.x
Puniamoorthy, N., Ismail, M. R. B., Tan, D. S. H., & Meier, R. (2009). From kissing to belly stridulation: Comparative analysis reveals surprising diversity, rapid evolution, and much homoplasy in the mating behaviour of 27 species of sepsid flies (Diptera: Sepsidae). Journal of Evolutionary Biology. https://doi.org/10.1111/j.1420-9101.2009.01826.x
Puniamoorthy, Nalini, Schäfer, M. A., & Blanckenhorn, W. U. (2012b). Sexual selection accounts for the geographic reversal of sexual size dimorphism in the dung fly, sepsis punctum (diptera: Sepsidae). Evolution. https://doi.org/10.1111/j.1558-5646.2012.01599.x
Puniamoorthy, Nalini, Su, K. F. Y., & Meier, R. (2008). Bending for love: Losses and gains of sexual dimorphisms are strictly correlated with changes in the mounting position of sepsid flies (Sepsidae: Diptera). BMC Evolutionary Biology, 8(1), 1–11. https://doi.org/10.1186/1471-2148-8-155
R, P., G, D., MI, L., RA, I., & C, K. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14(4), 417–419. https://doi.org/10.1038/NMETH.4197
Rang, F. J., Kloosterman, W. P., & de Ridder, J. (2018). From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biology 2018 19:1, 19(1), 1–11. https://doi.org/10.1186/S13059-018-1462-9
Rohner, P. T., Blanckenhorn, W. U., & Puniamoorthy, N. (2016). Sexual selection on male size drives the evolution of male-biased sexual size dimorphism via the prolongation of male development. Evolution; International Journal of Organic Evolution. https://doi.org/10.1111/evo.12944
Sahlin, K., & Medvedev, P. (2019). <Emphasis Type="Italic">De Novo</Emphasis> Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11467 LNBI, 227–242. https://doi.org/10.1007/978-3-030-17083-7_14
Sahlin, K., Sipos, B., James, P. L., & Medvedev, P. (2021). Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis. Nature Communications, 12(1). https://doi.org/10.1038/S41467-020-20340-8
Samuk, K., Manzano-Winkler, B., Ritz, K. R., & Noor, M. A. F. (2020). Natural Selection Shapes Variation in Genome-wide Recombination Rate in Drosophila pseudoobscura. Current Biology, 30(8), 1517-1528.e6. https://doi.org/10.1016/J.CUB.2020.03.053
Sayadi, A., Immonen, E., Bayram, H., & Arnqvist, G. (2016). The de novo transcriptome and its functional annotation in the seed beetle Callosobruchus maculatus. PLoS One, 11(7), e0158565. https://doi.org/10.1371/journal.pone.0158565
Sessegolo, C., Cruaud, C., Da Silva, C., Cologne, A., Dubarry, M., Derrien, T., Lacroix, V., & Aury, J.-M. (n.d.). Transcriptome profiling of mouse samples using nanopore sequencing of cDNA and RNA molecules. https://doi.org/10.1038/s41598-019-51470-9
Sonenshine, D. E., Bissinger, B. W., Egekwu, N., Donohue, K. V., Khalil, S. M., & Roe, R. M. (2011). First Transcriptome of the Testis-Vas Deferens-Male Accessory Gland and Proteome of the Spermatophore from Dermacentor variabilis (Acari: Ixodidae). PLOS ONE, 6(9), e24711. https://doi.org/10.1371/JOURNAL.PONE.0024711
Soneson, C., Yao, Y., Bratus-Neuenschwander, A., Patrignani, A., Robinson, M. D., & Hussain, S. (2019). A comprehensive examination of Nanopore native RNA sequencing for characterization of complex transcriptomes. Nature Communications 2019 10:1, 10(1), 1–14. https://doi.org/10.1038/s41467-019-11272-z
Steijger, T., Abril, J. F., Engström, P. G., Kokocinski, F., Hubbard, T. J., Guigó, R., Harrow, J., & Bertone, P. (2013). Assessment of transcript reconstruction methods for RNA-seq. Nature Methods 2013 10:12, 10(12), 1177–1184. https://doi.org/10.1038/nmeth.2714
Swanson, W. J., Clark, A. G., Waldrip-Dail, H. M., Wolfner, M. F., & Aquadro, C. F. (2001). Evolutionary EST analysis identifies rapidly evolving male reproductive proteins in Drosophila. Proc Natl Acad Sci U S A, 98(13), 7375–7379. https://doi.org/10.1073/pnas.131568198
Vedelek, V., Bodai, L., Grezal, G., Kovacs, B., Boros, I. M., Laurinyecz, B., & Sinka, R. (2018). Analysis of Drosophila melanogaster testis transcriptome. BMC Genomics, 19(1), 697. https://doi.org/10.1186/s12864-018-5085-z
Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., Gocayne, J. D., Amanatides, P., Ballew, R. M., Huson, D. H., Wortman, J. R., Zhang, Q., Kodira, C. D., Zheng, X. H., Chen, L., … Zhu, X. (2001). The sequence of the human genome. Science, 291(5507), 1304–1351. https://doi.org/10.1126/science.1058040
Vibranovski, M. D., Lopes, H. F., Karr, T. L., & Long, M. (2009). Stage-specific expression profiling of Drosophila spermatogenesis suggests that meiotic sex chromosome inactivation drives genomic relocation of testis-expressed genes. PLoS Genet, 5(11), e1000731. https://doi.org/10.1371/journal.pgen.1000731
Wang, F., Chen, Z., Pei, H., Guo, Z., Wen, D., Liu, R., & Song, B. (2021). Transcriptome profiling analysis of tea plant (Camellia sinensis) using Oxford Nanopore long-read RNA-Seq technology. Gene, 769, 145247. https://doi.org/10.1016/J.GENE.2020.145247
Wei, D., Li, H. M., Tian, C. B., Smagghe, G., Jia, F. X., Jiang, H. B., Dou, W., & Wang, J. J. (2015). Proteome analysis of male accessory gland secretions in oriental fruit flies reveals juvenile hormone-binding protein, suggesting impact on female reproduction. Scientific Reports. https://doi.org/10.1038/srep16845
Weirather, J. L., Cesare, M. de, Wang, Y., Piazza, P., Sebastiano, V., Wang, X.-J., Buck, D., & Au, K. F. (2017). Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Research, 6. https://doi.org/10.12688/F1000RESEARCH.10571.2
Workman, R. E., Tang, A. D., Tang, P. S., Jain, M., Tyson, J. R., Razaghi, R., & Zuzarte, P. C. (2019). Nanopore native RNA sequencing of a human poly(A) transcriptome. Nature Methods, 16(12), 1297–1306. https://doi.org/10.1038/S41592-019-0617-2
Wu, D.-D., Irwin, D. M., & Zhang, Y.-P. (2011). De Novo Origin of Human Protein-Coding Genes. PLOS Genetics, 7(11), e1002379. https://doi.org/10.1371/JOURNAL.PGEN.1002379
Yamane, T., Goenaga, J., Ronn, J. L., & Arnqvist, G. (2015). Male seminal fluid substances affect sperm competition success and female reproductive behavior in a seed beetle. PLoS One, 10(4), e0123770. https://doi.org/10.1371/journal.pone.0123770
Zhang, H., Jain, C., & Aluru, S. (2019). A comprehensive evaluation of long read error correction methods. BioRxiv, 519330. https://doi.org/10.1101/519330
Zhao, L., Zhang, H., Kohnen, M. V., Prasad, K. V. S. K., Gu, L., & Reddy, A. S. N. (2019). Analysis of Transcriptome and Epitranscriptome in Plants Using PacBio Iso-Seq and Nanopore-Based Direct RNA Sequencing. Frontiers in Genetics, 0(MAR), 253. https://doi.org/10.3389/FGENE.2019.00253
Zhao, P., Zhou, X., Zou, J., Wang, W., Wang, L., Peng, X., & Sun, M. (2014). Comprehensive analysis of cystatin family genes suggests their putative functions in sexual reproduction, embryogenesis, and seed formation. Journal of Experimental Botany, 65(17), 5093–5107. https://doi.org/10.1093/JXB/ERU274

Table 1. Statistics from ONT GridION long-read and Illumina HiSeq short-read cDNA library preparation, RNAseq reads filtering, and final de novo transcriptomes.

Table 2. Summary statistics of sequence similarity at nucleotide and protein level. Sequence similarities were generated by comparing test samples ONT-1 and ONT-2, as well as the second Illumina sample ILL-2 against a representative Illumina control sample, ILL-1.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Discovering Novel Genes in Non-Model Fly Accessory Glands Using De Novo Nanopore Transcriptomics

Status:

Version 1

Abstract

Figures

Background

Results

Discussion

Conclusions

Methods

Abbreviations

Declarations

References

Tables

Additional Declarations

Status:

Version 1