Advances in sequencing technologies have facilitated the incorporation of ‘omics data into investigations of both microevolutionary processes and macroevolutionary patterns of diversification in biological systems. For instance, our ability to generate high-throughput data from both model and non-model organisms has dramatically enriched our knowledge of how populations diverge, using genome-wide sequencing (Carneiro et al., 2014; Jansson et al., 2020; Samuk et al., 2020), and how new genes are born and genomes evolve, using RNAseq based transcriptomics (Au et al., 2013; Klasberg et al., 2016; Mrinalini et al., 2021; Wu et al., 2011). Specifically, several short-read technologies have been developed for RNAseq transcriptomics, and among these Illumina is a market leader due to high base calling accuracy (> 99.9%), high data yield, and the availability of well-established bioinformatics tools and best practices for data analysis (Conesa et al., 2016; Corchete et al., 2020; Hölzer & Marz, 2019). However, Illumina short-reads span ≤ 600 bp in length which presents significant challenges for resolving full length, protein coding transcripts, especially in non-model species that lack genic and genomic reference databases. Instead of complete DNA sequences (CDS), short-reads are stitched together in silico using a reference database or via de novo concatenated assembly to create contiguous sequences (contigs), and from this, full-length, protein-coding transcripts can be derived. Well-annotated and high-quality reference databases can be very useful for assembling CDS and quantifying gene expression, however they are often only available for well-studied organisms such as humans (Garg et al., 2020; Venter et al., 2001), fruit flies (Adams et al., 2000), mice (Bult et al., 2019) etc. Partially or spuriously assembled CDS and chimeric transcripts are common pitfalls of de novo short-read assembly (Conesa et al., 2016; Freedman et al., 2021). Moreover, in evolutionarily closely-related genes and in gene families consisting of multiple isoforms, it is difficult to resolve CDS and quantify gene expression to the isoform level using short-reads (Conesa et al., 2016; Steijger et al., 2013).
Recent inventions in Third Generation Sequencing technologies that provide high throughput, long-read data have allowed end-to-end transcript sequencing, thereby eliminating the need for assembling contigs. Two long-read sequencing technologies dominate the market at present: Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) Single-Molecule Real Time (SMRT) Sequencing. Among the two, ONT is arguably the leader in long-read transcriptomics and offers short to ultra-long DNA/RNA molecules longer than 10 kbp in length. Sequencing is done in real-time, based on the principle of passing long, unfragmented complementary DNA (cDNA) or RNA molecules through a protein nanopore, recording minute changes to electric current, and translating these changes into sequence data (Jain et al., 2016; Rang et al., 2018). Single-use cartridges with pre-loaded reagents can be easily used with portable, bench-top instruments, which makes ONT a convenient platform for ecological studies even in the field (Chang et al., 2020). Further, ONT offers cDNA sequencing, in both polymerase chain reaction (PCR) based and PCR-free formats, and direct RNA sequencing that bypasses the need for converting RNA into cDNA. Despite all these advantages, a major obstacle to the adoption of ONT long-read transcriptomics is the high error rate in both cDNA and direct RNA sequencing (Workman et al., 2019). One way to mitigate sequencing errors is by adopting a reference-based approach, and as mentioned earlier, well-characterized, species-specific reference databases can go a long way in resolving transcript sequences and their expression levels, even to the level of isoforms (Dong et al., 2021; Sessegolo et al., n.d.; Workman et al., 2019). This approach with ONT has been documented in species with good quality reference databases such as humans (e.g., Soneson et al., 2019; Weirather et al., 2017; Workman et al., 2019), mice (Sessegolo et al., n.d.), cattle (e.g., Halstead et al., 2021), fruit flies (Bayega et al., 2018), viruses (e.g., Boldogkői et al., 2018), and well-studied plants (e.g., Cui et al., 2020; Wang et al., 2021).
However, in non-model species without reference databases, the uptake of ONT long-read transcriptomics has been negligible and there are no studies comparing ONT transcriptomics with other sequencing platforms (e.g., Illumina, PacBio). In fact, a search for ONT transcriptomics of lesser-studied species led us to just one publication on the Muscovy duck where a database from an alternate duck species was used as a reference (Lin et al., 2021) (search conducted on 8th September 2021). The combination of lack of reference databases and high sequencing error rates, makes it especially challenging to undertake de novo ONT transcriptomics (Sahlin et al., 2021; Sahlin & Medvedev, 2019). Further, although over 555 tools are available for long-read analysis (https://long-read-tools.org/, accessed 9th September, 2021)(Amarasinghe et al., 2020, 2021), no clear bioinformatics pipelines have been established for de novo ONT transcriptomics and we are still in very early stages of reference-free ONT transcriptomics. However, there is promise for wider application of de novo ONT transcriptomics if sequencing data quality can be significantly improved by implementing appropriate post-sequencing error correction procedures, such as de novo clustering, consensus sequence calling, and polishing (Amarasinghe et al., 2020; Rang et al., 2018; Sahlin et al., 2021; Sahlin & Medvedev, 2019; Zhang et al., 2019). Specifically, a non-hybrid approach using only long-reads or a hybrid approach that capitalizes on higher quality short-reads are two alternate approaches currently in use to address long-read error correction (Amarasinghe et al., 2020; Sahlin et al., 2021; Sahlin & Medvedev, 2019; Zhang et al., 2019).
In this study, we explore the use of reference-free, de novo ONT long-read transcriptomics for novel gene discovery and gene expression quantification in the accessory glands of a dung fly species, Sepsis punctum (Diptera; Sepsidae). This is an ecologically relevant insect, often found on decaying organic material such as cattle dung and is widespread across North America and Europe. Sepsid flies are emerging models in a range of disciplines such as eco-toxicology (Blanckenhorn et al., 2013a; Blanckenhorn, et al., 2013b), biogeography (Blanckenhorn et al., 2021; Giesen et al., 2017, 2019), evo-devo (Herath et al., 2015) and sexual selection (Puniamoorthy et al., 2009; Puniamoorthy et al., 2008). In particular, S. punctum populations in North America and Europe differ in mating behaviour as well as male reproductive investments and female remating frequencies, making it an interesting model for sexual selection studies (Puniamoorthy et al., 2012a; Puniamoorthy et al., 2012b; Rohner et al., 2016). However, with the exception of species from the genus Themira, which is distantly-related to S. punctum, there is generally a lack of genic, genomic, or transcriptomic data for sepsid species.
Insect reproductive genes and proteins are known to diverge rapidly at the species level and even at the population level (Abry et al., 2017; Bayram et al., 2017, 2019; Goenaga et al., 2015; Mrinalini et al., 2021; Swanson et al., 2001). The male reproductive tissues i.e. the testes, are responsible for sperm production and are often closely associated with paired accessory glands (AG) that synthesize seminal fluid proteins (Figure 1). These play a crucial role in post-copulatory sexual selection and AGs, in particular, can be hotspots for the evolution of completely novel genes (Mrinalini et al., 2021). Rapid evolution of insect genomes, through the birth of novel genes and subsequent recruitment for high-expression in the AGs, underpins starkly divergent reproductive gene repertoires and protein compositions even in closely-related species (Mrinalini et al., 2021). Furthermore, proteins synthesized by newly-evolved AG genes often do not show any similarity to proteins from other organisms, and their functions are largely unknown (Bayram et al., 2017, 2019; Mrinalini et al., 2021; Parthasarathy et al., 2009). Given the lack of genomic/transcriptomic reference databases, the likely presence of novel, species-specific genes, and the lack of protein homology to any other species, it is challenging to resolve full-length accessory gland genes of insects, especially using a de novo transcriptomics approach with an emerging technology such as ONT.
Given this backdrop, we aim to: (i) perform ONT GridION long-read transcriptomics of S. punctum male accessory glands; (ii) implement a de novo ONT GridION transcriptome pipeline with error correction procedures; (iii) characterize novel accessory gland genes; (iv) quantify gene expression; and (v) evaluate the usefulness of de novo ONT transcriptomics by comparing results with Illumina HiSeq data.