PacBio sequencing and error correction
The full-length transcriptome of mature and immature G. luofuense seeds comprised a total of 12,869,707 subreads (19.81 Gb) with an average length of 1,540 bp (Table 1, Fig. 1B). After self-correction with an accuracy value of ROIs > 0.8, 384,042 circular consensus sequences (CCSs) with an average length of 1,919 bp were generated, of which full-length, non-chimeric (FLNC) reads accounted for 81% (312,444, Fig. 1C). The FLNC reads were clustered using the ICE algorithm, and non- FLNC reads were polished. The FLNC reads and polished non- FLNC reads were merged, yielding 165,883 polished consensus isoforms ranging from 167 to 13,816 bp in length (Fig. 1D). The 165,883 polished consensus reads were further corrected using Illumina sequencing data with LoRDEC software. The mean length and N50 and N95 values changed slightly after correction (Table 2).
Table 1
Detail information in the processing of PacBio sequencing data
Terms | Numbers or ratio |
Subreads base (G) | 19.81 |
Number of subread | 12,869,707 |
Average length of subread | 1,540 |
N50 of subread | 2,013 |
Number of CCS | 384,042 |
Number of sequences with 5 'terminal primer | 362,429 |
Number of sequences with 3 'terminal primer | 362,580 |
Number of sequences with poly(A) tail | 335,541 |
Number of full length sequence | 317,094 |
Number of full-length non-chimeric reads (Flnc) | 312,444 |
Average length of Flnc read | 1,919 |
Percentage of Flnc (%) | 0.81 |
Number of polished consensus read | 165,883 |
Minimum length of consensus read | 167 |
Maximum length of consensus read | 13,816 |
Average length of consensus read | 1,847 |
N50 of consensus read | 2,245 |
Table 2
Detail information in PacBio sequencing data corrected by Illumina sequencing data
Type | Before correction | After correction |
Total nucleotide | 306,359,206 | 307,446,027 |
Total sequence | 165,883 | 165,883 |
Mean length | 1,847 | 1,854 |
Minimum length | 167 | 155 |
Maximum length | 13,816 | 14,509 |
N50 | 2,245 | 2,254 |
N90 | 1,175 | 1,179 |
Genome mapping and novel gene detection
The corrected polished consensus reads were mapped to the G. luofuense reference genome using GMAP. 162,887 (98.19%) reads were mapped to the reference (Fig. 2A); of these, 63,049 uniquely mapped reads (38.01% of total mapped reads) were mapped to the positive strand of the reference genome, 60,292 uniquely mapped (36.35%) reads were mapped to the negative strand, 39,546 (23.84%) were multiply mapped reads, and 2996 (1.81%) reads were unmapped. Over 98% of the mapped reads showed similarity to the reference genome, and coverage values of the mapped reads were all above 80% (Fig. 2B). A saturation curve revealed that there were sufficient mapped reads to identify most genes on the G. luofuense reference genome (Fig. 2C). The PacBio reads had higher percentages of exon numbers considering exon number > 5 (Fig. 2D). After deleting the unmapped and redundant reads, 41,151 reads remained, of which 7,899 were novel isoforms of known genes and 5,726 reads were from novel genes.
Annotation and classification of novel genes
The 5,726 novel genes were annotated by searching against six databases—NCBI NR, KEGG, GO, SwissProt, KOG, and Pfam. A total of 4,099 novel genes were annotated, of which 2,588 were annotated in the NR database (Table 3). Five species—Picea sitchensis (649 genes), Amborella trichopoda (116), Vitis vinifera (88), Elaeis guineensis (80), and Nelumbo nucifera (61)—produced the largest numbers of hits to the G. luofuense novel genes (Fig. 3A). 2,487 novel genes were annotated with KEGG pathways (Table 3), and the most enriched pathways were “signal transduction” (123 genes), “carbohydrate metabolism” (83 genes), and “translation” (69 genes, Fig. 3B). GO analysis classified 2,069 genes into three categories: “biological process”, “cellular components” and “molecular functions” (Fig. 3C). Novel genes classified in the biological process category were mainly annotated with the terms “metabolic process” (1,052), “cellular process” (1,037), and “single-organism process” (581). Novel genes classified in the cellular component category were mainly annotated with the terms “cell” (519), “cell part” (519), and “membrane” (367). Novel genes classified in the molecular function category were mainly annotated with the terms “binding” (1,192), “catalytic activity” (942), and “transporter activity” (132). 1,930 genes, 1,315 genes and 2,069 genes were annotated with the Swiss Prot, KOG and Pfam databases, respectively (Table 3).
Table 3
Summary of annotated numbers of novel genes by the six databases
Databases | Novel gene numbers |
NR | 2,588 |
SwissProt | 1,930 |
KEGG | 2,487 |
KOG | 1,315 |
GO | 2,069 |
Pfam | 2,069 |
All in databases | 297 |
One in databases | 2,942 |
Total annotated genes | 4,099 |
AS, APA analysis and fusion genes
After mapping reads to the reference genome of G. luofuense, a total of 9,061 AS events were detected (Fig. 4A). These could be classified into seven types: retained intron (2,713, 29.94%), alternative 3′ splice site (2,468, 27.24%), alternative 5′ splice site (1,769, 19.52%), skipped exon (1305, 14.40%), alternative first exon (542, 5.98%), alternative last exon (217, 2.39%), and mutually exclusive exon (47, 0.52%) (Fig. 4B). The gene TnS000292955g01 had the largest number of alternative splicing events: 27. A total of 8,512 genes from G. luofuense seeds had at least one supported poly(A) site. Of these, 3,654 (42.93%) had a single poly(A) site, and 640 (7.52%) had at least five poly(A) sites (Fig. 4A and 4C). The largest number of poly(A) sites—21—was found in the gene TnS000670009g01. In addition, 2,008 fusion genes were identified in the full-length transcriptome, of which 359 (17.9%) were mapped to scaffold149603 (Fig. 4A).
Identification of TFs and lncRNAs
A total of 2,160 transcription factors (TFs) from 86 gene families were detected using iTAK. The largest fraction of identified TFs came from the C3H (5.6%), bHLH (4.53%), and MYB-related (4.26%) families (Fig. 5A). In addition, 11,885, 5,958, 11,294 and 11,037 lncRNAs were identified using the CNCI, CPC, PFAM and PLEK methods, respectively. A total of 3,551 lncRNAs were identified by all four methods (Fig. 5B), with lengths ranging from 200 to 7,840 bp. The lncRNAs were further classified into four types: 1,422 (40.05%) sense intronic lncRNA, 1,149 (32.36%) long intergenic non-coding RNA, 547 (15.40%) antisense lncRNA, and 433 sense overlapping lncRNA (12.19%) (Fig. 5C). The length distribution of the identified lncRNAs was considerably narrower than that of mRNAs predicted from the G. luofuense genome (Fig. 4A and Fig. 5D). Most identified lncRNAs had five or fewer exons, whereas mRNAs predicted from the reference genome tended to have larger numbers of exons (Fig. 5E).
Illumina sequencing of seed samples at two developmental stages
To explore gene expression patterns during seed development of G. luofuense, 306,900,384 clean Illumina reads (46.04 Gb of raw data) with Q30 values from 93.54–94.07% were generated from three immature seed samples (IS) and three mature seed samples (MS) (Table 4). After the deletion of adaptors and low-quality reads, the average GC content of the six samples was 47.08%. PCA analysis showed that gene expression was highly correlated among the replicate samples of immature and mature seeds (correlation efficiency value = 0.95, cumulative proportion of variation explained by PC1 and PC2 = 78.7%) (Fig. 6A). After mapping to the G. luofuense genome, the mapping ratios of IS (average 89.44%, Table 5) were found to be significantly larger than those of MS (average 84.46%, Student’s t-test p-value = 0.003). RNA-seq analysis of the two developmental stages yielded a total of 23,977 genes (19,010 in IS and 20,737 in MS), of which 2,970 were identified as novel genes.
Table 4
Detail information of Illumina sequenced data from the six G. luofuense seed samples
Sample name | Raw reads | Clean reads | Q20 (%) | Q30 (%) | GC content (%) | Genome mapping (%) |
IS01 | 55,631,598 | 54,243,040 | 97.77 | 93.55 | 47.12 | 89.75 |
IS02 | 49,049,534 | 47,624,272 | 97.99 | 94.07 | 46.93 | 88.74 |
IS03 | 59,174,882 | 57,466,054 | 97.76 | 93.54 | 47.56 | 89.82 |
MS01 | 54,240,794 | 53,255,114 | 97.86 | 93.88 | 46.98 | 85.32 |
MS02 | 52,496,962 | 50,708,568 | 97.85 | 93.78 | 46.71 | 84.57 |
MS03 | 44,723,634 | 43,603,336 | 97.82 | 93.64 | 47.18 | 83.49 |
Total | 315,317,404 | 306,900,384 | 97.84 | 93.74 | 47.08 | 86.95 |
Table 5
Detail information of genome mapping of the transcriptome from the six G. luofuense seed samples
Sample name | IS01 | IS02 | IS03 | MS01 | MS02 | MS03 |
Total reads | 54,243,040 | 47,624,272 | 57,466,054 | 53,255,114 | 50,708,568 | 43,603,336 |
Total mapped | 48,642,362 (89.67%) | 42,217,664 (88.65%) | 51,570,801 (89.74%) | 45,412,254 (85.27%) | 42,897,372 (84.6%) | 36,405,206 (83.49%) |
Multiple mapped reads | 1,210,358 (2.23%) | 1,063,787 (2.23%) | 1,296,609 (2.26%) | 1,589,381 (2.98%) | 1,416,262 (2.79%) | 1,337,971 (3.07%) |
Uniquely mapped reads | 47,432,004 (87.44%) | 41,153,877 (86.41%) | 50,274,192 (87.49%) | 43,822,873 (82.29%) | 41,481,110 (81.8%) | 35,067,235 (80.42%) |
Reads map to '+' strands | 23,746,949 (43.78%) | 20,608,975 (43.27%) | 25,166,379 (43.79%) | 21,941,014 (41.2%) | 20,771,109 (40.96%) | 17,543,656 (40.23%) |
Reads map to '-' strands | 23,685,055 (43.66%) | 20,544,902 (43.14%) | 25,107,813 (43.69%) | 21,881,859 (41.09%) | 20,710,001 (40.84%) | 17,523,579 (40.19%) |
Enrichment analysis of DEGs and qRT-PCR validation
A total of 14,323 differentially expressed genes (DEGs) were identified between IS (control group) and MS: we found 7,891 upregulated genes and 6,432 genes downregulated (Fig. 6B) from IS to MS. The DEGs were also annotated with the three categories of GO terms, and multiple GO terms in the “biological process” category were significantly enriched with regard to Z-scores and adjusted p-values (Fig. 6C). The top five enriched GO terms were “single-organism cellular process” (GO:0044763), “single-organism process” (GO:0044699), “metabolic process” (GO:0008152), “cellular metabolic process” (GO:0044237), and “Organic substance metabolic process” (GO:0071704). The DEGs were also enriched in multiple KEGG pathways with reference to Arabidopsis thaliana. The top five enriched KEGG pathways were “metabolic pathways” (KEGG ID: ath01100, 1329 genes), “biosynthesis of secondary metabolites” (ath01110, 844 genes), “carbon metabolism” (ath01200, 179 genes), “ribosome” (ath03010, 164 genes), and “starch and sucrose metabolism” (ath00500, 154 genes) (Fig. 6D). qRT-PCR was used to validate the relative expression of 14 genes of interest: four MADS-box genes, four Aux/IAA genes, four bHLH genes, and two MYB genes. The relative expression of the 14 genes at the two seed developmental stages is presented in Fig. 6E.