Global view of the PacBio sequencing data
Using PacBio SMRT, we obtained 611,876 polymerase reads from 21 pooled samples (Additional file 9: Table S1). We then extracted 10,437,029 subreads with an average length of 2,177 bp from the polymerase reads (Additional file 9: Table S1). After processing them further, we obtained 498,059 circular consensus sequences (CCSs), each of which contained a poly(A) tail and 5′ and 3′ adaptors, with an average length of 2,755 bp (Additional file 9: Table S1). Among the CCSs, we detected 430,554 full-length non-chimeric (FLNC) reads, with an average length of 2,568 bp (Additional file 9: Table S1). After polishing the reads with both iterative clustering for error correction (ICE) and Arrow, we obtained 227,276 polished consensus reads whose average length was 2,697 bp (Additional file 9: Table S1). By mapping the short reads generated from Illumina sequencing via LoRDEC, we corrected up to 100% of the sequencing errors [44]. We ultimately obtained 227,276 high-quality polished consensus reads with an average length of 2,696 bp for further analysis (Additional file 10: Table S2).
Isoform detection and characterization
Using TAPIS software, we classified and characterized the full-length isoforms [23]. In total, we obtained 89,640 isoforms, including 4,210 (4.70%) isoforms of known genes, 65,448 (73.01%) novel isoforms of known genes, and 19,982 (22.29%) isoforms of novel genes (Fig. 1a). We compared the lengths of the transcripts between the PacBio data and reference genome annotation, and found that the transcripts described in the reference genome annotation were shorter than those detected by PacBio Iso-Seq (Fig. 1b), which helps to reveal the complexity of the L. chinense transcriptome. In addition, we compared the numbers of exons in the transcripts between the PacBio data and reference genome annotation. We found that single-exon genes represented a large proportion (30.37%) of the PacBio data, while in the reference genome, single-exon genes accounted for only 7.29% (Fig. 1c). The proportion of multiple-exon genes (≥ 2 exons) in the reference genome annotation was greater than that in the PacBio data (Fig. 1c). Moreover, single-isoform genes constituted a large proportion of the PacBio data (49.22%), followed by two-isoform genes, which accounted for 15.98% (Fig. 1d).
Read mapping and identification of novel genes and TFs
Using GMAP software, we aligned the high-quality polished consensus reads to the reference genome and found that 1.13% (2,572/227,276) of the reads were unmapped [45]. Among the remaining 224,704 alignments (98.87%), 9.96% (22,628/227,276) of the reads were mapped to multiple sequences, 88.91% (202,076/227,276) of the reads were uniquely mapped, 52.73% (119,840/227,276) of the reads mapped to the sense sequences in the genome (‘+’), and 36.18% (82,236/227,276) of the reads mapped to the antisense sequences in the genome (‘-’) (Fig. 1e, Additional file 11: Table S3).
In the PacBio sequencing data set, 2,572 unique transcript clusters did not align to the reference genome. The function of the unmapped transcripts was annotated on the basis of the information within the non-redundant (NR) database, nucleotide (NT) database, Protein Family (Pfam) database, EuKaryotic Orthologous Groups (KOG) database, Swiss-Prot database, Kyoto Encylcopedia of Genes and Genomes (KEGG) database and Gene Ontology (GO) database [46-51]. In these databases, 2,022 (NR), 1,584 (NT), 1,066 (Pfam), 1,260 (KOG), 1,621 (Swiss-Prot), 1,995 (KEGG) and 1,066 (GO) unmapped transcripts were annotated (Fig. 1f). Among these unmapped transcripts, 619 were found in all 7 databases, and 2,163 unmapped transcripts were annotated in only one database (Fig. 1f).
We defined reads that mapped to unannotated regions of the reference genome as reads from novel genes, and a total of 13,139 novel genes were detected. To better understand the novel genes, they were functionally annotated. In total, 9,343 (NR), 5,833 (NT), 6,494 (Pfam), 5,478 (KOG), 5,883 (Swiss-Prot), 9,177 (KEGG) and 6,494 (GO) novel genes were annotated in the 7 databases (Fig. 1g). In addition, 1,945 novel genes were annotated across all seven databases, and 10,905 novel genes were annotated in only one database (Fig. 1g). Furthermore, 2,234 novel genes were unannotated in all 7 databases, which may indicate that these genes have little coding ability and thus may represent lncRNAs [52].
We then used iTAK software to predict TFs [53]. We obtained a total of 5,532 TFs from 95 families (Additional file 12: Table S4). The top 30 families detected are shown in Fig. 1h. The top ten TF families identified included FAR1 (303), C3H (286), bHLH (266), SNF2 (250), others (209), bZIP (209), MYB-related (179), PHD (176), NAC (174), and SET (167) (Additional file 12: Table S4).
Identification of lncRNAs and fusion transcripts.
LncRNAs are RNA molecules that are longer than 200 nt and that do not encode proteins. We used four tools, PLEK, CNCI, CPC, and the Pfam database, to identify lncRNAs from PacBio Iso-Seq data [47, 54-56]. The PLEK, CNCI, CPC and the Pfam database tools predicted 44,108 lncRNAs, 15,500 lncRNAs, 21,549 lncRNAs, and 43,102 lncRNAs, respectively (Fig. 2a). To improve the accuracy of the lncRNA prediction, each lncRNA was predicted by four tools, and a total of 7,527 lncRNAs were detected (Fig. 2a). According to the position of each lncRNA when mapped to the reference genome, we classified the lncRNAs into four types, lincRNAs (36.44%), antisense lncRNAs (16.99%), sense intronic lncRNAs (18.37%), and sense overlapping lncRNAs (28.19%) (Fig. 2b) [57]. Overall, compared with mRNAs, the lncRNAs had fewer exons and were shorter (Fig. 2c and d). The length of most lncRNAs was less than 1 kb, and single-exon lncRNAs accounted for the majority of the lncRNAs (72.34%) (Fig. 2c and d).
A fusion gene consists of multiple coding regions connected end to end and controlled by a set of regulatory sequences (including promoters, enhancers, and ribosome-binding sequences). A total of 887 fusion transcripts were identified in this study (Additional file 13: Table S5). By analysing the distribution of fusion transcripts on the scaffolds (because the reference genome has not yet been assembled at the chromosomal level), we found that these fusion events tended to occur between scaffolds (866) rather than within a scaffold (21) (Additional file 13: Table S5).
AS and APA analyses
AS is an important mechanism for regulating gene expression and proteome diversity, and it is also an important cause of the large differences between the number of genes and the number of proteins in eukaryotes [58, 59]. By using the SUPPA tool, we identified 8,503 alternatively spliced genes, including 3,748 genes with skipped exons (SEs), 4,269 genes with retained introns (RIs), 404 genes with mutually exclusive exons (MXs), 3,625 genes with alternative 5′ splice sites (A5s), 1,521 genes with alternative first exons (AFs), 4,195 genes with alternative 3′ splice sites (A3s), and 632 genes with alternative last exons (ALs) (Fig. 3a). Among the 7 basic alternatively spliced genes, RI genes were the most common, while MX genes were the least frequent (Fig. 3a).
Most genes in eukaryotes can generate a variety mRNA 3′ ends through APA, which greatly increases the complexity of the transcriptome [60, 61]. By using the TAPIS pipeline [23], we found that 11,108 genes contained at least one poly(A) site, among which 4,387 genes contained a single poly(A) site and among which 6,271 genes contained more than one poly(A) sites (Fig. 3b). To identify the potential cis-element involved in polyadenylation, a motif enrichment analysis was performed to analyse the 50 nucleotides upstream from the poly(A) sites [23]. We discovered a conserved motif (AUAAA) upstream of the poly(A) cleavage site (Fig. 3c), and this motif also was present in maize, red clover (Trifolium pratense L.) and sorghum (Sorghum bicolor) [22-24]. Moreover, to investigate the preferential nucleotides at the poly(A) cleavage sites, we analysed the nucleotide composition of 50 downstream and 50 upstream nucleotides at all APA cleavage sites. We found that uracil (U) was enriched upstream of the APA cleavage sites while adenine (A) was enriched downstream of the cleavage sites (Fig. 3d). The same findings were reported in red clover and sorghum [23, 24].
Identification of DEGs and tissue-specific genes in L. chinense
To investigate gene expression patterns in the seven evaluated tissues of L. chinense, we used fragments per kilobase of transcript sequence per million mapped reads (FPKM) values to normalize the reads from Illumina sequencing. In total, we identified 3,032 DEGs between floral tissues (bracts, sepals, petals, stamens, and pistils) and vegetative tissues (leaves and the shoot apex), of which the expression of 878 DEGs was upregulated, whereas that of the other 2,154 was downregulated (Fig. 4a). We performed GO enrichment analysis to relate these 3,032 DEGs to their products, and the results showed that the top three terms containing categorized genes were ‘catalytic activity’, ‘single-organism metabolic process’, and ‘transferase activity’ (Additional file 1: Fig. S1a). Furthermore, KEGG pathway enrichment analysis revealed that the DEGs participated in 107 pathways, and the top 3 significantly enriched pathways were associated with biosynthesis of flavonoids, glycosphingolipids, stilbenoids, diarylheptanoids and gingerol (Additional file 1: Fig. S1b).
In the different tissues, the euclidean distance method was used to perform a clustering analysis of all the genes to identify their clustering patterns [62]. The results showed that these genes exhibited different expression patterns; for example, a number of genes tended to be highly expressed in vegetative tissues, and many genes tended to be highly expressed in floral tissues (Fig. 4c). The 21,944 genes (with different FPKM values in different tissues) were then grouped into six subclusters via the H-cluster algorithm. The results showed that these genes presented different expression patterns (Additional file 2: Fig. S2). Interestingly, the expression pattern of the genes in subcluster 1 (10,124 genes), whose expression was upregulated in the stamens, pistils, shoot apex, and flowers (the bracts, sepals, petals, stamens, and pistils combined) but downregulated in the leaves, was opposite that in subcluster 2 (2,289) (Additional file 2: Fig. S2). Interestingly, genes in subcluster 6 (296 genes) showed a strong tissue preference, exhibiting patterns of which their expression highly upregulated in the stamens and flowers (Additional file 2: Fig. S2).
To better understand tissue-specific genes in L. chinense, we investigated the distribution of tissue-specific genes in L. chinense and the pathways with which these genes were associated. Genes with less than 1 FPKM in any tissue, were considered not expressed in that tissue [63]. In our study, we detected a total of 2,126 tissue-specific genes in 7 tissues. Stamen tissue contained the highest number of tissue-specific genes (819, 38.71%), followed by the leaf (372, 17.58%), shoot apex (343, 16.21%), bract (263, 12.43%), pistil (225, 10.63%) and petal (52, 2.46%) tissues, while the sepal tissue contained the minimum number of tissue-specific genes (42, 1.98%) (Fig. 4b). The expression patterns of the 2,126 genes were tissue specific (Fig. 4d), and KEGG pathway enrichment analysis also revealed that these tissue-specific genes may play roles in different pathways. For example, bract-specific genes were significantly enriched in plant-pathogen interactions (Additional file 3: Fig. S3a). Interestingly, the tissue-specific genes in the shoot apex were enriched not only in plant-pathogen interactions, but also in monoterpenoid biosynthesis (Additional file 3: Fig. S3b). The petal-specific genes were associated with thiamine metabolism (Additional file 3: Fig. S3c), while the stamen-specific genes were significantly enriched in glycosphingolipid biosynthesis and starch and sucrose metabolism (Additional file 3: Fig. S3d). The pistil-specific genes were enriched in flavonoid biosynthesis (Additional file 3: Fig. S3e). Interestingly, the sepal-specific genes were also enriched in monoterpenoid biosynthesis (Additional file 3: Fig. S3f).
Gene co-expression analysis
Weighted correlation network analysis (WGCNA) was used for gene co-expression analysis [64]. In total, 13 co-expression modules were obtained, with the number of genes involved ranging from 173 (tan module) to 4,503 (turquoise module) (Fig. 5a, Additional file 14: Table S6). In the shoot apex, genes from the black module were highly expressed, and KEGG enrichment analysis revealed that these genes were associated mainly with plant-pathogen interactions and plant hormone signal transduction (Fig. 5b, Additional file 4: Fig. S4). Genes from the green-yellow module tended to be expressed in the bracts and shoot apex, and these genes also participated in plant-pathogen interactions (Additional file 4: Fig. S4, Fig. 5c). Genes from the magenta module, which were associated with glycolysis and the tricarboxylic acid (TCA) cycle, tended to be expressed in the petals and sepals (Fig. 5d, Additional file 4: Fig. S4), while genes from the pink module were highly expressed in the leaves, petals, and sepals and were involved in porphyrin and chlorophyll metabolism and photosynthesis (Fig. 5e, Additional file 4: Fig. S4). Genes belonging to the purple module were preferentially expressed in the petals and stamens (Additional file 4: Fig. S4). By using KEGG pathway enrichment analysis, we found that these genes played a role in oxidative phosphorylation and the TCA cycle (Fig. 5f). In the pistils, the expression of tan module genes was very high, and these genes were related to indole alkaloid biosynthesis (Fig. 5g, Additional file 4: Fig. S4).
To determine the hub genes in the black, green-yellow, magenta, pink, purple, and tan modules, six algorithms, degree, edge percolated component (EPC), maximal clique centrality (MCC), closeness, radiality, and stress, in the CytoHubba package of Cytoscape software were used [65]. To improve reliability, only the genes ranked in the top 20 by all seven algorithms were considered hub genes. We ultimately identified 9, 8, 9, 5, 9, and 12 hub genes in the black, green-yellow, magenta, pink, purple, and tan modules, respectively (Additional file 5: Fig. S5, Additional file 15: Table S7). These hub genes play important roles in different pathways. For example, Lchi02285 (which encodes the MADS-box protein JOINTLESS, black module) played a role in regulating transcription (Additional file 15: Table S7) [66]. Novelgene6677 (which encodes the disease resistance protein RPM1, green-yellow module) had a substantial role in plant resistance (Additional file 15: Table S7) [67]. Lchi18683 (which encodes plastocyanin, pink module) was indispensable for photosynthesis (Additional file 15: Table S7).
MIKC-type MADS-box genes in L. chinense
Using HMMER software and the BLASTP tool, we identified 10 MIKC-type MADS-box genes (7 known genes and 3 novel genes) in the L. chinense genome [68]. By comparing MIKC-type MADS-box genes in A. thaliana and P. trichocarpa with those in L. chinense, we determined the phylogenetic relationship among these 105 genes and conducted a neighbour-joining phylogenetic tree (Fig. 6a). Among these genes, Lchi23168, Lchi01744 and Novelgene12125 were classified as member of the AP3/PI subfamily, Lchi02285 was classified as member of the SVP subfamily, Lchi20361 were classified as member of the SEP subfamily, Novelgene10568 were classified as member of the AGL12 subfamily, Lchi01587 and Lchi04024 were classified as member of the AG/SHP/STK subfamily, Lchi17876 was classified as member of the SOC1 subfamily, and Novelgene0718 was classified as member of the TM8 subfamily (Fig. 6a). We found that the number of MIKC-type MADS-box genes in L. chinense (red circle) was less than that in A. thaliana (yellow circle) and P. trichocarpa (blue circle), and no L. chinense MIKC-type MADS-box gene was classified as member of the AGL6, AGL17, AGL15, FLC or AP1/FUL subfamily (Fig. 6a). We believed that the poor quality of the genome annotation may limit identification of additional MIKC-type MADS-box genes in L. chinense, as three of the ten MIKC-type MADS-box genes were identified via hybrid sequencing, which indicated that the information of L. chinense genome annotation was incomplete.
By using hybrid sequencing, we determined the expression profiles of ten MIKC-type MADS-box L. chinense genes. Lchi23168 and Lchi01744 were B-type genes that were related to the determination of stamen and petal identity and were highly expressed in the stamens and petals (Fig. 6b). As an E-type gene involved in the determination of the identify of all floral organs, Lchi20361 was highly expressed in the pistils, petals and stamens (Fig. 6b). Furthermore, Lchi04024 was highly expressed in the pistils and stamens, Lchi01587 was highly expressed in the pistils, and these two genes were C/D-type genes related to the determination of carpel and stamen identity (Fig. 6b). Interestingly, we found that the expression levels of four MIKC-type MADS-box genes, Lchi17876 (SOC1 subfamily), Lchi02285 (SVP subfamily), Novelgene0718 (TM8 subfamily) and Novelgene10568 (AGL12 subfamily), were greater in vegetative tissues than in floral tissues (Fig. 6b). Notably, not all of the MIKC-type MADS-box genes were highly expressed only in floral organs, as some of them were also highly expressed in vegetative organs. For example, AGL12 is highly expressed in the A. thaliana root meristem, and the expression levels of SVP-like genes in Prunus mume and SOC1-like genes in P. bretschneideri are greater in vegetative organs than in floral organs [69-71].
RT-qPCR validation
To validate the accuracy of the gene expression levels detected by RNA sequencing, we performed RT-qPCR of twenty randomly selected DEGs. As shown in Additional file 6: Fig. S6 and Additional file 7: Fig. S7, the expression levels of the ten DEGs measured by Illumina sequencing and RT-qPCR were correlated (R2 = 0.947, p-value < 0.01) despite some differences in transcript abundance. These results indicated that the gene expression levels detected by Illumina sequencing were reliable.