Pulp color change
Obviously, there is larger difference in pulp colour between the two cultivars. There were not significant differences in L*, a*, b*, C*, h° value of ‘Zihonglong’ and ‘Jinghonglong’ pulps on 22 DPA. The L*, a*, b*, C*, h° value of ‘Jinghonglong’ were relatively stable, while those of ‘Zihonglong’ changed significantly during the fruit development stages (Table 1). With the development of ‘Zihonglong’ fruit, L* value was decreased gradually, a*, C*, h° increased prominently on 25DPA and reached the highest values (27.90, 28.53, 343.05 respectively) on 28DPA, and then appeared a slight decline on mature fruit. b* value decreased from 2.11(yellow pulp) on 22 DPA to –6.73 (blue pulp) on 25DPA.
Transcriptome analysis using PacBio Sequel
The full-length transcriptome of pitaya fruit was generated by PacBio Sequel on ‘Zihonglong’ and ‘Jinghoglong’. As shown in Table 2, 9,579,839 subreads from 8.47G bases were obtained from the pulp of ‘Zihonglong’, while 7,245,659 subreads from 7.74G bases from the pulp of ‘Jinghonglong’. After removing adapters and artefacts, 367,001 circular consensus sequence (CCS) (including 314,173 full-length non-chimerics, FLNCs) of ‘Zihonglong’ and 481,602 CCS (including 348,184 FLNCs) of ‘Jinghonglong’ were generated, respectively. The lengths of ‘Zihonglong’ FLNCs ranged from 334 to 14,604 nt with an average length of 950 nt, while ‘Jinghonglong’ FLNCs showed an average length of 1,095 nt and ranged from 374 to 6,988 nt. For ‘Zihonglong’, 184,875 polished consensus sequences transcripts were produced, including 23,669 polished high-quality (HQ) and 161,206 low-quality (LQ) isoform sequences. For ‘Jinghonglong’, 188,215 polished consensus sequences, including 25,299 polished HQ and 162,916 LQ isoform sequences were obtained. After correcting and removing redundant reads, 65,312 and 91,638 genes (non-redundant reads) were obtained from full length transcripts of ‘Zihonglong’ and ‘Jinghonglong’, respectively.
Comparison of SMRT sequencing and next-generation sequencing
The number of SMRT gene obtained from SMRT sequencing were less than that of unigene assembled from NGS reads, whereas, the mean length of SMRT gene is much longer than that of unigene assembled from NGS reads. Additionally, about 65% of the assembled transcripts from NGS reads were <500 bases in ‘Zihonglong’, while about 74% of that in ‘Jinghonglong’. However, only 12% in ‘Zihonglong’ and 6% in ‘Jinghonglong’ of the transcripts from the PacBio Sequel reads were <500 bases. Approximately 80% of the transcripts from the PacBio Sequel reads ranged from 500 bases to 2000 bases (Table 3). Hence, the SMRT sequencing offered significant advantages over NGS in the length of reads.
Clustering analysis
Multiple transcripts corresponded to one gene in the transcriptional group. PacBio long reads clustering analysis demonstrated that 65,317 and 91,638 genes were generated from polished consensus sequences transcripts in ‘Zihonglong’ and ‘Jinghonglong’, respectively. Various isoforms generated by a single gene were widely found among the tested samples. A total of 17.42% genes had more than one isoform in ‘Zihonglong’ pulp, a slightly higher than that (16.97%) of ‘Jinghonglong’. In the former, 11,377 genes showed more than two alternative splice forms (isoforms), of which the majority corresponded to two-to-three isoforms, accounting for 74.78% of the total, and 516 genes contained over 10 isoforms. To the latter, 15,551 genes had more than two isoforms, among which the majority were two-to-three isoforms, accounting for 67.99% of the total, and 767 genes with over 10 isoforms were obtained (Figure 1). Therefore, when alleles and associated homologs were grouped against these results, they typically shared the same alternative splicing patterns [9], indicating that a gene might generated different transcripts via alternative splicing.
Function annotation
Function annotation of pitaya non-redundant FLNC transcripts (genes) was investigated using different databases. As shown in Table 4, a total of 34,601 transcripts were annotated in the Clusters of Orthologous Groups of proteins (COG) database; 54,706 in Gene Ontology (GO); 28,796 in Kyoto Encyclopedia of Genes and Genomes (KEGG); 56,010 in euKaryotic Ortholog Groups (KOG); 88,549 in protein families and domains (Pfam); 72,130 in Swiss-Prot; 95,458 in TrEMBL; 10,5413 in NR; and 63,052 in NT. Moreover, 120,604 transcripts were annotated in all of the nine databases, while 30,875 sequences without hits to reference database were probably novel genes in pitaya.
The homologous species of Hylocereus were predicted by sequence alignment on the basis of the NR database. Of all the genes hits to NR plant proteins from BLASTx, the pitaya genes gave the highest number of hits to the Beta vulgaris (51,879 hits), followed by Spinacia oleraces (21,876 hits) and Vitis vinifere (2,010 hits) (Fig. 2a). Most hits found in Beta vulgaris were probably for the reason that pitaya and Beta vulgaris belong to Caryophyllales, and Beta vulgaris database is better annotated than those of other species. As shown in Figure 2b, molecular function (MF, 62,439 FLNCs) was more abundant than biological process (BP, 142,635 FLNCs) and cellular component (CC, 104,215 FLNCs). Within these functional groups, the highest number of sequences were annotated with the metabolic process (35,263 sequences, 11.40%), cellular process (28,379 sequences, 9.18%) and catalytic activity (27,507 sequences, 8.89%). A total of 117 pathways with 28,796 genes were annotated by KEGG, associated with 23.88% of the whole annotated dataset (120,640 genes). Among these, 237 genes were identified in phenylalanine, tyrosine and tryptophan biosynthesis pathway, however, KEGG path way involved in betalain biosynthesis was not founded.
SSR and lncRNA prediction
A total of 31,169 SSRs were discovered in 24,889 genes (38.10%) from ‘Zihonglong’, the number of genes containing more than one SSR was 11,885, and the number of SSRs present in compound formation was 4,472. A total of 53,024 SSRs were discovered in 39,793 genes (43.42%) from ‘Jinghonglong’, the number of genes containing more than one SSR was 18,725, and the number of SSRs present in compound formation was 8,868 (Figure 3). In both cases, the most abundant motifs detected was mono–nucleotides, accounting for 41.72% and 40.58% of the total SSRs in ‘Zihonglong’ and ‘Jinghonglong’, respectively, and 4,883 (15.67%) and 6,204 (11.70%) di-nucleotides were detected from ‘Zihonglong’ and ‘Jinghonglong’, respectively.
We obtained 11,650 and 11,113 lncRNAs from 65,317 and 91,638 genes in ‘Zihonglong’ and ‘Jinggonglong’, respectively (Figure 4). Four of these lncRNAs were up to 3,000 nt in ‘Zihonglong’, while 18 up to 3,000 nt were investigated from ‘Jinghonglong’, most of which were single-isoform transcripts presenting in both samples. The functions of these lncRNAs need to be further characterized.
Genes involved in betalain biosynthesis
Taken account of the different expression levels of genes between ‘Zihonglong’ and ‘Jinghonglg’, the genes from PacBio sequel were used as the reference dataset. It was shown that 44,109 DEGs were found between ‘Zihonglong’ and ‘Jinghonglong’ during four development stages, among which, the most DEGs were investigated in R2_vs_W2, containing 11,317 up-regulated and 4,788 down-regulated DEGs respectively (Figure 5a). Heat map of all DEGs in both ‘Zihonglong’ and ‘Jinghonglong’ was created, and the four developmental stages of ‘Zihonglong’ and ‘Jinghonglong’ were clustered together, subsequently, both of the two pitaya cultivars showed that the 22 DPA and the 25 DPA got together, accompanying with the 28 DPA and the 30 DPA got together (Figure 5b).
To evaluate the candidate genes involved in betalain biosynthesis, further analysis was performed. According to the known betalain synthesis pathway [3, 21], TYR, MYB1, CYP76AD1, DODA and 5GT are involved in betalain biosynthesis. Totally, 104 genes including 7 TYR, 38 MYB, 20 CYP76AD, 26 DODA and 13 5GT genes were identified in our SMRT data and these genes were illustrated in a heat map (Figure 5c).
5 CYP76AD and 8 DODA genes with an FPKM > 10 or specially expressed in different colored phenotypes of pulp were selected as candidate genes, among which, HpCYP76AD4 (i1_HQ_R_c13003/f5p0/1979) and HpDODA (i1_LQ_R_c96099 /f1p0/1004) expressed highly in the pulp of ‘Zihonglong’ with the highest FPKM value, but there was almost no expression in the pulp of ‘Jinghonglong’. Furthermore, the expression patterns observed here were consistent with their role in betalain production of pitaya pulp. Thereby, we deduced that the low expression of HpCYP76AD4(i1_HQ_R_c13003/f5p0/1979) and HpDODA (i1_LQ_R_c96099 /f1p0/1004) may account for the absence of betalains in pulp of ‘Jinghonglong’, the length of the two genes were 1979 bp and 1004 bp, respectively. Subsequently, 5 CYP76AD and 8 DOD genes were selected as candidate genes, the BLAST search in NCBI and Neighbor-joining tree of with their homologous gene reported in other plants were performed. The results showed that HpCYP76AD4 (i1_HQ_R c13003/f5p0/1979) had higher similarity with Of CYP76AD8 (KM51679.1), followed by PoCYP76AD1-likeprotein (AKI33825.1) (Quvery67%, ident86%) and BaCYP76AD14 (AJD87470.1) (Quvery75%, ident81%) (Figure 6a). HpDODA (i1_LQ_R_c96099/f1p0/1004) has the highest sequence consistency with PgDODA (Q7XA48.1) (Quvery79%, ident88%), the Quvery and ident between HpDODA (i1_LQ_R_c96099/f1p0/1004) and BvDODA (AET43288.1) were 78% and 63%, respectively (Figure 6b). The length of open reading frame (ORF) of HpCYP76AD4 (i1_HQ_R_c13003/f5p0/1979) was 1521nt (from 122 nt to 1642 nt), that of HpDODA (i1_LQ_R_c96099/f1p0/1004) was 816 nt (from 111 nt to 926 nt).