Using long-read sequencing, we obtained an intact chloroplast genome and a well-defined gene structural annotation of Caulerpa lentillifera. The new genome is 7.5 kb larger than the previously reported genome sequence [21] for the genome, and in addition to SNVs and a few possible structural variations between the two versions, there are several indels relating to copy number variation of tandem repeat sequences (Additional file 1: Table S2). While these differences could be due to intraspecific differences between the isolates, limitations of using only short sequence reads in the previous work may have also contributed to the differences, as short reads can fail to assemble repetitive sequences [22]. Since the raw reads for the previously published genome are not available, we cannot determine the exact reason for the differences between the two sequences at present. Long-read sequences also permit identifying heteroplasmy within individuals, as recently shown in the chloroplast genome of a related species by nanopore sequencing [17]. Our PacBio long-read data did not reveal any evidence of such structural variations, and in our opinion the prevalence and nature of heteroplasmy across the siphonous green algae requires further work based on long-read methods that deliver highly accurate reads.
Several bioinformatic tools for automated feature annotation of chloroplast genomes have been developed, but relatively little work has been done to compare their predictions to experimentally determined RNA sequences. Our Iso-seq work shows that the majority of genes encoding proteins and rRNA were accurately predicted by MFannot and GeSEq. However, we found that the Iso-seq data-guided annotation could greatly improve the annotation of introns and exon-intron boundaries. Taking our exon-intron boundary information as a reference, we were able to greatly improve structural annotations of atpF and ccsA across the Bryopsidales, and corrected intron structures facilitated the analysis of the unusual characteristics of these introns. Several common features of the atpF and ccsA introns were identified, such as the conserved domain V motif and other common motifs upstream from it. Domain V, which is one of the six conserved domains radiating from a “central wheel” of group II introns, is the most conserved element and important component in catalytic reactions of group II introns [23, 24]. It was clear that the 2-nt bulge (AY) and the catalytic triad (AGC or CGC for some introns) at the stem of domain V are most important for chemical catalysis of excision [25, 26]. Although the catalytic triad are still conserved retained across all the analyzed group II introns of Bryopsidales, the bulge of domain V in atpF and ccsA introns are relative variable, indicating the splicing mechanisms of these introns might be different from typical group II introns. Previous work mainly based on land plants and Euglena showed that most group II introns are degenerated in their RNA structures or have lost the intron encoded proteins [27]. Our results indicate that the introns in atpF and ccsA have several obvious differences from canonical group II introns, including the absence of consensus intron boundary sequences, ORFs lacking homology to splicing or mobility, and deviant overall structure making it difficult to accurately determine the secondary domains other than domain V. However, our Iso-seq data showed that the introns in atpF and ccsA were spliced predictably, suggesting that an effective mechanism has evolved to recognize and splice these atypical introns in bryopsidalean chloroplasts.
The fragmentation of several protein-coding genes has been a puzzling feature of green algal chloroplast genomes. In this study, three protein-coding genes were found to be fragmented in the C. lentillifera plastid genome, with cemA shown to be fragmented in addition to the previously reported tilS and rpoB, which are known to be fragmented across Bryopsidales [12] and some other green algal lineages (e.g. [25, 28, 29]. Considering that cemA is not fragmented in other Caulerpa species, it likely represents a recent event. This observation, along with reports of some other fragmented genes such as rpoC1 and rpoC2 in Chlamydomonas species [30], suggests that gene fragmentation may be fairly common in green algal chloroplast genomes. The fragmented genes in Caulerpa retained high sequence conservation following the fragmentation, a clear indication that they are not pseudogenes. Our Iso-seq data and RT-PCR results provide clear evidence for transcription of these genes. They also indicate that the two pieces of both cemA and tilS are co-transcribed in transcriptional units, but the presence of shorter transcripts covering either fragment of these genes suggests that the transcripts may be divided into two portions by RNA processing mechanisms. Our results for rpoB contrast with the other genes, rather showing that while the two fragments were occasionally found on a single transcript, they were more commonly transcribed separately. A careful comparison between chloroplast genome and the aligned Iso-seq reads showed no evidence for RNA editing, thus it seems unlikely that the frame shifts in these fragmented genes were modified to restore normal reading frames. Ribosomal frameshifting [31] could be a hypothetical alternative mechanism to correct the frameshifts in fragmented genes at the level of translation, but the fact that various types of gene fragmentation exist in Bryopsidalean lineages [12], including some with longer inserts between the fragments, would suggest this is unlikely and that it is more likely that the two pieces of these fragmented genes are translated separately and combine after translation. We did not find SD-like sequences (translation initiation signals of bacteria and some chloroplast mRNAs) upstream of the translation initiation sites of the gene fragments, so it remains to be confirmed whether the transcriptional products of fragments a and b are separately translated and perform their normal functions by forming protein complexes of both subunits. Nevertheless, gene fragmentation (or gene fission) as well as gene fusion are important mechanisms that contribute to the evolution of gene architecture and origination of new genes. Gene fusion/fission was major contributor to evolution of multi-domain proteins in bacteria and creation of new genes in Drosophila [32, 33], and the mechanism of the origin of gene fission has been revealed as a two-step process consisting of duplication and degeneration in Drosophila. Recently, gene fragmentation was found to be very prominent in mitochondrial genomes of Diplonemids, where the resulting modules (gene fragments) are transcribed separately, which might contribute to a gradual increase in the complexity of a given cellular machinery [34]. What drives gene fragmentation in chloroplast genomes as well as the mechanisms and consequences of this process in these organelles remain open questions.
Our Iso-seq data allowed us to experimentally verify polycistronic mRNAs and post-transcriptional isoforms, which are important for understanding the mechanisms of plastid genome expression. Although transcriptional and post-transcriptional regulation of chloroplast genes have been well studied in higher plants [6, 35, 36], little is known about the situation in algae. Unlike higher plants, it has been assumed that transcript processing may be less important in controlling plastid gene expression in algae, because nearly all genes seemed to be transcribed as monocistronic RNAs in the unicellular green alga Chlamydomonas reinhardtii [6]. However, our analysis revealed that more than half of the protein-coding genes are co-transcribed with adjacent genes, forming polycistronic transcripts of up to 9 genes in Caulerpa, so the observations of Chlamydomonas can certainly not be extrapolated to other green algae. In addition, because we used very strict criteria to consider genes as co-transcribed on PTUs, several genes flanking our PTUs but that were not entirely covered by Iso-seq reads were not counted as part of the PTUs, so the extent of gene co-transcription is probably even larger. Our work, along with observations of co-transcribed chloroplast gene clusters in C. reinhardtii based on RNA-seq coverage analysis [37, 38], provide clear evidence for polycistronic transcription in algae. Unsurprisingly, genes on the same PTU in C. lentillifera were often functionally related. This observation extends to the co-transcribed clusters of unknown ORFs, which often shared conserved functional domains of the same class within the PTU (Additional file 4: Table S5). Four of the six conserved gene clusters in Bryopsidales [12] were found to be co-transcribed in PTUs, underlining the strong evolutionary conservation of co-transcription and the importance of the co-occurrence of these genes. The remaining two conserved gene clusters observed in Bryopsidales (psaM-psb30-psbK-psbN-trnM and psbE-psbF-psbL-psbJ) were not observed as PTUs in this study. This is because our full-length RNA data did not contain any reads of the genes in question, perhaps due to their lower levels of expression or faster degradation.