Morphological features and genome assembly
Cell self-flocculation of S. obliquus AS-6-11 was observed by SEM analysis. The microalgal cells are round and form aggregates through cell-cell contacts (Fig. 1), which is different from the other reported Scenedesmus strains that are in spindle shape [12].
The estimated genome size of S. obliquus AS-6-11 is 172.3 Mbp with 2,772 contigs, and the N50 contig size is 94.4 kbp using MECAT for the genome assembly (Additional file 1: Table S1; NCBI BioProject ID: PRJNA593662). Results using the MECAT software showed a better assembly ability than that of SMRT Portal in S. obliquus AS-6-11, in which the contig numbers are 58.1% less, and the N50 value is 1.5-fold higher (Additional file 1: Table S1). The genome sizes of the released Scenedesmus strains [20-24] range from 23.4 to 208.0 Mbp (Table 1). Among the available results, the N50 contig sizes of S. obliquus AS-6-11 reported in this study and S. obliquus strain DOE0152z using Pacbio technology are significantly higher than the other Scenedesmus strains using SGS (Table 1). The N50 contig size of S. obliquus AS-6-11 is 1.2-fold and 10.7-fold higher than Scenedesmus sp. MC-1 and S. quadricauda LWG 002611, respectively. Besides, the GC content of Scenedesmus strains ranges from 52.0% to 63.2%, and S. obliquus AS-6-11 has the lowest GC content (Table 1). Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the assembly of S. obliquus AS-6-11 is 87.1% complete with 2,168 BUSCO groups (Additional file 2).
Genome annotations
A total of 31,964 protein-coding genes were predicted in the S. obliquus AS-6-11 genome (Table 2). The predicted gene number of S. obliquus AS-6-11 genome is dramatically higher than the other Scenedesmus strains (Table 1). According to the Non-redundant protein (NR), SWISS-PROT, and Pfam protein families databases, 19,847, 13,099, and 13,612 proteins were annotated, respectively (Table 2). The protein number annotated based on the NR database is the largest, which is 1.52-fold higher than that obtained based on the SWISS-PROT database. Besides, 65 GO terms and 428 pathways were predicted by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases in S. obliquus AS-6-11, respectively.
The top 20 GO terms and KEGG pathways enriched in gene function annotation of the S. obliquus AS-6-11 genome were illustrated in Fig. 2. The top 20 GO terms are mainly located in biological process (10) and cellular component (8), in which the cell, cell part, and organelle are the top three GO terms (Fig. 2a). The top 20 KEGG pathways are mainly related to genetic information processing (14), in which chromosome and associated proteins, membrane trafficking, and spliceosome are the top three KEGG pathways (Fig. 2b).
Comparative genomic analysis based on KEGG pathways
A total of 428 pathways were annotated in the S. obliquus AS-6-11 genome. In terms of lipid metabolism, the fewest genes (171) were annotated in S. obliquus AS-6-11, especially in glycerolipid metabolism, glycerophospholipid metabolism and arachidonic acid metabolism (Table 3). However, more genes related to fatty acid biosynthesis and elongation were identified in S. obliquus AS-6-11 than that in C. reinhardtii and V. carteri (Table 3). Moreover, genes in the carotenoid biosynthesis in S. obliquus AS-6-11 are the fewest.
Comparative genomic analysis of orthologous gene clusters
Comparing with the other four species, S. obliquus AS-6-11 has 15,879 gene clusters with 14,576 orthologous clusters and 1,303 single-copy gene clusters (Fig. 3). There are 3,357 overlapping orthologous gene clusters among the five microalgae. S. obliquus AS-6-11 has the most gene clusters and singletons (defined as the singleton genes for which no orthologs could be found in any of the other species [25]), and the number (8,751) is 1.26-fold, 3.71-fold, 5.34-fold and 1.67-fold higher than that in C. reinhardtii, C. variabilis, M. conductrix and V. carteri, respectively (Fig. 3). Comparative orthologous gene cluster analysis also showed that the phylogenetic proximity of S. obliquus AS-6-11 is very similar to that of the other four microalgae (Additional file 3: Fig. S1).
Comparative genomic analysis based on gene families
A total of 3,608 gene families were identified in S. obliquus AS-6-11, in which 136 unique gene families existed (Fig. 4). Both the total and unique gene families in S. obliquus AS-6-11 are more abundant than that in the other four microalgae (Fig. 4). The number of the unique gene families in S. obliquus AS-6-11 is 0.86, 1.19, 1.31 and 1.39-fold larger than C. reinhardtii, C. variabilis, M. conductrix and V. carteri, respectively (Fig. 4). In the S. obliquus AS-6-11 genome, the unique gene families include membrane protein (PF10160), red chlorophyll catabolite reductase (RCC reductase, PF06405), D-mannose binding lectin (PF01453), lipase maturation factor (PF06762), lipid-A-disaccharide synthetase (PF02684), thioesterase-like superfamily (PF13279) and so on. In addition, S. obliquus AS-6-11 and M. conductrix have the most common gene families (Fig. 4).
Analysis of the genome features related to cell self-flocculation
Cell self-flocculation of budding yeast Saccharomyces cerevisiae has been well-studied. The flocculation proteins, for example, Flo1p, Flo5p, Flo9p, and Flo10p, are cell wall proteins (CWPs) and also called lectin [26, 27]. GPI-anchor was reported as the common element in cell adhesion proteins and the GPI-anchored adhesins in yeast species of Candida albicans and S. cerevisiae are the well-known fungal adhesions [28]. In S. obliquus AS-6-11, a total of 432 GPI-anchored CWPs are identified. Analysis of the top 10 GPI-anchored CWPs indicated that seven of them has the transmembrane region, and eight of them had the signal peptides (Table 4). The isoelectric point (pI) and molecular weight (Mw) of the GPI-anchored CWPs vary from 4.95 to 9.58 and 6.10 KDa to 78.84 KDa, respectively (Table 4).
Fasciclin (PF02469) is an extracellular domain (http://pfam.xfam.org/family/PF02469) that belongs to the ancient cell adhesion domain that is common to plants and animals. So far, fasciclin domain proteins have not been analyzed in microalgae. In the S. obliquus AS-6-11 genome, a total of 33 fasciclin domain proteins are identified, which are divided into three groups (Fig. 5a). Three main motifs are randomly distributed across the fasciclin domain proteins (Fig. 5b). The predicted pI values and Mw greatly differ among the fasciclin domain proteins (Additional file 4: Table S2). The subcellular localization prediction of fasciclin domain proteins indicated that most proteins have cytoplasmic (cyto) sites, and 15 of them have secreted (extr) sites (Additional file 4: Table S2). Further analysis of these 15 fasciclin domain proteins containing extr sites showed that six proteins are homologous to the reported fasciclin proteins of Monoraphidium neglectum (64.84%), Aquabacterium sp. (61.36%), Scenedesmus sp. Ki4 (48.09%), Pelomonas puraquae (46.94%) (Table 5). Additionally, two of the predicted proteins are annotated into the GO term of the extracellular region part according to the GO database.
Combining analysis of GPI-anchored CWPs and fasciclin domain proteins, four fasciclin domain proteins were found to distribute in GPI-anchored CWPs (Fig. 6a; Additional file 5), in which one has two FAS1 domains (four repeated domains in the fasciclin I family of proteins), two have transmembrane regions, and one has signal peptide (Fig. 6a). Comparative genomic analysis of S. obliquus AS-6-11 and the other four microalgae species (C. reinhardtii, C. variabilis, M. conductrix and V. carteri) revealed no similar proteins to the four fasciclin domain proteins. We also performed comparative transcriptome analysis of S. obliquus AS-6-11 and the non-flocculating S. obliquus FSP-3, and the results showed that the four fasciclin domain protein-encoding genes (Fig. 6a) had transcription level in S. obliquus AS-6-11, but the transcription of these genes cannot be detected in S. obliquus FSP-3 (Additional file 6, Table S3).
The unique gene family D-mannose binding lectin was also analyzed (Additional file 5). One gene belongs to this unique gene family was identified, and the encoded protein has two conserved domains: CAP (cysteine-rich secretory proteins) domain and B_lectin (D-mannose binding lectin) domain. The putative D-mannose binding lectin of S. obliquus AS-6-11 is homologous to a secreted glycoprotein Pry1p of S. cerevisiae YJM693 (SGD ID: S000003615), and the identity is 58% (Fig. 6b). The similarity between Pry1p and D-mannose binding lectin attributes to the same CAP domain (Fig. 6b).