Composition and distribution of the algal selenoproteome
We predicted more than 1000 selenoprotein genes from genomic (36 organisms) and/or transcriptomic (including EST) datasets of 137 algal species (detailed information about these organisms is shown in Table S1 in Supplementary file 1). The distribution of selenoproteins and their Cys-containing homologs in these organisms is shown in Figure S1 in Supplementary file 1. Details about these selenoprotein genes are available at the SPDB database website (http://www.selenoprotein.com). Algal selenoproteins can be identified in textual information by searching for the species or selenoprotein family name or can be identified by their sequence using a web blast tool[29]. For each selenoprotein gene, information such as the nucleic acid sequence, amino acid sequence, SECIS element, gene splicing structure, and EST alignment information was recorded. A detailed description of this database is shown in Figures S10–S15. Considering that the majority of organisms examined here had only transcriptomic data, the possibility that additional selenoprotein genes were not sequenced in some of these organisms could not be neglected. Figure 1 shows the distribution of different selenoproteins and their homologs in the 36 algae with genomic sequences.
According to the taxonomic classification of algae[30-35], we divided these species into Plantae (including Green algae, Red algae, and Glaucophytes), the SAR group (including Stramenopiles, Alveolates, and Rhizaria), Cryptophytes and Haptophytes. The majority of these algae (34 out of 36) belong to Plantae and the SAR group. The composition of the algal selenoproteome varied dramatically among different taxonomic groups, including a group of six species in which no selenoprotein gene could be detected (Figure 1 and Figures S1 and S3). However, in certain lineages, the number of selenoproteins appeared to be more stable. For example, all the algae species possessing larger selenoproteomes (containing more than 20 selenoproteins, as shown by the green branches in Figure 1) were found in Mamiellales and Diatoms, whereas the algae having smaller selenoproteomes (less than 2 selenoproteins, as shown by the red branches in Figure 1) were detected in red algae and Eustigmatophyceae.
In Plantae, the size of the selenoproteome varied significantly among different organisms. Red algae and glaucophytes had very small selenoproteomes, including two organisms (Chondrus crispus and Cyanidioschyzon merolae) in which no selenoprotein genes could be detected. Among green algae, Mamiellales species had the largest selenoproteomes (>20 selenoproteins), whereas Sphaeropleales, Streptophyta, and Trebouxiophyceae had the smallest selenoproteomes (0-5 selenoproteins). Compared with other algae, Streptophyta is evolutionarily closer to land plants. Although the only organism with sequenced genomic data found in this clade, Klebsormidium flaccidum, contains only two selenoprotein genes, more selenoprotein genes were detected in some other streptophytes using EST data, such as Nitella hyalina and Chaetosphaeridium globosum, which are thought to be closer to higher-level plants than K. flaccidum (Figure S1).
The distribution of known selenoproteins in the SAR group was also highly variable. Stramenopiles are the largest group of SAR and include Diatoms, brown algae, yellow-green algae, Phaeophyceae, and Eustigmatophyceae. In A. anophagefferens, a pelagophyte, 82 selenoprotein genes belonging to 33 families were found. It has been previously reported to have the largest eukaryotic selenoproteome[26]. The number of selenoprotein genes in diatoms varied from 20 to 44, which is similar to the size of the selenoproteomes in the Mamiellales order of green algae. Brown algae and yellow-green algae had much smaller selenoproteomes (5-6 selenoprotein genes). Moreover, no selenoprotein gene was detected in Eustigmatophyceae. Alveolates and Rhizaria are the other two groups of the SAR group; we detected 29 and 23 selenoprotein genes in Symbiodinium minutum (Alveolates) and Bigelowiella natans (Rhizaria), respectively.
Two additional algae species with sequenced genomes are Guillardia theta (Cryptophyte) and Emiliania huxleyi (Haptophyte). Fourteen selenoproteins belonging to 12 families were detected in G. theta. Surprisingly, a total of 96 selenoprotein genes were identified in E. huxleyi, which is the largest selenoproteome within all organisms discovered so far; the selenoproteins belong to 25 different families. Such a large number of selenoprotein genes might be related to the high repetition rate of the genome of E. huxleyi [36].
Forty-two selenoprotein families were predicted in algae. Many algal selenoproteins have homologous proteins containing no Sec residues, and the most common substitution involves the replacement of Sec by Cys (hereinafter referred to as Cys-homologous). In addition, there are many other homologs of selenoproteins in which the corresponding position of Sec contains neither Sec nor Cys (hereinafter referred to as Other-homologs). The homologous proteins (Cys-homologs and Other-homologs), although they are probably not related to Se metabolism, may function similarly due to their sequence similarity. More importantly, they contain information on the evolution of selenoprotein families. Therefore, we also included Cys-homolog and Other-homolog data when analyzing the evolution and distribution of the selenoprotein family.
Figure S2 shows the distribution of algae containing different selenoproteins and/or their homologs. Considering the distribution of all types of homologous proteins (Sec-containing, Cys-homologs, and Other-homologs), the PDI_a and TXNRD families are present in all 36 algal genomes, and GPX and GRX are also present in 35 species (Figure S2A). Therefore, these protein families may be essential for the majority of algae. However, the proportion of Sec-containing proteins is different, as PDI_a and GRX are present in the Cys-containing form in most algae, while GPX and TXNRD are mainly present in the Sec-containing form. Figure S2B shows the ranked distribution of selenoproteins (Sec-containing) in different algae. Sec-containing forms of four selenoproteins, GPX, SELENOU, SELENOT and TXNRD, could be found in more than half of the 36 genomes and are the most widely distributed selenoproteins in algae.
Figure S2C shows the proportion of Sec-containing members in each protein family. Some selenoprotein families, such as MSP, SELENOK, SELENOS, USGC, AhpC_b, SELENON, and FesRD, are found almost exclusively in the form of Sec-containing proteins. In addition, 80% of DIO, TlpA, SELENOW, and Hypo family members are Sec-containing proteins. These selenoproteins have fewer non-Sec-containing homologous proteins, indicating that their function is more dependent on Se metabolism in algae. In contrast, members of some other selenoprotein families, such as MsrB, PDI_d, AhpC_a, and GST, are found as Cys-homologous or Other-homologous proteins in nearly 90% of algae genomes.
Identification of novel selenoproteins
In this study, three novel selenoprotein families were found in different algae (Figure 1 and Figure 2).
PDI_e. We found a large number of PDI-like protein genes in algae. The thioredoxin-like fold domain can be detected in most of these proteins. Therefore, their functions may be related to redox regulation. Based on the amino acid sequences surrounding the Sec residue, PDI sequences could be divided into five subfamilies (Figure S3): PDI_a, PDI_b, PDI_c, and PDI_d, which contain only one Sec, and PDI_e (as named in this study), which was found to have three neighboring Sec residues that formed a GUGUU motif (Figure 2A). This is the first study to discover a selenoprotein with two consecutive Sec residues. Because of this Sec-Sec sequence, we considered PDI_e as a novel selenoprotein (the EST sequence alignment and predicted SECIS elements of PDI_e in several organisms are shown in Figure S4). We speculate that the selenoprotein synthesis system of organisms containing PDI_e is sufficient to meet the requirements of decoding continuous TGA codons. Correspondingly, the number of selenoproteins in several PDI_e-containing algae was also abundant (Figure 1, Figure S1 and Figure S3). Even in some PDI_e-containing algae without the relevant genomic sequences, many selenoproteins could also be detected. For example, in Isochrysis galbana, 17 selenoproteins from 14 families were found in 6,432 assembled Est contigs, and in Karenia brevis, 29 selenoproteins from 17 families were found in 29,618 assembled Est contigs.
We found a total of 12 PDI_e genes in 10 different algae. They are mainly distributed in Haptophyceae and the SAR group. The loss of the GUGUU motif occurred in the homologous proteins of Fistulifera solaris. There was no Sec-containing PDI_e sequence in the Plantae group, and only non-Sec-containing sequences homologous to PDI_e were detected. In Figure 2A, in addition to the PDI_e proteins found in algae, the proteins found in the NR database that have sequence similarity to PDI_e are also shown. The results show that there is no protein homologous to PDI_e in bacteria, fungi, or other multicellular eukaryotes, so we conclude that this is a selenoprotein found only in single-celled eukaryotic algae.
AhpC_b. Two families of selenoproteins containing AhpC_TSA domains could be found in algae, AhpC_a and AhpC_b. AhpC_a was detected in almost all algae species, but most of the corresponding proteins were Cys-containing homologs. The Sec-containing AhpC_a was present in only three algal species: A. anophagefferens, E. huxleyi, and S. minutum. AhpC_b was found in Thalassiosira oceanica. There is a detectable similarity between AhpC_b and AhpC_a, but the Sec-flanking sequences are significantly different. In the NR database, we found several proteins homologous to AhpC_b. However, interestingly, all of these homologs were found in prokaryotic organisms and in Cys form. Figure 2B shows the phylogenetic tree and multiple alignment of amino acid sequences of AhpC_b, their closest homologs from prokaryotic species, and all Sec-containing AhpC_a in algae. As shown in Figure 2B, the UxxC(CxxC) motif of AhpC_b and other prokaryotic homologs is different from the TGGUT motif of AhpC_a. Because of the difference between the key motif and the whole sequences, we considered AhpC_b as a novel selenoprotein (the SECIS element is shown in Figure S4). We speculate that it potentially originated from a prokaryotic ancestor by horizontal gene transfer because no similar eukaryotic sequence was found. Due to its AhpC_TSA domain, the function of AhpC_b may be related to antioxidation.
SymSEP. We found a selenoprotein family that was present in Symbiodinium phyla only in the Sec-containing form. We named it SymSEP. Four SymSEP selenoproteins were found among the genomic sequences and Est contigs from 2 species, Symbiodinium minutum and Symbiodinium sp. C3. The SECIS elements were detected and are shown in Figure S4 in supplementary file 1 (in the unpublished data, we also found a SymSEP sequence in Symbiodinium microadriaticum).
A phylogenetic tree and multiple sequence alignment of SymSEP-homologous proteins are shown in Figure 2C. The figure shows all proteins similar to SymSEP found in all 137 algal sequences. Other similar proteins detected in the NR database are also included. As shown, the Sec-containing form of the protein is only present in the Symbiodinium phyla. Cys-containing homologs contain CxxC motifs that are widely distributed in a variety of eukaryotic algae and bacteria. In addition, there are two branches that do not contain either Sec or CxxC motifs. Based on the phylogenetic tree in the figure, we speculate that SymSEP first originated from prokaryotes in the form of a Cys-containing protein and only became a Sec-containing protein in Symbiodinium phyla after differentiation. The Trx-like domain was also detected in its coding region, suggesting that the function of SymSEP is related to redox regulation.
Substitution of Sec
Sec is within the functional core site of the selenoprotein, and its codon is the termination codon TGA. Mutations in the codon result in the conversion of Sec into other amino acids, such as Cys (TGC, TGT) and Trp (TGG). Compared to that of Sec, their codon is only different at the third base. Among the various amino acids, the properties of Cys and Sec are the most similar, and most of the selenoproteins have homologous proteins in which Sec is substituted by Cys. The substitution of Sec by Cys is an important event in the evolution of selenoproteins.
As the correct translation of Sec-TGA requires complex synthetic systems, such as the SECIS structure located downstream of the coding region, the change from Cys to Sec is theoretically more difficult than that from Sec to Cys. The traces left by this transformation in the SECIS structure found downstream of the Cys-containing gene were previously reported. We also found a SECIS in a Cys-containing PRX from S. minutum (see attached Figure S5ABC in supplementary file 1). More interestingly, we found a pair of GRX genes in Fragilariopsis cylindrus. Their sequences are highly similar (positive > 80%), but one is Sec-containing, whereas the other is Cys-containing. Analysis of these two GRXs revealed a typical Sec-Cys substitution event. Most algae contain Cys-containing GRX, and Sec-containing GRX is only found in several selenoprotein-rich species from the SAR group and haptophytes. No Sec-containing GRX could be found in the Plantae group. Phylogenetic analysis of algae GRX revealed that the Sec-containing protein was clustered within a subtree which is partly shown in the Figure S6A. It can be inferred from the phylogenetic tree that most of the Sec-containing GRX have a common ancestor (except 001, 002, and 006). However, in the subtree branch, there are also a few Cys-containing homologous genes, which may undergo Sec-to-Cys changes. The Cys-containing GRX and Sec-containing GRX of Fragilariopsis cylindrus highlighted in Figure S6A have a common parental node; in other words, their differentiation has only recently occurred. More interestingly, the flanking genomic sequences of the two GRXs are homologous (see Figure S5D). Therefore, we hypothesize that these two GRXs may be derived from the same Sec-containing ancestral gene, in which genomic-level replication events occurred in this species or its related ancestors. The original single GRX gene was duplicated into two copies, and in one of the copies, Sec was converted into Cys due to a mutation. This is the first discovery of a genomic replication event associated with Sec-Cys substitution.
As we discussed above, the specific TGA decoding method and the complex synthesis system of selenoproteins make it very difficult for Cys to change into functional and genetically retainable Sec in terms of evolutionary history. However, in specific situations, the Cys-to-Sec mutation occurs in species with a functional selenoprotein synthesis system, and it occurs in a coding region upstream of a functional SECIS sequence; this change could be achieved. Then, the mutation will produce a decodable TGA-Sec codon. If the protein with the Cys-to-Sec change still has complete or partial function and allows the species to survive and breed, then it will be retained as a functional gene. Such events have been previously reported in several selenoproteins, especially those containing multiple Sec residues, such as SELENOP and several SELENOW proteins. In this study, we have found several new examples of Cys-to-Sec events. We previously found a SELENOW protein with 2 Sec in a UxxU motif in amphioxus, while in other SELENOW proteins, only one Sec was found in the CxxU motif. Interestingly, another UxxU-type SELENOW was found in this work (from Ostreococcus lucimarinus). The multiple sequence alignment of these SELENOW sequences is shown in Figure S7 in supplementary file 1. Another example of a Cys-to-Sec mutation was found in the SELENOJ family. SELENOJ was first discovered in vertebrates and was thought to exist only in multicellular animals[37]. Interestingly, multiple SELENOJ selenoproteins and Cys-containing homologs were detected in algae, including one sequence containing 2 Sec residues from Alexandrium tamarense (Figure S6B). In this 2-Sec-containing SELENOJ protein, the first Sec was also present in several algae and animals. The second Sec was only found in the EST sequences of A. tamarense. Therefore, it could be potential evidence of the Cys-to-Sec evolution event, which could lead to a novel selenium-related function due to the new position of Sec.
In addition to Cys homologs, we searched for non-Cys-containing homologs from 42 selenoprotein families in 137 algal datasets and the NR database. In these Other-homolog protein sequences, the local region corresponding to the position of the Sec motif was changed into other motifs. SELENOF is one of the earliest discovered animal selenoproteins[38]. It is mainly found in the Sec-containing form in multicellular animals and exists in the form of Cys homologs in only a few invertebrates (Arthropoda, Ecdysozoa, etc.)[39-41]. SELENOF is also widely distributed in algae, and the Sec-containing algal SELENOF protein contains the same CxU motif as the animal SELENOF protein. Interestingly, there is no Cys homolog of SELENOF found in algae. Instead, other homologs with other motifs were found in various algae. Their CxU motifs are converted into CMR in terrestrial plants and certain algae and into DQW in some green algae (Figure S6C). In addition, the Sec motif has undergone significant changes in some SELENOF proteins, resulting in the loss of local conservation, such as in SELENOF in Micromonas commoda. Despite the loss of the Sec-containing motif, these other homologs are still preserved and functional in the algal genomes of many different evolutionary domains, indicating that SELENOF has more functions not related to Se. Figure 3 shows the distribution of Sec-containing, Cys-homologous and Other-homologous proteins in the various evolutionary domains of eukaryotic algae (including terrestrial plants) in 42 selenoprotein-containing families of algae. In the GPX, GRX, GST, MDP, PDI, and other families, the core Sec motif has also become a non-Sec motif. In addition, the figure also shows the distribution of homologous selenoprotein proteins in terrestrial plants. Although there is no Sec-containing protein, most of the homologous proteins of unicellular algal selenoproteins are found in terrestrial plants. The phyla of terrestrial plants, such as Charophyceae (Nitella hyalina) and Coleochaetophyceae (Chaetosphaeridium globosum), have a greater number of selenoproteins, suggesting that the loss of selenoproteins in terrestrial plants may have occurred in later geological ages.
Selenoprotein gene clusters and fusion genes
Genetic recombination, transposition, or whole-genome duplication can result in changes in the genomic location of the DNA fragment. These events may lead to clustering or fusion of genes. Previously, we reported clusters of selenoprotein genes in several invertebrate genomes, which might suggest a functional correlation between them[42-45]. Here, selenoprotein clusters were also observed in algae. Figure 4A shows the type and presentation of clusters in different algae. The gene structure and position of these clusters are shown in Figure S8 in supplementary file 1. As we can see from Figure 4A, the clustering of selenoprotein genes was only found in 13 species. It is mostly found in E. huxleyi. The most frequently found selenoprotein families were MSRA and SELENOU. Among them, the SELENOF-PDI_a pair is the only species-cross cluster we detected, which suggests that the function of SELENOF is correlated with PDI in Mamiellales. Moreover, genome synteny is also detected in Mamiellales algae (shown in Figure 4B) flanking these SELENOF-PDI pairs. Not all Mamiellales selenoprotein gene clusters have such a cross-species distribution, including AhpC_a-PDI_a, GST-DsbA, and Rhod-MSRA, which is only found in specific Mamiellales genomes. Considering genomic collinearity, we speculate that the genomic fragment in which SELENOF-PDI_a is located may have important functional or structural conservation in microalgae. Although the Sec motif was lost, the genomic level conservation in Micromonas commoda was retained. In addition, three clusters were composed of the same selenoprotein genes: two SELENOW genes in Chlamydomonas reinhardtii, two SELENOU genes in Emiliania huxleyi, and three PRX genes in Symbiodinium minutum. The adjacency of these gene locations in the genome indicates that they potentially originate from the duplication and differentiation of the same ancestor gene.
Recombination or transposition events, which occur within the coding region of a gene, may result in the truncation or fusion of genes. We scanned the conserved domains of all algal selenoproteins. Figure 4C shows that a total of 36 domains were detected in 29 algae selenoprotein families, and domain alignment diagrams for all selenoprotein families are provided in family page of Selenoprotein Database. The most frequently detected domain in algae selenoproteins was the Trx-like domain, which was present in approximately half (20) of the algal selenoprotein families. All of them are functionally related to the thiol/disulfide redox system, such as AhpC, PRX, PDI, DsbA, GPX, GRX, and GST. Other Trx-like-containing families, such as DIO, SELENOF, SELENOM, SELENOH, SELENOT, SELENOW, SELENOU, SELENOL and TlpA, also have oxidoreduction-related functions. In several selenoproteins, such as PITH, rhodanese, MSRA, and MSRB, no Trx-like domain could be detected; however, some of them have been reported to be functionally related to the oxide reduction process of sulfur. The PITH selenoprotein contains the proteasome-interacting thioredoxin domain. The rhodanese-like selenoprotein is likely to be a sulfur transferase involved in cyanide detoxification. MSRA and MSRB are widely present in animals and are related to the reduction of methionine sulfoxide[46, 47]. Another important function is also associated with algal selenoproteins. The hemerythrin metal-binding domain is found in the algae TlpA selenoprotein, which suggests its oxygen-binding function[48]. The iron-sulfur cluster binding-related catalytic activity could be indicated by the domains found in the FeS-oxidoreductase and reductase[49]. The methylated-DNA-[protein]-cysteine methyltransferase selenoprotein (MDP) is related to the biological process of DNA repair[50-52].
As shown in Figure 4C and 4D, novel domain fusions were detected for several selenoprotein families in certain algae, including a SELENOM protein fused with the pVHL (Von Hippel-Lindau disease tumor suppressor beta domain) domain (Aureococcus anophagefferens), another SELENOM protein fused with the ShKT peptide toxin domain (Emiliania huxleyi), and a fusion protein of two selenoproteins (E. huxleyi). Their coding regions were found in both genomic and EST sequences. The multiple sequence alignment is shown in family page of Selenoprotein Database. As pVHL was previously reported as the substrate recognition component of an E3 ubiquitin ligase complex[53], it is possible that the SELENOM with the pVHL fusion potentially has a function related to tumor suppression[53]. Moreover, considering that the ShKT domain is often found in the anemone toxin protein, whose function is related to that of potent inhibitors of K(+) or iron channels, the fusion of Emiliania huxleyi SELENOM may be related to the toxicity of algal blooms[54].
The fusion of two selenoprotein genes, GST (glutathione S-transferase) and MSRA (methionine sulfoxide reductase A), was found in E. huxleyi. The structure of the fusion gene is composed of 4 exons, which is also indicated by the EST sequences (Figure 4D). Multiple sequence alignment of this fusion protein and other selenoproteins shows its homology (shown in family page of Selenoprotein Database). This is the first study to identify a fusion event involving two selenoprotein genes. GST participates in the detoxification of reactive electrophilic compounds by catalyzing their conjugation to glutathione. MSRA reverses the inactivation of many proteins due to the oxidation of critical methionine residues by reducing methionine sulfoxide (MetO) to methionine. GST and MSRA are both considered detoxification enzymes because of their antioxidant function. It has been reported that GST and MSRA were coinduced during chemical stress conditions in bacteria [55, 56], suggesting the correlation of their function and biological processes. This protein fusion in Emiliania huxleyi involves the enhancement of the association of these two related genes. Further efforts are needed to explore the biological pathways involving these two enzymes.