First, I predicted low complexity regions in all protein sequences of the 121 investigated species (supplementary table 1) and determined the fraction of proteins with low complexity regions. This yielded one value for each species and protein type (Fig. 1). I then compared this fraction between cytoplasmic proteins, secreted non-effector proteins, and effector proteins. It turned out that the fraction of protein sequences with low complexity regions was lowest in secreted non-effector proteins in 119 species (Fig. 1). Exceptions to this general trend were the two obligate biotroph species Blumeria graminis f.sp. tritici (short name ‘Blugrt’) and Erysiphe necator (short name ‘Erynec’), both belonging to the Ascomycota. I found this pattern in only 29 to 71 species in 10,000 random permutations (see Methods). Moreover, the fraction of proteins with low complexity regions was in 118 species higher in cytoplasmic proteins compared to secreted proteins (effectors and non-effectors) than in cytoplasmic proteins (Fig. 1). The three exceptions from this general trend were Taphrina deformans (a facultative biotrophic pathogen belonging to the Ascomycota; abbreviation ‘Tapdef’) as well as the two Basidiomycete symbionts Tulasnella calospora (abbreviation ‘Tulcal’), and Piriformospora indica (abbreviation ‘Pirind’). This trend occurred only in 14 to 52 species in 10,000 random permutations. In conclusion, these observations indicate that the fraction of proteins with low complexity regions does not evolve by chance. However, evolutionary mechanisms that could explain systematic differences in the presence of low complexity regions between cytoplasmic and secreted proteins remain to be identified. For example, the high fraction of cytoplasmic proteins with low complexity regions could suggest that low complexity regions are functionally important; hence, their presence could be advantageous and selected. Alternatively, the occurrence of low complexity regions could be neutral in cytoplasmic proteins, and therefore low complexity regions accumulate in cytoplasmic proteins. Likewise, it remains to be elucidated if the presence of low complexity regions in secreted proteins is generally disadvantageous, which would then explain the low fraction of secreted proteins with low complexity regions. In particular, the evolutionary and molecular mechanisms that underlie differences between secreted non-effector proteins and effector proteins remain to be elucidated. In summary, the fraction of proteins with low complexity regions differed between cytoplasmic and secreted proteins, but the observed trend was largely consistent between different lifestyles and phyla (Fig. 1A to Fig. 1F).
Next, I calculated the fraction of each protein sequence that spans a low complexity region, thereby providing one value for all proteins in each species (supplementary table 2). I found that the median of these fractions was highest in effector proteins for 95 species (Fig. 2). This number ranged from 21 to 58 species in 10,000 random permutations, again indicating that this pattern does not evolve neutrally. Moreover, the median fraction of protein sequences spanning a low complexity region did not evolve by chance in the investigated protein categories (Table 1). Intriguingly, I found the highest median values in the group of effector proteins (Table 1). Together with the analysis of the protein fraction with low complexity regions, this finding indicates that low complexity regions are less common in effector proteins, but on average longer when they occur.
Table 1
Minimum and maximum median fractions of protein sequences that span a low complexity region as obtained from 10,000 permutations
| Observed | minimum | maximum | observed median |
protein category | median value | median value1 | median value1 | value by chance2 |
effector | 0.05367232 | 0.03999907 | 0.04338914 | no |
secreted non-effector | 0.03812317 | 0.03915904 | 0.04451039 | no |
cytoplasmic | 0.04198473 | 0.04112554 | 0.04178556 | no |
1) Reported are the minimum and maximum median values that are obtained from 10,000 random permutations |
2) An observed value is considered to evolve by chance if it lies between the minimum and maximum median values obtained from 10,000 random permutations |
Previous studies reported that low complexity regions differ in their amino acid composition, that is, certain amino acids were found to be overrepresented in low complexity regions [39–41]. My analysis revealed over-representation of certain amino acids as well; however, no lifestyle-specific or phylum-specific enrichments could be identified (supplementary table 3).
To investigate further a putative role of low complexity regions in the emergence of novel effector alleles, I inferred homologous relationships between all 73,484 effector sequences (supplementary table 2). In two independent analyses, I used the natural effector protein sequences or sequences where I replaced low complexity regions with ‘X’ as unknown amino acid, because low complexity regions can complicate the search for homology [42]. For both analyses, I reconstructed families of homologous sequences with OrthoFinder [43] (supplementary table 2, supplementary table 4, and supplementary table 5). Next, I aimed to identify all families of homologous effector proteins that contain at least one member from each species. This set of proteins represents likely ancestral sequences, as they are conserved in all species; however, no family of homologous proteins contained members from all species (Table 2). Therefore, I used those families of homologous proteins that covered the largest number of species as a proxy for truly ancestral sequences (Table 2). As a complementary approach, I identified all groups of homologous proteins containing only effector proteins from one species. Since these sequences are species-specific, they likely emerged only recently. I found that low complexity regions span a higher fraction of protein sequences in recent proteins compared to ancestral proteins (Fig. 3A; P-value < 2.2 × 10− 16, Wilcoxon Rank-Sum test). I obtained similar results when I used natural protein sequences (that is, low complexity regions are not masked) to infer homologous relationships (supplementary Fig. 1A; P-value < 2.2 × 10− 16, Wilcoxon Rank-Sum test). Information about families of homologous effector proteins based on natural and masked protein sequences are summarized in supplementary table 2, supplementary table 4, and supplementary table 5. This finding is in line with a previous study showing that the fraction of a protein sequence that spans a low complexity region is higher in younger protein sequences when comparing mammalian proteins with other vertebrate and non-vertebrate sequences [41], suggesting that this observation reflects a general trend in eukaryotic proteins.
Table 2
Groups of homologous proteins and number of their members for ancestral and species-specific proteins as identified by OrthoFinder with native and masked protein sequences
| ancestral proteins | species-specific proteins |
native effector protein sequences | one group (OG0000001) with 870 proteins conserved in 119 species | 9026 groups with 9369 species-specific proteins |
masked effector protein sequences | one group (OG0000001) with 864 proteins conserved in 119 species | 19645 groups with 19645 species-specific proteins |
To identify lifestyle-specific differences in the fraction of a protein sequence that spans a low complexity region, I identified all families of homologous effector proteins that contain at least one member in a species from each of the six lifestyles. This approach highlighted 156 families (with 10 to 998 members) when I used the results obtained from OrthoFinder with masked sequences as input (supplementary table 2 and supplementary table 4). Next, I calculated the average fraction of a protein sequence that span a low complexity region between all effector proteins of one family, yielding one value per family and lifestyle (supplementary table 6). I then used this data set as input for a principal component analysis and found that the first and second principal components explain about 69.1% of the observed data (Fig. 3B). Interestingly, data from proteins of obligate biotrophic and necrotrophic fungi showed the largest difference in the principal components. Moreover, wood degrading and hemibiotrophic fungi showed similar results to necrotrophic fungi, although their contribution to the principal components was smaller (Fig. 3B). Furthermore, data from symbiontic fungi were similar to those of obligate biotrophs, which might reflect their strong dependence on host plants for survival. I obtained similar results when I used natural protein sequences for the detection of homology (supplementary table 7 and supplementary Fig. 1B). To gain more fine-grained insights in the contribution of fungal lifestyles on the fraction of protein sequences that span a low complexity region, phylogenetic information need to be taken into account [44]. However, obtaining accurate alignments and phylogenetic trees is challenging in this data set, because the used effector protein sequences represent hundreds of million years of evolution [45].
To investigate potential layered effects between protein type, lifestyle, and phylogenetic relationships (phylum), I fitted a general linear model to the data of all proteins, regardless of their type (supplementary table 2). Specifically, I used the formula “fraction of protein sequences that span a low complexity region” ~ protein type * lifestyle * phylum. I then used the results to rank the models with different fixed-term effects according to the Bayesian information criterion, and I found that all three parameters together explain best the observed data (Table 3). In summary, the fraction of low complexity regions in a protein sequence is higher in younger protein sequences, indicating that low complexity regions contribute to the formation of novel alleles. Moreover, the results obtained from a principal component analysis and a generalized linear model suggest that lifestyle contributes to the evolution of the fraction of effector protein sequences that span low complexity regions.
Table 3
Bayesian information criteria of generalized linear models that fit the fractions of protein sequences spanning low complexity regions based on different strata
intercept | ls1 | phy2 | pt3 | ls:phy | ls:pt | phy:pt | ls:phy:pt | df4 | likelihood | BIC5 | delta | weight |
0.021787527 | + | + | + | + | + | | | 25 | -2390553.51 | 4780751.949 | 0 | 0.993225959 |
0.021544955 | + | + | + | + | | | | 15 | -2390477.508 | 4780741.974 | 9.975720634 | 0.006774041 |
0.021784575 | + | + | + | + | + | + | | 29 | -2390557.614 | 4780703.345 | 48.60455593 | 2.77E-11 |
0.021538011 | + | + | + | + | | + | | 19 | -2390479.323 | 4780688.793 | 63.15680041 | 1.92E-14 |
0.021687621 | + | + | + | + | + | + | + | 39 | -2390614.684 | 4780675.456 | 76.49366543 | 2.44E-17 |
0.021166705 | + | + | | + | | | | 13 | -2389468.048 | 4778751.459 | 2000.490346 | 0 |
0.025110773 | + | + | + | | + | | | 21 | -2389472.866 | 4778647.473 | 2104.47681 | 0 |
0.024853619 | + | + | + | | | | | 11 | -2389392.821 | 4778629.411 | 2122.538928 | 0 |
0.025113029 | + | + | + | | + | + | | 25 | -2389477.269 | 4778599.466 | 2152.48305 | 0 |
0.024848407 | + | + | + | | | + | | 15 | -2389394.459 | 4778575.875 | 2176.074258 | 0 |
0.025809602 | + | | + | | + | | | 19 | -2389040.414 | 4777810.974 | 2940.975137 | 0 |
0.025550158 | + | | + | | | | | 9 | -2388957.341 | 4777786.856 | 2965.093633 | 0 |
0.024464376 | + | + | | | | | | 9 | -2388392.656 | 4776657.485 | 4094.464077 | 0 |
0.0204476 | | + | + | | | | | 6 | -2388228.708 | 4776372.198 | 4379.751247 | 0 |
0.020441439 | | + | + | | | + | | 10 | -2388230.374 | 4776318.719 | 4433.230397 | 0 |
0.025188246 | + | | | | | | | 7 | -2387955.191 | 4775810.962 | 4940.987682 | 0 |
0.020632524 | | | + | | | | | 4 | -2387877.125 | 4775697.439 | 5054.509974 | 0 |
0.020037031 | | + | | | | | | 4 | -2387216.409 | 4774376.006 | 6375.943801 | 0 |
0.020261673 | | | | | | | | 2 | -2386870.007 | 4773711.608 | 7040.341502 | 0 |
1) ls, lifestyle |
2) phy, phylum |
3) pt, protein type |
4) df, degrees of freedom |
5) BIC, value of the Bayesian Information Criterion |
A previous study indicated that low complexity regions could play a position-dependent role and proteins where low complexity regions tended to localize towards the termini of a protein had a larger number of interaction partners [46]. To investigate if low complexity regions show different localization patterns in my set of fungal proteins, I determined the relative position of low complexity regions in all types of proteins, that is, cytoplasmic proteins, secreted non-effector proteins, and effector proteins. Figure 4 shows the result for each low complexity region in each protein and species. In 115 species, the median relative position of low complexity regions in cytoplasmic proteins was located closer to the N-terminus than the median relative position of low complexity regions in secreted proteins (effectors and non-effectors). Exceptions to this general finding were Taphrina deformans (an Ascomycete facultative biotroph, short ‘Tapdef’), Ustilago maydis (a Basidiomycete facultative biotroph, short ‘Ustmay’), Zymoseptoria tritici (an Ascomycete hemibiotroph, short ‘Zymtri’), Rhizoctonia solani (a Basidiomycete necrotroph, short ‘Rhisol’), Tuber aestivum var. urcinatum (an Ascomycete symbiont, short ‘Tubaes’), and Wolfiporia cocos (a Basidiomycete wood degrading fungus, short ‘Wolcoc’). This suggests that the position of low complexity regions evolves in general differently between cytoplasmic and secreted proteins, and this conclusion is corroborated by results from 10,000 random permutations, where cytoplasmic proteins were located closest to the N-terminus in only 14 to 51 species. Following the results reported by Coletta and colleagues [46], this would indicate that cytoplasmic proteins with low complexity regions have more interaction partners than secreted proteins with low complexity regions. In 52 species, the median relative position of low complexity regions in secreted non-effectors was closer to the N-terminus than in effectors, and in 69 species, the opposite trend was observed. This is consistent with randomized samples, where low complexity regions were closer located to the N-terminus in secreted non-effectors compared to effectors in 39 to 83 species, suggesting that the relative localization of low complexity regions is similar between different types of secreted proteins (effectors and non-effectors). To investigate further if the observed median values of relative positions evolved by chance, I randomly assigned each protein to one protein type (cytoplasmic, secreted non-effector, and effector). I found that the median relative position in the different protein type does not evolve by chance (Table 4).
Table 4
Minimum and maximum median relative positions of low complexity regions as obtained from 10,000 permutations
| Observed | minimum | maximum | observed median |
protein category | median value | median value1 | median value1 | value by chance2 |
effector | 0.6216692 | 0.4660705 | 0.4987578 | no |
secreted non-effector | 0.6446508 | 0.03915904 | 0.04451039 | no |
cytoplasmic | 0.4762675 | 0.02892562 | 0.07200726 | no |
1) Reported are the minimum and maximum median values that are obtained from 10,000 random permutations |
2) An observed value is considered to evolve by chance if it lies between the minimum and maximum median values obtained from 10,000 random permutations |
I used the before described data set of homologous effector protein families to investigate if there is a difference in the relative positions of low complexity regions between anciently and recently emerged protein sequences (Table 2, supplementary table 2, supplementary table 4). I observed that the relative position is closer to the N-terminus in ancient proteins (Fig. 5A; P-value = 0.01875, Wilcoxon Rank-Sum test). I observed a similar trend when I used natural protein sequences to infer homologous relationships (supplementary Fig. 2A; P-value = 0.02432, Wilcoxon Rank-Sum test). If we assume that the relative position of a low complexity region is indicative of the number of interaction partners, this result suggests that effector proteins with low complexity regions evolve a larger number of interaction partners over time.
To analyze potential lifestyle differences in relative positions of low complexity regions, I calculated a mean value of all homologous proteins belonging to one lifestyle (supplementary table 8). This yielded 46 families of homologous protein sequences with 24 to 998 members. The smaller number of homologous effector protein families compared to the analysis of the protein sequences that span a low complexity region (supplementary table 6) originates from the need to exclude proteins that do not contain low complexity regions, because I cannot determine relative positions in such cases. A principal component analysis based on these data showed that the first two principal components explain around 72.2% of the observed data (Fig. 5B). Again, I observed a strong contribution from effector proteins of obligate biotrophic species to the observed data. Moreover, hemibiotrophic, necrotrophic, and facultative biotrophic species showed similar contributions, which may reflect that the lifestyle of those species covers also saprotrophic feeding strategies. I found similar trends when I used data based on natural effector protein sequences (supplementary table 9 and supplementary Fig. 2B). Again, I sought to detect layered effects of the categories protein type, lifestyle, and phylum, and I used a general linearized model to highlight the contributions of these factors as described above. I found again that all three parameters together explain best the observed relative position of low complexity regions (Table 5). In summary, I conclude that the relative position of low complexity regions differs in ancestral and recent protein sequences. In addition, the results obtained from a principal component analysis and a generalized linear model suggest that lifestyle contributes to the evolution of relative positions of low complexity regions in effector proteins.
Table 5
Bayesian information criteria of generalized linear models that fit the relative position of low complexity regions based on different strata
intercept | ls1 | phy2 | pt3 | ls:phy | ls:pt | phy:pt | ls:phy:pt | df4 | likelihood | BIC5 | delta | weight |
0.485809 | + | + | + | + | | | | 15 | -248734.2587 | 497674.6588 | 0 | 1 |
0.4857375 | + | + | + | + | | + | | 19 | -248729.5896 | 497720.2917 | 45.63289662 | 1.23E-10 |
0.4858139 | + | + | + | + | + | | | 25 | -248725.2206 | 497794.0102 | 119.3514804 | 1.21E-26 |
0.485567 | + | + | + | + | + | + | | 29 | -248716.2173 | 497830.9745 | 156.3157266 | 1.14E-34 |
0.4854967 | + | + | + | + | + | + | + | 38 | -248704.2129 | 497930.6505 | 255.9917515 | 2.58E-56 |
0.4735661 | + | + | + | | | | | 11 | -248903.1913 | 497957.553 | 282.8942366 | 3.72E-62 |
0.4734886 | + | + | + | | | + | | 15 | -248898.7062 | 498003.5538 | 328.8950795 | 3.81E-72 |
0.4705451 | + | | + | | | | | 9 | -248941.2639 | 498006.2127 | 331.5539289 | 1.01E-72 |
0.4735189 | + | + | + | | + | | | 21 | -248893.6163 | 498075.8306 | 401.1718379 | 7.70E-88 |
0.4732644 | + | + | + | | + | + | | 25 | -248884.5653 | 498112.6996 | 438.0408635 | 7.60E-96 |
0.4704797 | + | | + | | + | | | 19 | -248931.9806 | 498125.0735 | 450.4147729 | 1.56E-98 |
0.4843521 | | + | + | | | | | 6 | -249320.1771 | 498722.8108 | 1048.15203 | 2.49E-228 |
0.484943 | | | + | | | | | 4 | -249339.2366 | 498733.4442 | 1058.785433 | 1.22E-230 |
0.4842792 | | + | + | | | + | | 10 | -249315.7992 | 498769.026 | 1094.367215 | 2.30E-238 |
0.4890257 | + | + | | + | | | | 13 | -250223.0786 | 500624.813 | 2950.154198 | 0 |
0.4766281 | + | + | | | | | | 9 | -250395.2361 | 500914.1571 | 3239.498331 | 0 |
0.4732229 | + | | | | | | | 7 | -250436.2246 | 500968.6486 | 3293.989831 | 0 |
0.4879425 | | + | | | | | | 4 | -250808.1222 | 501671.2154 | 3996.556604 | 0 |
0.4881567 | | | | | | | | 2 | -250822.7226 | 501672.9308 | 3998.272028 | 0 |
1) ls, lifestyle |
2) phy, phylum |
3) pt, protein type |
4) df, degrees of freedom |
5) BIC, value of the Bayesian Information Criterion |
|