Virulence evaluation of Cercospora sojina isolates
Isolates were collected from major soybean producing areas in northeast China, including 29 isolates from Heilongjiang province and three from Jilin province. Virulence testing showed that the disease index of the isolates ranged from 20.31 to 90.25, and among these the Race15 isolate had the highest virulence and the Race1 isolate had the lowest (Table 1).
Genome assembly and general characteristics
In total, 601,794 high quality reads were generated by PacBio sequencing, covering 6,038,283,778 bp in total, and having a mean length of 10,033 bp and an N50 length of 13,900 bp. The genome of the Race15 strain of C. sojina (40.12 Mb) consisted of 12 curated contigs, with an N50 length of 4.9 Mb. 12,607 coding genes were predicted, at a gene density of approximately 314 genes per Mb (Table 2 and Additional file 1: Table S1). Additionally, non-coding genes were predicted, with 200 tRNA, 2 sRNA and 13 snRNA genes being predicted in the genome of Race15 (Additional file 1: Table S2). A total of 2,140,679 bp of repetitive sequences were identified, representing 5.34% of the Race15 genome. These included DNA transposons, LTR retrotransposons, tandem repeat sequences and other unclassified transposons (Additional file 1: Table S3).
Race15 gene annotation and prediction
In order to annotate the function of the predicted genes in the Race15 genome build, seven different databases were used to annotate and predict genes in our Race15 genome build (Additional file 1: Table S4). Previous studies have shown that genes annotated using the PHI and CAZy databases, or predicted as Secretory Protein or Secondary Metabolism, are most likely associated with fungal virulence (Fig. 1D) [24, 25]. A total of 680 genes were annotated and classified using the PHI database (Additional file 1: Table S5). 292 genes were annotated as reduced virulence, and there were 252 genes annotated as unaffected pathogenicity. 72 genes were annotated as pathogenic loss and 34 genes were annotated as lethal factors. 24 genes were annotated as virulence enhanced, 5 genes were annotated as a chemical target: resistant; however, only 1 gene was annotated as effector (plant nontoxic determinants) (Fig. 1B). Successful phytopathogenic fungi can break down and utilize plant cell wall polysaccharides using CAZymes [21]. There were 340 genes annotated in the CAZy database, which could be divided into 5 categories and 1 structural domain. Some genes could be annotated as belonging to two or more classes at the same time. Among these, there were 49 genes annotated as Auxiliary Activities enzymes (AAs), 17 genes annotated as CEs, 195 genes annotated as Glycoside Hydrolases (GHs), 66 genes annotated as Glycosyl Transferases (GTs), 6 genes annotated as Polysaccharide Lyases (PLs), and 27 genes annotated as Carbohydrate-Binding Modules (CBMs) (Fig. 1C).
Pigments are another important group of secondary metabolites used for the successful invasion of pathogens. Previous studies have found that C. sojina can produce some grey pigments, that are significantly induced by both starvation and cAMP treatments, suggesting that these pigments may be related to pathogen virulence [21]. A total of 777 genes were predicted to be related to Secondary Metabolism (Fig. 1A), and among these, 16 clusters of 229 genes were predicted as Type Ⅰ polyketide synthases, 1 cluster of 3 genes was predicted as siderophore and 5 clusters of 36 genes were predicted as Terpene. There were 22 clusters of 347 genes predicted as non-ribosomal peptide synthase (NRPS). There were 2 clusters of 47 genes predicted as t1pks-nrps. In addition, there were 8 clusters of 115 genes predicted as others. Secreted proteins were predicted by Signal P and TMHMM, and the proteins containing signal peptides without obvious transmembrane structure were annotated as secreted proteins. Most phytopathogenic fungi can secrete many proteins and metabolites during the plant–fungi interaction, and these secreted proteins and metabolites play important roles at different infection stages of fungal penetration, colonization, and lesion formation [26, 27]. According to our screening results, 766 genes were predicted as Secretory Proteins.
Synteny analysis between Race15 and Race1
While the Race15 and Race1 genomes largely corresponded, there were various inversion, translocation, and Tran+Inver (translocation and inversion) events that disrupted the otherwise collinear gene order. A comparison of these two strains showed high coverage and synteny. While 93.6% of the regions in the Race15 genome showed synteny with the Race1 strain, only 91.95% of the regions in the Race1 genome showed synteny with Race15 (Additional file 1: Table S6). Although they have good coverage relative to each other, the colinearity was relatively low. Sequence comparisons between the genome assemblies of Race1 and Race15 exhibited colinearity in Contigs 1, 2, 4, 7, 10, and 11. Contigs 3, 5, and 8, however, had large segment translocations. Moreover, a comparison of translocation and inversion in Contig 9 accounted for most of the non-syntenic fragments. Comparison of Contig 6 revealed it was mainly comprised of inversion or Tran+Inver fragments (Fig. 2). Synteny analysis revealed that although the two genomes contained most of the same genes, in the process of evolution they respectively experienced a significant volume of distinct genome structure variation. This resulted in some changes in genomic structures, leading to changes in coding genes, and even changes in functional proteins, especially in non-linear areas.
Core and orphan gene content
Comparing Race15 and Race1 at the genetic level, we found that the functional differences were caused by the genetic differences. We conducted core-pan gene analysis of Race15 and Race1, with the assumption that the specific genes of Race15 may be related to its increased virulence. Race15 and Race1 were highly homologous. 25,258 genes were clustered into 10,843 clusters by cd-hit, including 10,356 core genes and 487 dispensable genes. There were also many paralogous genes, and these paralogous genes encoded similar functions. The high homology, identity, and coverage between paralogous genes generally passed the threshold for similarity using cd-hit. Therefore, paralogous genes were clustered within the same cluster. Among these clusters, 417 clusters contained more than 3 genes. The largest cluster contained 210 genes. After clustering, homologous genes existing in both samples were removed, and the genes that were present in only one sample were identified as specific genes, Race15 and Race1 had 245 and 274 specific genes, respectively, with the Race15 specific genes possibly being associated with high virulence (Fig. 2 and Additional file 1: Table S7). There were 6 genes annotated in the PHI database, 16 genes predicted as Secretory Protein, 4 genes annotated in the CAZymes database, and 23 genes predicted as secondary metabolic processes. We analyzed these new genes carefully and speculated that they were mainly related to virulence, reduced virulence, unaffected pathogenicity, nrps, t1pks, α-l-fucosidase, β-1,3-glucanase, β-xylanase and other virulence related categories. One of the genes in particular deserved attention, Vtc4 (A07645), which has been shown to be associated with virulence increase in Cryptococcus neoformans [28]. It encodes a protein with similarity to the yeast vacuolar transport chaperone for polyphosphate synthesis, and deletion of this gene reduced polyphosphate formation, influenced the transition between yeast and filamentous growth, and attenuated virulence [29]. The genes that were specific to Race15 and not annotated previously could have been missed in our genome annotation. To make up for this shortcoming, we performed a homologous comparison and annotation of Race15 genes with various databases, and annotated Race15 genes as much as possible to facilitate subsequent screening of high-virulence-related genes (Additional file 1: Table S8).
Genome-wide genotyping of 30 different C. sojina isolates
To obtain the sequence variation between different C. sojina isolates, Single nucleotide polymorphisms (SNPs)were identified in 30 C. sojina isolates from different geographical locations by mapping the sequence reads of each isolate independently to our Race15 reference genome. Of the 30 isolate re-sequencing data, each generated between 1.8 and 7.5 Gb of data. The median aligned read coverage was 42-fold and the minimum and maximum coverage was 11-and 55-fold, respectively. For each isolate, between 23.98% and 97.33% of the sequence reads were mapped to the Race15 reference genome, which covered between 99.05% and 99.68% of the reference genome base. Subsequent resequencing analysis relied mainly on the comparison of data to our reference genome (Additional file 1: Table S9). We noticed two isolates, SB (23.98%) and SH (26.62%), had the lowest percentages of reads aligning to the reference Race15 genome. Therefore, we tried to assemble the sequences that were not aligned, and found they aligned to a species of Paenibacillus. However, when we used the SNPs obtained by comparison to the reference genome to construct phylogenetic trees and screen candidate genes, the coverage of the reference genome reached more than 99%, meaning that our results would not be affected by this contaminant species.
Across all the isolates, an average of 12,674 SNPs per isolate was found, covering 0.032% of the reference genome. Except for Race1 (in which we could not identify heterozygous mutations because it was not re-sequenced), out of the 31 isolates, the number of SNPs and homozygous mutations in the KS strain was the largest, at 13,580 and 12,501, respectively. The SNP density of the 31 isolates was between 0.22 and 0.34 relative to the reference genome. The lowest was DH, at 0.22 SNPs per KB, while the SNP density values of A, B and 13 other isolates were all 0.34 SNPs per KB.
Among the 31 isolates, KS had the largest number of non-synonymous (NSY) SNPs at 2998. DH had the lowest number of NSY SNPs, at 1901. In addition, a total of 31,812 SNP sites were identified by merging the SNP sites of the 31 isolates (Additional file 1: Tables S10 and S11).
Phylogenetic analysis of whole-genome SNP data
Most isolates were collected from sites in the main soybean producing area in Northeast China, the region in China were FLS disease is the most serious. DH, JH and WQ were collected from Jilin province. The soybean planting areas here are relatively smaller than that in Heilongjiang province where the other isolates were collected. The phylogenetic tree revealed that WQ and DH had isolated branches, and the remaining 30 isolates all grouped into a single large branch. At the same time, the virulence value of WQ and DH was low, as these two are closely related to the area of soybean resistance cultivars where FLS disease prevalence is lower, and the annual effective accumulated temperature is significantly higher than that of the other 30 isolates collected from Heilongjiang. All of these isolates have close ancestral evolutionary relationships, consistent with the geographical location of the samples collected. Race15 had the highest virulence and was most closely related to strains SB, JS and JY. The virulence values of these 3 isolates were also relatively high (second, third, and fourth, respectively), while the proximal isolates of these 4 isolates were not highly virulent (Fig. 3).
Linkage disequilibrium (LD) decay analysis and the association between SNP and virulence
Although the sexual stage of C. sojina has not been observed in either field or laboratory conditions, evidence suggests that the life cycle of C. sojina includes both a haploid stage and a sexual stage [17-18]. SSR analysis has been used to analyze the mating-type ratio by others, and most of the populations showed a nearly 1:1 ratio [17]. In addition, we generated an LD decay plot and it showed that no significant decay of LD (r2 ≤ 0.1) was observed, but rather r2 decreased quickly to half of its maximum value at 3.7 kb physical distance(Fig. 4). This indicated that parts of the population were undergoing sexual recombination; thus, our GWAS data could be used for subsequent analysis
Although our sample size was small, the filtered candidate genes could be used to help narrow the scope of virulence related genes. Plink analysis was used to identify a total of 1,198 SNP sites associated with virulence values (P < 0.01). For all of the mutant sites that were filtered (31,812 SNP loci, with a minor allele frequency (MAF) less than 0.05 per loci before filtering, and 21,783 SNP loci after filtering, including homozygous mutation loci and heterozygous mutation loci), a total of 1,198 SNP loci were identified to be associated with virulence (P < 0.01) (Additional file 1: Table S12). Among the significant loci with P < 0.01, 527 loci were located in genes, and 236 genes included at least one SNP loci. Of these 236 genes (Additional file 1: Table S13), there were 17 genes annotated in the PHI database, including at least one gene associated with a SNP, 7 genes annotated in the CAZy database, 9 genes predicted to be secretory proteins, and 26 genes predicted to be secondary metabolic processes (Additional file 1: Table S14). A total of 18 genes in the secondary metabolism library were predicted as Nonribosomal peptides (NRPs). NRPs are a class of peptide secondary metabolites usually produced by microorganisms like bacteria and fungi and are a very diverse family of natural products with an extremely broad range of biological activities. They are often toxins, siderophores, or pigments.
Genomic association analysis of SNP and virulence value could be intuitively described using a Manhattan diagram. It can be seen from Fig. 5 that Contig3's SNP sites are relatively concentrated and may be a candidate region for virulence gene association studies. There were 86 genes in the Contig3, including at least one SNP locus, of which FGSG09408 was worthy of attention, as it has been confirmed to play a role in the establishment of polarized growth and was up-regulated at 2 and 8 h, times when cell division and cytokinesis are activated during germination in Fusarium graminearum [30]. We used Qualimap to analyze the copy number of each sample, and our results showed that the coverage of contigs in each sample was consistent, meaning that there was no copy number variation.
Genes correlated with virulence by differential counting of NSY SNPs
NSY SNPs have a more direct effect on the pathogenicity of fungi [8]. In the genomes of the 32 different isolates, a total of 3,441 genes had at least one NSY mutation, and 485 genes had accumulated more than 50 NSY SNPs. There were 78 genes annotated in the PHI database, enzyme database (CAZymes), and predicted as Secretory Protein and secondary metabolic processes, respectively (Additional file 1: Table S15).
Genes with different amounts of NSY SNPs between high and low virulence isolates
We examined genes with significant differences in the number of NSY SNPS between low-virulence and high-virulence samples, as these genes were most likely related to virulence. First, the virulence values of the 32 isolates were sorted from high to low. 11 isolates with virulence values more than 65 were classified as the high virulence group, and 11 isolates with virulence values less than 50 were classified as the low virulence group. A Wilcoxon rank sum test revealed that 69 genes had significant (P < 0.05) differences in the counts of NSY mutations between the low and high virulence groups. Of these mutations, 2 genes (5.8%) were annotated in the PHI database, 3 genes were annotated in CAZy, while 3 genes were predicted to be Secretory Protein candidate effectors. Collectively, 7 genes were speculated to be involved in secondary metabolic processes (Table 3).
A heatmap of the 11 genes with differential NSY SNP counts (P < 0.05) between the low and high virulence groups is shown in Fig. 5. For each gene, the various counts of NSY SNPs across the 32 isolates are indicated by blue, gray, and red colors corresponding to values ranging from low to high. 11 genes had fewer NSY mutation sites in the high-virulence group and more NSY SNP sites in the low-virulence group.
Identification of important candidate genes
The most important candidate genes were identified by the intersection of these 3 kinds of genes, as shown in Fig. 6. Of the 5 identified key virulence related genes, 4 were in Contig 3 and 1 was in Contig 1, which was consistent with the virulence related areas detected in the Manhattan map (Table 4).
For instance, there were 3 SNP sites in the gene A05389, at contig3:1317647, contig3:1317841 and contig3:1319579. It was noted that in contig3:1317647, 4 high-virulent isolates (JS, JY, SB and Race15) shared the CC locus, whereas the other 28 isolates shared the AA locus that was absent in SH. In the contig3:1317841 and contig3:1319579 sites, the JS, JY and Race15 isolates shared the CC locus. SH and SB lacked this locus, while the other isolates shared the TT and AA loci in their respective contigs. After homology comparison, this locus was annotated as GH family 78 protein from Bipolaris oryzae (ATCC 44560), which was speculated as a secretory protein and annotated as α-L-Rhamnosidases in the CAZy database. α-L-Rhamnosidases catalyze the hydrolysis of terminal α-L-rhamnose residues from various carbohydrates. The catalytic domains in most of these enzymes belong to the GH78 and GH106 families of GHs [31].
There were 2 SNP sites in the gene A00784, at contig1:3188987 and contig1:3189373. In addition to JS, SB and Race15 shared the GG genotype, while the other isolates shared the CC locus that was absent in isolate SB. A00784 is located on Contig 1 and was speculated as an oxidoreductase in the CAZy database. By homology comparison, this was speculated as Glucose oxidase in Cercospora beticola. Glucose oxidases were speculated to specifically oxidize β-D-glucose to gluconic acid and hydrogen peroxide [32]. It has been clearly demonstrated that the hydrogen peroxide generated by glucose oxidases can inhibit the growth of pathogens [33]. To date, glucose oxidase genes have been cloned from Penicillium expansum, Penicillium variabile and Talaromyces flavus [34].
A05375 had 3 SNP sites, at contig3:1279311, contig3:1280051 and contig3:1280961. After homology comparison, this was annotated as non-ribosomal peptide synthetase 8 (NRPS8) from Bipolaris maydis. In contig3:1279311, JY, JS, and Race15 shared the GG locus, while the other isolates shared the TT homozygous genotype that was absent in isolate SH. In contig3:1280051, JS, SB, JY, and Race15 shared the TT locus, and the other isolates shared the GG locus. In contig3:1280961, JS, JY, SB, and Race15 shared the AA genotype and the other isolates shared the GG genotype that was absent in isolate SH.
A05382 is located in Contig 3 with 3 SNP sites, at contig3:1304880, contig3:1305120, and contig3:1305142, respectively. In contig3:1304880, in addition to the four isolates (JS, JY, SB and Race15) who shared the CC locus, the other isolates shared the AA locus. In contig3:1305120, JS, JY, SB and Race15 had the TT genotype, while the other isolates were CC. In contig3:1305142, JS, JY and Race15 were GG, and the other isolates were AA, which was absent in isolate SB. After homologous comparison, this locus was annotated in Cercospora beticola as O-acetyl-ADP-ribose deacetylase MACROD1 (CB0940-03553).
A05384 is located in Contig 3. JY, JS, SB and Race15 shared the TT genotype, while the others shared the CC genotype that was absent in isolate SH. After homology comparison, this locus was annotated as hypothetical protein (CB0940-03556) in Cercospora beticola with 1 SNP site.
It can be seen from the above results that the highly virulent isolates Race15, SB, JS, and JY had the same mutation sites in the above five genes, including SNPs or deletions. Considering their closeness in the phylogenetic tree and their putative shared ancestral state, some genes with the same mutation site may be related to species rather than virulence. Our study was designed simply to provide a useful reference for the screening of functional candidate genes, narrowing down the workload for subsequent analyses.