Illumina HiSeq4000 data
Sequencing was performed using the Illumina HiSeq 4000 platform. A total of 56,458,678 reads were produced from D25-1, with an insert size of 500 bp and a read length of 125 bp. The proportion of adapter sequences was 0.25%, the proportion of repetitive sequences was 2.38%, and the low-quality sequences accounted for only 7.04%. After filtering out 9.69% of the sequences from the raw data (7,057 Mb), 6,372 Mb of clean data were obtained (Table 1).
Table 1
Summary statistics of D25-1 sequencing data
Insert size (bp) | Read length (bp) | Raw data (Mb) | Adapter (%) | Duplication (%) | Total reads | Filtered reads (%) | Low-quality filtered reads (%) | Clean data (Mb) |
500 | 125 | 7,057 | 0.25 | 2.38 | 56,458,678 | 9.69 | 7.04 | 6,372 |
Table 2
Summary statistics of the data obtained by the PacBio platform
Polymerase Read Number | Mean Read Length (bp) | Polymerase Read Bases (bp) | Polymerase Read Quality | Sub-read Number | Sub-read Mean Length (bp) | Sub-read Bases (bp) | Subread Quality | Utilization Ratio |
294,876 | 15,189 | 4,957,271,514 | 0.84 | 447,618 | 9,970 | 4,463,016,422 | 0.84 | 0.92 |
In the base distribution analysis, the frequencies of A and T, as well as the frequencies of G and C, were correlated, indicating an equilibrium base composition. Further, base quality values were relatively high, suggesting a low error rate and high-quality reads (Fig. 1).
PacBio data
Given the large quantities of adapter sequences and low-quality or erroneous sequences in the raw sequencing data obtained using the PacBio platform, extensive filtering was performed to generate clean data, which were deposited at DDBJ/ENA/GenBank under accession number QOHM00000000. The lengths and qualities of the reads were well distributed (Fig. 2.). The version described in this paper is version QOHM01000000. Summary statistics are provided in Table. 2.
Genome Evaluation
Prior to assembly, the genome size, heterozygosity, and repetitive sequence information were determined by a K-mer analysis. The D25-1 genome was approximately 44.69 Mb in size, with a coverage depth of 144.34×. The details are shown in Fig. 3.
Assembly statistics
Data assembly was performed by combining the data from the Illumina HiSeq4000 and PacBio sequencing platforms. As shown in Table 3, the D25-1 genome was well spliced. A total of 16 Fusarium equiseti chromosomes with a total length of 40,776,005 bases were assembled. The longest chromosome was 8,344,890 bp, and the shortest chromosome was 2,316 bp. The gap number was 0, and the GC content was 48.01%.
Table 3
Seq Type | Total Number | Total Length (bp) | N50 Length (bp) | N90 Length (bp) | Max Length (bp) | Min Length (bp) | Gap Number (bp) | GC Content (%) |
Scaffold | 16 | 40,776,005 | 6,178,397 | 2,783,306 | 8,344,890 | 2,316 | 0 | 48.01 |
Contig | 16 | 40,776,005 | 6,178,397 | 2,783,306 | 8,344,890 | 2,316 | - | 48.01 |
GC content
Through calculating GC content and average depth, one can analyze whether a GC bias exists. If not seriously biased, the corresponding scatter diagram will assume a shape similar to that of Poisson distribution and have a peak near the GC content estimate of the genome. The more the graph deviates from this peak, the lower the depth is. The GC bias in the genome of D25-1 was analyzed after assembly, and a Poisson distribution was observed, suggesting no substantial bias (Fig. 4).
Gene components
Gene prediction was performed using the assembly to obtain the open reading frames and gene length distribution, as summarized in Table 4. The total number, total length, average length, and percentage of total chromosome lengths were analyzed for the genes, exons, CDS, and introns. The introns were abundant but short, accounting for a relatively low proportion of the genome.
Table 4
Type | Total Number | Total Length (bp) | Average Length (bp) | Length/Genome Length (%) |
Gene | 13,829 | 22,077,720 | 1,596.48 | 54.14 |
Exon | 40,110 | 19,787,286 | 493.33 | 48.53 |
CDS | 13,829 | 19,787,286 | 1,430.85 | 48.53 |
Intron | 26,281 | 2,290,434 | 87.15 | 5.62 |
Table 5
Summary statistics for non-coding RNAs
Type | Copy number | Average Length (bp) | Total Length (bp) | Percentage of Genome (%) |
tRNA | 333 | 86.91 | 28,943 | 0.0710 |
rRNA (de novo prediction) | 71 | 116.01 | 8,237 | 0.0202 |
sRNA | 69 | 66.69 | 4,602 | 0.0113 |
snRNA | 31 | 140.12 | 4,344 | 0.0107 |
miRNA | 108 | 55.77 | 6,024 | 0.0148 |
Genes with lengths between 500–999 bp were most abundant (n = 3,663), followed by those of 1000–1499 bp. Genes shorter than 200 bp and longer than 10,000 bp were infrequent (Fig. 5).
Non-coding RNAs
The non-coding RNAs in the Fusarium equiseti D25-1 genome were analyzed. With a copy number of 333, an average length of 86.91 nt, and a total length of 28,943 nt, tRNAs accounted for 0.0171% of the entire genome. The rRNA copy number was 71; the average length was 116.01 nt, and the total length was 8237 nt, accounting for 0.0202% of the genome. sRNAs represented 0.0113% of the genome, with a copy number of 69, an average length of 66.69 nt, and a total length of 4,602 nt. Small nuclear rRNAs accounted for 0.0107% of the genome, with a copy number of 31, an average length of 140.12 nt, and a total length of 4,334 nt. The microRNA copy number was 108; the average length was 55.77 nt, and the total length was 6,024 nt, accounting for 0.0148% of the genome.
Repetitive sequences
Repetitive sequences, including DNA transposons, tandem repeats (TRs), and transposable elements, have important roles in chromosomal spatial structure, regulation of gene expression, and genetic recombination. Transposable elements are further classified into long terminal repeats (LTRs) and non-LTRs, and the latter category includes long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs). Tables 6 and 7 list the repetitive sequences in the Fusarium equiseti D25-1 genome, as determined by various prediction algorithms and databases. For instance, the TRs predicted by the TRF software spanned 224,134 bp, which accounted for only 0.5497% of the genome. The total predicted repetitive sequences spanned 1,713,918 bp, accounting for 4.2033% of the genome.
Table 6
Repetitive sequence statistics
Method | Repeat Size (bp) | Percentage of Genome (%) |
Repbase | 561,215 | 1.3763 |
ProMask | 420,889 | 1.0322 |
de novo | 1,169,760 | 2.8687 |
TRF | 224,134 | 0.5497 |
Total | 1,713,918 | 4.2033 |
Table 7
Transposon classification statistics
| | Repbase TEs | | ProteinMask TEs | | De novo TEs | | Combined TEs |
Type | Length (bp) | % of Genome | Length (bp) | % of Genome | Length (bp) | % of Genome | Length (bp) | % of Genome |
DNA | 220,036 | 0.5396 | 195,998 | 0.4807 | 164,159 | 0.4026 | 311,230 | 0.7633 |
LINE | 63,265 | 0.1552 | 59,402 | 0.1457 | 27,174 | 0.0666 | 98,240 | 0.2409 |
LTR | 279,482 | 0.6854 | 165,768 | 0.4065 | 352,607 | 0.8647 | 527,778 | 1.2943 |
SINE | 2,627 | 0.0064 | 0 | 0.0000 | 9,329 | 0.0229 | 11,283 | 0.0277 |
Other | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
Unknown | 1,940 | 0.0048 | 0 | 0.0000 | 619,726 | 1.5198 | 621,666 | 1.5246 |
Total | 561,215 | 1.3763 | 420,889 | 1.0322 | 1,169,760 | 2.8687 | 1,530,931 | 3.7545 |
Gene annotation
As shown in Table 8, 13134 genes, accounting for 94.97% of all the genes in D25-1 genome, were annotated after BLAST searches against all databases. Over 70% of annotations were based on the NR, NOG, and IPR databases. Furthermore, 1422 genes (10.28%) were annotated using the PHI database.
Table 8
Summary of overall annotation results
Sample | Total | ARDB | CAZy | COG | GO | IPR | KEGG | KOG |
D25-1 | 13829 | 2 | 397 | 1516 | 7261 | 9995 | 4646 | 2231 |
Proportion | | 0.01% | 2.87% | 10.96% | 52.5% | 72.27% | 33.59% | 16.13% |
Sample | NOG | NR | P450 | PHI | SwissProt | T3SS | VFDB | Overall |
D25-1 | 11332 | 12640 | 1209 | 1422 | 3284 | 3804 | 55 | 13134 |
Proportion | 81.94% | 91.4% | 8.74% | 10.28% | 23.74% | 27.5% | 0.39% | 94.97% |
Genes encoding toxins were mined based on the annotation results. Two genes related to zearalenone were identified (D25-1_GLEAN_10000531 and D25-1_GLEAN_10000533). D25-1_GLEAN_10000531 was related to the non-reductive iterative type I polyketide synthase involved in the synthesis of zearalenone, whereas D25-1_GLEAN_10000533 was related to the highly reductive iterative type I polyketide synthase (Table 9).
Table 9
Genes related to zearalenone
Gene ID | Identity | Database | Database gene ID | Function |
D25-1_GLEAN_10000531 | 54.84 | KEGG | pcs:Pc21g12450 | Zearalenone synthase, nonreducing iterative type I polyketide synthase |
D25-1_GLEAN_10000533 | 57.27 | KEGG | pcs:Pc21g12440 | Zearalenone synthase, highly reducing iterative type I polyketide synthase |
Additionally, 23 genes related to trichothecene mycotoxin were identified, including 19 genes related to the active efflux pump of trichothecene mycotoxin, two related to trioxyacetyltransferase of trichothecene mycotoxin, and two related to the biosynthesis of trichothecene mycotoxin. Many of these genes were obtained by searches against the NOG database (Table 10).
Table 10
Genes related to trichothecene
Gene ID | Identity (%) | Database | Database gene ID | Function |
D25-1_GLEAN_10001590 | 42.83 | sordNOG | JGI67026 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10001811 | 76.04 | ascNOG; euNOG; fuNOG; opiNOG | EFQ35752 | Trichothecene 3-O-acetyltransferase |
D25-1_GLEAN_10002084 | 87.89 | hypNOG; necNOG | FGSG_10823P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10002452 | 76.6 | hypNOG; necNOG; sordNOG | FGSG_12768P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10003513 | 78.59 | hypNOG | NechaP36294 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10003680 | 94.59 | hypNOG; necNOG | FGSG_08749P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10005110 | 91.85 | hypNOG; necNOG | FGSG_05352P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10005621 | 80.1 | hypNOG; necNOG; sordNOG | XP_387212.1 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10006448 | 69.65 | necNOG | NechaP53897 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10006579 | 88.71 | ascNOG; hypNOG; necNOG; opiNOG; sordNOG | XP_383902.1 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10006973 | 71.71 | NOG; ascNOG; euNOG; hypNOG; fuNOG; necNOG; opiNOG; sordNOG | FVEG_00056T0 | Trichothecene 3-O-acetyltransferase |
D25-1_GLEAN_10006980 | 85.6 | NOG; ascNOG; euNOG; hypNOG; fuNOG; necNOG; opiNOG; sordNOG | FGSG_03537P0 | Trichothecene biosynthesis |
D25-1_GLEAN_10009924 | 89.23 | ascNOG; fuNOG; opiNOG | FGSG_02343P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10010021 | 85.26 | hypNOG; necNOG | FGSG_12141P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10010553 | 91.59 | necNOG | FVEG_04795T0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10011281 | 86 | hypNOG | FOXG_01267P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10011497 | 83.28 | hypNOG; sordNOG | FGSG_11815P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10011948 | 82.65 | hypNOG; necNOG; sordNOG | XP_381015.1 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10013026 | 92.83 | necNOG; sordNOG | XP_389873.1 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10013028 | 94.49 | ascNOG; fuNOG; opiNOG; hypNOG; sordNOG | FGSG_09701P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10013718 | 54.74 | necNOG | FVEG_04795T0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10011175 | 53.63 | NR | gi|751354543|gb|KIL92265.1| | tri7-trichothecene biosynthesis gene cluster |
D25-1_GLEAN_10009924 | / | IPR | PF06609 | Fungal trichothecene efflux pump (TRI12) |
Comparative genomic analyses
As summarized in Table 11, the newly assembled F. equiseti genome had the most intact sequence, with 16 contigs and an N50 of 6,178,397 bp, indicating a high-quality assembly. In a comparative analysis, the best-assembled F. oxysporum genome only had 33 scaffolds, with an N50 of 4,490,135 bp, which was superior to the assembly results for other strains. The differences in quality might be explained by the differences in assembly technology and platform. In this study, data were obtained using the third-generation PacBio platform and second-generation Illumina platform for joint assembly; whereas, most previously reported data had been obtained using the second-generation sequencing technology alone.
Table 11
Sample_Name | Seq Type | Total Number | Total Length (bp) | N50 Length (bp) | N90 Length (bp) | Max Length (bp) | Min Length (bp) | Gap Number (bp) | GC Content (%) |
Bipolaris sorokiniana | Scaffold | 154 | 34,409,167 | 1,789,485 | 1,003,746 | 3,642,493 | 2,011 | 1,196,549 | 49.84 |
F. avenaceum | Scaffold | 83 | 41,590,745 | 1,436,644 | 424,894 | 4,337,333 | 602 | 29,282 | 48.47 |
F. oxysporum | Scaffold | 33 | 52,908,293 | 4,490,135 | 2,466,030 | 6,470,671 | 4,587 | 0 | 47.67 |
F. pseudograminearum | Scaffold | 281 | 36,973,259 | 8,840,934 | 7,724,594 | 11,688,822 | 502 | 40,400 | 47.75 |
Nectria haematococca | Scaffold | 209 | 51,286,497 | 1,255,602 | 96,667 | 4,937,060 | 865 | 56,137 | 50.79 |
F. equiseti D25-1 | Contig | 16 | 40,776,005 | 6,178,397 | 2,783,306 | 8,344,890 | 2,316 | - | 48.01 |
Structural variation
Figure 6 summarizes the genomic collinearity results. The highest collinearity (i.e., greatest conservation) was found for F. pseudograminearum, which had high similarity with F. equiseti D25-1, followed by F. avenaceum and F. oxysporum, whereas B. sorokiniana had the lowest similarity. In addition, the gene number in the reference B. sorokiniana genome was 5,133; however, in a collinearity analysis, only 37.12% of the target genes were covered. The N. haematococca reference genome had 7,123 genes, covering 51.51% of the target genes. The F. pseudograminearum genome had 10,052 genes, covering 72.69% of the target genes. The F. avenaceum genome had 9884 genes, covering 71.47% of the target genes. The F. oxysporum genome had 10,236 genes, covering 74.02% of the target genes. Such findings might be related to the large number of reference genes from F. oxysporum.
Core-pan genome
Based on core-pan genome analysis, the six strains shared 1,805 core genes. As an outgroup, Bipolaris sorokiniana had the most species-specific genes (n = 8,912), followed by Nectria haematococca (n = 5,759), Fusarium oxysporum (n = 4,946), Fusarium equiseti (n = 3,483), Fusarium avenaceum (n = 2,614), and Fusarium pseudograminearum (n = 2,299), respectively (Fig. 7).
COG was used to annotate the core genes (Fig. 8) in four modules (cellular processes, genetic information storage and transmission, metabolism, and unknown function). Regarding cellular processes, 3 genes were associated with cell cycle control, cell division, and chromosome partitioning; 14 were involved in cell wall/membrane/envelop biogenesis; 1 gene was related to cell mobility; 5 were related to defense mechanisms; 1 was associated with the cytoskeleton; 1 was associated with intracellular trafficking, secretion, and vesicular transport; 51 were related to posttranslational modification, protein turnover, and chaperones; and 6 were related to signal transduction mechanisms. Five terms in the genetic information storage and transmission module were overrepresented; 1 gene was responsible for chromatin structure and dynamics; 1 was related to RNA processing and modification; 17 were related to replication, recombination, and modification; 14 were related to transcription; and 80 were related to translation, ribosomal structure, and biogenesis. In the metabolism module, 84 genes were associated with amino acid transport and metabolism; 63 were related to carbohydrate transport and metabolism; 38 were associated with coenzyme transport and metabolism; 69 were related to energy production and conversion; 20 were related to inorganic ion transport and metabolism; 58 were linked with lipid transport and metabolism; 35 were related to nucleotide transport and metabolism; and 36 were related to secondary metabolite biosynthesis, transport, and catabolism.
Genes specific to the Fusarium equiseti D25-1 genome were annotated using COG (Fig. 9) based on four modules (cellular processes, genetic information storage and transmission, metabolism, and unknown function). In the cellular processes module, 2 genes were associated with cell cycle control, cell division, and chromosome partitioning; 3 were associated with cell wall/membrane/envelop biogenesis; 1 gene was related to defense mechanisms; 5 were related to posttranslational modification, protein turnover, and chaperones; and 5 were related to signal transduction mechanisms. In the genetic information storage and transmission module, 17 genes were associated with replication, recombination, and modification, and 7 were related to translation, ribosomal structure, and biogenesis. In the metabolism module, 12 genes were associated with amino acid transport and metabolism; 13 were related to carbohydrate transport and metabolism; 7 were associated with coenzyme transport and metabolism; and 13 were related to secondary metabolites biosynthesis, transport, and catabolism.
Phylogenetic analysis based on the whole genomes (Fig. 10) indicated that F. equiseti was most closely related to F. pseudograminearum, followed by F. avenaceum and F. oxysporum.
Gene family analysis
As shown in Table 12, the clustered genes in the outgroup taxon Bipolaris sorokiniana accounted for 72% of the total genes, whereas the genes assigned to clusters in Fusarium solani accounted for over 90% of the genes. Typically, Fusarium equiseti D25-1 gene families exceed 60%, with 54 species-specific gene families.
Table 12
Gene family clustering analysis
Sample | Gene Number | Clustered Genes | Unclustered Genes | Families | Unique Families |
Bipolaris sorokiniana | 12,154 | 8,692 | 3,462 | 5,708 | 213 |
Fusarium oxysporum | 16,792 | 15,288 | 1,504 | 8,726 | 139 |
Nectria haematococca | 15,647 | 14,091 | 1,556 | 7,901 | 109 |
Fusarium equiseti D25-1 | 13,829 | 12,654 | 1,175 | 8,493 | 54 |
Fusarium avenaceum | 13,092 | 12,476 | 616 | 8,181 | 9 |
Fusarium pseudograminearum | 12,348 | 11,446 | 902 | 8,117 | 7 |
Single-copy orthologs were detected in the genome of each strain (n > 2,500). Multiple-copy orthologs were also detected in each genome, but the counts differed among the species. For instance, Fusarium oxysporum and Nectria haematococca had high numbers of multi-copy orthologs (Fig. 11).
Gene family-based clustering analysis (Fig. 12) indicated that F. equiseti D25-1 was most similar to F. pseudograminearum, followed by F. avenaceum and F. oxysporum. These results were consistent with the results of the genome-wide sequence analysis, as well as the evolutionary position of F. equiseti in a genome-based phylogenetic analysis.