3.1. Overview of the fungal BGC atlas
To gain a more extensive review of the biosynthetic chemical space of fungi, a total of 13,125 fungal genomes or metagenome-assembly genomes (MAGs) were curated and analyzed using multiple bioinformatic tools. This genome dataset spans a wide phylogenetic range, encompassing 12 phyla and 1,102 genera (Dataset S1). It includes extensively studied fungal taxonomic groups like Aspergillus, Fusarium, Penicillium, and Talaromyces, as well as taxa with limited information about their BGCs, such as Pyricularia, Trichoderma, and Colletotrichum. The 13,125 fungal genomes were predicted to encode a total of 303,983 BGCs (Dataset S2). Among these, nonribosomal peptide synthetase (NRPS) BGCs were the most prevalent, constituting 38.5% (116,988). Type I polyketide synthase (PKS I) and terpene BGCs ranked as the second and third most abundant classes, accounting for 26.5% (80,518) and 16.4% (49,751) of the total BGCs, respectively. Conversely, saccharide BGCs were not identified in fungi, and ribosomally synthesized and post-translationally modified peptides (RiPPs) BGCs represented only 0.4% (1,312). It is worth noting that the fungal genomes primarily come from the phyla Ascomycota, Basidiomycota, and Mucoromycota, accounting for 81.4% (10,683), 13.9% (1,827), and 2.9% (375) respectively. The genome quantities from other phyla are all below 0.7% (94). As expected, the phylum Ascomycota has the highest average BGC count, reaching 26.41, while the less-studied phylum Zoopagomycota surprisingly ranks second with an average BGC count of 17.04. Basidiomycota, widely used as medicinal mushrooms, encode the third highest average number of BGCs among fungal phyla, aligning with their reputation as prolific producers of biologically active natural products.
The fungal BGC atlas intuitively uncovered the average number of BGCs in all 1,105 fungal genera. Among them, 225 genera have an average BGC count exceeding 30, including well-known genera such as Aspergillus (45.3 ± 12.7, n = 1,095), Fusarium (40.7 ± 7.2, n = 1,326), and Penicillium (41.0 ± 9.2, n = 393), as well as less-studied ones like Diaporthe (95.0 ± 19.3, n = 32), Calonectria (80.7 ± 17.4, n = 70), Macrophomina (62.9 ± 11.7, n = 28), Eutypa (62.8 ± 2.9, n = 41), and Pyricularia (49.2 ± 7.5, n = 383) (Fig. 1 and Dataset S1). The full list of fungal genera with great biosynthetic potential can be accessed in Dataset S1. It is noteworthy that the median genome size of these fungi is around 33.8 Mb, and scaffold N50 around 0.89 Mb (Fig. 1), indicating that the assembly quality of most fungal genomes has reached a moderate level. However, the incomplete genome sequences did affect the predicted total number of BGCs in a genome, especially for large-size BGCs, such as NRPS and PKS BGCs.
3.2. Diversity, novelty, and distribution pattern of fungal GCFs
Considering the existence of many homologous BGCs in phylogenetic closely related strains and the significant impact of incomplete BGCs on the novelty and specificity of GCFs, BIG-SLICE was employed to investigate the diversity and novelty of 212,478 complete fungal BGCs (not on contig edge). The BGCs sharing similar domain architectures were clustered into 43,984 GCFs using cosine-like (via l2-normalization) distances at the default threshold (T = 300), with 27,241 GCFs containing only one BGC. The dendrogram of the 16,743 GCFs consisting of more than two BGCs was presented in Fig. 2 and Dataset S3, revealing that these GCFs were predominantly presented in phyla Ascomycota (15,160, 90.5%) and Basidiomycota (882, 5.3%). The other three phyla encoded 2–70 phylum-specific GCFs, which were correlated with their small genome and BGC counts. However, the products encoded by these phylum-specific GCFs may be more easily obtained due to the clean secondary metabolic background. Besides, there are 575 GCFs harboring BGCs originating from multiple phyla, indicating a relatively conserved nature of core genes within these BGCs across strains from diverse phyla. Interestingly, genera producing the highest number of secondary metabolites were also observed to encode the most genus-specific GCFs. Among the identified 13,075 genus-specific GCFs, Aspergillus encoded the largest amount of the genus-specific GCFs (2,298), followed by Fusarium (1,622), Penicillium (1,023), Colletotrichum (845), and Trichoderma (395). A total of 3,668 GCFs contain BGCs derived from multiple genera, approximately six times the number of phylum-nonspecific GCFs, implying that the core genes in a significant proportion of BGCs exhibit conservation only at the phylum level. All BGCs within the same GCFs belong to the same class. NRPS and PKS I still constitute the two predominant classes in the GCF dataset, accounting for 37.2% (6,230) and 33.8% (5,654), respectively. Biosynthetic-Pfam and sublevel Pfam features of the GCF models were used to construct the hierarchical relationship between the 16,743 GCFs consisting of more than two BGCs. The clustering results indicated that GCFs of the same class tend to cluster on nearby hierarchical branches, except for the PKS1 and PKS-NRPS hybrids classes, which are highly mixed together. GCFs of the "Other" type are distributed across multiple core branches, consistent with the composition of their core biosynthetic gene clusters.
The fungal GCF atlas comprises 27,241 GCFs with a single BGC, 6,512 GCFs with two BGCs, 7,486 GCFs with BGC numbers ranging from three to ten, and 2,745 GCFs with more than ten BGCs. Only 254 GCFs harbor more than 100 BGCs, indicating the great species-specificity of these GCFs. In PKS I-type GCFs, those with only two BGCs constitute 44.3% (2,505 in 5,654), a significantly higher percentage compared to other categories. Conversely, within terpene-type GCFs, the highest proportion includes GCFs with more than ten BGCs, reaching 29.5% (385 in 1,307). This indicates that the species-specificity of PKS I-type BGCs is significantly greater than that of terpene-type BGCs. The average cumulative BLAST score of BGCs within a GCF, calculated using KnownClusterBlast in antiSMASH, was employed to assess the novelty of the GCFs. The results indicate that 69.7% (11,675 in 16,743) of GCFs are entirely novel when compared to BGCs from the MIBiG database. Additionally, 8.0% (1,346 in 16,743) of GCFs exhibit an average cumulative BLAST score below 1000, and 15.3% (2,564 in 16,743) of them display an average cumulative BLAST score below 5000. It's worth noting that 48.3% (949 in 1,965) of PKS-NRPS hybrid-type GCFs show an average cumulative BLAST score above 1000 and only 23 of them contain BGCs from MIBiG, indicating that the novelty of these BGCs is relatively low but the majority of them remain unidentified. A large number of other types of GCFs also show low novelty, and secondary metabolites encoded by these BGCs are highly worthy of targeted exploration. In addition, among the obtained 43,984 GCFs, only 165 GCFs contain BGCs from MIBiG, further highlighting the great biosynthetic potential of fungi.
3.3. Diversity and novelty of fungal cyclodipeptide synthases
The reported CDPS are divided into three types: eight bacterial cyclodipeptide synthases (Type one CDPS, T1CDPS), seven eukaryotic arginine-containing cyclopeptide synthases (Type two CDPS, T2CDPS) and four other eukaryotic cyclopeptide synthases (Type three CDPS, T3CDPS). Using these characterized CDPS proteins as seed sequences, a total of 304 T1CDPS, 40 T2CDPS, and 15 T3CDPS candidates were identified by HMMER (Dataset S4), and their phylogenetic tree is presented in Fig. 3A. Interestingly, all T1CDPS homologs are exclusively distributed in Fusarium, while the T2CDPS candidates are identified from BGCs of Neofusicoccum (16), Aspergillus (13), and Trichophyton (11),. The putative CDPSs were organized into 13 clusters, with 6 clusters consisting of a single protein. All T1CDPS candidates are divided into one cluster (cluster a). The T2CDPS candidates are divided into 5 clusters, with three large clusters containing 16 (cluster b),12 (cluster c), and 8 (cluster e) proteins, respectively, while the cluster g consists of 2 proteins and 2 seed sequences (Fig. 3B). The T3CDPS candidates are divided into two clusters, 6 (cluster e) and 5 (cluster f) proteins, respectively. Furthermore, potential CDPSs sourced from the genus Fusarium, Trichophyton and Neofusicoccum are distributed within a single cluster, suggesting that genera Fusarium harbor the potential to synthesize type one diketopiperazine scaffolds, while Trichophyton and Neofusicoccum harbor the potential to synthesize type two diketopiperazine scaffolds.
The structure of the gene cluster similarity network for BGCs containing these CDPS homologs closely resembled the CDPS sequence similarity network (Fig. 3C). The T1CDPS homologs were mainly observed in BGCs of CDPS and PKS I types, the putative T2CDPS are mainly located in BGCs of CDPS-PKS I and CDPS-NRPS hybrid types, and the T3CDPS homologs are mainly located in BGCs of NRPS, NRPS-like and NRPS + T1PKS (Fig. 3C). The results indicate a high correlation between the CDPS sequences and the core biosynthetic genes within the same gene clusters, providing a promising opportunity for the targeted exploration of specific types of natural products based on CDPS sequences. The finding also highlights their potential to biosynthesize diketopiperazine-containing polyketides or peptides. For instance, BGCs containing CDPS from cluster 1 also harbor the essential genes for the biosynthesis of curvularin, indicating their potential to synthesize diketopiperazine-containing curvularin derivatives (SI Appendix Fig. S1). Most seed sequences of T1CDPS appeared as singletons under the given thresholds, indicating that the diversity of newly identified CDPS candidates is low, and the functionality of CDPSs within the same cluster is likely to be very similar.
3.4. Diversity, and novelty of fungal diketopiperazine-forming NRPSs
Fungal diketopiperazine-forming NRPSs were identified based on four experimental characterized NRPSs responsible for the biosynthesis of the diketopiperazine scaffold using an HMM-based screening approach. A total of 175,418 hits were produced under the default settings, resulting in the discovery of 24,808 unique protein sequences containing two AMP-binding domains. As shown in SI Appendix Fig. S2, the remaining 24,808 hits were depicted along the horizontal axis according to their HMM score and annotation results. A clear drop in the HMM similarity score was observed around sequences 9,500. The first protein annotated as not being an NRPS is sequence 9,473 (HMM score 796.6, BLAST e-value 2E-170). Consequently, all the 9,472 proteins with HMM scores greater than 796.6 and the ten NRPSs showing comparable HMM scores but with BLAST e-values of 0, were regarded as candidates for diketopiperazine-forming NRPSs (Dataset S5). They are distributed across multiple fungal phyla but also present in significant numbers in genera such as Fusarium, Aspergillus, Penicillium, and Colletotrichum (Fig. 4A), which may explain the ubiquity of diketopiperazine natural products in fungi. The phylogenetic tree of protein sequences indicates that diketopiperazine-forming NRPSs from the same fungal genus are mostly distributed across multiple branches, demonstrating a certain degree of diversity and genus-specificity. The majority of diketopiperazine-forming NRPSs with HMM score > 1800 originate from Aspergillus and are classified into cluster e, which includes the characterized nonribosomal peptide synthetase (hasD) in the NRPS network. This underscores the promising potential of Aspergillus for synthesizing diketopiperazine natural products. The NRPS proteins in cluster b mostly have moderate HMM scores (1200–1800), longest sequence lengths (> 10 k nt), and are widely distributed across multiple fungal genera, but none of the proteins in this cluster have been functionally validated.
The diketopiperazine-forming NRPSs in cluster a represent the most abundant and diverse group, with the widest range of HMM scores and sequence lengths. Moreover, they are extensively distributed across various fungal genera, and the two identified diketopiperazine-forming NRPSs (ftmA and notE) are classified within this group (Fig. 4B). The NRPS proteins in cluster c and d are strictly clustered together separately in the phylogenetic tree. Their sequence lengths are all below 8000 nt and are mainly derived from Fusarium. CriC, the first fungal diketopiperazine-forming NRPSs that catalyzes the formation of a cyclic dipeptide from L-tryptophan and L‐alanine, is distributed within cluster g with eight NRPS proteins from Aspergillus (Fig. 4), suggesting that these proteins possess significant catalytic potential for the cyclization of L‐tryptophan and L‐alanine.
3.5. Discovery of diketopiperazines and the diketopiperazine-forming enzyme from Aspergillus sp. WHUF0304
The genus Aspergillus is one of the major sources of diketopiperazine natural products and diketopiperazine-forming enzymes. Mass spectrometry-guided molecular network analysis revealed that the strain Aspergillus sp. WHUF0304 may produce a series of novel diketopiperazine natural products (Fig. 5A). Therefore, we explore the chemical diversity and biosynthetic mechanisms of diketopiperazines in this strain to illustrate the biosynthetic characteristics of diketopiperazine scaffolds in fungi. A total of eighteen indole diketopiperazine alkaloids (1–18), including three new ones, were characterized from the fermentation culture of a marine-derived fungus Aspergillus sp. WHUF0304 (Fig. 5A). Aspergillan A (1) was similar to cryptoechinulin D38, except a furan moiety was formed in 1 by the connection of C-24 and C-27 via oxygen. Notably, 1 has a zero specific rotation and no cotton effect in its ECD spectrum, which indicates Aspergillan A is a racemate. The enantiomers of Aspergillan A, 1a, and 1b, were separated by chiral HPLC using a Chiralpak IB column. ECD calculations with time-dependent density functional theory (TD-DFT) were performed, and the Boltzmann-averaged ECD spectra of (12S,28R,31R)-1 and (12R,28S,31S)-1 matched well with the experimental ECD spectra of 1a and 1b, respectively. Aspergillan B (2) was the dehydro analog of aspergilline D39. 2 was also assumed to be a racemate due to its zero specific rotation and baseline ECD curvet, and the absolute configuration of 2a and 2b was assigned as 12S, 21R, 29S and 12R, 21S, 29R, respectively. Aspergillan C (3) was identified as an analog of aspergilline B39.
The structures of compounds 4–18 were identified as Cryptoechinulin D (4), Eurotinoid B (5), Variecolortide B (6), Variecolortide C (7), Isoechinulin A (8), Variecolorin G (9), Neochinulin D (10), Variecolorin J (11), Cryptoechinulin C (12), Variecolorin O (13), Variecolorin H (14), Aspergilline B (15), Neoechinulin A (16), Neoechinulin B (17), Dihydroneoechinulin B (18) based on comparison of HRESIMS, 1H NMR, and 13C NMR data with previously reported literature38-45. All these indole diketopiperazine alkaloids are speculated to be synthesized from L-tryptophan and L-alanine. Gene-centric analysis revealed that Aspergillus sp. WHUF0304 does not encode the CDPS gene in its genome. However, it does contain a diketopiperazine-forming NRPS protein, which exhibits 94% sequence similarity to CriC identified from Eurotium cristatum NWAFU-1 (Fig. 5B). In addition, homologous genes of the other five post-modification genes in the cir gene cluster can also be located near the target NRPS gene, and their sequences exhibit high similarity (97%-99%). This case highlights the importance of the present study in the efficient exploration of novel diketopiperazines and the biosynthetic enzymes.