The biosynthetic potential of the cattle GIT microbiome based on high-quality MAGs
We expanded our previous database, BGMGM (Bovine Gastro Microbial Genome Map), by integrating additional 2,114 non-redundant large intestine MAGs from a published study22, as well as 109 MAGs newly sequenced and assembled from third-generation sequencing of omasum, jejunum, and rectum content samples. After dereplication, the number of high-quality MAGs increased from 13,572 to 14,093 (44.88% of total MAGs, contamination > 90% and completeness < 10%; Fig. 1A, Table S1). The comprehensive MAGs dataset from 10 segments of the GIT provides extensive coverage (Fig. 1B, Table S1) and allows for a comparison of microbial components across different GIT locations. Approximately, a quarter of MAGs (3,759) span across different segments of GIT, with 3.0% and 0.60% of MAGs were concurrently found in the stomach and small intestine, small and large intestine regions (Fig. 1B, Table S2). To improve the quality of genes in these MAGs, pre-assembled contigs were scaffolded using long reads as a backbone. Then, a high-quality MAG database with more complete gene sets was obtained, with the average number of contigs per genome decreased from 154.8 to 86.6 (P < 0.001), the length of the maximum scaffold increased from 144,318.9 to 276,725.2 bp (P < 0.001), the mean N50 elevated from 62,223.5 to 138,958.2 bp (P < 0.001, Fig. 1C, Table S1). After being mapped to the Genome Taxonomy Database Toolkit (GTDB-Tk), a total of 11,141 (21.9%) MAGs were annotated (Fig. 1A, Table S1).
Next, the antiSMASH framework was applied to identify BGCs in MAGs. From all 14,093 MAGs, totally 26,503 BGCs (clustered into 328 gene cluster clans (GCCs)) were identified (Fig. 2A, Table S2) with length ranging from 1 to 117 kb (46.5% were 15 to 25 kb) (Fig. 2B, Table S2). After comparing with BIG-FAM/RefSeq23, totally 8,672 (32.7%) BGCs were assessed as novel BGCs (distance > 0.6), and 1,031 (3.9%) were completely novel (distance = 1.0),which represent the unique BGCs in the cattle GIT (Figs. 2A & C). A large number of BGCs were found to encoded natural products classified into NRPS (Non-Ribosomal Peptide Synthetases, 5,096 BGCs) and RiPPs (Ribosomally-synthesized and Post-translationally modified Peptides, 13,689 BGCs) (Figure S1). Moreover, within 167 types of natural products derived from 26,503 BGCs, a series of Betalactone (3,027 BGC products) and Arylpolyene (2,637 BGC products) were identified. Betalactone and Arylpolyene are recognized as substances involved in food fermentation and plant defense mechanisms24, with the potential of antioxidant, antibacterial, and anticancer activities. For the 1,031 completely novel BGCs, 15 distinct products categories, including 20% NRPS, 15.1% RRE-containing, 10.7% cyclic-lactone-autoinducer were encoded (Fig. 2C, Table S3), indicating a profound database of BGCs were harbored by cattle GIT.
In 167 types of natural products, 95 were originated from specific microbial phylum (Figure S2, Table S3). Actinobacteria accounted for 5.3% of the 1,031 complete novel BGCs and encompassed 7 distinct types. It’s noticed that Methanobacteriota, as one of the most abundant archaea phyla in the rumen and BGMGM25,26, specifically encodes novel subtypes like NRPS-like.NRPS.thioamitides, NRPS-like.thioamitides, thioamitides.LAP. CowSGB_9022, an unannotated species (92.54% completeness and 2.92% contamination for its genome) from the Pseudomonadaceae family and the Saccharopolyspora genus, encodes the highest number of BGCs and produces the most variety of products (Fig. 1D, Table S4). Totally 7 novel BGCs were identified in CowSGB_9022 including c00003, c00046, c00102, c00137, c00245, c00350, and c00368. The c00046 BGC encodes an amino acid analogue which contains a substitution of the R group of glycine in an NRPs-like BGC, whereas the c000368 BGC specifically encodes a Lanthipeptide-class-i peptide (Fig. 1E). Interestingly, the genes in the CowSGB_9022 found in the small intestine were enriched in the acetate oxidation, ethanol oxidation, hydrogen oxidation and fermentation (Figure S3). Its diverse encoded BGCs and abundance of oxidative-reduction related genes suggest its significant role in the anaerobic fermentation process within the small intestine and strong adaptability within the intestine microbiome. In sum, we characterized the distribution of secondary metabolite classes in cattle GIT, discovered novel cattle-specific BGCs as well as their encoded products and belonged MAGs.
The evaluation of antioxidative properties of metabolites encoded by microbial BGCs
To explore the antioxidative properties of the BGCs products, we trained an interpretable ensemble model for identifying potential antioxidants using graph neural network27,28. A total of 60,621 antioxidant assays with the corresponding Simplified Molecular Input Line Entry System (SMILES) information were collected from PubChem, CHEMBL, and AODB databases29. These metabolites were further classified into five categories of binary classification datasets: electron transfer (ET), hydrogen atom transfer (HAT), inhibition of lipid peroxidation-based assay (Lipid), NRF2-antioxiant response element signaling pathway (NRF2), and others (Others) based on the method29 of “collection of small molecule antioxidant data” (Fig. 3A, Table S4). Ensembles of Message Passing Neural Network (MPNN) on each of the five datasets were applied to learn the characteristics directly from molecular diagrams. Optimization was performed to ensure the most optimal values of depth (6), dropout rate (0.05), feedforward neural network hidden size (FFN, 2,000), number of FFN layers (2), and the overall model’s hidden size (2,000) in MPNNs of the ET models (The optimal parameters of the remaining four models are shown in Table S4). To improve the model performance, molecular features including Morgan fingerprint, Morgan count, and RDKIT 2d normalized features were provided as data augmentation strategies. As a result, five structures of MPNNs that could predict the antioxidants through the successive convolution from the digital atoms and bonds with hidden size larger than 1,000 were established (Fig. 3B, Table S5). A ten-fold cross-validation strategy with an 80%-10%-10% split for training, validation, and testing respectively were utilized, combined with 60 assembled models. To deal with the unbalanced training data, binary cross-entropy loss function was employed. Simultaneously, Precision-Recall Area Under the Curve (PRAUC), Area under the Receiver Operating Characteristic Curve (AUROC), F1 score, and Matthew's Correlation Coefficient (MCC) were used as evaluation metrics across five datasets for. In the training datasets, the PRAUC and the AUROC reach 0.970, 0.990, 1.000, 0.990, 0.840, and 0.974, 0.978, 0.998, 0.999, 0.969 for the ET, HAT, Lipid, NRF2, and Others training datasets, respectively (Fig. 3B, Table S5). In the ten-fold validation, due to the more balanced dataset, the ET, HAT and Lipid models performed with high average PRAUC (ET 0.718, HAT 0.907, Lipid 0.844). Particularly, with imbalanced datasets, the NRF2 model and Others model performed well, achieving PRAUC values of 0.539 ± 0.0512, 0.413 ± 0.0992 (Fig. 3B, Table S5), which is obviously higher than the PRAUC (0.364) of antibiotic MPNNs in a study trained by 512 antibiotic-active and 38,80 non-antibiotic-active compounds recently published in Nature27.
To predict antioxidant properties, we sorted out the structures of secondary metabolites as follows: 244 metabolites’ SMILES from our 26,503 BGCs, 134 from nature products database and additional 58 from mapped MIBIG records (detailed in the “exploration of microbial-derived small molecules” methodology). This resulted in a dataset of 436 secondary metabolite SMILES strings that were utilized for antioxidants prediction (Fig. 3C, Table S6). The majority of these 436 molecules exhibited low similarity to known antioxidants (Average morgan fingerprint similarity = 0.0553, Fig. 3C, Table S6). Subsequently, five fine-tuned graph neural networks performed convolutional operations based on the specific atoms and bonds present within each input molecular structure, thereby assessing the antioxidant properties of the 436 metabolites. Totally 245, 121, 266, 2, and 47 potential antioxidants were predicted with ET, HAT, Lipid, NRF2, and others antioxidant properties, respectively (Fig. 3D, Table S7). The interpret structures of potential antioxidants with high prediction score were calculated by conducting a Monte Carlo tree search27. We found that 229 (52.5%) secondary metabolites were predicted to have at least two types of antioxidant properties. Among them, four metabolites possessed 4 types of potential antioxidant properties, including NPA007583, CMNPD14685, BGC0002074 and BGC0000621 (Table S7). Overall, our deep-learning model provides a brand-new approach to predict the antioxidant based on their molecular structure, offering an important avenue for exploring and utilizing secondary metabolites.
The distribution of BGCs and their encoded metabolites across GIT segments
After evaluating the BGCs and secondary metabolites from the MAGs, we focused on the distribution of BGCs and their secondary metabolites in different GIT segments (Fig. 4A). Similarly, most of the BGCs were identified in the rumen (10,836) and rectum (10,049) (Fig. 4A, Table S8). Interestingly, we found that 896 BGCs commonly occurred in the forestomach and the small intestine, much higher than the small-large intestines (169 common BGCs) and the forestomach-large intestines (108 common BGCs) (Fig. 4A, Table S8), suggesting a closer relationship between the secondary metabolic potential of the forestomach and the small intestine than the small and large intestines. In other word, an independent microbial metabolic community may reside in the large intestine. When considering the distribution of novelty BGCs, BGCs in the small intestine showed a significant higher novelty (Mean dist = 0.396) compared to those in the forestomach (Mean dist = 0.358, P.adjust < 0.001) and large intestine (Mean dist = 0.297, P.adjust < 0.001, Fig. 4B, Tables S2, 8). Particularly, novelty of rumen BGCs remained consistently high and presented no significant difference with small intestines (compare to duodenum, jejunum, and ileum, the P.adjust were 0.168, 0.478, and 1.000, respectively. Figure 4B, Table S8). Moreover, the study found that BGCs encoding NRPS and RiPP were the most common types in all segments of the GIT, while forestomach and small intestinal had a higher proportion of NRPS (25.9%, 27.9%) compared to the large intestinal tract (9.4%), suggesting a more enriched diversity of multi-domain protein complexes of the forestomach and small intestinal secondary metabolites.
Furthermore, we focused on the distribution of potential antioxidants and their sourced microbial species in the GIT (Fig. 4D, Table S9). Totally 95 potential antioxidants with at least two antioxidative properties and predicted scores higher than 0.7 were selected for GIT distribution (Fig. 4E, Table S9). Among them, 33 antioxidants found to be specific to the rumen, 4 to the rectum, and 1 to the Ileum and cecum. A particular molecular with conjugated double bonds from the 4 unique potential antioxidants of rectum were identified, achieving ET and HAT prediction scores of 0.747 and 0.848. It was encoded by the FixCowSGB-3326 species of the CAG-100 genus within the Firmicutes. Its conjugated double bond structure along with the carboxyl group were calculated to contribute 79.7% of antioxidant prediction through the interpretability analysis of the Monte Carlo tree-based model (Fig. 4E, Table S9). The distinct potential antioxidants in the rumen were originated from 18 species, including six from Methanobrevibacter_A and Methanobrevibacter_B, nine from seven genera such as Ruminococcus, UBA1213, and Thermobifida. Many of the 95 potential antioxidants showed peptide structures. Among them, Fuscachelin C, encoded by orphan NRPS BGCs from Thermobifida fusca, a degrader of plant cell walls30, has been predicted to possess antioxidant properties in ET (0.928), Lipid (0.941), and Others (0.710) categories, suggesting the antioxidant potential of some cellulose-degrading microbiota in the rumen. Interpretability analysis of ET prediction suggests that 98.2% of contribution derived from the dipeptide backbone carbon chain structures in Fig. 4E (Table S9). The interpretability analysis of the oligopeptide, Cys-Cys-Cys, suggests a contribution score of 0.993 from the thiol group-containing structure, which served as the primary antioxidant group in the glutathione (GSH) (Fig. 4E, Table S9). In addition, we have also discovered peptide structures containing multiple cysteines with thiol group-containing structures in the specific metabolic products from the rumen, omasum, abomasum to the small intestine from 8 species in the Firmicutes phylum (Fig. 4E). The secretion of these potentially potent antioxidants suggests a strong potential for antioxidant compound secretion within the bovine GIT microbiota, which may regulate the host’s antioxidant status as previously reported31,32.
The oxidative stress signaling patterns of the cattle single-cell atlas
The oxygen signal is essential for life activities33, yet the regulatory gene expression of oxygen signals at the single-cell level is currently unknown. Using our cattle single-cell atlas, we compared the mRNA expression levels of antioxidant, ROS generation, and ROS signal response in 1,803,004 cells belonged to 126 cell types and 1,006 clusters across 51 tissues (Han et al., 2024, in preparation; Shi et al., 2024, in preparation). After scoring the 16 ROS generation, the 19 antioxidant, and the 22 ROS response pathways from Gene Ontology (GO) database, we have identified the top 50 cell clusters in three aspects of oxygen signal, among which 19 clusters were from the digestive system, 13 from the immune system, and 7 from the reproductive system (Fig. 5A, Table S10). Neutrophils were found to be the major cell types in the top 50, which align with the understanding that neutrophils produce ROS to eliminate pathogens34. As for ROS responding pathways, the average scores of gene sets were highly expressed in the digestive system, reproductive system, and immune system (Fig. 5A, Table S10), with enterocytes of the ileum and duodenum sorted into the top 10 clusters (Fig. 5A, Table S10). In terms of the antioxidant pathways, a total of 31 out of the top 50 clusters belonged to the digestive system, most of which were identified as epithelial cells in the esophagus and forestomach. Spinous cells of the reticulum, esophagus, rumen, and omasum were ordered at the top 4, 7, 8, and 9 among 1,006 clusters, highlighting the oxidative stress environment of GIT epithelium cells (Figure S4, Fig. 5A, Table S10).
We next focused on the function of GIT cells with the highest scores in the antioxidant, ROS generation, and ROS signal response pathways. Interestingly, spinous cells in the forestomach and enterocytes in the small intestine especially in the ileum showed the highest score in the GIT cell types. Particularly, spinous cells in the reticulum and enterocytes in the ileum, showed the most significant enrichment and the largest number of genes in pathways related to mitochondria, electron transport, respiration, antioxidant and ROS generation (Figs. 5B & C). Additionally, the genes with high expression in neutrophils located in the rectum were enriched in pathways related to response to other organisms. Similar results were also found in the hindgut neutrophils. These results indicate that the GIT epithelial cells act as a selective barrier between tissue and the gut environment, undergoes challenge from lumen microbiota. To elucidate the regulatory mechanisms behind the gene expression of these cell types, we ascertain the specific correspondence of regulons within each cell type in the whole GIT segments using SCENIC35 (Han et al., 2024, in preparation). In the spinous cells of the forestomach, we found that ESRRA consistently maintains a top 5 RSS score in the rumen, reticulum and omasum. Among the top 20 genes regulated by it, there are a large number of genes related to energy metabolism and antioxidation, such as HSD17B13, SLC25A34, ATP5ME, TST MGST3, and GLRX. For the first time, we analyzed oxygen signal gene expression in the to-date largest cattle single-cell atlas and discovered that cell types in the digestive system exhibited a higher response to ROS signals than other tissues, especially the epithelial cells in the forestomach.
Cell metabolism of high antioxidant cell types in the forestomach
Cellular antioxidant enhancement often signifies that the cells have been subjected to oxidative stress36. To further explore the clues behind the occurrence of a strong antioxidant state in the GIT, we selected spinous and basal mitotic cells using 3 major pathways of antioxidants and average antioxidant scores as primary reference metrics (Fig. 6A). Apart from spinous and basal mitotic cells in the forestomach, smooth muscle cells in the rectum, and principal cells in the epididymis were in the top 50 for the three major antioxidant pathway scores and the average antioxidant scores (Fig. 6A), indicating strong oxidative stress in these cells. Afterward, through the High Dimension Weighted Gene Co-expression Network Analysis (hdWGCNA), we found that spinous and basal cells simutaniously exhibited higher correlations with module1, 2, 4, 5 (Fig. 6B, module1 R = 0.164, P < 0.001, R = 0.152, P < 0.001; module2 R = 0.113, P < 0.001, R = 0.156, P < 0.001; module4 R = 0.204, P < 0.001, R = 0.170, P < 0.001; module5 R = 0.117, P < 0.001, R = 0.170, P < 0.001) than other modules and specific expression in gene modules 1, 2, and 4 in the rumen (Fig. 6B, Figure S5, Table S11), which were enriched in the functions of aerobic electron transport chain (P < 0.001, Enrichment score = 111942.7) and fatty acid beta-oxidation (P < 0.001, Enrichment score = 623.48). Moreover, these modules are also strongly correlated with important antioxidant pathways such as antioxidant activity (R = 0.535, P < 0.001), peroxidase activity (R = 0.493, P < 0.001), and peroxiredoxin activity (R = 0.576, P < 0.001). Similarly, we identified highly enrichment of modules related to fatty acid metabolism and energy metabolism in both modules in Reticulum (module1, 3) and Omasum (module7) (Figure S5, Table S11). Since spinous and mitotic basal cells utilized fatty acids as the major energy source, these results demonstrated the co-expression of the antioxidant genes and fatty acid metabolism genes, suggesting that antioxidant genes could serve as protection for robust fatty acid metabolism in these cells.
As the rumen is the primary site for fermentation and production of SCFAs in the forestomach. Next, we analyzed the cellular antioxidant metabolism and fatty acid metabolism in the rumen using the flux balance analysis tool. It was discovered that the metabolic activity of spinous and basal cells in the forestomach was significantly higher (P < 0.05) than other cell types. Specifically, among all the results of the flux balance analysis, a total of 5,696 up-regulated reactions (P < 0.05) in spinous and mitotic basal cells and 1,012 down-regulated reactions in other cell types were found (P < 0.05), highlighting the strong metabolic status of the spinous and mitotic basal cells in the forestomach. Pathways related to energy metabolism, antioxidant and respiration-related processes were found in a high activity, including fatty acid oxidation (631 upregulated reactions in spinous and basal cells, compared to 3 in other cell types), glutathione metabolism (21 upregulated reactions in spinous and basal cells, no in other cell types), ROS detoxification (7 upregulated reactions in spinous and basal cells, no in others cell types) (Fig. 6C, Table S12). We focused on glutathione metabolism and fatty acid oxidation, where the reduction of GSH to GSSG in glutathione metabolism and the elongation of short fatty acids into longer fatty acids showed the highest Cohen's value (Fig. 6D, Table S12). These two reactions not only demonstrate how the cell regulates its oxidative stress by consuming GSH but also illustrate the process that synthesizes fatty acids from short fatty acids (Fig. 6D).
To further assess the subclusters and that regulate the fatty acid and antioxidant process, the two cell types were re-clustered into 6 subclusters (Fig. 6E, Table S13). Interestingly, we observed subclusters 3 and 4 that could be an intermediate cell types in the differentiating process from basal cell to spinous cell with high expression of KRT6A, KRT14, and KRT5. Through pseudo-time analysis, we have observed a cell differentiation trajectory from subcluster 4, to subcluster 3, and finally to subcluster 1 (Fig. 6F). Hallmark pathways scoring results showed that subcluster 3 exhibited significantly higher expression levels of fatty acid metabolism (u = 0.21, P < 0.01), while subcluster 4 exhibited strong cell division potential (u = 0.13, P < 0.01) when compared to other subclusters (Fig. 6G). These results suggest that a specialized phase of fatty acid metabolism and antioxidant defense stages exist during the known differentiation process from basal cells to spinous cells (Fig. 6G). Moreover, two TFs, ESRRA and PPARG, along with the genes they regulate related to fatty acid metabolism and antioxidation, are highly expressed in cluster 3 (Fig. 6H). Sum up, we found that increased antioxidant metabolism acted as a defense against intense fatty acid metabolism in a specific subtype of cells dedicated to robust fatty acid metabolism.
Deriving interactions between potential antioxidants and high GIT oxidative stress cells
One of the most relevant types of interactions that modulate host metabolism status are the metabolite-mediated interactions among microbes, metabolites, and host cell proteins37. Here, we employed a structures based virtual screening method,transformerCPI 2.038, to score the interaction between 436 secondary metabolites and 14,976 marker proteins from genes (LogFC > 1) of all cell types in GIT. A total of 6,484,608 interactive scores were obtained (Fig. 7A).
Furthermore, we examined the interaction between distinct cell types in the rumen and the potential antioxidant properties of Cys-Cys-Cys, an oligopeptide which exhibited predictive scores of 0.770 and 0.822 for its effectiveness in ET and lipid antioxidation, respectively. This unique compound is specifically prevalent in the rumen, omasum, and rectum. (Fig. 4E), which possesses the same antioxidant moiety, thiol (-SH), as GSH. First, we calculated its absorption properties using the ADMET2.0 (https://admetmesh.scbdd.com/) tool and found that it has acceptable HIA, Lipinski Rule, and MDCK Permeability, indicating the potential to pass through the GIT epithelium. Moreover, in the rumen spinous cells, 151 proteins were predicted to interact with Cys-Cys-Cys (score > 0.5) among the 263 marker protein sequences (Fig. 7B). Interestingly, these 151 proteins were enriched into more than 10 pathways, with half of which were associated with mitochondria and antioxidant metabolism (P.adjust < 0.05, Count > 10, Fig. 7C). Several pathways were found to co-occur in both the marker gene enrichment results and the interactive gene enrichment results, such as mitochondrial membrane, mitochondrial envelope, organelle envelope, mitochondrial inner membrane and cellular detoxification pathways (Fig. 7C). In the cellular detoxification pathway (P.adjust < 0.001 Count = 8), 6 out of 8 marker genes have the potential to interact with Cys-Cys-Cys (P.adjust = 0.002,Average Interaction = 0.700), while in the mitochondrial membrane pathway, 13 out of 26 genes have the potential for interaction (P.adjust = 0.003,Average Interaction = 0.703) (Figs. 7D, E, F). Moreover, the 19 genes from cellular detoxification and mitochondrial membrane pathways are found highly expressed in spinous and mitotic basal cells (cluster 6,7, P < 0.001) in the rumen (Table S13) (Fig. 7D). Similar results were identified in the 310 marker proteins from spinous cells (cluster5) of omasum (Figure S6). This indicates a potential mechanism for microbial interaction with the host through the secretion of antioxidant substances.
Subsequently, to determine the potential functions of the spinous cells, we correlated the cell types in the rumen to dairy cow phenotypes data from GWAS. The result revealed a strong association between the high fatty acid metabolism subcluster (subcluster 3 of spinous and mitotic basal cells) in the forestomach and milk protein (bp value = 0.0101, tp value = 2.32), milk fat (bp value = 0.0729, tp value = 1.45) traits in cows (Fig. 7G, Table S14). This finding suggests that subcluster3 of spinous and mitotic basal cell not only show high fatty acid metabolism and cellular antioxidant activities, but also potentially significantly influences the production traits of dairy cows.