Quality controlling and metagenome assembling
Shotgun sequencing produced 29,623,109,936 bp and 24,284,563,524 bp of unassembled raw bases from the soil (s05-02) and water (s05-03) samples (Table 1), respectively. In total, 194,202,138 (98.99%) and 159,527,496 (99.19%) reads passed the quality control in the soil and water sample dataset, respectively (Table 1).
After assembling, a total of 160,212 (soil) and 135,994 (water) contigs were generated (Table 1). For the two samples, the longest contigs were 833,193 bp and 286,311 bp, GC content were 60.18% and 51.33%, and N50 length were 2,289 bp and 3,841 bp, respectively.
After Prokka [30] annotation, 351,242 genes were obtained from the soil sample and 390,943 genes from the water sample (Table 2).
Metagenome-assembled genomes (MAGs)
The CheckM [54] analysis showed that 75 MAGs were reconstructed with the completeness > 70% and contamination < 10%, including 18 from the soil sample and 57 from the water sample (Fig. 2a and Fig. 2b). A total of 20 MAGs met the high-quality standard with the completeness > 95% and contamination < 5%, accounting for 26.67% of all the MAGs. Table S3 lists the statistics of the 75 bins including the completeness, contamination, GC value, N50, genomic size, and GTDB [58, 59] species classification. The original reconstructed MAGs have been reassembled by using reassemble_bins module in MetaWRAP [28] and their genomic integrities have been significantly improved (Fig. 2c and Fig. 2d).
Microbial community composition analysis
The microbial community composition of the soil sample was visualized by Krona [34] (Fig. 3). The results showed that bacteria accounting for 99.5% were dominant, with archaea accounting for 0.3% and viruses accounting for 0.08%. The three most abundant phyla in bacteria were Proteobacteria (54.0%), Actinobacteria (18.6%) and Firmicutes (9.8%), and another 29 phyla were identified (Table S4a), such as Bacteroidetes (8.9%), Gemmatimonadetes (3.3%), Planctomycetes (1.3%), and Acidobacteria (1.1%). For the most abundant phylum, Proteobacteria, 112 families (Table S4b) were identified, in which Sphingomonadaceae (34.9%) was the most abundant, followed by Bradyrhizobiaceae (7.1%), Methylobacteriaceae (6.3%), etc. At the genus level of the Proteobacteria, 408 genera were identified (Table S4c), in which Sphingomonas (22.9%) had the highest abundance, followed by Sphingobium (4.6%), Methylobacterium (4.4%), etc. Among the Actinobacteria (Table S4d), we identified 46 families, in which Streptomycetaceae (22.7%) was the most abundant, and at the genus level of the Actinobacteria, 141 genera were identified, in which Streptomyces (21.5%), Conexibacter (7.5%) and Micromonospora (5.6%) were the three most abundant genera (Table S4e). A total of 2,400 bacterial species were identified (Table S5), including Bacillus cereus (5.1%), Hymenobacter sedentarius (4.5%), Gemmatirosa kalamazoonesis (3.0%), Sphingomonas indica (2.3%), Lactococcus piscium (1.6%), etc.
In the soil sample, Euryarchaeota (81.9%) accounted for the largest proportion in the archaea (Table S6a), and Thaumarchaeota (14.7%) was the second most dominant phylum, followed by Crenarchaeota (1.7%). Table S6b, c, d, e, and f list the detailed information for the archaea at the class, order, family, genus, and species level, respectively.
Eighteen species of viruses were identified in the soil sample (Table S7). Surprisingly, Pandoravirus salinus (15.3%), a giant virus, was the most abundant, followed by Bacillus virus 250 (11.5%), Mycobacterium phage Jobu08 (7.7%), etc.
In the water sample (Fig. 4), the bacteria accounted for the majority (99.1%). Only a small portion was archaea (0.4%), and the rest was viruses (0.5%). At the phylum level, Proteobacteria (54.4%), Bacteroidetes (17.9%) and Actinobacteria (15.1%) were the most abundant phyla, and another 30 phyla were identified (Table S8a). Among the Proteobacteria, 124 families (Table S8b) were identified, including three most abundant families: Rhodobacteraceae (42.6%), Comamonadaceae (14.8%) and Pseudomonadaceae (3.8%). Among the 464 genera identified in the Proteobacteria (Table S8c), Yoonia (22.4%) was the most abundant, followed by Limnohabitans (6.8%) and Pseudomonas (3.7%). For the Actinobacteria, 46 families (Table S8d) were identified, of which Microbacteriaceae (23.3%) was the most abundant, and 150 genera (Table S8e) were identified, of which Streptomyces (13.9%), Corynebacterium (5.6%) and Pontimonas (5.1%) accounted for the largest proportion. A total of 2,714 bacterial species (Table S9) were identified, in which Yoonia vestfoldensis (12.2%), Belliella baltica (5.3%), Limnohabitans sp. 63ED37-2 (3.6%), Aquiflexum balticum (1.8%) and Cyanobium sp. NIES-981 (1.2%) were dominant.
Among the archaea in the water sample, Euryarchaeota (98.8%) accounted for the largest proportion (Table S10). Table S10b, c, d, e, and f list the detailed information for the archaea at the class, order, family, genus, and species level, respectively.
In the water sample, a total of 52 species of viruses were identified (Table S11), in which the most abundant was Chrysochromulina ericina virus (12.7%), followed by Phaeocystis globosa virus (8.8%) and Synechococcus phage S-PM2 (5.9%).
Phylogenetic analysis
A phylogenetic analysis for the 75 MAGs was performed. Initially, using CheckM [54] to classify the MAGs, 38 of the 75 MAGs (50.67% of the datasets) were not assigned to a specific taxonomic name below a domain. Subsequently, we further refined the classification of all the MAGs by using GTDB-tk [57], and bin5 from the soil sample with the 98.06% average nucleotide identities (ANIs) was even successfully classified to the species level. Of the 75 MAGs, 28 (37.33%) lacked a genus-level match, and only one (bin13 from the water sample) lacked a family-level match (Table S3).
The phylogenetic tree constructed with the MAGs (Fig. 5) showed that the 75 MAGs could be taxonomically assigned to 10 phyla, including main members: Proteobacteria (16 MAGs, five from the soil sample and 11 from the water sample), Bacteroidota (22 MAGs, two from the soil sample and the remaining 20 from the water sample), Actinobacteriota (17 MAGs, three from the soil sample and 14 from the water sample), etc. The microbial community composition of the soil and water sample was roughly similar, which may be due to the fact that sampling points of the two samples were very close. However, at the phylum level and below, the species composition of the two samples gradually showed differences.
Metagenomic functional annotation
For the two samples, a total of 232,679 and 294,919 genes were annotated against nr database, 104,704 and 117,467 genes against KEGG [42] database, and 45,638 and 59,921 genes against GO [46] database, respectively. More annotation results are shown in Table 2.
The COG [44] functional annotation results were visualized using ggplot2 (Fig. 6a). Except for function unknown (12.8%), the three most abundant categories in the soil sample were replication, recombination and repair (4.7%), translation, ribosomal structure and biogenesis (3.9%), and transcription (3.9%). In the water sample, except for function unknown (11.0%), translation, ribosomal structure and biogenesis (4.3%) was the most abundant, followed by replication, recombination and repair (3.6%), amino acid transport and metabolism (3.5%), etc.
WEGO [50] was used to show the three main categories (cellular component, molecular function and biological process) and the first two levels of the GO terms, with bar charts showing the number of the genes annotated. In the soil sample (Fig. 6b), the dominant category was biological process (35.5%), followed by molecular function (33.0%) and cellular component (31.5%). We found that among the biological processes, the top three GO terms were cellular process (33,656), metabolic process (32,600) and response to stimulus (8,678); among the molecular functions, the top three were catalytic activity (27,005), binding (17,692) and structural molecule activity (3,089); and among the cellular components, the top three were cell (33,716), cell part (33,716) and membrane (14,342). Whereas in the water sample (Fig. 6b), biological process (35.4%) was the dominant category, followed by molecular function (32.7%) and cellular component (31.9%). The top three GO terms were as follows: cellular process (45,113), metabolic process (43,200) and response to stimulus (12,611) among the biological processes; catalytic activity (35,860), binding (24,853) and structural molecule activity (4,247) among the molecular functions; and cell (45,101), cell part (45,101) and membrane (19,524) among the cellular components.
ggplot2 was used to display the annotation information at the first two levels of KEGG [42], with the bar graphs representing the number of the annotated genes for each metabolic pathway. For the soil sample (Fig. 6c), the genes related to metabolism (about 62.9% of the datasets) were the most abundant, and the rest were genetic information processing (9.7%), environmental information processing (9.4%), human diseases (7.3%), cellular processes (6.8%), and organismal systems (3.9%). Further analysis on the metabolism showed that the most abundant was carbohydrate metabolism (23.6%), followed by amino acid metabolism (20.5%), metabolism of cofactors and vitamins (10.7%), energy metabolism (10.1%), etc.
Similarly, for the water sample (Fig. 6c), the most abundant metabolic function module was metabolism (58.4%), followed by human diseases (9.7%), environmental information processing (8.5%), cellular processes (6.4%), organismal systems (6.3%), and genetic information processing (6.2%). In metabolism, carbohydrate metabolism (24.5%) was the most abundant, followed by amino acid metabolism (20.7%), metabolism of cofactors and vitamins (10.6%), energy metabolism (9.7%), etc.
Metabolic potential analyses based on the MAGs
Carbon degradation: The 75 MAGs contained abundant hydrolases from the 101 glycoside hydrolases (GH) family (Table S12), among which the MAGs containing GH13 accounted for 77.3% (58 out of 75), and the MAGs containing GH3 and GH23 accounted for 76.0% (57 out of 75), respectively. The enzymes of interest included alpha-amylase, beta-amylase, glucoamylase, alpha-glucosidase, isoamylase and pullulanase (involved in starch degradation); xylanase, beta-mannanase, alpha-L-arabinofuranosidase, alpha-D-glucuronidase and beta-xylosidase (involved in hemicellulose degradation); endo-1,4-glucanase, exo-β-1,4-glucanases and β-1, 4-glucosidases (involved in cellulose degradation); and other carbohydrate degradation enzymes.
Carbon fixation: Carbon fixation capacity was examined for each of the 75 MAGs (Fig. 7). Almost all the MAGs had the genes encoding the enzymes involved in the 3-hydroxypropionate bi-cycle, reductive tricarboxylic acid cycle, and calvin cycle pathways (related enzymes are listed in Table S2). The two MAGs (bin2 and bin18 in the soil sample) possessed more than 50% complete 3-hydroxypropionate bi-cycle pathway. In the results, 27 MAGs (not listed) had one or more genes encoding the key enzymes, 2-oxoglutarate: ferredoxin oxidoreductase, pyruvate: ferredoxin oxidoreductase, and ATP-citrate lyase, of the reductive tricarboxylic acid cycle pathway. Although the calvin cycle pathway was relatively complete based on the KEGG modules, the key enzyme, ribulose-1,5-bisphosphate carboxylase-oxygenase (RuBisCO), was only identified in the two MAGs (bin9 in the soil sample and bin32 in the water sample), and another key enzyme of this pathway, phosphoribulokinase (PRK), was identified in the two MAGs (bin15 and bin44 in the water sample). Further, except for the two MAGs (bin9 in the soil sample and bin15 in the water sample), other MAGs had the genes encoding the enzymes involved in the reductive acetyl-CoA pathway, and most of the MAGs had the key enzyme, acetyl-CoA synthase, of this pathway. Except for bin12 in the water sample, 74 MAGs had the genes related to the dicarboxylate/4-hydroxybutyrate cycle pathway; the genes encoding the enzymes that related to the 3-hydroxypropionate/4-hydroxybutylate cycle pathway existed in 64 MAGs.
Methane metabolism: As we know, microbial methanogenesis and methane oxidation play essential roles in the biogeochemical cycles of the earth system. We found that, except bin54 in the water sample, almost all the MAGs contained genes related to the methanogenesis pathway, however, this pathway was incomplete. It can be seen from Fig. 7 that bin42 in the water sample possessed 17% complete methanogenesis pathway. The key enzyme of the methanogenesis pathway, methyl coenzyme M reductase (MCR), was found in 17 MAGs (not listed). Furthermore, no enzyme related to the methane oxidation pathway was identified in all the MAGs in the two samples (Fig. 7).
Nitrogen metabolism: As shown in Fig. 7, the genes related to the assimilation nitrate reduction pathway were identified in eight MAGs, while 13 MAGs had the genes encoding the enzymes related to the dissimilation nitrate reduction pathway. Only one MAG had the gene encoding the key enzyme, respiratory nitrate reductase, of the dissimilation nitrate reduction pathway. Besides, the genes related to the nitrogen fixation pathway were identified in three MAGs (Fig. 7), while the nitrogenase genes (nif, vnf and anf) were not found. Four MAGs (Fig. 7) had the genes related to the denitrification pathway, while no MAG had the genes related to the nitrification pathway. There were only two MAGs which had the genes encoding the key enzyme, nitrite reductase, of the denitrification pathway.
Sulfur metabolism: For sulfur metabolism, 52 MAGs (Fig. 7) had the genes related to the assimilatory sulfate reduction pathway, while eight MAGs had the genes related to the dissimilatory sulfate reduction pathway and 12 MAGs had the genes related to the thiosulfate oxidation by SOX complex pathway. Among the 12 MAGs, the marker gene, soxB, used to analyze the thiosulfate oxidation by SOX complex pathway was found in the ten MAGs (bin1 in the soil sample and bin1, bin2, bin35, bin41, bin44, bin46, bin51, bin54, bin55 in the water sample).
Stress response and metal resistance analyses
We investigated the subsystems of all the MAGs, with the results showing that more than 19 types of the subsystem [65] categories were identified, including the three most abundant categories: carbohydrates, amino acids and derivatives, and protein metabolism (see Table S13 for more information). All the MAGs contained respiration, stress response, and virulence, disease and defense categories. However, there were only four MAGs (one Planctomycetota MAG and three Proteobacteria MAGs) containing photosynthesis category. All RAST [64] annotations showed that the highest subsystem coverage was 33.0%, and the lowest was 12.0%.
Both the samples contained many stress response subsystems (Table S14), of which the detoxification subsystems were found in all the MAGs, the oxidative stress subsystems were found in the abundant MAGs except for the two Actinobacteriota MAGs, and the osmotic stress subsystems were found in 54 MAGs. Besides, we found that two MAGs (one Bacteroidota MAG and one Proteobacteria MAG) in the water sample had proteins associated with cold shock (CspA family of proteins). These proteins are induced when the temperature drops, allowing cells to adapt to lower temperature [69].
The annotation for heavy metal resistance genes revealed diverse resistance genes that confer resistance to 19 heavy metals such as silver, copper, iron, nickel, zinc, and antimony, including actP, copA, copB, silP, acn, ctpC, ziaA, etc (related genes are listed in the Table S15). The results showed that more than 1,064 heavy metal resistance genes were identified in all the MAGs (Fig. 8), among which the heavy metal resistance genes conferring resistance to multi-metals were abundant. Furthermore, the genes conferring resistance to antimony, arsenic, molybdenum and tungsten have been identified in all MAGs.
Biosynthetic gene cluster (BGC) identification
By using Anti-SMASH [67], a total of 270 putative BGCs clusters of secondary metabolites were identified in the 75 MAGs, revealing that the two samples contained 25 types of BGCs such as terpene, T3PKS, NRPS, lanthipeptide, and bacteriocin (Table S16); there were 96 terpenes (35.56%) categories, 40 NRPS (14.81%) categories, 35 T3PKS (12.97%) categories, and 27 bacteriocin (10.00%) categories. We found that most NRPSs existed in the Acidobacteriota MAGs, for example, 25 of 40 NRPSs existed in the Acidobacteriota MAGs. Furthermore, terpenes were found in 61 of all the MAGs. Similarly, T3PKSs were found in many MAGs in the two samples, NRPSs were found in the Acidobacteriota MAGs and the Bacteroidota MAGs, and PKSs were found in the Bacteroidota MAGs and the Actinobacteria MAGs. Eleven bacteriocins were found in the Cyanobacteria MAG. Interestingly, some of the BGCs identified in this study were highly similar to known gene clusters, while the others were not closely related to any known gene clusters.
Antibiotics resistance gene (ARG) annotation
ARGs were found in 15 of the 75 MAGs, including adeF, rpsL (mycobacterium tuberculosis rpsL mutations conferring resistance to streptomycin), etc. Among these 15 MAGs, adeF existed in all the soil sample MAGs, and ampS existed in bin10. In the water sample, besides adeF and ampS, rpsL accounted for the majority (5 out of 8). The detailed ARGs annotation information is shown in Fig. 9.