SAR324 population-genome binning
We generated 14 deltaproteobacterial SAGs from seven depths at Station ALOHA and retrieved draft SAR324 genomes (28 MAGs and 6 SAGs, [17, 25–31]) from public databases, screened for similarity to Station ALOHA populations by metagenomic read mapping. Average nucleotide identity (ANI) was used to assess the relationship between the set of selected draft genomes, and a threshold of 95% was set as a boundary for population delineation [32, 33]. Genes from the genomes within a population were catalogued treated as a “population genome”. From this, a dataset of 48 draft SAGs and MAGs comprising 98,323 predicted genes was reduced to 18 population genomes, yielding 51,089 predicted genes. This approach reduced sample complexity by decreasing the redundancy, simultaneously improving genome completeness (Supp. Table 1).
ANI clustering and classification of SAR324 genomes
Using ANI as a distance metric (Fig. 1 CD), the 18 population genomes were classified into one NB1-j outgroup (SLc_064) and five subclades that could be affiliated to SAR324 (sensu Wright et al 1997 [5]) as identified by their 16S rRNA genes (Supp. Figure 4). These comprised: five population genomes (CLc_004, CLc_007, SLc_201, CLc_018, CLc_019) grouped into subclade A, three population genomes (SLc_090, CLc_010, SLc_189) into subclade B, a single SAG from this study (SLc_081) composed the subclade C, three population genomes (SLc_063, CLc_017, CLc_003) grouped into subclade D and five population genomes (CLc_008, CLc_001, CLc_012, CLc_002, SLc_002) into subclade E. Genomic GC percentages (Supp. Table 1) were different between outliers (66.6%), and SAR324 genomes (43.03% ± 3 (SD)). GC content was conserved within subclades, supporting their close phylogenetic relationships: subclade A (45.64% ± 1.21), subclade B (46.59% ± 1.28), subclade D (41.98% ± 2.08) and subclade E (41.53% ± 0.83). Genomic GC percentages were slightly different between the population genomes (containing only coding sequences) and individual genomes that composed them (1.01 ± 0.02%), supporting the congruence of genomic features between the genomes pooled together into population genomes.
Subclade definition by ANI clustering (with more than 80% ANI between subclades) was congruent with the one obtained using classical phylogenetic marker genes like the 16S rRNA gene (Supp. Figure 5). Placing the 16S rRNA genes from our SAR324 genomes by parsimony into the SILVA reference phylogenetic tree (Supp. Figure 4), produced clusters similar, but less resolutive, to those based on ANI of genomes. Using ANI distance to address phylogenetic relationships allowed us to finely cluster partial genomic fragments even if they did not harbor a satisfying set of phylogenetic markers (16S rRNA gene [34] or universal single-copy marker genes [35]).
SAR324 abundance in metagenomes and ecotypes delineation
Metagenomic reads, from samples SAGs were isolated in 2015/2016, were mapped onto the SAR324 genes extracted from population genomes. Average coverages over the different population genomes were used to assess their depth distributions (Fig. 1E). Distributions of the population genomes were depth-specific, with internal subclade consistency, supporting a phylogenetically coherent ecotype distribution. Subclade A was only present in surface waters but in lower abundance than the subclade B that was distributed above the DCM (maximal between 75 m and 125 m but not present below 200 m deep). Subclade C and D were present at the DCM and immediately below (125–200 m). Subclade E was abundant below 500 m (500, 750, 1000, 4000 m) reaching the highest abundance for the entire clade at 1000 m deep (almost 10-fold higher than surface abundance). Hence, 4 ecotypes were defined, based on genomic similarity and depth distribution: surface ecotype (subclade A), above-DCM ecotype (subclade B), below-DCM ecotype (subclades C and D), and deep ecotype (subclade E).
Meta-pangenome of SAR324
Predicted proteins from population genomes were aligned and clustered into orthogroups (groups of orthologous proteins) representing the pangenome of SAR324 (Fig. 1A). Orthogroups were classified and displayed according to their frequencies in population genomes (Fig. 1B). A small number of orthogroups (225 orthogroups, Supp. Figure 1A) were shared among all genomes (ie ‘core genome’, in red, Fig. 1B). A higher proportion of orthogroups was shared exclusively among subclades (‘core’ genome of subclade A in yellow, subclade B in green, subclades D and E in blue, Fig. 1B). 2674 orthogroups were shared within the subclade A, among the total 4533 specific orthogroups set of this subclade (Supp. Figure 1A), 1064 among the total 2530 for the subclade B, 159 among the total 1079 for the subclade D, and 994 among the total 3392 for the subclade E.
Genomes from subclades D and E were closely related to one another, sharing many orthogroups in common (976 orthogroups exclusively shared between subclades D and E), that had high ANI values. More shared orthogroups (612) were obtained when genomes were grouped by ecotypes (Supp. Figure 1B) rather than by subclades. Surface ecotypes and deep ecotypes encompassed most of the unique orthogroups (4533 and 3392, respectively). The highest number of shared orthogroups (1303) were found between the below-DCM and deep ecotypes.
The SAR324 core genome was distributed evenly across all depths, while accessory genome showed depth-specific distributions (Fig. 1F). Core-genome accounted for about 12% (9.5–14.5%) of the orthogroups of each ecotype (Supp Table 3) while singletons (i.e. orthogroups found in a single population genome only) accounted for about 35% (28.5–39.1%). About the half of the orthogroups (53%) was shared by genomes between ecotypes or within a single ecotype (eco-genome i.e orthogroups unique to an ecotype). Interestingly, eco-genome was in a higher proportion at surface (44%) than deeper.
Time and depth distribution of SAR324 ecotypes at Station ALOHA in 2011
To determine SAR324 ecotype depth distributions across time, average coverages were assessed by mapping them with metagenomic time-series reads from Station ALOHA (Fig. 2). The general depth distributions averaged across all dates (Fig. 2A) displayed the clear depth zonation of ecotypes. Depth distributions (Fig. 2B) showed consistent patterns of ecotype zonation with time, but with divergent temporal variations in abundance for each ecotype. Surface ecotypes were more abundant at 25 m and 75 m, with a deeper distribution in January and December. Above-DCM ecotypes had higher abundance at 125 m in May and early November, and high abundance at 75 m in January. The below-DCM ecotype had a high abundance at 125 m in November but was consistently present at 200 m and deeper. The deep ecotype was mostly restricted to water masses below 500 m, with higher abundances at the end of November at 500 m and 750 m.
The deep chlorophyll maximum (DCM) zone (fluctuating around 125 m deep) and the nutricline in the disphotic zone (steep change in nutrients concentration and water masses characteristics between 200 m and 500 m deep) appeared to be two biological relevant boundaries for the ecotype distribution of SAR324. Environmental parameters fluctuated more over time in the euphotic zone (Supp. Figure 2), and the depth of the DCM varied slightly with time at Station ALOHA which could explain the higher variability in time and depth of distribution maxima for shallower ecotypes compared to those of the deeper ones.
Metabolic pathway reconstruction and distribution
To reconstruct metabolic pathways of each ecotypes, protein orthogroups from the pangenome were annotated and explored using the KEGG database. Coverage within each relevant metabolic pathway across depth and time was averaged and clustered according depth profile (Fig. 3). Metabolic pathways grouped into distinguishable clusters with defined depth and temporal distributions. Similar to ecotype depth distributions, the DCM and nutricline coincided with the boundaries separating metabolic pathway distributions. The majority of pathways were abundant at 500 m and 750 m (cluster Ia) with diverse pathways involved in carbon fixation, utilization of C1 (THF) and C2 (Glyoxylate) compounds, phosphonates, sulfonates and sulfur. All of those pathways followed the same temporal patterns with a maximal abundance in November. A group of pathways (cluster Ib), comprising degradation of nitrotoluene and benzoate, displayed tight distribution patterns with higher abundance from 125 m to 1000 m and a maximum at 500 m. Cluster Ib was abundant when the abundance of SAR324 members peaked, in January, May, early and late November. Cluster Ic, including fructose, mannose and selenocompound metabolic pathways, had less defined distributions and were absent at 200 m depth. The pentose phosphate pathway, taurine metabolism, and ion transporters were grouped into cluster II, that reached a maximum around 200 m, with highest abundance in January and May. Clusters IIIa and b were distributed in surface waters with maxima at 75–125 m and 25 m, respectively. The cluster IIIb was mainly present in January, and IIIa in May. Remarkably, carotenoid biosynthesis pathway, starch and sucrose metabolism, and butanoate metabolism were present only at 25 m (IIIb). Nitrogen metabolism and vitamins metabolisms (B6 and B8), as well as structures involved in cell motility were mostly found above 200 m (cluster IIIa).
Key enzyme and protein distributions
The presence of enzymes or proteins involved in key reactions of the metabolism (central carbon metabolism, source of electron, nutrients use) were identified among population genomes and ecotypes by KEGG gene annotations (Fig. 4). To verify that gene distributions reflected genome distributions (and were not artificially due to genome incompleteness), the coverage for each gene in metagenomes associated with SAGs was computed and scaled from 0 to 1 to smooth the effect of gene copies (right panel in Fig. 4). The presence and distribution of key enzymes were structured according to depth, in a similar pattern as what was seen for metabolic pathways and ecotypes, with the DCM and nutricline corresponding also to metabolic boundaries. Central metabolic characteristics, including inferred sources of carbon and energy, coincided with the vertical zonation of ecotypes within the SAR324 clade.
Carbon metabolism
All SAR324 population genomes and ecotypes harbored the full TCA and pentose phosphate cycles, as well as complete glycolytic pathways (Supplementary maps ko00010, ko00020, ko00030). SAR324 genomes from all depths also encoded for C1-compounds use pathways, through the folate and tetrahydrofolate (THF) cycles (Supplementary maps ko00670, ko00790). In addition, the glyoxylate shunt of the TCA cycle, especially the key isocitrate lyase, was encoded in SAR324 genomes from ecotypes below the DCM and preferentially distributed in metagenomes below 200 m (Fig. 4).
SAR324 genomes from different ecotypes encoded different carbon fixation pathways. Genomes from below 200 m harbored the full Calvin-Benson-Basham (CBB) cycle, including the key enzyme ribulose-bisphosphate carboxylase (RuBisCO) (Fig. 4). Below-200 m genomes harbored type Ia RuBisCO, able to perform carboxylation reaction, while the two population genomes from above-200 m displayed type IV, a RuBisCO-like protein (RLP) involved in other reactions, possibly the biosynthesis of low-molecular-weight thiols essential for oxidation of thiosulfate and elemental sulfur (Supp. Figure 10). Type Ia RuBisCO from SAR324 showed active transcription in metatranscriptomes from 4000 m (data not shown [36]). The aerobic carbon monoxide dehydrogenase (CoxSLM), that oxidizes carbon monoxide to carbon dioxide under aerobic conditions, was present in almost all SAR324 population genomes analyzed, but was preferentially distributed below 200 m.
The fumarate hydratase and 2-oxoglutarate:ferredoxin oxidoreductase coding genes (Fig. 4, Supplementary map ko00720), allowing the TCA cycle to function in reverse (rTCA) and fix carbon, were present in genomes along with the canonical TCA enzymes (Supplementary map ko00020). Genes coding for enzymes involved in TCA cycle were mostly found in below-200 m metagenomes, while the fumarate hydratase gene was also detected in above-200 m metagenomes. Genomes from the surface ecotype possessed the gene coding for the phosphoenolpyruvate (PEP) carboxylase, which provides oxaloacetate to TCA/rTCA by fixing bicarbonate to PEP. The gene encoding the carbonic anhydrase, which converts carbon dioxide in bicarbonate used by PEP carboxylase was also only present above 200 m.
The complete C1-based tetrahydrofolic acid cycle (THF) was present in SAR324 genomes and distributed below 200 m (Supplementary map ko00670). In addition to the THF cycle, the presence in deep genomes and metagenomes of multiple genes could support a putative carbon dioxide fixation through methanogenesis (Supplementary map ko00680). Notably, were present genes coding for the methane/ammonia monooxygenase (AmoA), for oxidoreductases involved in trimethylamine metabolism, such as the trimethylamine monooxygenase (Tmm), the heterodisulfide reductase (HdrA) as well as the hydrogenase (HyaAB), and for the F420 co-enzyme biosynthesis pathway.
SAR324 genomes from divergent ecotypes harbored different ABC transporters (Supplementary map ko02010), including for potential carbon sources, with marked depth distribution (Supp Fig. 3). For example, fructose and urea transporters were preferentially distributed above the DCM while glucose/mannose between 125m and 200m, and phospholipids and amino-acids, below 200m.
Sulfur metabolism
Multiple enzymes involved in sulfur metabolism were encoded in SAR324 genomes (Fig. 4). Genes coding for SOX system (or TOMES, Thiosulfate Oxidation MultiEnzymes System), oxidizing thiosulfates to provide reducing power for cell, were present in population genomes from all ecotypes, but the system was not complete in any ecotype. The SoxYZ complex (sulfur chelating and binding proteins) pivotal proteins of TOMES that covalently bound sulfur substrate, were found in all ecotypes. Sulfohydrolase SoxB, that hydrolyzes SoxY-cysteine-S-sulfate liberating sulfate, was present and distributed throughout the entire water column except for 200 m. At the 200 m boundary, only the heterodimeric cytochrome c SoxXA complex, that oxidatively couples the sulfane sulfur of thiosulfate to a SoxY-cysteine-sulfhydryl group of the SoxYZ complex, was detected.
In addition to SoxXA, SoxYZ and SoxB, the gene coding for SoxC (sulfane dehydrogenase), forming the SoxCD complex, was present in SAR324 genomes from surface populations and detected into surface metagenomes (Fig. 4). This distribution of Sox complexes suggests that SAR324 performed thiosulfate oxidation through the canonical Kelly-Friedrich (Sox) pathway at the surface. A RuBisCO-like protein (RLP, type IV) that may be involved in biosynthesis of low-molecular-weight thiol essential for oxidation of thiosulfate and elemental sulfur, was encoded in one of the surface population genomes (CLc019) as well as in the outgroup SLc064 (Supp Fig. 10).
SoxYZ complex and SoxB were the only part of the SOX system found in below-200 m genomes and metagenomes, respectively. However, genes coding for other enzymes related to sulfur utilization were found and preferentially distributed below 200 m such as those coding for the adenosine 5'-phosphosulfate (APS) reductase (AprAB), for the membrane QmoABC complex (Quinone-interacting membrane-bound oxidoreductase, the electron acceptor for AprAB), and for the ATP-sulfurylase (sat, a homolog of QmoABC generating ATP from APS). Genes for assimilatory sulfate reduction (notably ferredoxin-based and PAPS (CysH) reductases), were also detected in genomes below 200 m and distributed preferentially in deep metagenomes. Dissimilatory-type sulfite reductase (DsrAB) was present in both above 200 m and deeper populations. The heterohexameric DsrEFH and DsrC proteins, suspected to transfer sulfur from a persulfurated carrier molecule to the dissimilatory sulfite reductase DsrAB, were present in three population genomes (CLc010, SLc002 and SLc081) from above-DCM and deep ecotypes. The simultaneous presence of the SoxYZ-B complex, DsrAB, the DsrEFH-C complex, the AprAB-QmoABC complex, and ATP-sulfurylase (sat) below 200 m suggested that SAR324 from the deep are able to use the branched pathway for thiosulfate oxidation through reverse dissimilatory sulfate reduction (rDSR) and APS processing.
SAR324 genomes also harbored two ways to retrieve sulfite from the environment: from alkanesulfonate with the transporters (SsuACB) and the monooxygenase (SsuD) in genomes from above 200 m and from taurine with transporters (TauACB) and dioxygenase (TauD) in genomes from below 200 m.
The dimethylpropiothetin dethiomethylase (DddL), a carbon-sulfur lyase producing acrylate from DMSP (and producing DMS as byproduct), was present in genomes above 200 m. Acrylate may be converted into acryloyl-CoA by the propionyl-CoA synthetase (PrpE) then transformed into 3-hydroxypropionate-CoA by the enoyl-CoA hydratase (PaaF), feeding the TCA/rTCA cycle. In deeper populations, genes coding for the dimethylsulfide dehydrogenase (DdhABC) were found suggesting that DMS might also be oxidized into DMSO.
Nitrogen metabolism
Analyzed SAR324 genomes did not display nitrate / nitrite transporters nor enzymes to reduce, by any pathways, nitrate into nitrite. However, the majority of them possessed nitronate monooxygenase (Ncd2, NMO), oxidizing nitroalkane into nitrite and a formamidase, hydrolyzing formamide into formate and ammonium (Fig. 4). Above-200 m genomes harbored the ferredoxin-nitrite reductase (NirA) to reduce nitrite into ammonium as well as a nitrilase to hydrolyze nitrile into ammonium. No genomic evidence was found to support the use of nitrate to oxidize sulfur compounds.
Surface genomes possessed cynS and cynT genes coding for cyanate lyase and carbonic anhydrase, respectively. From cyanate and bicarbonate, cyanate lyase produces carbamate, that is spontaneously transformed into ammonium and carbon dioxide which is then recycled into bicarbonate.
Without associated pathways detected in this dataset, the nosZ gene coding for the nitrous oxide reductase was detected in one above-DCM genome, the nitric oxide-forming reductase (NirK) and the methane/ammonia monooxygenase (PmoA / AmoA) were present in some deep genomes.
Most analyzed genomes also coded for urease, degrading urea into ammonia and carbon dioxide, and its associated transporters. Considering all these findings, the ability to access and assimilate nitrogen contained in organic compounds (urea, nitroalkane, formamide, cyanate, possibly isocyanates, and related compounds) may be an important capability of the SAR324 lineage.
Phosphorus metabolism
Transporters for phosphates (PstABCS) were only detected in above-200 m genomes and were mostly distributed in metagenomes from above the DCM while transporters for phosphonates (PhnCDE) were present in most of genomes and preferentially distributed below 200 m. Most genomes possessed phosphonoacetate hydrolase (PhnA), catalyzing the hydrolytic cleavage of phosphonoacetate to inorganic phosphate (Fig. 4). Phosphonoacetaldehyde hydrolase (PhnX) was present and distributed in genomes and metagenomes from the deep. Interestingly, the key C-P lyase (PhnJ), that cleaves PRPn to retrieve phosphorus from organic compounds and release methane, was only present in surface genomes and only distributed in 25 m metagenomes.
Rhodopsin-based phototrophy
Two orthologous groups of rhodopsins were harbored by SAR324 genomes, both from above 200 m (Fig. 4). Protein sequence homology with close orthologs from Station ALOHA metagenomes (Supp Fig. 7) and phylogenetic placement into a type II-rhodopsin reference tree (Supp Fig. 6, [37]) revealed that one group was homologous to proteorhodopsin (PR) (environmental cluster 4, super-cluster I) and the other was distant from known rhodopsins and was most similar to xanthorhodopsins (XR). Proteorhodopsins are transmembrane, retinal-binding proteins that utilize light energy to transport protons outside the cell, generating a proton motive force across the membrane [38, 39]. All PR-like sequences from SAR324 population genomes had the DTE-Q motif of key amino-acid residues in transmembrane helix 3, with a key lysine that binds retinal on transmembrane helix 7, suggesting that they function as blue-tuned proton pumps. XR-like SAR324 sequences also encoded the key lysine residue on the seventh helix and had a DTE-L motif in the third helix. XR-like type amino-acid sequences were identical to a full length rhodopsin isolated from Station ALOHA metagenomes (HOT226_1_0075m_c26173_1), that when expressed in E. coli, bound retinal and catalyzed light-driven proton pumping [40]. The leucine in position 105 suggests that XR-like rhodopsins are tuned to absorb in green light. Tertiary structure prediction from amino-acid sequences confirmed the 7 transmembrane helix tunnel structure for both PR and XR-like types (Supp Fig. 7).
PR and XR were both present in genomes of both surface ecotypes, but XR was absent in the above-DCM populations (Fig. 4). Gene distributions in metagenomes showed congruent patterns with PR type distributed in 25 m and 125 m while XR-like type only at 25 m. Enzymes (lycopene beta-cyclase, phytoene desaturase and beta-carotene oxygenase) involved in the biosynthesis of retinal were all present in surface genomes but absent from above-DCM ones. Genes encoding them were only distributed in surface metagenomes. Only one of the two oxygenases, the beta-carotene 15,15'-dioxygenase (Blh), performing the last step of β-carotene cleavage, was present in one of the Above-DCM population genomes and distributed down to 125 m.
Synteny maps for XR-like (Supp Fig. 8) and PR type (Supp Fig. 9) opsins were evaluated by BlastN searches. Genes flanking PR type were not conserved among SAR324 genome variants, except between two genomes from subclade A. Only a contig from the SLc 189 population genome had a flanking gene involved into the retinal biosynthesis (blh gene coding for the beta-carotene 15,15'- dioxygenase). In contrast, genomic regions flanking XR-like encoding genes were highly conserved among all the SAR324 genomes. XR-like genes were directly adjacent to a complete set of genes encoding enzymes for retinal biosynthesis.
Distributions in time of PR and XR-like types were explored by metagenomic read mapping, which revealed little temporal variability among the different ecotypes bearing them (data not shown). Diel variability in SAR324 opsin gene abundance and transcription was explored via metatranscriptomic and metagenomic read mapping (Fig. 5). PR gene transcription was below the limit of detection, while XR-like opsins exhibited diel transcript abundances, with a XR transcript maxima occurring early in the morning each day.