Study site and sample collection
Samples were collected from 15 m depth in Marguerite Bay at the site of the Rothera Time-Series73 (RaTS, latitude 67.572°S; longitude 68.231°W). Sampling conditions in the Antarctic are highly variable and not always favorable for sampling, still we were able to collect samples on 24 days over two consecutive productive seasons (7 months period, from February until April 2018 and from November 2018 until February 2019). The water column properties were characterized using a SeaBird SBE 19 + CTD that measured conductivity, temperature, depth, salinity, chlorophyll autofluorescence, and photosynthetically available radiation. Mix layer depth (MLD) was defined as the depth with a water density difference of 0.05 kg m− 3 relative to the surface19. Water samples were collected by performing sequential casts with Niskin bottles. These were emptied by gentle siphoning (using clean PC tube) into 20 L pre-rinsed polycarbonate carboys. All carboys and tubing used for sampling were decontaminated by rinsing with Contrad 70™ 2% v/v, followed by MilliQ ultrapure water (UP). Prior to sample collection, carboys were rinsed thrice with sample water. The carboys were protected from direct sunlight using opaque bags. Sampling was performed in the morning between 9:30 and 11:00 AM. Samples were further processed within 4 h from the start of sampling. The lab was kept at in situ temperature (with a minimum of 0.1°C), and if needed,samples were kept on ice.
For dissolved inorganic nutrient subsampling, seawater was filtered through a 0.2 µm pore-size Supor membrane Acrodisc® filter (Pall, Port Washington, NY 11050, USA) and collected in 3 times sample-rinsed high-density polyethylene pony vials (PerkinElmer, USA). The subsamples for dissolved inorganic nitrogen and phosphorus were stored at -20°C until analysis, and samples for dissolved inorganic silicate were stored in the dark at 4°C. Samples for chlorophyll-a (Chl-a) were collected by filtering 1 to 8.5 L of unfiltered (total Chl-a) and 20 µm reverse-sieved (< 20 µm fraction) onto GF/F filters (Whatman, UK), after which the filters were each wrapped in aluminum-foil, placed in a plastic bag and flash-frozen in liquid nitrogen and stored at − 80°C until further analysis. For microbial abundances, 1 mL of seawater was fixed with glutaraldehyde to 0.5% v/v final concentration (25%, EM-grade, Sigma Aldrich, USA) for 30 min at 1°C (in the dark), after which it was flash-frozen in liquid nitrogen and stored at − 80°C until analysis in the NIOZ laboratory (The Netherlands). Phytoplankton cells were counted on fresh samples (kept on ice) directly upon arrival in the Rothera lab.
Prior to handling the genomic samples, all bench surfaces were cleaned with Contrad 70™ 2% v/v (Decon Labs, East Sussex, UK) and rinsed with UP water. The cellular fraction metagenomic samples were collected by filtering 3 to 15 L (1h filtering time at in situ temperature) of whole water through a 0.22 µm Sterivex™ (MilliporeSigma, Massachusetts, US). The filters were sealed with parafilm and stored at -80°C until extraction. The viral fraction metagenomic samples were collected from 20 L seawater using the iron flocculation method according to Hurwitz et al. 2013 (ref74). Briefly, particles bigger than 0.22 µm were removed by two subsequential filtrations using a A/E glass fiber filter (142 mm, Whatman, GE Healthcare, UK) followed by a 0.22 µm PES filter (142 mm, MilliporeSigma, Massachusetts, US). The viruses were flocculated with 1 mg L-1 of FeCl3 (1h incubation) and collected onto a 1 µm PC filter (142 mm, Whatman, GE Healthcare, UK). Filters were stored at 4°C in the dark until further analysis.
Microbial abundances, chlorophyll-a, and nutrient concentrations
Viruses and bacteria were enumerated according to Brussaard et al. (2010) with modification to the TE-buffer (pH 8.2, according to ref76). Upon dilution in TE-buffer and staining with SYBRGreen-I (Sigma Aldrich, USA), the samples were enumerated using a FACSCalibur flow cytometer (BD Biosciences, USA) equipped with an air-cooled Argon laser with an excitation wavelength of 488 nm with the trigger on green fluorescence. Total phytoplankton abundances were obtained using a benchtop FACS Celesta flow cytometer (BD Biosciences, USA) with the trigger on red chlorophyll autofluorescence77.
Chlorophyll-a concentration (Chl-a) was determined using HPLC (according to ref18). Briefly, filters were freeze dried (48 h) and pigments were extracted using 90% acetone (48 h at 4°C in the dark)78. Pigments were separated by high performance liquid chromatography (HPLC; Waters model 2690), equipped with a 3.5 µm particle size Zorbax Eclipse XDB-C8 column with a cooled autosampler (4°C) according to Van Heukelem and Thomas (2001). Pigment detection and quantification was based on comparison with DHI LAB standards at 346 nm (Waters 996 HPLC Photodiode Array Detector).
The concentrations of dissolved inorganic macro-nutrients were obtained using a TRAACS 800 autoanalyzer (Bran + Luebbe, Germany), following Grasshoff et al. (1983) for nitrate and nitrite, of Murphy and Riley (1962) for phosphate, and of Strickland and Parsons (1968) for silicate (refs80–82). The calibration standards were diluted in low nutrient seawater (LNSW) with salinity levels similar to the samples (35‰). The detection limit was 0.007 µM for nitrate + nitrite, 0.001 µM for nitrite, 0.007 µM for phosphate, and 0.03 µM for silicate.
Viral and cellular fraction metagenomes
The DNA of the cellular fraction (Sterivex filters) was extracted using the Qiagen DNeasy Powersoil kit (Qiagen, Carlsbad, USA), according to the manufacturer’s instructions. The filters were opened and cut into approximately 3 x 3 mm pieces prior to 2x 10 sec bead beating at 3.55 m s-1 with a 30 sec dwell on a Bead Ruptor Elite (Omni Internationals, USA). Viruses from the viral fraction (iron flocculation method) were resuspended at 4°C overnight in a EDTA (0.1 M) -MgCl2 (0. 2M) - ascorbate (0.2 M) buffer with pH of 6.0 (1 mL per liter of sampled seawater). The resuspended sample was gently filtered through a 0.45 µm SFCA syringe filter (Sartorius, Göttingen, Germany) and treated with 1 U µL-1 of DNase I (Roche, Basel, Switzerland) for 18h at 4°C. DNase I was deactivated by adding EDTA and EGTA (0.1 M final concentration). The viruses were concentrated to a volume of 0.8 to 1 mL using a 100 kDa Amicon (MilliporeSigma, Massachusetts, US). DNA was extracted using Wizard® PCR Preps DNA Purification kit (Promega, Wisconsin, United States) according to manufacturer’s instructions. The DNA was eluted twice in 50 µL of 80°C molecular grade low TE buffer (pH 8.0, 10 mM Tris and 0.1 mM EDTA).
The libraries were prepared using the TruSeq Nano DNA Kit (Illumina, San Diego, CA) and sequenced with the Illumina platform NovaSeq6000 S2 (2x 150 bp), according to the manufacturer’s instructions, on average 80M reads per sample (58 to 123M). The read quality was assessed before and after quality trimming using MultiQC v1.883. PhiX reads were removed using bbduk v38.07 (https://sourceforge.net/projects/bbmap/), low complexity (dust filter threshold of 20) and PCR duplicates were removed with prinseq-lite v0.20.484. Illumina adaptors were clipped with trimmomatic v0.3685. To assess the degree of enrichment of our viromes, the abundance of microbial genetic markers was assessed using ViromeQC v1.0.1 in all metagenomes86.
PhyloFlash v3.487 was used to characterize microbial communities by target assembly and classification of the 16/18S rRNA gene. Reads that mapped to plastid 16S rRNA genes were removed.
Virus genome assembly, taxonomy, and host prediction
All reads from the viral fraction metagenomes were co-assembled into scaffolds using metaSpades v3.14.188. Scaffolds of at least 1500 bp long were checked for viral signals using PPR-Meta v1.189, which was found to be the most sensitive virus prediction tool with relatively few false positives in a benchmark work that included some of the samples from this study90. Sequences identified as viral but with no CheckV v0.7.0 viral hallmark genes and cellular hallmark genes31 and classified as non-viral by Cenote-Taker291 were removed. The remaining co-assembled scaffolds are equivalent to viral species-level operational taxonomic units (vOTUS) as clustering at 95% identity over 80% of the sequences8 resulted in the clustering of only 0.039% of the sequences.
We quantified the overlap between Rothera viral sequences and the Global Ocean Virome8 and Southern Ocean Virome29 datasets by performing an all versus all blastn which were then clustered at 95% average nucleotide identity over 80% of the sequence using the CheckV virus clustering.
For viral sequences longer than 5,000 bp or with > = 70% completeness (CheckV completeness), coding regions were predicted using prodigal92 (v2.6.3 -p meta). The viral sequences were clustered based on their coding regions using vContact293 using the RefSeq prokaryotic viruses (--db ProkaryoticViralRefSeq94-Merged) together with a costume collection of protist DNA viruses (Supplementary Table 3). The clusters were visualized using Cytoscape v3.8.2 with a perfuse force-directed layout94. Cenote-taker2 taxonomy was overlayed. The Caudoviricetes (the single class in the Uroviricota Phylum), Nucleocytoviricota, Preplasmiviricota, and archaeal virus groups were manually defined (Supplementary Fig. 9) which always corresponded with the taxonomy of reference genomes within those groups and largely with the Cenote-taker2 taxonomy for the metagenomic viral sequences. For the unknown sequences within the above defined groups were attributed the group phylum rank taxonomy. Scaffolds annotated as Caudoviricetes by Cenote-taker2 and all sequences within the Caudoviricetes mega cluster were further classified into family rank taxonomy using PhaGCN232. For Crassvirales, diverse and abundant, Maveriviricetes (virophages) and Polintoviricetes (polinton-like viruses) sequences, we performed a targeted identification and classification (see sections below).
The host genus was predicted for high and mid quality bacteriophage sequences (longer than 10kb or at least 70% complete) using iPHoP v0.938 with the standard cut-off for genus rank prediction and a 0.2 false discovery rate cut-off for class and phylum rank prediction (using All_combined_scores.csv). Conflicting host predictions were removed. The standard genus level output resulted in 744 predictions while the lower prediction threshold resulted in 1,668 predictions at the class level.
All viral sequences, excluding those classified as eukaryotic viruses, were inspected for the presence of the lysogeny genes integrase, excisionase, and bacteriophage repressor proteins genes. The coding regions were predicted using prodigal92 (v2.6.3 -p meta) which were matched by hmmsearch (e-value 1e-5, HMMER v3.1b2) to HMM profiles from the PFAM (34.0), KO (97.0), ncbi COGs (COGS2020) and by diamond search (e-value 1e-10, v0.9.22) to the nrNCBI (downloaded in March 2022). For integrase were considered matches to K03733, K14059, K04763, K21039, COG0582, COG1518, PF00589, PF13495, PF00665, PF13683, PF13102, PF18697, PF09003, PF02899, PF12835, PF02022, PF14659, PF18644, PF13976, PF13333, PF14882, PF12834, PF17921 and those containing the string “integrase” in the diamond top hit excluding those containing the string “CRISPR” and “Transposase”. For excisionase were considered matches to PF06806 and PF0782, and those containing the string “excisionase” in the diamond top hit. For the phage repressor protein were considered matched to PF00717, K22300, K07727, K18830 and COG2932.
Reads from the viral and cellular metagenomes were mapped to all scaffolds using bwa v0.7.17 mem95 Reads mapped with less than 95% identity over 80% of the read length were excluded with CoverM v0.6.196, and the number of reads mapped to each scaffold was calculated with samtools coverage v1.11 97. When scaffolds had less than 70% horizontal coverage, their abundance was set to 0. The read relative abundances were calculated by dividing the read counts by the total number of viral reads in each sample. The depth corrected scaffold vertical coverage was calculated by dividing the relative read abundance by the total scaffold length multiplied by the average number of viral read bases.
Antarctic Crassvirales bacteriophages
To identify Crassvirales scaffolds, we used a targeted approach that does not depend on protein predictions as Crassvirales are known to include alternative genetic codes44. First, we used transeq98 to obtain the six-frames translations of the DNA scaffolds (options: -frame 6 -table 11 -clean). HMM profiles of the Crassvirales phylogenetic marker genes (see ref44), including the major capsid protein (MCP), the terminase large subunit (TerL), and the portal protein (portal), were used to search for homologs via hmmscan (HMMER v3.1b2). Hmmscan results were then filtered (e-value < 0.05, coverage > 0.3). Scaffolds over 10 kb in length and containing matches to any of these three proteins were put forward for further inspection as candidate Crassvirales (n = 466).
To place the candidate Crassvirales scaffolds within the larger context of Caudoviricetes families, we used the Virus Metadata Resource from the International Committee on Taxonomy of Viruses (VMR_21-221122_MSL37). We selected at random ten isolates (or all if < 10 available) for each of the 43 Caudoviricetes families excluding: i) members of the Crassvirales order (Feb 2023), and ii) isolates from the Helgolandviridae (n = 1), Graaviviridae (n = 2), and Duneviridae (n = 6) from which no sequences could be extracted using our approach. This resulted in 245 species/isolates for which we obtained genomes, encoded proteins, and gene annotations from the NCBI Nucleotide database.
To reconstruct the phylogeny, we retrieved the TerL, MCP, and portal proteins encoded in the reference genomes by extracting the proteins containing in their headers "terminase", excluding "terminase small", "major head", "major capsid", and "portal", using grep and seqtk subseq (https://github.com/lh3/seqtk). We were able to extract the phylogenetic markers from most of the genomes (243 TerL (99.2%), 234 MCP (95.5%) and 185 portal (75.5%) proteins from the 245 genomes). In addition, we collected 673 Crassvirales genomes and TerL, MCP and portal protein sequences (from refs43,44). These sequences were used with those from our candidate Crassvirales to build separate phylogenetic trees of each protein. For the multiple sequence alignment, we used MAFFT99 (v7.475, option: –auto), modeltest-ng100 to determine the amino acids substitution model (option: -d aa), and IQ-TREE2 v2.1.1101 to compute the phylogeny (option: -m LG + G4 + F -B 1000). Candidate Crassvirales scaffolds for which all three proteins (TerL, MCP, and Portal) that fell within the clade formed by the ICTV-approved Crassvirales families were deemed novel Crassvirales sequences.
The ORFs of candidate scaffolds were detected with Prodigal v2.6.392 using three modes to consider stop codon reassignments (options: standard, -TGA W, and -TAG Q). These were functionally annotated using the marker protein HMMs from Yutin et al. 2021 (ref44). The trees and their annotation were created using the software suite ggtree v3.6.2102 and iTOL v5103. The genomic maps were built via gggenomes v0.9.5.9104.
Scripts used to obtain the results of the analysis of Crassvirales members can be found on github (https://github.com/MVFofanov/rothera_crassvirales).
Nucleocytoviricota viruses
To identify NCVs, we first binned all assembled scaffolds using vRhyme v1.1.0 for the viral fraction co-assembly and using CONCOCT v1.1.0 for the corresponding cellular fraction assemblies. Scaffolds were examined for Nucleocytoviricota markers by performing a hmmsearch on prodigal predicted proteins92 (v2.6.3 -p meta) against the nine giant virus orthologous groups HMM profiles using ncldv_markersearch46. All bins containing hits to phylogenetic marker genes were further investigated. Bins were decontaminated from likely contaminating cellular scaffolds using viralrecall105 (as per ref106). The bins were then screened for the seven genetic markers46. Only bins containing at least four out of seven markers were kept, except for those classified as Mirusviricota48 and those closely related with medusavirus (Mamonoviridae47) where all bins with at least two markers were kept. This is due to Mirusviricota not possessing all NCV marker genes48 and because only three markers were detected in the medusa virus isolate despite reported previously51. Bins with more than one duplication of the genes SFII, VLTF3, A32, and PolB were removed to avoid bins with high strain heterogeneity (as per ref106). We removed redundant bins by clustering bins at 95% average nucleotide identity or 100% marker gene amino acid identity and selected the best representative (most marker genes or highest N50). Eukaryotic contamination was identified by predicting (using Pfam reference) and classifying (using MMETSP_zenodo_3247846_uniclust90_2018) ORFs using MetaEuk v6.a5d39d9107.
A reference set containing NCV isolates and metagenomic bins were created based on previous metagenomic surveys46,48,106,108 and additional RefSeq NCV isolates as of September 2022 (Supplementary Table 4).
The marker genes were detected in the reference and bins protein sequences and alignments constructed and concatenated using ncldv_markersearch. The alignment was trimmed of regions where > 20% of the sequences have a gap and with an entropy score bellow 0.55 using BMGE v1.12109. The trimmed alignment was used to build a maximum likelihood tree using the best model finder option (-m MFP) with ultrafast bootstrap of 1,000 replicates. The best fit model was LG + F + R10. The tree was visualized using FigTree v1.4.4110.
Maveriviricetes and Polintonviricetes viruses
All scaffolds were inspected for virophages (class Maveriviricetes) and polinton-like viruses (PLVs, class Polintonviricetes) major capsid protein (MCP) using a collection HMM built using global, and clade restricted protein alignments. Rothera scaffolds ORFs were predicted and translated to amino acid using prodigal92 (v2.6.3 -p meta).
For the Maveriviricetes, we used a collection of non-redundant isolate and metagenomic MCP sequences52. The global and family rank MCP alignments were built with MAFFT99 (v7.407, -auto), and the HMM profile built with hmmbuild111 (HMMER v3.1b2). We used these profiles to identify putative virophage MCPs in the Rothera scaffolds’ amino acid sequences (> 200 aa) using hmmsearch (HMMER v3.1b2). The new putative virophage MCP were joined with the reference sequences and the Maveriviricetes phylogeny was built, and the sequences classified according to the newly proposed taxonomy (ref52).
For PLVs, HMM profiles were constructed using the MCP genes from a collection of curated PLV sequences34,56. PLV MCP identified in Yutin et al., 2015 (ref34) were used as queries to search the non-redundant NCBI protein database. The hits were collected, clustered with UCLUST112 with a similarity threshold of 0.5 and were iteratively aligned and clustered using the approach described by Wolf et al., 2018 (ref113). From the alignment of the cluster containing the query proteins, a tree was built using FastTree114. From the tree topology eight clades were manually inferred for which separate HMM profiles were built. Translated Rothera scaffolds were searched with these HMM profiles and the profiles provided by Bellas and Sommaruga, 2021 (ref56). The hits were mixed with reference PLV MCPs (proteins of the HMM profiles); the resulting protein set was iteratively aligned and clustered using the abovementioned approach; the alignment of the protein cluster containing reference PLV MCPs was used to build a tree using FastTree. Tree clades were named according to the clades as defined in Yutin et al., 2015 and Bellas and Sommaruga, 2021 (refs34,56).
For the functional annotation of the other virophage genes, all predicted proteins from putative virophage scaffolds were pooled with proteins predicted from virophage genomes available from Roux et al. 2023 (ref52). Proteins were clustered using mmseqs easy-cluster115, and for each cluster, alignments and profiles were generated by MAFFT99 (v7.505, - einsi), hhconsensus116 (v3.3.0) and finally hhmake116 (v3.3.0). All profiles were then searched against the Pfam (http://pfam.xfam.org/) and PDB (https://www.rcsb.org/) databases using hhsearch116 (v3.3.0, -E 1e-4). Matches with at least 90% probability, e-values below 1e-4, clear functional annotation as well as a score that was at least 90% of the best score per cluster were used for annotation. Annotation of PLV scaffolds was performed similarly to the virophages, with all predicted proteins from putative PLV scaffolds being pooled with proteins predicted from available PLV genomes34,56. We additionally searched the profiles against a database consisting of the profiles of clusters from Bellas and Sommaruga 2021 (ref56) and transferred annotations. The annotated genomes were visualized using gggenomes 104 in combination with ggtree102.
For all NCV bins, predicted proteins were searched against the GVDB database using HMMER v3.3.2. Upstream regions (50bp) of genes homologous to Cafeteria roenbergensis virus ‘late’ genes were extracted from the genomes and searched for enriched motifs using the streme tool from the meme-suite v5.5.0 and the protein-coding regions as negative controls. Candidate motifs were filtered for non-AT consensus sites and scores below 1e-3. Similarly, upstream regions of all virophage genes were extracted and searched for enriched occurrences of these candidate motifs using the sea tool (--thresh 1) from the meme-suite, again using the protein-coding regions as negative controls. PLV scaffolds above 5,000 bps were analyzed similarly for occurrences of a consensus ‘early’ NCV promoter motif (WWWWWTGWWWWW).
Viral diversity and correlation with environmental parameters
We performed a multivariate analysis on viral sequences of high and mid quality (longer than 10 kb or at least 70% complete) to avoid highly fragmented genomes. Viral read abundance was normalized by sequence length and centered-log ratio transformation using microbiome R package v1.20.0 (https://github.com/microbiome/microbiome). We explored the seasonal dynamics of alpha diversity by calculating the species richness and the Shannon and inverse Simpson diversity indexes for each day using hillR v0.5.1117, which are expressed in effective number of species. Simpson evenness was calculated by dividing the inverse Simpson index by the species richness. Additionally, we tracked the number and abundance of dominant viruses, defined as viral scaffolds representing over 0.5% of the viral reads at any given time (n = 203).
We performed a hierarchical cluster analysis based of the Aitchison distance using hclust118 (method centroid), and the normalized abundances visualized in heatmap using the stats package in R.
Microbial abundances and Chl-a concentrations were log-transformed and other environmental metadata was visually checked for normality and z-scored or log transformed to achieve a distribution closest to normal. Pairwise Pearson correlation was computed and from strongly covarying environmental metadata pairs, one was removed. To find which combination of variables better correlated with changes in the viral community, we fitted a general multivariate regression model to our data using Adonis2 (vegan v2.6-4)119. Variables were sequentially permuted so that the variables were input to the model by order of highest sum of squares, non-significant variables were removed from the model.