2.1 Oyster sampling
The oyster samples collected in this study span five years, from June 2014 to July 2019. We divided the samples into nine time batches according to the chronological order. In addition, all the samples were divided into four other groups: four amplification groups based on the amplification methods (whole genome amplification (WGA), whole transcriptome amplification (WTA), reverse transcription and WGA (RT-WGA), and double-stranded DNA (dsDNA)); two tissue groups based on tissue origin (mixed tissues and hemolymph of adults); two status groups based on health status (diseased and moribund), and seven Site groups based on sampling sites (BH, HD, LJ, SZ, TS, YJ, and ZH) (Fig. 1E). In total, we constructed 54 sequencing libraries with 35 samples. For more information, see Table S1.
For eight of the nine time batches, the tissues without the gonad from three oysters were mixed into one sample; the seventh batch was the exception. The first batch, dCh, contained samples of dying adult C. hongkongensis collected from an oyster farming area in Beihai (BH) of Guangxi Province in June 2014. The second batch had two groups, YJd and YJr, and contained samples of healthy adult C. hongkongensis collected in September 2015 from an oyster farming area in Yangjiang (YJ), Guangdong. The downstream amplification method for YJd was WGA (to detect mainly DNA virus genomes) and for YJr it was WTA (to detect mainly RNA virus genomes). The third batch had eight groups, LJd, LJr, QZd, QZr, TWd, TWr, ZHd, and ZHr, and contained healthy adult C. hongkongensis collected from oyster farming areas in the Qinzhou area (QZ) of Beihai (BH), Tanwei area (TW) of Huidong (HD), Zhuhai (ZH), and Lianjiang (LJ) of Guangdong Province in November 2015. The fourth batch had two groups, SZd and SZr, and contained healthy adult C. hongkongensis collected from the Shenzhen (SZ) oyster farming area in Guangdong in April 2016. (The letters “d” and “r” indicate WGA and WTA, respectively, in the third and fourth batches.) The fifth batch, ML, contained healthy adult C. hongkongensis collected from SZ in May 2016. The downstream amplification method for ML was RT-WGA (to detect both DNA and RNA virus genomes). The sixth batch, BH, contained moribund adult C. hongkongensis collected from BH in July 2016. The seventh batch had nine groups. GX, K1ZY, K2ZY, T2S, T4S, T5S, T6S, T8S, and ZH, and contained adult C. hongkongensis that were separately collected from BH in Guangxi Province, Kaozhouyang (K#ZY) of Huidong (HD), Taishan (T#S), and Zhuhai (ZH) in Guangdong Province in May 2017. K1ZY, K2ZY, and T8S contained healthy adult C. hongkongensis, and the other groups contained moribund adult C. hongkongensis. The method of sampling in this batch was different from the method used for all the other batches. A 1-mL syringe was used to draw hemolymph from the pericardial cavity of oysters, and samples from 5–8 oysters were mixed into one sample. The eighth batch, os, contained adult C. gigas collected in July 2018. The samples in these eight batches were collected and preserved by the South China Sea Fisheries Research Institute (Guangdong, China). The ninth batch had two groups, HSd and HSr, and contained healthy adult C. hongkongensis purchased from the Huangsha Aquatic Product Market in Guangzhou (GZ), Guangdong, in July 2019; their original farming location was ZH. The samples in this batch were collected and preserved by Guangdong Magigene Technology Co., Ltd (Guangzhou, China). All the samples were quickly frozen in liquid nitrogen, temporarily stored during transportation, and placed in an ultra-low temperature freezer at −80°C for long-term storage.
2.2 VLP enrichment
All 35 samples were processed to enrich for VLPs as described by Wei et al. (2017) and using the online protocols (dx.doi.org/10.17504/protocols.io.m4yc8xw). First, 500 mg of mixed tissue, or 14–34 mg spat mixture, was dissected and ground to powder in liquid nitrogen. The powder was further homogenized in approximately 2–5 volumes of sterile SB buffer (0.2 M NaCl, 50 mM Tris-HCl, 5 mM CaCl2, 5 mM MgCl2, pH 7.5). After three rounds of freezing and thawing, the pellets were resuspended entirely in 10 volumes of pre-cooled SB buffer. For the hemolymph sample, 10 mL hemolymph was mixed with an equal volume of 2×SB buffer, then directly subjected to three rounds of freezing and thawing. The following steps were the same for the tissue, spat, and hemolymph samples. All the samples were centrifuged at 1,000, 3,000, 5,000, 8,000, 10,000, and 12,000 × g for 5 min each at 4°C using a 3K30 centrifuge (Sigma, Osterode am Harz, Germany), and the supernatants were retained. Cell debris, organelles, and bacterial cells were further removed using a Millex-HV 0.22 μm filter. The filtrates were transferred to ultracentrifuge tubes containing 28% (w/w) sucrose using a syringe. The tubes were transferred to an ice bath for 10 min before centrifugation in a Himac CP 100WX ultracentrifuge (Hitachi, Tokyo, Japan) at 300,000 × g for 2 hr. Supernatants were discarded and the precipitates were fully resuspended in 720 μl of water, 90 μl 10 × DNase I Buffer, 90 μl DNase I (1 U/μl), and incubated at 37°C with shaking for 60 min, followed by storage overnight at 4°C, and transfer to 2-ml centrifuge tubes.
2.3 Viral nucleic acid extraction and amplification
Total nucleic acid was extracted from the VLPs using an HP Viral DNA/RNA Kit (R6873; Omega Bio-Tek, Norcross, USA); carrier RNA was not used, to avoid potential interference with sequencing results. A Qubit™ dsDNA HS Assay Kit (Q32851) and Qubit™ RNA HS Assay Kit (Q32855) (Thermo Fisher Scientific, Waltham, USA) were used to quantify the concentrations of dsDNA and RNA separately.
Virome studies are highly reliant on amplification because the viral biomass in natural samples is very low (Polson et al., 2011; Bar-On et al., 2018). Because most amplification methods introduce bias, it is challenging to study viromic data quantitatively at present (Parras-Moltó et al., 2018; Fan et al., 2021). Here, a REPLI-g Cell WGA and WTA Kit (150052, Qiagen, Hilden, Germany), which is based on the multiple displacement amplification (MDA) method, was used to uniformly amplify the genomes (WGA) and transcriptomes (WTA) (Hosono et al., 2003; Pan et al., 2013; Picher et al., 2016). MDA has many significant advantages over other amplification methods, such as replicating up to 70 kb, more even coverage, and 1000-fold higher fidelity than Taq polymerase amplification (Hosono et al., 2003; Stepanauskas et al., 2017), which make MDA widely used in virome studies.
To better compare the RNA and DNA virus communities, we used WGA and WTA methods to construct libraries in four batches of mixed tissues, which accounted for 70% (38/54) of all libraries (Table S1). RT-WGA is a modified protocol that simultaneously amplifies DNA and RNA (Wei et al., 2018b; Li et al., 2019). In this study, 14 libraries were constructed based on RT-WGA, including hemolymph and mixed tissue samples (Table S1). The steps for the WGA, WTA, and RT-WGA were according to the online protocols (dx.doi.org/10.17504/protocols.io.m5vc866). For WTA, there is a “DNA wipeout” step before reverse transcription that aims to remove DNA altogether, but this step is not part of the WGA and RT-WGA protocols. Compared with WTA and RT-WGA, the WGA protocol skips the reverse transcription reaction to avoid amplifying RNA in the downstream reaction. In addition, two other samples were directly subjected to random shotgun library preparation using a Nextera XT DNA Library Preparation Kit (Illumina) following the standard manufacturer’s protocol. Because of the limited data quality and sample number, these two libraries were not included in the following diversity analysis.
2.4 Library construction and sequencing
Amplified DNA was quantified by gel electrophoresis and Nanodrop 2000 spectrophotometer (Thermo Fisher Scientific) and randomly sheared by ultrasound sonication (Covaris M220) to produce fragments ≤800-bp long. The sticky ends were repaired, and adapters were added using T4 DNA polymerase (M4211, Promega, USA), Klenow DNA Polymerase (KP810250, Epicentre), and T4 polynucleotide kinase (EK0031, Thermo Fisher Scientific, USA). Fragments of 300–800 bp were collected after electrophoresis. After amplification, libraries were pooled and subjected to 150 bp, 250 bp, or 300 bp paired-end sequencing on Novaseq 6000, HiSeq X ten, and Miseq platforms (Illumina, USA). Considering the RT-WGA libraries were likely to have higher virus diversity than the WGA and WTA libraries (Wei et al., 2018a), they were sequenced with higher depth and also produced better assembly results (Table S1).
2.5 Virus detection and quantification based on reference viral genomes
Instead of using the traditional read alignment tools such as BLAST, BWA, and Bowtie2, we used FastViromeExplorer (Tithi et al., 2018), which was developed for fast and accurate virus detection and quantification in metagenomics data. FastViromeExplorer filters the alignment results based on minimal coverage criteria and the minimal number of mapped reads and accurately reports virus types and relative abundances. The Kallisto (version 0.43.1) method, integrated with FastViromeExplorer, was used with the default settings to map clean reads against three reference databases: the National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database, Global Ocean Virome database (GOV; Roux et al., 2016), and the Integrated Microbial Genome/Virus (IMG/VR) system, separately, to generate a reference abundance table. The RefSeq database contained 14,042 viral genomes or genome segments (update till 30 May 2019), GOV (Roux et al., 2016) includes 298,383 epipelagic and mesopelagic viral contigs, and IMG/VR contains 125,842 metagenomic viral contigs of the set of sequences collected from the Joint Genome Institute “earth virome” study (Paez-Espino et al., 2016).
2.6 Virus detection and quantification based on de novo assembly (vOTU annotation)
High-quality clean reads were generated using Fastp (version 0.20.0) (Chen et al., 2018), (options: --correction, --trim_poly_g, --trim_poly_x, --overrepresentation_analysis, --trim_front1=16, --trim_tail1=2, and --length_required=50) and reads that matched the Illumina sequencing adapters were removed (option: --detect_adapter_for_pe). The clean reads in libraries that were in the same assembly group were pooled and assembled using Megahit (version 1.2.9) (Li et al., 2015) with the default settings. Only contigs longer than 800 bp were kept. To detect low abundant contigs, clean reads that did not map back to the first round of assembled contigs were reassembled for two additional rounds, then all remaining reads were pooled and assembled together. Contigs from all four assembly rounds were pooled, and clustered at 97% global average nucleotide identity with at least 90% overlap of the shorter contig using cd-hit-est (version 4.8.1) (options: -aS 0.9 -c 0.97 -G 1 -M 0 -T 0 -g 1), resulting in 3,347,421 nonredundant contigs (Fig. 1A).
Diamond (Buchfink et al., 2015) is a state-of-the-art method that can annotate sequences with high precision and speed. Compared with BLAST searches against virus only databases, BLAST searches against the NCBI nr database of nonredundant protein sequences can significantly lower the number of false positive results (Nouri et al., 2018; Yao et al., 2020). However, BLAST has relatively low accuracy for short fragments (Jiang et al., 2011) and it cannot be used for sequences that have no similarity. Although, it is possible to dig deeper into the “dark matter” by combining multiple virus mining tools, such as CheckV (Nayfach et al., 2021) and VirSorter2 (Guo et al., 2021), identifying and classifying suspected viral sequences is challenging because there is a lack of adequate credible annotations (Handley et al., 2019). Given these challenges, the nonredundant contigs were annotated only using Diamond (version 0.9.24.125, options: -e 1e-10, --max-target-seqs 50) against the NCBI nr database (as of 11 July 2019). Among them, 728,784 (21.77%) of the total contigs were annotated as viral origin (i.e., vOTUs). Among them, 7.68% were Eukaryota, 0.34% were Archeae, 21.59% were bacteria, and 0.82% were unclassified cellular organisms, and 47.89% unknown origin (Fig. 1A). FastViromeExplorer was used with default settings to map the clean reads against the vOTU contigs to obtain the vOTUs abundance table.
2.7 Viral genome integrity, taxonomy, and auxiliary metabolic genes analysis
The viral genome completeness of assigned contigs was tested using CheckV (version 0.7.0) and its associated database (Nayfach et al., 2021). After removing false positive contigs that matched more host genes than viral genes, 3,473 nearly complete viral genomes were obtained.
Three methods (Diamond, vContact2, and PhaGCN) were used to determine the taxonomy of the viral contigs at the family level. Diamond annotations were further processed using two scripts (daa2rma and rma2info) in MEGAN6 (Huson et al., 2016) with default parameters, and parsed to taxonomy annotations. The advantage of Diamond is that there is no minimum length requirement for query sequences; however, it has three drawbacks: low accuracy, low annotation rates, and inaccurate taxonomy of NCBI. PhaGCN is a novel semi-supervised learning model that combines the strengths of a BLAST-based model and learning-based model using a knowledge graph (Shang et al., 2017). For comparison purposes, only vOTUs that were longer than 10 kb were analyzed using PhaGCN and vContact2 with default parameters.
To mine the auxiliary metabolic genes (AMGs) from DOV, Vibrant v1.2.1 (Kieft and Anantharaman, 2020) was used. Salmon v1.5.2 (Patro et al., 2017) was used with default settings to map clean reads against the AMG dataset to obtain the AMGs abundance table.
2.8 Viral contamination assessment
The experimental preparation for viromic sequencing involves the use of various reagents, many of which have been proved to carry contaminated viral sequences of unknown origin (Holmes, 2019). The extent of viral contamination in common laboratory components, especially viruses with small single-stranded DNA (ssDNA) genomes, has been reported previously (Asplund et al., 2019; Ashleigh et al., 2021).
To assess the viral contaminant level in this study, all the 3,347,421 nonredundant contigs (≥800 bp; not only viral contigs) in DOV were used as queries in a BLASTN search (with the parameters set as 95% identity and 95% query coverage) against the approximately 500 contaminant viral sequences reported by Asplund et al. (2019) and Ashleigh et al. (2021). We found very little evidence of viral contamination; no sequences matched with 100% identity, no expected circoviruses or RNA viruses were detected, and most of the alignments were with dsDNA phages (Additional file 1). The 3,473 near-complete viral genomes were used as queries in the same BLASTN search, but no matches were found. We also used Salmon (v1.5.2) to map all the clean reads in the DOV libraries to the contaminant viral sequences. The mapping rates for most of these libraries were <0.01% (Additional file 2), which is consistent with the BLASTN results.
2.9 Viral community and statistical analysis
In this study, the FPKM (Fragments Per Kilobase per Million) value was used to represent the relative abundance of the reference viral genomes, vOTUs, and AMGs. On the basis of the FPKM-transformed abundance table, R and Excel were used to analyze the corresponding viral diversity and community structures. The vegan and ggplots R packages were used to calculate α-diversity indexes and plot the nonmetric multidimensional scaling (NMDS). Analysis of variance (ANOVA) and TukeyHSD were used to test the differences between groups with the significance level set at 0.05. For the procrustes analysis, the characteristic axis coordinates of NMDS were extracted as the input of the procrustes function, and the protest function was used to perform the substitution test to evaluate the significance of the results.