Discovery of abundant baculoviral homologs in spider genomes using a combined strategy of structure and sequence comparisons
We developed a comprehensive approach (Fig. 1a) to identify the uncovered NALDVs nrEVEs. First, by querying the predicted baculoviral protein structures in the AlphaFold Protein Structure Database cluster (AFDB50), we identified a distinct spider species, Araneus ventricosus, harboring multiple baculoviral homologs dispersed across 283 scaffolds of the genome (Table S1). Structural alignment revealed significant similarities between the proteins identified in A. ventricosus and their baculoviral homologs. For example, three proteins showed a homology probability over 0.95, with the conserved baculovirus core proteins (PIF-1, PIF-4, and LEF-4) of Autographa californica multiple nucleopolyhedrovirus (AcMNPV), while their amino acid sequence identities were below 20% (Fig. 1b). Subsequently, the open reading frames (ORFs) of selected scaffolds (Table S2) with large DNA virus-coding strategies (e.g., higher ORF density) were extracted, and their structures were predicted. After structural comparison and sequence alignment of these proteins with the predicted baculoviral protein structures, we identified over 30 homologs of baculoviruses, including 22 baculoviral core proteins (conserved among all baculoviruses) (Fig. 1a). These core proteins included those involved in DNA replication and transcription (e.g., helicase, dnap, p47, lef-4, lef-5, lef-8, and lef-9), as well as virion structure and other functions (e.g., pif-0, pif-1, pif-2, pif-3, pif-4, pf-5, pif-6, pif-8, vp39, 38k, 49k, odv-ec43, odv-ec27, p33, and p48) (Table S2).
To comprehensively investigate the occurrence and distribution of baculoviral homologs in spider genomes, we screened 22 baculoviral core homologs of A. ventricosus against all available arachnid genome assemblies in the National Center for Biotechnology Information (NCBI) Whole Genome Shotgun (WGS) database, encompassing 92 species from diverse orders, including Araneae, Trombidiformes, Sarcoptiformes, Ixodida, Mesostigmata, Scorpiones, Opiliones, and Pseudoscorpiones (Table S3). Our analysis revealed 10641 baculoviral homologs across 89 species (Table S4). Among the 32 assembled spider genomes, 10091 baculoviral homologs were identified in 1742 scaffolds, spanning 13 distinct families (Fig. 1c), with 12 families from the infraorder Araneomorphae and one family (Theraphosidae) from more ancient infraorder Mygalomorphae. Most baculoviral homologs were identified in the order Araneae, whereas a few baculoviral homologs with lower identity levels were identified in other orders (Fig. 1d).
Broad distribution of baculovirus-related nrEVEs in spider genomes
Spider scaffolds harboring baculoviral homologs (Table S5) displayed a wide spectrum, ranging from several kilobase pairs to hundreds of millions of base pairs (Fig. 2a). Several species containing hundreds of baculoviral homologs were further analyzed (Fig. 2b). All chromosomes of T. clavata from SpiderDB27 (Table S6), and selected chromosomes of Hylyphantes graminicola, Latrodectus elegans, Dolomedes plantarius, Meta bourneti, Metellina segmentata, and Parasteatoda lunata from the WGS database contain clustered distribution of baculoviral homologs. Chromosomal regions with baculoviral homologs usually contained direct repeats, inverted repeats, large fragments of repetitive sequences, and host genes, confirming that these baculoviral homologs were endogenous in the spider genome (Fig. 2c). Compared to the adjacent regions, the baculoviral homology rich regions displayed a notable increase in the percentage of ORFs coverage, suggesting sequential integration of large viral DNA fragments (Fig. 2c). Collectively, the distribution and localization of baculoviral homologs confirmed the widespread existence of baculovirus-related nrEVEs in spider genomes.
Spider nrEVEs-related mRNAs, piRNAs and miRNAs exist in spiders
We further conducted an extensive survey of the baculoviral homologs in the NCBI Transcriptome Shotgun Assembly (TSA) database. In total, 1689 TSA datasets covered 1138 different spider species from 76 families (Table S7). Remarkably, mRNAs with baculoviral homologs were widespread in Araneae (Table S8). In addition, the number of baculoviral homologs in different spider families was strongly correlated with the number of datasets available for each family (Fig. 3a). However, the detection rate of core baculoviral homologs varied among the different families, and no obvious correlation was identified between the detection ratio and spider classification (Fig. S1a).
We further examined the transcriptomic data of T. clavata available in SpiderDB, which contains the most detailed data for different spider tissues. Nearly full-length mRNA of certain baculoviral homologs were recovered (Fig. S1b). Although the expression levels of EVE-related mRNAs were relatively limited (FPKM < 3), they displayed a broad expression pattern in the transcriptome data from different spider tissues, with a higher detection ratio in different parts of the major silk gland (Fig. S1c).
Additionally, we surveyed the small RNA (sRNA) sequencing data of the major silk glands of T. clavata, revealing highly abundant sRNAs with many baculoviral homologs. Notably, the piRNA levels of odv-ec43, p48, and p47 were significantly higher in the major silk glands, and the sRNA patterns of pif-4, lef-9, and 38k showed a typical miRNA pattern (Fig. 3b). In addition, piRNAs appeared to be mainly present in the ducts of major silk glands, whereas miRNAs were mainly present in the tail (Fig. 3b).
Identification of spider nrEVEs and related mRNAs in different spider individuals
To further confirm the prevalence of nrEVEs in individual spiders, we purified genomic DNA from the collected A. ventricosus spiders and subjected them to high-throughput sequencing. Sequencing quality was confirmed by mapping the reads to A. ventricosus reference assembly scaffolds (accession number: BGPR01), with an overall alignment rate of 82%. The coverage of the top nine scaffolds (AvEVE1-9), which contained the highest number of types of baculoviral homologs, were mapped with a coverage ranging from 42.62–98.24%, and mismatch rates ranging from 1–3.2% (Table 1). To explore how the nrEVEs were selected in the spider genome, we analyzed the ratio (Ka/Ks) of the non-synonymous substitution rate (Ka) and synonymous substitution rate (Ks) between our AvEVE sequences and reference assemblies, revealing that the baculoviral homologs in the AvEVEs underwent strong purifying selection pressure (Fig. 3c). Moreover, we performed PCR assays to identify the presence of baculoviral homologs (lef-8, p33, pif-0, and pif-1 of AvEVE2 and lef-9, pif-5, vp39, and 38k of AvEVE3) in the five A. ventricosus individuals. Remarkably, these genes were consistently detected in all the tested spiders, regardless of sex, indicating the existence of AvEVEs in A. ventricosus genome (Fig. S2).
Table 1
Alignment detail of AvEVE1-9
nrEVEs | Accession | No. of type of BHs | Length (bp) | No. of Reads* | Mismatch# (%) | Coverage (%) |
AvEVE1 | BGPR01003736.1 | 20 | 128845 | 3926 | 3.1 | 42.62 |
AvEVE2 | BGPR01006520.1 | 20 | 79615 | 18629 | 1 | 98.24 |
AvEVE3 | BGPR01005754.1 | 18 | 89329 | 226558 | 1.1 | 87.76 |
AvEVE4 | BGPR01001437.1 | 7 | 257312 | 28483 | 3.2 | 56.45 |
AvEVE5 | BGPR01014530.1 | 7 | 39014 | 11448 | 1.3 | 96.90 |
AvEVE6 | BGPR01017730.1 | 7 | 32001 | 18304 | 1.5 | 77.52 |
AvEVE7 | BGPR01014537.1 | 6 | 39004 | 68159 | 1.6 | 95.13 |
AvEVE8 | BGPR01022151.1 | 6 | 25695 | 1327 | 2.6 | 72.74 |
AvEVE9 | BGPR01031596.1 | 5 | 17954 | 37204 | 1.7 | 70.70 |
*The read number was gained from the alignment result of sequencing data mapping to the full assemble scaffolds of BGPR01. |
#The mismatch percent represents the mismatch rate of all reads. |
In addition, samples of T. clavata were used to evaluate the expression levels of nrEVEs genes in diverse tissues via quantitative reverse transcription PCR (qRT-PCR). The results demonstrated a markedly higher expression level of baculoviral homologs in the silk gland than in other tissues (Fig. 3d).
Spider nrEVEs likely originated from an unidentified ancestor NALDV
To understand the origin of spider nrEVEs, sequences of the most conserved homologs in Naldaviricetes, PIF-0, were subjected to maximum likelihood phylogenetic analysis. The resulting phylogenetic trees of PIF-0 showed that spider nrEVEs formed a monophyletic group comprising three distinct clades (Clades I, II, and III), that were evolutionarily distant from other NALDVs (Fig. 4a). Phylogenetic analyses were conducted using concatenated sequences of PIFs or LEFs from the same scaffolds. The phylogenetic tree based on concatenated sequences of PIFs and LEFs also showed that spider nrEVEs formed a monophyletic group with high confidence and that members of this group were evolutionarily distant from other NALDVs (Fig. 4b and 4c). Notably, phylogenetic trees based on concatenated sequences of PIFs or LEFs confirmed the separation of Clade I and Clade II of spider nrEVEs but lacked Clade III (Fig. 4b and 4c). This is because in contrast with Clade I and Clade II nrEVEs which contain closely distributed PIFs and LEFs, Clade III typically contains isolated nrEVE sequences.
To understand the origin of spider nrEVEs, we selected nine representative scaffolds harboring conserved PIFs and LEFs spanning lengths within the range of 80 and 500 kb for further analysis (Table S9). Among these, three belonged to Clade I, whereas the remaining six were classified as Clade II. All proteins from these nine scaffolds and representative NALDVs were analyzed via structural comparison (Table S10). Based on our structural comparison results and previously studies11,28,29, we systematically summarized the conserved proteins of spider nrEVEs and NALDVs (Table S11). Notably, these scaffolds collectively harbored all conserved NALDV and lefaviral proteins, except AC81. In addition, they contained the most conserved baculoviral and nudiviral proteins, except for VLF-1 and P6.9, as well as some baculovirus-conserved proteins (49K, ODV-EC43, ODV-EC27, and P48) (Fig. 4d and Table S11). Five proteins conserved in spider nrEVEs (helicase-2, flap endonuclease, DNA-binding protein, nudix hydrolases, and AC106/107) were found in other NALDVs. Additionally, 11 predicted proteins were conserved in spider nrEVEs that did not exist in NALDVs (Fig. 4e), including homologs of protein kinases and innexins (Fig. 4f). Collectively, our data suggests that spider nrEVEs evolved from an unclassified ancient NALDV belonging to the order Lefavirales, of the class Naldaviricetes.
nrEVEs and spiders have engaged in long-term coevolution
To understand the relationship between spider nrEVEs and their host spiders, we systematically conducted a co-phylogenetic analysis of the PIF-0 proteins with the classification of spiders. The results showed that the evolutionary route of spider nrEVEs was highly consistent with the classification and evolutionary trajectories of spiders (Fig. 5a and Fig. S3). Specifically, the nrEVEs and transcripts from Clades I and II exclusively exist in the Entelegynae lineage, including the retrolateral tibial apophysis (RTA) clade, Orbiculariae clade, and other lineages (e.g., Palpimanoidea and Eresoidea). The nrEVEs from Clade III exist in more ancient spider lineages, such as Mesothelae, Mygalomorphae, and Haplogynae, and in a few species from the RTA and Orbiculariae clades.
Table 2
Age estimates and their 95% HPD intervals of spider nrEVEs and NALDVs
Node | Time, Mya | 95% HPD, Mya | Prior | Classification |
1 | 398.12 | 300.65-511.22 | | NALDVs |
2 | 357.31 | 273.63–455.00 | | Lefavirus |
3 | 353.97 | 265.34-458.87 | | |
4 | 332.86 | 254.63-427.89 | | |
5 | 329.86 | 248.89-428.24 | | |
6 | 270.17 | 202.96-349.58 | 387.00 ± 81.1531 | Mesothelae Spider nrEVEs |
7 | 217.36 | 159.03-286.51 | | |
8 | 207.03 | 162.18-259.81 | | Nudivirus |
9 | 203.80 | 152.41-263.95 | | |
10 | 198.47 | 145.99-261.23 | | Baculovirus |
11 | 179.97 | 136.62-230.08 | | |
12 | 179.42 | 143.04-223.59 | | |
13 | 179.32 | 126.52–242.00 | | Nimavirus |
14 | 166.92 | 114.47-226.42 | | Filamentovirus |
15 | 161.41 | 129.20-201.25 | | |
16 | 157.98 | 114.53-209.03 | | |
17 | 154.96 | 117.88-197.96 | 201.00 ± 43.1531 | Entelegynae Spider nrEVEs Clade I&II |
18 | 137.65 | 103.80-179.87 | | |
19 | 131.09 | 91.38-176.67 | | |
20 | 119.21 | 89.67-154.19 | | Spider nrEVEs Clade I |
21 | 117.44 | 83.28-159.22 | | |
22 | 107.33 | 79.75–138.70 | | |
23 | 104.01 | 95.50-112.38 | 103.38 ± 4.4130 | Bracovirus |
24 | 102.53 | 68.71-142.16 | | Hytrosavirus |
25 | 101.78 | 73.02-136.77 | | Spider nrEVEs Clade II |
26 | 95.14 | 63.96-129.92 | | |
27 | 86.33 | 61.37–116.10 | | |
28 | 81.98 | 53.56-118.43 | | |
29 | 81.48 | 60.50-106.94 | | |
30 | 81.44 | 56.33-111.03 | | |
31 | 72.99 | 53.05–95.63 | | |
32 | 63.96 | 42.56–90.28 | | |
33 | 62.14 | 41.18–86.96 | | |
34 | 53.79 | 36.96–72.92 | | |
35 | 51.30 | 34.18–71.95 | | |
We performed molecular dating analyses to elucidate the evolutionary events and time scales of nrEVEs in spiders, and their evolutionary history with NALDVs. By using the time to the most recent common ancestor (TMRCAs) of Chelonus inanitus bracovirus (CiBV) and Cotesia congregata bracovirus (CcBV)30, and the diversion times of spiders31 as a calibration point, we estimated the age of spider nrEVEs from three subclades and other NALDVs using the concatenated sequences PIF-0 and LEF-8 (Fig. 5b and Table 2). Our analysis indicated that the emergence of the TMRCAs of NALDVs was ~ 398 million years ago (Mya); the TMRCAs of all lefaviruses were ~ 357 Mya; the TMRCAs of spider nrEVEs and nimaviruses were estimated to be ~ 354 Mya; the divergence times of nudivirus and baculovirus was ~ 333 Mya, which is consistent with previous studies30; and the TMRCAs of hytrosavirus, filamentovirus, and AmFV were ~ 330 Mya. Moreover, our molecular dating analysis revealed that the TMRCAs of all spider nrEVEs were ~ 270 Mya, and the divergence time of the spider nrEVEs was highly consistent with those of the species to which they belonged.