Two families of glue genes in D. melanogaster
Alignments of the amino acid sequences encoded by the eight glue genes of D. melanogaster and their annotated orthologs from various Drosophila species [26] revealed that Drosophila glue genes form two distinct gene families and that there is no sequence match between them besides the signal peptide (Fig. 1, Fig. S1, Files S1-2). The first gene family comprises Sgs1, Sgs3, Sgs7, Sgs8 and Eig71Ee (Fig. 1, File S2) whereas the second gene family contains Sgs4, Sgs5 and Sgs5bis (Fig. S1, File S2). Genes of the first gene family are characterized by an IRXC[L/V]C motif in the encoded C-terminal domain and the presence of a phase 1 intron disrupting the signal peptide sequence whose position corresponds to amino acid position 10 (Fig. 1A). The second family proteins display a PCXXXXK motif in the C-terminal region (Fig. S1A).
In a previous study [26], we found that for the group of Sgs1, Sgs3, Sgs7 and Sgs8 genes, the rate of gene gains and losses was significantly higher than for average genes. In order to examine further the evolutionary dynamics of gene copies for this glue gene family and the factors influencing their rate of evolution, we decided to take advantage of high quality genome assemblies that became available in 2021 [24]. We chose to focus on closely related species of Drosophila which diverged relatively recently, so that we were unlikely to interpret as gene copy stasis situations that resulted from rapid duplications followed by the elimination of one of the duplicated copies. In the present study, we did not analyze Eig71Ee, as it has a supplementary role in immune defense and is thus probably subjected to additional functional constraints compared to the other glue genes. Overall, we examined the evolutionary dynamics of four glue genes - Sgs1, Sgs3, Sgs7 and Sgs8 - across 25 Drosophila species.
Existing genome annotations are often incomplete for Sgs genes
Using BLAST [29], we identified and annotated all copies of the Sgs genes which are orthologs of Sgs1, Sgs3, Sgs7 and Sgs8 in high-quality genome reference sequences of D. melanogaster and 23 other Drosophila species (Table S1-3, File S1). Compared to previous studies of Sgs genes in diverse Drosophila species [26, 30], we analyzed here the genome sequence of 6 additional Drosophila species: D. teissieri, D. triauraria, D. rufa, D. jambulina, D. obscura and D. subobscura. Compared to Da Lage et al. previous study [26], which used only protein sequences from D. melanogaster as queries for BLAST searches, we used Sgs sequences from all species as BLAST queries and compared large genomic syntenic blocks between species. We thus identified 13 additional Sgs genes in the species examined by Da Lage et al. and annotated 13 new Sgs genes in genome sequences from four other species (Table S3). Furthermore, we corrected gene annotations for five Sgs genes in five species, where introns were absent or mislabeled (Table S3, File S1).
Da Lage et al. [26] annotated four Sgs7 genes in D. suzukii based on a low-quality genome assembly [31]. Using a more recent Pacbio assembled genome [32] of the same strain, we found only one copy of Sgs7, located at the same position as in its closely related species D. biarmipes. This illustrates that determination of the number of gene copies is highly dependent on high quality genomes [20, 21]. In the present study we relied on PacBio- and Nanopore-based genome assemblies for all species, except for D. eugracilis and D. takahashii which had only Illumina-based genome sequences (Table S1).
A new nomenclature for Sgs3 genes
While D. melanogaster harbors a single Sgs3 gene, multiple copies of this gene were previously found in several Drosophila species and were distinguished with letters a, b, c according to the number of copies per species and to the order of their discovery in each species [26]. Here, as we found even more Sgs3 copies, we decided to change the gene nomenclature for better comparison between species. We define Sgs3x as the Sgs3 ortholog that is deleted in the melanogaster subgroup and that is flanked in other species by the Parg (CG2864) and Mnt (CG13316) genes in a large genomic syntenic block, which corresponds to position 3E2 on the X chromosome in D. melanogaster. All the other Sgs3 copies are in a large genomic syntenic block corresponding to region 68C10-11 on chromosome 3L in D. melanogaster. We labeled them from ‘b’ to ‘g’ from 5' (near the Chrb gene) to 3' (near the CG33489 gene) according to their respective positions within this genomic locus. We note that for serendipitous reasons there is no Sgs3a gene in this new nomenclature. Sgs3 genes located at the same corresponding position in the genome of diverse species were labeled with the same letter.
Several Sgs genes incorrectly contained premature stop codons
The coding regions of Sgs1 and Sgs3 contain long internal repeats encoding motifs rich in proline, serine and threonine [25]. Premature stop codons were found in genome sequence assemblies within the repeated region of Sgs1 in four species (D. takahashii, D. rhopaloa, D. triauraria and D. ficusphila) and of Sgs3x in D. biarmipes. Using a D. takahashii strain different from the genome sequence line, we PCR-amplified the region containing the presumptive premature stop codon and found an extra A nucleotide compared to the reference sequence of Sgs1, making up a stretch of 8 adenines instead of 7. The addition of this adenine removed the premature stop codon and gave a full length Sgs1 coding region. In D. triauraria we found 6 premature stop codons dispersed throughout the 4212-bp repeated region of Sgs1, with frameshifts adjacent to each stop codon. The presence of repeats prevented us from amplifying the region by PCR, so we do not know whether these are genuine stop codons or sequence assembly artifacts. Analysis of raw reads from full genome sequencing projects suggests that D. rhopaloa Sgs1 reference sequence may be corrected by adding an extra 'A' (supported by 21 reads compared to 42 reads harboring a deletion), that D. ficusphila Sgs1 reference sequence should be corrected by removing a 'C' from a 6-bp stretch of C (supported by 45 reads harboring a deletion versus 10 reads an extra C) and that D. biarmipes Sgs3x reference sequence should be corrected by adding an extra 'C' (supported by 13 reads compared to 4 reads harboring a deletion) (Fig. S2, File S3). We therefore considered the modified sequences for these three species in our subsequent analysis.
In summary, we detected premature stop codons in five Sgs genes. Four of them likely correspond to sequence assembly errors. For D. triauraria Sgs1, it is not clear whether the 6 premature stop codons are real or artifactual.
The Sgs1, Sgs3, Sgs7 and Sgs8 genes form four subfamilies
The four genes Sgs1, Sgs3, Sgs7 and Sgs8 encode proteins with a signal peptide and conserved amino acid motif patterns in the N-terminal and C-terminal regions (Fig. 1A, File S4-5). They harbor two coding exons and a short phase 1 intron interrupting the signal peptide. They can be grouped into four subfamilies based on their genomic location and synteny: Sgs1, Sgs3 (which includes Sgs3b-g genes but not Sgs3x), Sgs3x and Sgs7-8 (see below for a description of each subfamily). Sgs coding sequence length varies greatly between genes and species, with Sgs1 being the longest gene (higher than 1,7 kb in all species) and Sgs7-8 the smallest ones (between 222 and 240 bp in all species) (Fig. 2, File S5-6). The genes Sgs7 and Sgs8 are closely related to Sgs3 and they can be distinguished from Sgs3 by the length of their coding sequence (Fig. 2) and the fact that they are located at other genomic locations (see below).
Sgs1 did not duplicate and was lost at least twice via gene deletions
In all the Drosophila species studied, Sgs1 is composed of a first coding exon which is always 28 bp, a short phase 1 intron whose size varies between 50 bp and 71 bp, and a second exon which harbors a long repeat region and whose size varies from 1,758 bp in D. takahashii to 5,861 bp in D. rufa (Table S4). The synteny of Sgs1 and its neighboring genes is conserved across all species (Fig. 3–5, Table S3). Using BLAST searches, Sgs1 was not found in D. erecta and D. kikkawai. The loss of Sgs1 in D. erecta and in D. kikkawai is associated with a 4-kb and a 3-kb deletion, respectively (according to D. teissieri and D. jambulina sequences, respectively), thus removing the full Sgs1 coding region while preserving the two neighboring coding genes hoe2 and CG14044 (Fig. 4–5, File S7). We conclude that two recent Sgs1 gene losses occurred, in association with gene-wide deletions.
In the outgroup species D. pseudoobscura, D. obscura and D. subobscura, and in further distantly related species, no Sgs1 gene was found at the syntenic location (Fig. 5) nor across the whole genome via BLAST. This suggests that the Sgs1 gene appeared after the divergence between the most recent common ancestor of these species and D. melanogaster, i.e. about 30 million years ago [33]. Our analysis reveals that since its appearance within the Drosophila genus, the Sgs1 gene has maintained the same neighboring genes throughout all the Drosophila species we examined and that it did not duplicate.
Sgs3x did not duplicate and was lost at least three times via gene deletion
As for Sgs1, the first coding exon of Sgs3x is 28 bp in all the studied species and the second exon harbors repeats and varies in size, from 581 bp for D. elegans to 4,148 bp for D. bipectinata. In all species featuring an Sgs3x gene, the gene is located at the same corresponding genomic location, between genes Parg (CG2864) and Mnt (CG13316) (Fig. 3).
The most parsimonious scenario is that Sgs3x was already present in one copy in the ancestor of the species studied here. Based on our phylogenetic analysis and parsimony, we infer that Sgs3x has been lost three times: before the most recent common ancestor of D. melanogaster and D. erecta (melanogaster subgroup) (Fig. 6, via a 1-kb deletion when compared with D. eugracilis), in the ancestor of D. triauraria, D. rufa, D. jambulina and D. kikkawai (montium group) (Fig. 7, via a 2-kb deletion compared to D. bipectinata) and in the ancestor of D. ficusphila (Fig. 7, via a 1-kb deletion compared to D. elegans). Overall, Sgs3x exhibits an evolutionary history like Sgs1: it did not change neighboring genes, did not duplicate and experienced deletions of its full gene coding sequence in a few species.
Two Sgs3 copies lost their internal repeats in the lineage leading to D. subobscura
We define Sgs3, Sgs7 and Sgs8 as copies of the Sgs1-Sgs3-Sgs7-Sgs8 gene family that are present within a large genomic syntenic block corresponding to region 68C10-11 on chromosome 3L in D. melanogaster. The Sgs3 genes are distinguished from Sgs7 and Sgs8 by the presence of repeats and by longer coding regions (Fig. 2). However, in D. obscura, at the loci occupied by Sgs3b and Sgs3d in D. subobscura, we detected two Sgs3 genes which are shorter (both 270 bp) than typical Sgs3 genes (Fig. 2), do not present internal repeats but share similar N-terminal and C-terminal regions with their corresponding Sgs3 copies in D. subobscura (Fig. 8). Dot plots suggest that the repeated sequences of Sgs3b and Sgs3d were lost in the lineage leading to D. obscura (Fig. 8–9). We named the resulting genes in D. obscura Sgs3bshort and Sgs3dshort. The coding sequence of these two genes are extremely similar (Fig. 1B), suggesting that they originate from a recent gene conversion event in the lineage leading to D. obscura (Fig. S3-4). In addition to Sgs3bshort and Sgs3dshort, D. obscura possesses a copy of Sgs3e harboring internal repeats (Fig. 8–9). Complete losses of internal repeats were not observed in Sgs1 nor in Sgs3x (Table 1).
Sgs3 underwent several duplications, deletions, inversions and gene conversions
As opposed to Sgs1 and Sgs3x, Sgs3 first exon varies slightly in size, from 19 bp to 28 bp (Table S4). The second exon length varies from 356 bp in D. jambulina Sgs3b to 1967 bp in D. bipectinata Sgs3e (Table S4). The beginning of the second exon of Sgs3 encodes for a relatively conserved amino acid sequence, ASILLI (Fig. 1A). Two Sgs3 copies are found in most of the studied species: Sgs3b (which is located between genes CG33272 and CG7512) and Sgs3e (which is located within an intron of the gene Mob2) (Fig. 9, S4). Parsimony suggests that both genes were present in the most recent common ancestor of all studied species (Table 1). Comparison of protein sequences (File S8) shows that Sgs3c, Sgs3d, Sgs3f and Sgs3g are duplicates of Sgs3b and that Sgs3e did not duplicate in the lineages studied here. The high similarity between the two Sgs3 copies present in D. pseudoobscura is also indicative of gene conversion. Parsimony principle indicates that across the 24 studied species, Sgs3e underwent 2 gene losses and no duplications whereas Sgs3b experienced 2 gene losses and 4 gene duplications, all within the same syntenic block (Fig. 9, Table 1). Furthermore, inversions of the entire Sgs3 coding sequence, together with adjacent regions, occurred in two instances (crosses in Fig. 9, S5). Such inversions were not observed for Sgs1 nor for Sgs3x (Table 1).
Table 1
Summary of the sequence changes observed for the different Sgs gene subfamilies in the 24 studied species. Numbers indicate the number of genetic events inferred for each Sgs gene.
| Sgs1 | Sgs3x | Sgs3e | Sgs3b | Sgs7-Sgs8 |
inferred number of copies in the common ancestor of all studied species | 0 (appeared after the D. melanogaster/D. pseudoobscura divergence) | 1 | 1 | 1 | 0 (appeared after the D. melanogaster/D. pseudoobscura divergence) |
position and orientation relative to neighboring genes | constant | constant | constant | variable | variable |
first coding exon size | constant (28bp) | constant (28bp) | variable (19-28bp) | variable (25-31bp) | constant (28bp) |
internal repeats | present | present | typically present | typically present | typically absent |
loss of all the internal repeats | 0 | 0 | 0 | 2 | not applicable |
gene deletion | 2 | 3 | 2 | 2 | 4 |
gene duplication | 0 | 0 | 0 | 4 | ≥ 3 |
gene inversion | 0 | 0 | 0 | 2 | ≥ 1 |
gene conversion | 0 | 0 | 0 | 2 | ≥ 3 |
Sgs7 and Sgs8 underwent several duplications, gene losses and gene conversion
D. melanogaster possesses two glue genes near Sgs3b that are devoid of internal repeats, Sgs7 and Sgs8. In the other 23 Drosophila species, we annotated in the corresponding syntenic region 0, 1, 2 or 3 Sgs genes with no repeats (Fig. 9). For all these Sgs7 and Sgs8 orthologs, the size of the first coding exon is 28 bp and the second coding exon size varies between 194 bp in D. ananassae Sgs7 and 212 bp in D. bipectinata Sgs7b.
The two Sgs8 copies in D. eugracilis exhibit very similar sequences (Fig. S6), suggesting that they originated from a recent duplication or from gene conversion in the branch leading to D. eugracilis (Fig. 9). Similarly, another recent duplication or gene conversion event seems to have occurred in the branch leading to D. takahashii (Fig. 9–10). In certain cases, it was impossible to determine with absolute confidence whether the different copies correspond to Sgs7 or Sgs8, due to their short coding sequences, their rapid divergence and signs of gene conversion. For example, D. erecta and D. teissieri harbor Sgs genes at the exact genomic positions corresponding to D. melanogaster Sgs7 and Sgs8 genes (Fig. 10). However, at the Sgs7 position in D. teissieri is a coding region which is closer to Sgs8 than Sgs7, and reciprocally at the Sgs8 position (Fig. 1B). Dot plot analysis (Fig. S7) suggests that gene conversion occurred between Sgs7 and Sgs8 in the lineage leading to D. teissieri. Overall, our distinctions between the Sgs7 and Sgs8 genes are thus subject to caution.
In addition, synteny comparisons suggest that an inversion occurred between the group of D. santomea, D. yakuba, D. teissieri and D. erecta, and the melanogaster complex (D. melanogaster, D. simulans, D. sechellia and D. mauritiana), which inverted a pair of Sgs7 and Sgs8 genes together with their adjacent genes (Fig. 9–10, S8). And further gene conversion events blurred the relationships between Sgs7 and Sgs8 in these four species (Fig. 9–10, S8).
In summary, a single copy of Sgs7-8 was probably present in the common ancestor of D. kikkawai and D. melanogaster. It underwent at least 4 deletions, 3 duplications, one inversion and several gene conversion events (Table 1).
Genomic instability is associated with the presence of short "new glue" genes
Our analysis reveals two types of gene dynamics. A first group of genes, comprising Sgs1, Sgs3x and Sgs3e, experienced several gene losses but no duplication, no local inversion and no gene conversion across the 24 Drosophila species studied here. In contrast, the second category, involving Sgs3b, Sgs7 and Sgs8, underwent multiple events of duplication, local inversion and gene conversion (Table 1, Fig. 9).
To test the potential involvement of repetitive elements, we looked for the presence of repeated sequences across 129-kb regions encompassing each Sgs gene in several Drosophila species (Fig. S9). We found that in D. melanogaster repeats are more frequent near the Sgs3b/Sgs7/Sgs8 genes than around the Sgs1 and Sgs3x genes. Furthermore, the recently duplicated genes Sgs3c and Sgs3d in D. subobscura and Sgs3f and Sgs3g in D. teissieri locate within regions dense in repeats. Interestingly, multiple genomic changes (duplications, inversions) were found at the Sgs7-8-3b and Sgs3f-g loci, and similar stretches of sequences were detected at both loci (Fig. S10). These sequences contain short (243–426 bp), intronless genes encoding for threonine-rich proteins with predicted signal peptides. These genes resemble four genes adjacent to Sgs4 that were previously annotated in D. melanogaster as "nested genes" or "new glue genes", even though their putative role in glue production is unclear [35, 36] (Fig. S11). We thus decided to name the new sequences we identified as new glue (ng) genes.
In total, we annotated 154 such ng genes in the Sgs3-7-8 genomic region of the 24 studied Drosophila species (Table 2, S3). We define ng genes as encoding for proteins displaying the following characteristics: (1) a protein shorter than 180 amino acids, (2) a signal peptide, (3) an internal region rich in alanines and containing stretches of at least three consecutive threonines, and (4) a C-terminal region rich in arginines and lysines (Fig. S11). The previously annotated ng4 gene from D. melanogaster does not exhibit characteristics (2) to (4). The threonine stretch can attain up to 17 consecutive threonines, as in D. ananassae LOC6500299. Noticeably, almost all the Sgs7 and Sgs8 genes are adjacent and tail-to-tail to an ng gene, with approximately 130–200 bp separating the stop codons of both genes (beige arrows in Fig. 9). Sgs3f and Sgs3g are distant of approximately 400bp from their tail-to-tail adjacent ng gene. Most duplications and inversion events appear to preserve the contiguity and distance between the Sgs gene and its adjacent ng gene (Fig. S12-S14).
We used BLAST to search for ng genes in other parts of the genome and we identified three additional loci, containing ng genes but no Sgs genes, in several of the 24 studied species (Table 2). In D. melanogaster, two of these three loci (87A1 and 88C3-4) are separated from each other by approximately 2Mb. No ng gene was found at the Sgs1 and Sgs3x loci. Furthermore, no ng genes were detected by BLAST in the full genomes of D. virilis and D. hydei. This suggests that ng genes appeared after the divergence of D. virilis and D. melanogaster.
In summary, a family of new genes called "new glue" genes was detected near Sgs genes in highly dynamic regions (Sgs7-8-3b and Sgs3f-g), but not in less dynamic regions (Sgs1 and Sgs3x).
Table 2
Number of ng genes identified in 7 representative species (D. melanogaster, D.ananassae, D. obscura, D. subobscura, D. willistoni and D. virilis). Each column corresponds to a genomic region. Note that the 87A1 locus is located 5Mb away from Sgs5 and that the 3C11-12 locus is 500kb away from Sgs1 in D. melanogaster. No ng gene was found near Sgs1, Sgs3e and Sgs3x.
Species | 3C11-12 (near Sgs4, Notch and dnc) | 68C11 (near Sgs3b, Sgs7, Sgs8) | 68C13 (near Sgs3f, Sgs3g) | 28E6-28E7 (near mon2, Bsg and CG8673) | 87A1 (near cad87A, CG6959 and sad) | 88C3-4 (near Cystatin-like, Phosphodiesterase 6 and stumps) |
D. melanogaster | 4 | 2 | 4 | none | none | 4 |
D. ananassae | none | 8 | 4 | none | 10 | 4 |
D. obscura | 6 | none | 2 | none | none | 1 |
D. subobscura | 5 | none | none | none | none | 3 |
D. willistoni | none | none | none | 2 | none | 2 |
D. virilis | none | none | none | none | none | none |
D. hydei | none | none | none | none | none | none |
A recent gene duplication and an inversion were probably mediated by new glue genes
To investigate whether these new glue genes may have played a role in the evolutionary dynamics of genomic regions, we examined whether they were present at the boundaries of three relatively recent genomic rearrangements. First, we found that the duplication leading to Sgs3d in D. subobscura (which likely occurred approximately 15 million years ago [33]) (Fig. 9) included 5' and 3' non-coding regions surrounding the Sgs3b gene, and that there were no ng genes in the region (Fig. S15). Second, for the inversion of the Sgs7-Sgs8 region which occurred just before the divergence of D. teissieri and D. santomea (around 2–11 million years ago [33]) (Fig. 9), we noticed that one of the breakpoints perfectly corresponds to the coding region of a ng gene (Fig. 11). Third, for the recent duplication leading to Sgs3g in D. teissieri (which occurred about 0–2 million years ago [33]), both breakpoints corresponded to ng genes (Fig. 11). The older the event, the more likely sequences at the breakpoints may be lost or modified. Here, we found that two breakpoints of a recent gene duplication and one breakpoint of an older inversion match the coding regions of ng genes. Given that ng genes are found in multiple copies over the genome, we suggest that they may facilitate large-scale genomic modifications such as gene inversion, gene duplications and gene losses.