Hifi sequencing with ultra-low DNA input workflow
The ultra-low DNA input protocol includes a PCR amplification step to generate sufficient material for sequencing. This was a critical consideration when selecting this workflow to generate high-quality genomic information from a single C. stellifer specimen. PCR products ranged from 5-8 kb. These values suggest that the gDNA had some degree of fragmentation and that short fragments were preferentially amplified. Sequencing output resulted in 191,906 PacBio Hi-Fi reads with an average read length of ~ 13,000 bp. The genome size was estimated to be approximately 104 Mb, with a heterozygosity of 2.88% and 11.4% of repeat sequences (Figure 1).
Figure 1: Genome properties based on raw data exploration. (A) GenomeScope results in linear coordinates on the PacBio Hifi sequencing dataset for one individual of Culicoides stellifer. The genome size (len) is predicted to be around 104 Mb, and 88.6% of the 19-mers are unique (aa), suggesting that the genome has around 11% repetitive content. Heterozygosity (ab), mean k-mer coverage for heterozygous bases (kcov), read error rate (err), the average rate of read duplications (dup), k-mer size used in the run (k:), and ploidy (p:) is also reported. The sequencing errors are identified by low-coverage K-mers. (B) Frequency histogram of the read length for the PacBio Hifi sequencing dataset for one individual of Culicoides stellifer. The dashed lines represent the mean value.
Mitogenome Assembly
Long-read sequencing technologies for mitochondrial genome assembly in Culicoides haven’t been explored before. We started by using the MitoHifi toolkit for mitochondrial assembly from Hifi data. When using raw data and assembled contigs with Hifiasm, the pipeline failed to correctly assemble the mitochondrial genome, as it generated a molecule much larger than expected (~ 50,000 bp). It is likely that the misassembly might be related to shorter reads, insufficient coverage, or the presence of nuclear-mitochondrial DNA (NUMTs). We selected 128 reads that mapped to a reference mitogenome (C. arakawae) and generated a de-novo assembly for C. stellifer’s mitochondrial genome using IPA assembler. This resulted in a 16,607 bp mitochondrial genome, which is within the range of mitogenome lengths previously reported for other species of the genus [9, 54].
Figure 2: Mitochondrial genome annotation for Culicoides stellifer. Protein-coding genes, rRNA, and tRNA are represented in green, brown and orange, respectively. The control region (D-loop) is marked in blue, and the intergenic spacers are marked in red. The annotation was completed using MITOS2 v2.1.8, and the figure was generated using Geneious prime v11:08.
The annotation using MITOS2 identified 13 protein-coding genes (PCGs), 22 transfer RNAs (tRNA), and two ribosomal RNAs (rRNA). The assembly was circularized and overall it showed the same gene arrangement previously described for other species in the genus. PCGs sizes ranged from 152 (ATP8) to 1,732 bp (NAD5). Transfer RNA sizes ranged from 48 [tRNA F(gaa) to 72 bp [tRNA V(tac)], while rRNA lengths were 781 (rRNA S) and 1,284 bp (rRNA L). The estimated control region size was 1,842 bp. Additionally, 17 spacers were identified, ranging in size from 1 to 18 bp.
Genome assembly
We compared two long-read assembly tools (IPA and Hifiasm) and various levels of duplicate purging in Hifiasm (-s parameter). Overall, Hifiasm produced the best assemblies. All four initial assemblies resulted in primary genomes with a size >146 Mb and almost 30% of complete-duplicated genes reported by BUSCO. Different values of the similarity threshold for duplicate haplotigs (- s parameter) in the Hifiasm assemblies resulted in a slight decrease in the total number of contigs and an increase in the N50, producing overall similar values (Table 1).
Purging duplicated regions with purge_dups reduced was successful in reducing the number of contigs by half, increased contiguity and significantly decreased the number of duplicated genes reported by BUSCO (Table 1). The 119 Mb size assembled genome is closer to the estimate of 105 Mb obtained from the k-mer analysis. In comparison to the other two available Culicoides genomes, our assembly displays good quality in terms of contiguity (N50 and L50) and completeness (Figure 3). The BUSCO scores of our assemblies (89.8 % complete (C) BUSCOs (including 2.0 % duplicated [D]), 1.5 % fragmented (F), and 8.6 % missing (M)) are very similar to those of the genome of C. brevitarsis, whose assembly includes three chromosomes and unplaced scaffolds. Our genome also is of significantly higher quality than that of C. sonorensis, as the latter involved pooling many individuals and was generated using short-read sequencing.
As our final assembly, we selected the one with the highest N50 and the lowest number of duplicated BUSCOs without significantly decreasing the complete BUSCO score. The genome assembly (referred to as purged_s030) comprises 450 contigs, totalling 119,322,097 bp, contig N50 of 479,264 bp and L50 of 81 (Figure 3). We estimated a high base accuracy (QV=53.3) and 90% completeness based on the k-mer comparison between the assembly and those found in the PacBio raw reads.
Table 1: Summary statistics of the Culicoides stellifer primary genome assemblies using HiFiasm with three levels of purge duplication and IPA. The only two other genomes of the Genus available in NCBI are included for comparison purposes.
Genome Assembly
|
C.stellifer
HiFiasm
-s 0.35
|
C.stellifer
HiFiasm
-s 0.35
purge_dups
|
C.stellifer
HiFiasm
-s 0.55
purge_dups
|
C.stellifer
HiFiasm
-s 0.75
purge_dups
|
C.stellifer
IPA
|
C.stellifer
IPA
purge_dups
|
C.sonorensis
Velvet
(GCA_900258525.3)
|
C.brevitaris
Raven; Polca (Masurca); Racon.
|
Sequencing technology
|
PacBio Hifi
|
PacBio Hifi
|
PacBio Hifi
|
PacBio Hifi
|
PacBio Hifi
|
PacBio Hifi
|
Illumina HiSeq
|
Oxford Nanopore PromethION; Illumina NovaSeq
|
Genome statistics
|
Total length (Mb)
|
158
|
119
|
119
|
120
|
146
|
117
|
155.9
|
129.5
|
Number of contigs
|
810
|
450
|
451
|
463
|
730
|
600
|
3858
|
223
|
Number of scaffolds
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
149
|
Longest contig or scaffold (bp)
|
1,731,460
|
1,731,461
|
1,731,460
|
1,731,460
|
1,156,022
|
1,156,022
|
763,582
|
46,604,242
|
Mean contig or scaffold length (bp)
|
194,770
|
265,155
|
264,809
|
259,459
|
199,980
|
195,305
|
40,420
|
863,398
|
N50
|
356,049
|
479,265
|
459,312
|
458,458
|
289,321
|
306,443
|
109,184
|
3.5 Mb
|
N90
|
84,987
|
132,711
|
132,524
|
113,333
|
104,260
|
98,560
|
NA
|
NA
|
L50
|
126
|
81
|
82
|
83
|
153
|
114
|
395
|
NA
|
L90
|
476
|
261
|
265
|
269
|
479
|
368
|
NA
|
NA
|
GC content
|
30.8%
|
30.8%
|
30.8%
|
30.8%
|
30.8%
|
30.8%
|
28.3%
|
27.9%
|
Total BUSCO for the genome assembly
|
Complete BUSCO
|
2970
(90.4%)
|
2953
(89.9%)
|
2942
(89.8%)
|
2950
(89.8%)
|
2955
(89.9%)
|
2942
(89.6%)
|
89.5%
|
91.9%
|
Complete single copy
|
2113
(64.3%)
|
2882
(87.7%)
|
2870
(87.4%)
|
2875
(87.5%)
|
1962
(59.7%)
|
2881
(87.7%)
|
63.0%
|
89.3%
|
Complete duplicated
|
857
(26.1%)
|
71
(2.2%)
|
79
(2.4%)
|
75
(2.3%)
|
993
(30.2%)
|
61
(1.9%)
|
26.5%
|
2.6%
|
Fragmented
|
50
(1.5%)
|
51
(1.6%)
|
53
(1.6%)
|
54
(1.6%)
|
54
(1.6%)
|
55
(1.7%)
|
4.4%
|
0.6%
|
Missing
|
265
(8.1%)
|
281
(8.5%)
|
283
(8.6%)
|
281
(8.6%)
|
276
(8.5%)
|
288
(8.7%)
|
6.1%
|
7.5%
|
*Table 1*
Figure 3: Contig-level assembly of Culicoides stellifer. (A) Snail plot showing lengths of all contigs. The longest contig is represented in red, N50 in dark orange, and N90 in light orange. The outer ring shows the GC content of the genome. (B) Visualization of assembly contiguity showing contig sizes on the Y-axis for which x percent of the assembly consists of contigs of at least that size. The three assemblies of Culicoides stellifer with various levels of similarity purging are compared to the assembly of Culiocides sonorensis.
Genome annotation
Overall, the degree of repetitive content in the genome assembly of Culicoides stellifer was approximately 15 Mb of repetitive elements, representing 11 % of the genome assembly (Table 2). Initially, nearly half of all repeats were classified as unknown. Due to the small size of this reference library, it was decided to manually investigate the largest and most abundant consensus sequences as described in the Methods. Many of the unknown repeats were determined to be non-autonomous TIR DNA transposons, and in general, all DNA transposons were characterized by a lack of substantial coding regions for transposases. In an attempt to find autonomous elements, the repeat library output from a larger version of the assembly with less purged duplicates (HiFiasm -s 0.35) was inspected for novel consensus sequences, and these were added to the existing repeat library and the genome was re-annotated. In this new library, a total of 4 consensus sequences of DNA transposons (TcMar-Tc1, TcMar-Tigger, TcMar-ISRm11, hAT-Tip100) had partial coding regions, but none of these appear to be functional.
Comparison and selective melding of the two libraries added new consensus sequences for four LTR retrotransposons with coding regions and well-resolved termini, as well as several LINE elements including an R2 consensus sequence. Retrotransposons make up a smaller fraction of the genome than DNA transposons, which stands in contrast to the pattern seen in the C. sonorensis genome [7]. Caution should be taken when comparing the repeats in these two genomes, as the methods differed and the repeat annotation in the C. sonorensis assembly was not as thorough as was done for C. stellifer. In general, the C. stellifer assembly has a lower repeat content than C. sonorensis ( ~11% vs 29.7%); however, when that repeat content is positively correlated with genome or assembly size [55], this is not surprising.
Table 2: Summary of repeat elements annotated in the Culicoides stellifer assembly. The numbers of consensus sequences in parentheses represent those generated by RepeatModeler2.
Repeat
|
Superfamily
|
Base pairs
|
Consensus Sequences
|
DNA transposon
|
|
|
|
TIR
|
Non-Autonomous
|
2,563,172
|
66 (59)
|
hAT
|
496,695
|
9 (8)
|
Tc1/Mariner
|
103,081
|
9 (3)
|
piggyBac
|
55,206
|
1 (1)
|
Other TIR
|
4,132
|
12
|
Total DNA
|
3,222,286
|
97 (71)
|
Retrotransposon
|
|
|
|
LTR
|
Bel-Pao
|
202,125
|
41 (5)
|
Ty1/Copia
|
100,586
|
20 (3)
|
Ty3-like
|
87,488
|
54 (2)
|
Unclassified LTR
|
48,094
|
2 (2)
|
Total LTR
|
423,038
|
117 (12)
|
LINE
|
I
|
239,478
|
30 (5)
|
Unclassified LINE
|
121,740
|
4 (4)
|
CR1
|
91,715
|
26 (6)
|
R2
|
48,480
|
1 (1)
|
RTE
|
27,718
|
4 (3)
|
Total LINE
|
529,131
|
65 (19)
|
Total Retrotransposon
|
952,169
|
182 (31)
|
|
Total TE
|
4,174,455
|
279 (102)
|
Other Repeats
|
|
|
|
Satellite/Simple/Low complexity
|
|
6,107,679
|
2976 (55)
|
Unknown
|
|
5,434,496
|
216 (216)
|
|
Total Repeats
|
15,716,630
|
3443 (373)
|
*Table2*
A breakdown of the contribution of different components of Earl Grey to the resultant repeat library is useful when considering repeat annotation in novel genomes (Table 3.). Dfam is a growing, open-source database of repeats, and its current subset of Dipteran repeats stems from species distantly related to C. stellifer, hence the limited contribution to the annotation. Rather than being an indictment of Dfam, this stresses the value of submitting consensus sequences to Dfam to increase its taxonomic scope and useability for new genomes.
Table 3. Comparative statistics of repeat sequences detected by various sources and their annotation in the assembly.
Repeat Source
|
Consensus Sequences
|
Mean Coverage/Consensus (bp)
|
Total Coverage (bp)
|
Dfam Diptera
|
181
|
1610
|
291,385
|
RepeatMasker
|
2891
|
966
|
2,792,913
|
RepeatModeler2
|
373
|
33,644
|
12,549,446
|
BRAKER2 predicted 18,895 proteins in the nuclear genome with 18,662 unique sequences. We annotated 10,524 proteins (55.7%) by searching against the Swiss-Prot protein sequence database. 7,283 genes were mapped to KEGG pathways using BlastKOALA (Table 4). Collectively, 7812 proteins were functionally annotated by InterProScan, of which 4057 were assigned a GO term. This resource provides complementary levels of protein annotation, including curated InterPro entries annotated with a unique name and GO terms. The following analyses were included in the output file: PANTHER, CATH-Gene3D, PROSITE Profiles, Pfam, SUPERFAMILY, SMART, FunFam, Conserved Domains Database (CDD), PRINTS, Hamap, PIRSF, NCBIfam and the Structure-Function Linkage Database (SFLD). These represent protein signature databases included in InterPro [56] that were scanned in an integrated way to predict protein functions and for which a match was found. Some of the results of these analyses are included in Table 4.
We annotated more than 3,000 additional protein-coding genes for either the C. sonorensis (15,612) or the C. brevitarsis (11,137) genome, respectively. This indicates that our workflow recovered a more complete set of genes for this group. We ran BUSCO in protein mode on the predicted proteins using the diptera_odb10 lineage dataset, which resulted in 91.5% complete BUSCO, including 8.3% duplicated, 1.0% fragmented and 7.5% missing. These values are similar to the report of C. brevitarsis (GCF_036172545.1-RS_2024_03) except for the complete and duplicated genes for which we report a slightly higher value (2.6% for C. brevitarsis). This difference is explained by the larger number of proteins predicted by BRAKER2 in our assembly compared to the annotation of C. brevitarsis using the NCBI Eukaryotic Genome Annotation Pipeline.
Table 4: Functional annotation of Culicoides stellifer proteins.
Genome annotation
|
Number of elements
|
Percentage
|
Predicted protein-coding genes (BRAKER2)
|
18,895
|
|
Swiss Prot
|
10,524
|
55.7
|
KEGG (BlastKOALA)
|
7,342
|
38.9
|
Pfam
|
6,209
|
32.9
|
InterPro
|
6,807
|
36.0
|
GO
|
6,026
|
31.9
|
Non-retroviral integrated RNA virus fragment identification
The genome query for integrated viral fragments yielded 38 hits, ranging from 44 bp (74.5% identity) to 322 bp (53.2% identity). Fourteen hits greater than 100 bp were queried against the non-redundant protein database in GenBank using blastx. While most of these returned no similar hits or only to RNA-binding domains of genes, a 322 bp fragment in the C. stellifer raw reads was found to be similar to VSV. Using blastn we confirmed the presence of this VSV-like fragment in the C. stellifer assembly (Figure 4) and in conjunction with the gene annotation data, showed that a full 1319 bp coding region for a nucleocapsid was present. A blastx search using this nucleocapsid sequence as a query returned many significant hits (93-98% query coverage, 28.33-38.23% amino acid identity, scores of 161-303, hit length of 1233-1377 bp) to rhabdovirus nucleocapsid proteins in GenBank.
Figure 4: Representation of the non-retroviral endogenous viral element (nr-EVE) sequence found in the Culicoides stellifer assembly and the surrounding structural elements in that section of the genome. The sequence is shown aligned to other Rhabdoviruses sequences.