Chip design
SNP discovery and probe design
The 819 WGS samples from 42 populations collected worldwide produced 195,287,002 shared sites across the genome with an average 10X read coverage across populations. The population-based genotype calls with ANGSD resulted in 34,502,245 polymorphic sites. After filtering with Plink for SNP and individual missingness (10%) and minor allele frequency (10%), we obtained 2,727,727 high-quality bi-allelic polymorphic sites for probe design (File S23).
Affymetrix identified 1,287,052 SNPs suitable for the chip design. From these we selected the top 250,000 SNPs based on their genomic location and functional annotation, making sure the SNPs were at least 200bp apart and aiming to have at least one SNP per exon. Out of this SNP set Affymetrix tiled 175,396 SNP probes on the chip (56,503 SNPs on exons, 72,305 on introns, and 46,588 intergenic). The list of SNPs, their genomic location, and probe sequences are in File S18. The p-convert is a metric that estimates the probability of successful SNP genotyping considering the probe thermodynamics and genomic alignment metrics. The mean p-convert for the probes in the chip was 0.71, suggesting the chip is likely to perform well.
Mapping probe sequences to genome assemblies
We selected the probe sequences identified by both algorithms BWA MEM and BWA ALN as having unique alignments in the AalbF3 genome assembly with mapping quality > 20 and no secondary alignments (Figure S3) and compared this finding with alignment results from the other assemblies 21,23,24. A total of 175,396 probes mapped with unique alignments to the AalbF3 assembly. Of these, ~ 96,000 probes mapped with unique alignments to the previous genome assembly (AalbF2), while 30 to 40% of the probes that aligned uniquely in AalbF3 had secondary alignments in the AalbF2 assembly. The genome assembly from the cell line 23 (AalbCell) had the lowest unique alignments, while this number increased with the most recent assemblies (Table S12 and Table S13).
Chip Validation
Segregation analysis
All 152 samples from the six crosses passed our quality control (File S19). A total of 123,964 SNPs were recommended by the “Best Practice Workflow” (70.68% of the SNPs on the chip). Out of these, 101,376 SNPs were heterozygous in at least one parent in each family, allowing us to test ~ 50 thousand SNPs per family (Table S14), and 5,249 SNPs across all families (Figure S4). After adjusting for multiple tests using the Holmes correction, 2,047 SNPs failed the segregation test, which represents 2.03% of the tested SNPs (File S19).
Comparing technical replicates
SNP calls were highly reproducible. The genotypic concordance within the four technical replicates was high (99.32%), while non-replicate samples shared just 52.74% of the genotypes (Table 3). The pairwise comparisons of the genotypes of each technical replicate using a custom code (Table S15, File S5) confirmed high genotypic concordance, with error estimates ranging from 0.33 to 1.02% (Table 3).
Table 3
Pairwise concordance analysis for technical replicates genotyped using the SNP chip. The first two columns list the two replicates. The third column shows the percent of times two replicates shared identical genotypes or the average error rate per replicate. “Random samples” refers to the percentage of times two randomly picked individuals share identical genotypes but are not technical replicates. The names of the genotyping files are described in Table S5.
Technical replicate 1 | Technical replicate 2 | Concordance (%) |
1a | 1b | 99.52 |
1a | 1c | 99.31 |
1b | 1c | 99.67 |
2a | 2b | 98.46 |
2a | 2c | 99.33 |
2b | 2c | 98.40 |
3a | 3b | 99.59 |
3a | 3c | 99.59 |
3b | 3c | 99.53 |
4a | 4b | 99.61 |
4a | 4c | 99.52 |
4b | 4c | 99.41 |
Mean | | 99.32 |
| Random samples | |
1a | 2b | 52.33 |
3c | 2a | 50.86 |
1a | 4a | 54.43 |
3b | 1a | 53.33 |
Mean | | 52.74 |
Replicate | | Mean error rate per replicate (%) |
1 | | 0.50 |
2 | | 1.27 |
3 | | 0.43 |
4 | | 0.49 |
Mean | | 0.67 |
Comparison of WGS and chip genotypes
The analysis of genotype data using either WGS or SNP chip methodologies across three distinct data set sizes (labeled a, b, c for SNP chips and y, x, w for WGS) is shown in Figure S2, performing one genotype calls for each dataset. Figure 2 shows that for both methods, the genotype error rate decreases as the sample size increases. Specifically, for the SNP chip datasets, error rates declined from 1.16–0.41% (Table 1). In contrast, error rates for the WGS datasets range from 1.35–3.09% (Table 1). Across both platforms, the SAI samples from Saint Augustine, Trinidad and Tobago, displayed slightly elevated error rates compared to other samples, ranging from 0.46–1.42% for SNP chips and 3.36–9.25% for WGS.
The average mismatch rate was 8.11% between the SNP chip and WGS when analyzing the same 18 individuals (ay, Table 1). This rate decreased to 6.70% upon increasing the sample count in genotype calls (Table 1, Figures S6 and S7). The within-population comparisons showed that the error rate varied depending on the genetic make-up of each population and the sample size. For example, the mismatch rate for KAT decreased from 5.61 to 4.63% as we increase the sample size (ay and cw, Table 1). We observed a similar pattern for SAI; however, the mismatch rates were higher, decreasing from 9.35 to 7.73. Some samples showed error rates as high as 10.90% (Table 1, column ay, sample SAI 5). When we examined how many times a SNP genotype did not match between both technologies, we observed that most SNPs showed genotype errors only in a few samples (Table 2). For example, KAT, which had six samples, showed that up to 18.54% of the mismatches are appear only once, while for SAI up to 53.31% of the errors appear only once, indicating the randomness of most errors.
Within each population subset, larger sample sizes correlate with more sites exhibiting mismatches within each population since most errors appear only in a few samples (Table 1). For instance, the SAI population, comprised of twelve individuals, showed a higher mismatch rate of 1.42% compared to the KAT's 0.95%, which only contains six individuals (Figure S7). Comprehensive evaluations of larger data sets (labelled "bc" for chips and "xw" for WGS, as illustrated in Figure S2) highlight mismatch rates of 0.41% for chips and 3.09% for WGS (Table 1). Independently of the comparison, when we look at the percentage of SNPs with errors in six samples, SAI consistently have higher percentages of mismatches, indicating the role of population genome architecture if there is genome size variation across the globe, or may be influenced by the DNA quality in the performance of both technologies. For KAT the values vary from 0.03 to 1.25%, while for SAI it varies from 0.08 up to 4.12% (Table 2).
On a broader scale, the SNP chip genotype data stays consistently below an error rate of 1.30%, irrespective of the sample size in question. Meanwhile, the WGS data oscillate, exhibiting error rates between 1.35% and 3.09%, contingent on the dataset size. The average error rate rises to between 6.70% and 8.11% when cross-comparing genotype calls from the two platforms (Figure S5, Table 1). A pattern emerges where reduced read depths in WGS data correlate with heightened mismatch rates between methods (Figure S8). Core metrics such as the Fisher Linear Discriminant (FLD) and call rate (CR) manifest as influential variables affecting these mismatch rates (Figures S9 and S10).
A PCA analysis compared the WGS and SNP chip data for the 18-sample dataset, post-SNP filtration based on criteria like FLD, CR, and a read depth greater than 20. The latter factors were deduced to correlate with mismatch rates (Fig. 4, Figures S9 and S10). The PCA displays a significant overlap between WGS and chip samples (Fig. 4). Therefore, while sample size exerts minimal influence on error rates at a sequencing depth of about 20x, depth combined with chip-specific metrics (FLD and CR) significantly dictates genotype consistency.
Variant functional annotation and SNP density
We observed the lowest number of errors in the AalbF3 annotation using canonical or non-canonical models when we performed the variant annotation using the 92,693 SNPs that genotyped well on the wild samples (Table S3 and S16) for the chip data and the ~ 2.7 million SNPs for the WGS data used in the chip design.
We confirmed that the 404,514 probes used in the chip target polymorphic sites in 17,461 protein coding genes and other genes (Table S17), with 915 SNPs in genes associated with diapause or immunity (Table S18). Lists of SNPs found on diapause and immunity genes are also provided (Table S19 and S20).
In the variant functional annotation, we categorized the genetic variants into six types: intron, intergenic, 5UTR, 3UTR, and coding exons (including both synonymous and nonsynonymous). The chip bias is especially pronounced in the overrepresentation of variants in functionally significant regions like coding exons (synonymous and nonsynonymous) and untranslated regions (UTRs). To correct for this, we turned to WGS data as a gold standard for unbiased variant representation. We performed a comprehensive functional annotation on both the WGS and SNP chip datasets to deepen our understanding of the variant landscape. This allowed us to categorize each SNP based on its genomic position or functional annotation category as described above. We then compared the proportions of SNPs across these categories between the two datasets.
Regarding intronic and intergenic variants, the chip generally underperforms WGS, as evidenced by negative bias percentages of -5.32% and − 12.47%, respectively. However, the chip outperforms WGS in detecting the targeted synonymous, 5UTR, and 3UTR variants, with positive bias percentages of 9.75%, 3.53%, and 3.72% (Table 4), when comparing SNPs in common between the two technologies, as WGS captures millions of variants that we could not use in this comparison. The difference is marginal for nonsynonymous variants, indicated by a minimal positive bias of 0.79%. The data suggests that the chip may be more sensitive to specific SNP functional annotations, while WGS provides a more balanced detection across all categories. Given this asymmetry, we calculated the proportion of SNPs from the chip that we could use to replicate the proportions observed with the WGS data and found that using ~ 65% of the SNPs reflects the same variant distribution as the WGS data (Table 4).
Table 4
Variant functional annotation of SNPs using SnpEff for WGS and SNP chip data, with results of bias correction. The number of SNPs and their percentage in each category for the two methods are listed in columns 2 to 5. Column 6 shows the number of SNPs used to match the proportions in the WGS data (Chip Possible). The final column shows the new percentage for each category of the chip data after bias correction (Chip corrected).
Variant type | WGS (N) | WGS (%) | Chip (N) | Chip (%) | Chip bias (%) | Chip Possible (N) | Chip corrected (%) |
Intron | 1,281,482 | 44.81 | 24,387 | 39.49 | -5.32 | 17,741 | 44.81 |
Intergenic | 993,796 | 34.75 | 13,758 | 22.28 | -12.47 | 13,758 | 34.75 |
Synonymous | 384,482 | 13.45 | 14,323 | 23.20 | 9.75 | 5,325 | 13.45 |
Nonsynonymous | 67,217 | 2.35 | 1,942 | 3.14 | 0.79 | 930 | 2.35 |
5UTR | 67,744 | 2.37 | 3,642 | 5.90 | 3.53 | 938 | 2.37 |
3UTR | 64,912 | 2.27 | 3,697 | 5.99 | 3.72 | 899 | 2.27 |
Total | 2,859,633 | 100.00 | 61,749 | 100.00 | 0.00 | 39,591 | 100.00 |
Initially the SNP chip included ~ 102 SNPs per 1Mb window across the Ae. albopictus genome with SNPs on 549 of the 574 scaffolds (no sites were found in the remaining 25 small scaffolds) from the AalbF3 assembly (Table S7). Once we genotyped the wild samples and performed quality control, the SNPs data set was reduced to 82,731 SNPs, averaging 57 SNPs per 1Mb window (Table S21), with a relatively even distribution across the genome (Figure S12).
Quality control for wild samples
A summary of the analysis produced by the Axiom Analysis Suite Software with all the thresholds used for QC and genotyping is included in File S20. A total of 243 samples passed quality control with 115,346 SNPs “recommended” by the “Best Practices Workflow”, representing 65.76% of all the variants in the chip. From this set Plink removed 9,725 SNPs missing in more than 20% of the individuals and 11,314 SNPs with a minor allele frequency less than 10%. All SNPs passed the HWE test. We removed two mosquitoes because their mean heterozygosity deviated more than 4SD from the mean heterozygosity and two other mosquitoes because of high relatedness (> 0.354, File S7). Thus, 237 samples passed the quality control and were used in subsequent analyses.
Twenty-three samples did not pass the DQC threshold of 0.82. Twelve samples, all from Ho Chi Minh (HOC), Vietnam, failed DQC indicating possible species misidentification prior to DNA extraction, which was confirmed by PCR, as they failed to amplify using the Ae. albopictus cytochrome oxidase I specific primers (Table S6, File S17). These samples were thus not included in Table 2 and subsequent analyses. Additional samples from Ho Chi Minh from a different collection time, confirmed to be Ae. albopictus, were subsequently genotyped (Table S3) and added to the dataset used for population genomic analyses. All the other samples from this collection were confirmed to be Ae. albopictus by PCR.
Linkage disequilibrium analysis
We compared the half distance of maximum r2 value for populations with at least six mosquitoes (Figure S13, Table S9 and S22) after estimating LD decay with the package PopLDdecay 38. The LD half-life estimation of Ae. albopictus within its native range reveals varying degrees of LD across different chromosomes. Chromosomes 1 and 3 show a rapid decay in LD compared to chromosome 2, highlighting differences in their evolutionary histories or recombination events (Fig. 5) or potential misplacement of scaffolds in the genome assembly. We also observed a correlation between the sample size and the LD estimates (Figure S14). For chromosome 2, the correlation coefficient (R2) was 0.64 with a negative slope, indicating that the LD half-life estimates decreased as the sample size increased. The slope was also negative for the other two chromosomes, but the R2 was 0.23 and 0.11 for chromosome 1 and 3. When we summarized the LD decay by country, it becomes more evident that there is rapid decay for chromosome 1 with relatively low variability between populations from the same country. The LD estimates for the other chromosomes, 2 and 3, exhibit greater variability (Fig. 5). Due to the limited sampling density, it is not possible to draw any patterns for comparisons among the countries or geographical regions.
Genetic ancestry, population structure, and differentiation among native populations:
All runs of Admixture with each of the three SNP sets (intergenic, LD1 or LD2) indicated a k value of 5 as the number of ancestral clusters (Figs. 6, S16 and S17). The genetic cluster with the green color primarily covers Japan, indicating a prevalent genetic component in this region. The red cluster is predominantly found in Taiwan and Okinawa Island in southern Japan, with some genetic admixture detected in East China (HUN and HAN populations). The blue cluster is mainly concentrated in Indonesia, but also observed in Nepal (KAT). However, it is important to note that some of the KAT mosquitoes were from a laboratory strain, which may have influenced the genetic composition. The magenta cluster is most notable in the central region of Vietnam, particularly in Quang Nam Province (QNC). The yellow cluster has the widest distribution, spanning across north Malaysia, the northern and southern regions of Vietnam, Sri Lanka, Cambodia, Thailand, the Maldives, Bhutan, India, and western China. Two of the other algorithms (fastStructure and LEA) suggested up to nine ancestral populations (File S9 and S10). The differences were mostly due to some of the island samples (Taiwan and Okinawa) forming their own clusters using these methods rather than being included in other clusters.
Because of the consistency of the Admixture results with the three SNP data sets, we choose k = 5 to compare all the algorithms by comparing the structure plots directly (Figure S15) and by interpolating the Q matrices over the range of the sampling localities (Fig. 6 and S16).
We trained Neural Admixture with nine populations representing the five genetic clusters using the three data sets (intergenic, LD1 and LD2). After the training we plotted the Q matrix for k = 5 for each SNP set and observed that when using the intergenic SNP set, Neural Admixture correctly assigned the populations into each know genetic cluster and detected the known admixture for the OKI population (Figure S15). For the LD1 SNP set, which followed the authors recommendation for linkage pruning with r2 of 0.01, Neural Admixture correctly identified the five genetic clusters, but did not identify the admixture in the OKI population during training, but it did identify it when performing the inference with all populations (Figure S16 and S17). For the LD2 SNP set, for which linkage pruning did not follow the authors recommendations with r2 of 0.1, Neural Admixture did not identify the five genetic clusters after training with the same nine populations and inferred the ancestry of the 28 populations (Fig. 15 to 17).
The advantage of using the Neural Admixture is its speed relative to the other programs. It took only seconds for training and inference using two CPUs (Central Processing Units) and GPUs (Graphical Processing Unita) on the HPC (High-Performance Computing), while for the other methods running times were longer (hours to days), using approximately 300 CPUs. For example, FastStructure using the logistic prior took the longest time (up to 8 days).
To evaluate the robustness of the grouping retrieved from the above analyses we also ran PCA and clustering analyses using the three data sets (intergenic, LD1 and LD2). Although PCA analyses did not show different clustering among datasets (Fig. 7), the variance of the SNP set with the intergenic SNPs was lower than the variance found using the two sets with different r2.
The PCAs using the same three SNPs data sets identify three major clusters rather than five retrieved by the admixture analyses when we used Plink (Fig. 7), which might be due the use of two principal component axes. However, the clusters in the PCA space followed similar patterns from the admixture analysis. Some samples from Nepal clustered with the samples from Indonesia (Fig. 6). The samples from East Asia (Table S3, Fig. 2) clustered along PC1, with samples from mainland Japan clustered at right (LD1 and LD2), followed by samples from Taiwan, and then the samples from the Japanese island Okinawa. Next, the Chinese samples clustered with the Vietnamese samples at the left of the East Asia cluster. The admixture analysis indicated that samples from mainland Japan belong to a different ancestral group than those from Okinawa (Fig. 6). While samples from Taiwan clustered with Okinawa along with samples from East China. Therefore, although the East Asia samples clustered along the PC1, they followed the clustering patterns observed from the admixture analyses for the clusters “green” and “red” in Fig. 6.
The PCA results using the R package adegenet correctly identified the genetic five genetic clusters (Fig. 8). The clustering using axes 1 and 3 aligned with the Admixture analysis but failed to separate the population QNC. However, scatter plots of the discriminant functions provided a clear delineation among the predefined population groups (Fig. 8). These discriminant functions are particularly tailored to accentuate the genetic differences between groups and have successfully captured the expected population structure. Given that PCA maximizes total variance without regard to group labels, the overlapping clusters observed in the PCA plots could reflect a more continuous genetic variation across populations or the presence of shared genetic polymorphisms that PCA is sensitive to but are not informative for group differentiation.
We estimated the genetic differentiation (Fst) using the three SNP sets and fitted a linear regression to our estimates. As expected, the Fst estimates for the intergenic SNP set were lower, but the overall patterns were similar for the LD1 and LD2 SNP sets: all three data sets had positive slopes (Figure S18). Next, after estimating the geographical distance between the sampling sites (Km), we fitted a linear regression for estimates by country with at least three sampling localities (lm = Fst/(1 – Fst) ~ log(distance)) (Figure S19). The correlation coefficient (R2) was higher for China and Thailand, 1 and 0.38, indicating isolation by distance. However, we have a small number of sampling localities. When we use used all 28 populations, the R2 was 0.00.
We created a matrix with the Fst value (upper) and the geographical distance in Km (lower) for all sampling sites, sorting by the matrix by distance, and observed that the Fst values were higher as the distance between the sampling sites increased (Table S23). Next, we used the R package Adegenet 47,48 to evaluate isolation by distance using populations with at least four mosquitoes (Table S3) using the LD2 SNP set. The Mantel test indicated a correlation of 0.20, which suggests no deviation from the random expectation (p values < 0.053) (Figure S20). The coefficient of correlation (R2) value of 0.05 when we fitted a linear regression model, indicated that approximately 5% of the variation in genetic distance can be explained by geographic distance (Figure S20). The positive slope of the regression line supports the isolation by distance hypothesis, with greater geographic distances correlating with greater genetic distances at the native range of Ae. albopictus. However, a small proportion of the genetic distance is explained by geographical distance.
Among the five genetic groups, the Fst values are higher for the “blue”, “red” and “pink” clusters, while the admixture proportions are lower (Figure S21 and Fig. 6). The “blue” cluster is in South Asia, covering Indonesia, while the “red” cluster is mainly Taiwan, Okinawa Island and East China). The “pink” cluster is mainly in Vietnam, specifically in Qui nhon City (QNC), but populations from South Asia show admixture with this ancestral group. The “yellow” cluster show the highest admixture and lowest Fst values, and it covers South and Southeast Asia (Figs. 2, 6, and S21).
The three Japanese populations above latitude 300N are more genetically differentiated than those below latitude 300N (average Fst = 0.12 vs Fst = 0.09), interestingly these populations are known to undergo photoperiodic diapause while the others do not (Table S24). The Fst estimates for samples from South Asia (India, Maldives, Malaysia, and Nepal) were the highest (average FST = 0.12 for the region), followed by the samples from Southeast and East Asia.
To evaluate the chip bias effect, we used six SNP dataset sets. Three (C10, C3, and C5) were corrected by sampling the SNPs based on their functional category and replicating the proportions observed in the WGS data (Table 5). The other three SNP data sets (U2, U8 and U9) included the same number of randomly sampled SNPs. The main difference between these sets of SNPs were the number of shared SNPs: the three SNP sets with corrected bias (the C set) shared 48.02% of the SNPs, whereas the SNP sets without bias correction (the U set) shared only 8.35% of the SNPs (Figure S22). The sNMF-LEA analyses with corrected bias (the C set) generated the same overall patterns observed with the other three SNPs sets (intergenic, LD1 and LD2, Figure S16 and S17). However, for the SNP sets without bias correction (the U set) the clustering of QNC and TAI in two runs was different, with their placement being swapped (Figure S23 and S24). Therefore, the correction of bias improved the performance of LEA in detecting the five genetic clusters correctly without any population being misidentified.
The clustering of the samples on the PCA space was very similar for all data sets using only the first two principal components, but the SNP sets with bias correction was more uniform since the variance explained by PC1 and PC2 was lower (Figure S25). For the SNP sets without bias correction (the U set), the PC1 variance was approximately 10% higher than the bias corrected set (the U set). Additionally, the direction of the clusters in the PCA was not uniform for the sets without bias correction (Figure S25).