The 21 selected microsatellite markers amplified in most of the species of the C. variipennis complex, with a few exceptions (Table 3.) All 21 markers were found to be polymorphic, with the number of alleles ranging from 11 to 37 (Mean +- SD = 26.4 ± 7.4; Table 2). More specifically, allelic diversity ranged from 3 to 15 (Mean ± SD = 8.6 ± 3.4) alleles per marker for C. albertensis, 4 to 12 (8.4 ± 3.2) for C. mullensi, 4 to 16 (8.0 ± 3.9) for C. occidentalis, 4 to 20 (13.0 ± 3.8) for C. sonorensis, and 4 to 14 (8.6 ± 3.7) for C. variipennis. Deviation from HWE was observed for most markers per species. This result originated from significantly positive FIS inbreeding coefficients observed for the majority of the markers and most species, with levels of observed heterozygosity lowered than expected (Table 3). It is important to note that the positive FIS values can be overestimated due to the sampling of a few individuals per species over an expansive range (i.e., the Wahlund effect). Results from the linkage disequilibrium analysis suggest that most genotypes at one locus were independent from genotypes at another locus. An exception was that markers C45, C728, and C995 appeared to be linked (P = 0.004, 0.03 and 0.058), as were markers C94 and C2085 (P = 0.05) (Supplementary Table S2). Note that only a single marker from each of these two groups was later used in the four- and seven-marker datasets.
The overall dataset of 21 markers was successful in the species-level differentiation of all specimens, though the clustering of individuals using a PCA revealed that two species, C. albertensis and C. variipennis, slightly overlap (Figure 2a). The clustering of individuals using a STRUCTURE analysis suggested the presence of five distinct clusters in the dataset (best K = 5; Figure 1d; individual assignments for other values of K are provided in Supplementary figure S1). This clustering using microsatellite markers corresponds to five different species, as it closely mirrors the results of the SNP dataset with the same samples from Shults et al. (2021) (Figure 1c). Importantly, individuals mostly belonged (>85% [mean = 98%] assignment probability) to a single genetic cluster when using the overall dataset of 21 microsatellite markers (i.e., unambiguous assignment to the correct species). RF analysis on the overall dataset also suggests that markers C226, C728, C838, and C1450 had the highest influence in differentiating species, followed by markers C589, C2085, and C1241 (Figure S2). When using most of the microsatellite markers (18-marker dataset), the OOB error rate was 1.3%. The confusion matrix found that a potential low rate of misidentifications might occur with C. sonorensis and C. variipennis samples, while no misidentification occurs among samples from the three other species (Figure S3).
When the seven-marker dataset was analyzed (i.e., C226, C728, C838, C1450, C589, C2085, and C1241), almost no overlap was found between individuals from distinct species on the PCA (Figure 2b). Similarly, STRUCTURE analysis revealed confident segregation of the individuals into the different species (Figure 1e), as most individuals (N = 75) were unambiguously assigned to the correct species (> 85% [mean = 95%] assignment probability). Only four samples had assignment probabilities lower than 85% to the correct species cluster, with one sample of C. occidentalis (63%), one sample of C. albertensis (71%), and two samples of C. variipennis (80 and 83%). Additionally, the clustering closely mirrored the results from both the entire 21 microsatellite marker dataset and the SNP dataset. This finding suggests robust segregation of samples into the different species using seven microsatellite markers. The RF analysis provides further support for species delineation using these markers, revealing an OOB error rate of 1.9% (Figure S3). The confusion matrix found a misclassified sample of C. mullensi, C. sonorensis, and C. variipennis.
When plotting individuals on a PCA using the four-marker dataset (i.e., C226, C728, C838, and C1450), individuals within the same species mostly clustered together, despite small overlap (Figure 2c). Similarly, the STRUCTURE analysis revealed that individuals mostly cluster into their respective species (Figure 1f), with most individuals (N = 65) being correctly assigned (> 85% [mean = 87%] assignment probability). However, 14 individuals had a mixed assignment (< 85% assignment probability), with four of them having less than 50% assignment to their correct species, hampering full confidence in identifying species using only four markers. This finding was confirmed by an RF analysis that revealed a small, but non-negligible OOB error rate of 6.3% (Figure S3). The confusion matrix revealed multiple misclassified samples belonging to C. mullensi, C. sonorensis, and C. variipennis.
Lastly, the microsatellite locus C508 was found to amplify only in C. sonorensis (Figure 3), the only proven vector species within the C. variipennis complex. In total, 79 individuals spanning 14 geographic locations were tested at this marker: C. albertensis from four populations, C. mullensi from one population, C. occidentalis from three populations, C. sonorensis from nine populations, and C. variipennis from four populations (Figure 1a and Table S1). Many more samples and populations need to be tested to confirm this species-specific amplification; however, in the samples tested here, there does not appear to be any geographical bias in amplification. Individuals of C. albertensis, C. occidentalis, and C. variipennis collected from the same location as individuals of C. sonorensis showed no amplification at this marker. It is also important to note that this marker was not included in the RF analyses above due to the substantial amount of missing data (i.e., non-amplification in four of the sibling species) (Table 3).