Study region and species
The study domain lies along the South African coastline, which is one of the most biodiverse marine systems in the world [79]. This region has also been identified as hotspot for ocean warming as it is experiencing environmental change at a faster rate than predicted [80]. In South Africa, the coastline is characterised by SST increasing with longitude, from the cool-temperate Benguela region on the West Coast to the sup-tropical Dalgoa region on the East Coast (Fig. 1).
The study species were selected as their distributions span several bioregions and the natural environmental gradients of southern Africa (Fig. S1, Additional File 1), and can represent the high- (C. punctatus), mid- (S. granularis), and low- (P. angulosus) rocky shore ecotypes [57]. They also differ in life history traits with C. punctatus being a brooder, and S. granularis and P. angulosus being broadcast spawners, with PLDs varying from ~ 5-15 days (S. granularis and C. punctatus) to potentially up to 50 days (P. angulosus; [34,35,81]). These species are each ecologically important; either as dominant grazers or scavengers, as substrates for other species to either live on, or as shelter for juvenile abalone [57].
A total of 14 sites, spanning ~ 2 200 km of the South African coastline, were sampled for S. granularis and P. angulosus, and 13 sites spanning ~1 800 km were sampled for C. punctatus (Fig. 1). These sites incorporate the natural environmental (e.g. SST, salinity, air temperature) gradients in the region, as well as the distributional range per study species (Additional File 1; 56].
Laboratory protocols and bioinformatics
Genomic data consisted of pooled ezRAD-seq samples, as it is a cost-effective approach to obtain precise allele frequency data [41]. Dorant et al. (2019; [77]) found that Pool-seq inflated FST values relative to individual-based sequencing approaches, but still gave highly similar allele frequency outputs and patterns of population structure. Thus, while the absolute magnitude of FST values may be upwardly biased relative to sequencing individuals, for a fraction of the cost Pool-seq data still allow us to infer relative patterns of population structure with confidence [82].
Genomic RAD-seq data was previously obtained for the study species from 11 of the 20 sample sites ([83]; L. Mertens pers. comm.). Additional sampling was conducted at the remaining sites during July 2018, with 30-40 individuals collected from each site (Tables S1-S3, Additional File 1). Individuals were preserved in 100% ethanol, from which < 25mg tissue (gonad from P. angulosus, foot from S. granularis, and muscle from C. punctatus) was taken for DNA extractions. Extractions were performed with the Qiagen DNeasy Blood & Tissue kit following the manufacturer’s protocols. The quality of the DNA extractions was assessed on 1% agarose gels and quantity was determined using the Qubit Quant iT dsDNA HS Assay system at the Central Analytical Facility at Stellenbosch University (CAF-SU). All extractions passing quality and quantity checks were stored at -200C. For each species, equimolar amounts of DNA from each individual were pooled per sample site, flash frozen and sent to the Hawaii Institute of Marine Biology (HIMB) for library preparation following [84] (further outlined in [83]). Equimolar pooled ezRAD libraries [37] were sequenced (V3, 2x300PE) on the Illumina Mi-Seq platform at University of California, Riverside.
The quality of raw FASTA reads were viewed with FastQC [85], and then uploaded onto the CAF-SU high performance cluster (HPC) for further analyses (see Table S4, Additional File 1 for outline of analyses). Bases with low quality scores (Q<20), overrepresented sequences and adapter sequences were removed using TrimGalore! [86].
As mitochondrial DNA (mtDNA) markers have different evolutionary characteristics than nuclear markers [46,87], we chose to filter mtDNA-mapped reads from the datasets [52]. In order to separate mtDNA from nuclear sequences, the quality-trimmed reads were first mapped onto mitogenome references of closely related species (Purple mottled shore crab, Cyclograpsus granulosus, NC_025571.1; Rea sea urchin, Loxechinus albus, JX888466.1; Fingered limpet, Lottia digitalis, DQ238599.1) using BWA-MEM ([88]; Table S5, Additional File 1). The mapped reads were converted to BAM files, sorted and filtered using SAMtools v.1.3 [89], and then merged using BAMtools [90]. The merged BAM files were converted back to SAM and used to filter the quality-trimmed reads, removing putative mtDNA markers before mapping, using the ‘filterbyname’ command in BBMap [91].
Given that there are no reference genomes for these or closely-related species, de novo assemblies were created, using quality-trimmed reads that were normalized to a coverage of 100X with BBMap ‘bbnorm’, and using k-mer value ranges identified with K-mer Genie [92]. The reads were assembled with three different programs: ABySS [93], MEGAHIT [94] and SPAdes [95]. Because SPAdes can only handle nine input samples at a time, we assembled half of each species’ samples at a time, and then merged the two SPAdes assemblies using GARM [96]. The outputs of the three assemblers were compared using QUAST v.4.1.1 [97] and the NCBI BLASTN v.2.4.0+ algorithm [98]. Metrics such as N50 and L50 values, and number of BLAST hits, were used to select a de novo assembly for further analysis.
We also mapped mtDNA-filtered reads to available reference genomes of the Purple urchin (GCA_000002235.4; 990.915 Mb), Owl limpet (GCA_000327385.1; 359.506 Mb), and Chinese mitten crab (GCA_003336515.1; 258.8 Gb) for comparison. Because these species are distantly related to our focal taxa, we had to relax SNP calling parameters (mapping quality >10, minimum pool coverage=10), but found that overall patterns of population structure were consistent between both approaches, mirroring the findings of [99]. As de novo assemblies have been shown to lead to more robust inferences than mapping onto loosely related genomes [100], we present only the more stringent de novo assembly approach here.
The mtDNA-free, but not normalized, reads were mapped onto the de novo assemblies with BWA-MEM. The subsequent SAM files were converted into BAM files, sorted, indexed and filtered with SAMtools. To control for sequencing biases, we down-sampled SAM files to the median number of reads across all pools with SAMtools. A synchronized multiple pileup file was created for each species with SAMtools ‘mpileup’, followed by the Popoolation2 ‘mpileup2sync.jar’ commands [101]. Final SNP calling was performed with the ‘popsync2pooldata’ function of the poolfstat v.0.0.1 R package [102]. To avoid potential biases associated with unequal sequencing of individuals within the pool, and since fewer SNPs at higher coverage has been shown to be more effective than a greater number of SNPs at lower coverage [103], we chose a stringent coverage cut-off of > 20X per pool [102,104], and a minor allele frequency (MAF) of > 0.01 in each pool, following [77,104]. To account for the possibility of loci being physically linked (linkage disequilibrium: LD) we further used custom R scripts to randomly select one SNP per 1 000bp per contig.
Assessing gene flow and potential drivers of population structuring
Characterising genomic differentiation
To assess genomic population structuring, pairwise Weir and Cockerham’s FST values from the LD-pruned SNP dataset were calculated using the ‘computeFST’ function of poolfstat, the confidence interval (CI) values of which were computed with a custom bash script from [77] using 1 000 bootstrap iterations. Nei’s genetic distances matrices were generated with the ‘stamppNeisD’ function of the R package StAMPP, and visualized in Principal Coordinates Analyses (PCoAs) generated with the ‘pco’ function in the ecodist R package [105].
Additionally, the allele frequencies of all SNPs per species were input into the core model of BayPass v.2.1 [45] to estimate scaled covariance (Ω) matrices. BayPass is specifically designed to handle Pool-seq data, and uses allele-frequencies to create an Ω matrix, which can be interpreted as pairwise estimates of differentiation and population structure. BayPass was run under default conditions to create the Ω matrices, which were then converted into a correlation matrices using the ‘cov2cor’ function in R stats package, and visualized as similarity matrix heatmaps.
Seascape features
The various seascape genomic analyses included a standard set of environmental features as predictor variables. A total of 20 environmental features were considered (Additional File 3), including air temperature and precipitation of the coldest month, warmest month, the range between coldest and warmest months, as well as annual mean between 1970 and 2000, which were downloaded from the WorldClim database at a resolution of ~1 km [106]. Annual mean, coldest ice-free month, and warmest ice-free month, and the range in SST between 2002 and 2010 and annual mean, monthly minimum and maximum, and range in sea surface salinity between 1955 and 2006 were downloaded from the MARSPEC database at a resolution of ~1 km [107]. Mean surface dissolved oxygen, diffuse attenuation coefficient, pH, and chlorophyll concentration between 2000 and 2014 were downloaded from the BIO-ORACLE database at a resolution of ~9.2 km [108]. Environmental features were downloaded for each sample site with the ‘load_layers’ function of the sdmpredictors R package [109]. We tested collinearity between predictor variables using pairwise Spearman’s correlation coefficients and Benjamini‐Hochberg (BH) corrected p-values (p < 0.05; [110]). We removed variables that were significantly correlated (r>0.65), and those with a variance inflation factor (VIF) >10.
Isolation-by-distance (IBD) versus isolation-by-environment (IBE)
Isolation-by-distance (IBD) and isolation-by-environment (IBE) were tested using Mantel tests. Mantel tests are widely used in landscape genetics to test which spatial features are significant drivers of genetic differentiation [111]. IBD was assessed with a standard Mantel test, which evaluates the relationship between two matrices (i.e. geographic versus genetic distances) and IBE was tested with Partial Mantel tests, which compare the relationship between two matrices while taking into account the effect of a third (i.e. temperature versus genetic distance, accounting for geographic distance; [111]).
IBD analyses consisted of Slatkin’s linearized pairwise FST (FST = [FST /(1-FST )]; [112]), and log-transformed geographic distances along the coastline calculated with the roadmap tool in QGIS [113], starting from the western-most site for each species. IBE analyses additionally included pairwise Euclidean climatic distances. Partial Mantel tests were performed for each climatic variable separately, with geographic distance as a conditioning variable. Individual Mantel test significance was assessed in ecodist, using 1 000 permutations. To account for multiple tests, p- were converted to q-values and significance was assessed using a False Discovery Rate of 0.05 (FDR) based on BH criteria with the qvalue R package [114].
A multi-model approach to identifying environmental associations with SNPs
To investigate possible associations between SNPs and environmental variables, we used seven different outlier detection methods, using the same seascape features as stated above as predictor variables. As GEAA methods have been shown to vary in the type and number of outliers detected [23,30], the multi-model approach used here allows for more robust inferences. The protocol pertaining to each outlier detection method are outlined below.
BayPass Bayesian hierarchical models
For an FST-like outlier detection approach, the core model of BayPass was run, which uses a hierarchical Bayesian model to create per-locus XTX values, which can be interpreted as an FST values corrected for the scaled covariance (Ω) of population allele frequencies [45]. BayPass v.2.1 was run under default conditions to create XTX values. As described in [45], a pseudo-observed dataset (POD) was created to estimate the posterior predictive distribution of XTX values, and candidate SNPs were selected if they fell within the 99.9% quantile of the POD XTX distribution.
For a GEAA-like approach, the auxiliary model in BayPass was run to identify candidate SNPs due to associations with environmental variables. The auxiliary covariate model includes a binary auxiliary variable to classify the association and compute a Bayes factor (BF) for each locus while accounting for multiple testing [45]. After running the model under default conditions, we followed the general rule derived from [115], which identifies outliers as those having a log10 Bayes factor (db) >20 [45].
Latent factor mixed models (LFMM)
Latent factor mixed models (LFMM) use mixed linear models to test for correlation between allele frequencies and an environmental predictor variable while correcting for population structure with latent factors [116]. As such, these models require a priori knowledge of the number of genetic clusters (K). K was inferred from previous mtDNA clustering analyses (K=2 for each species; [33–35]), as it is recommended to estimate K from independent genetic datasets [116]. LFMMs were run separately for each environmental variable using the R package LEA [117] with 10 000 cycles of the Gibbs sampling algorithm, 5 000 burn-in cycles, and 10 replicate runs. For all runs per predictor variable, z-scores were combined, genomic inflation factor was calculated, and candidate loci were selected following using R scripts available from: http://membres-timc.imag.fr/Olivier.Francois/LEA.
Moran spectral outlier detection (MSOD) & Moran spectral randomization (MSR)
Moran spectral outlier detection (MSOD) uses Moran’s eigenvector maps (MEMs) to create power spectrums for each individual SNP, by taking the squared correlation coefficient of allele frequencies with MEM eigenvectors [118]. Candidate SNPs are then identified as having power spectra outside of the average spectrum across all SNPs. Moran spectral randomization (MSR), is then used to identify candidate SNPs that show a strong correlation to environmental variables by building the observed spatial structure into the null model, while accounting for spatial autocorrelation [118].
MEM axes were first created from geographic coordinates using the spdep R package [119], then power spectra corresponding MAFs and MEMs at each site were calculated. Z-scores were calculated for each locus based on the deviation from the average power spectrum following R code from: https://popgen.nescent.org/2016-12-13_MEM_outlier.html. The outlier loci identified by MSOD were then subjected to the MSR randomization approach, which tests the correlation between outlier MAFs and environmental variables, given the power spectra of each SNP. Using the adespatial R package, the MSR was run individually for each environmental variable, with 1 000 permutations. We followed the suggested cut-offs of [118] of 0.01 and 0.05 for MSOD and MSR candidates, respectively.
Redundancy analysis (RDA)
Redundancy analyses (RDAs) are an extension of linear regressions that compare a matrix of dependent variables with multiple independent predictor variables. Linear regressions are calculated between allele frequencies and the climate variables at each site, while the fitted values are simultaneously constrained using a PCA. Environmental variables were centred and scaled, and allele frequencies were Hellinger transformed [120]. All RDAs were performed with the ‘rda’ function of the vegan R package [121]. Significance was assessed from the adjusted R2 value and with an ANOVA following 1 000 permutations. Candidate loci were those that had loading scores ±3 Standard Deviations (SD) of the mean loading for each of the first two constrained axes [28,30].
Distance-based RDAs (dbRDAs) were also run to account for autocorrelation between environmental and geographic distance. Distance-based Moran’s eigenvector maps (dbMEMS), which decompose Euclidean distances into a set of spatial variables [122], were created with the R package adespatial [123]. Significant dbMEMs were selected by first running an RDA solely using the dbMEMs as predictor variables, then using the adjusted R2 value from that RDA as the threshold for the forward selection procedure with the ‘forward.sel’ function in the packfor R package [124].
Outlier variation and functional annotation
Loci that were selected by two or more detection methods (2X outliers) were used to create a statistical ‘outlier dataset’, and these loci were removed from the total SNP dataset to create a ‘putatively neutral dataset’. Intraspecific outlier and putatively neutral variation was compared by running PCA ordinations on the MAFs of each dataset with the vegan package, and plotting the ordinations with the ggplot2 package in Rstudio [125].
Furthermore, we investigated the potential functional roles of outlier SNPs selected by two or more detection methods (2X outliers). The contigs containing the 2X outliers were BLASTed against NCBI non-redundant protein sequence database for crustaceans (for C. punctatus), molluscs (for S. granularis), and sea urchins (for P. angulosus) using Blast2GO [126]. Search results were filtered to only include those which had an E-value less than 10-4, and a minimal alignment length of 20 bp. Gene Ontology (GO) mapping and annotation was conducted on BLAST searches passing quality filters, using default parameters in Blast2GO.