Host and parasite data
Cophylogenetic analyses were conducted on two dataset pairs: 1) PHMITOS - the mitogenome dataset for nine parasites and nine hosts; 2) P18SHMITO − 18S rRNA + ITS for Gyrodactylus parasites (103 species) and mitogenomes of 100 host species. Seventeen Gyrodactylus mitogenomes were available in the GenBank database (last accessed 9th Jun 2022). After removing duplicates (leaving one mitogenome per species) and two mitogenomes for which we failed to find the host information, seven mitogenomes were left in the dataset; to this we added two newly sequenced and annotated mitogenomes, resulting in nine mitogenomes in total. Similarly, 378 Gyrodactylus 18S + ITS gene sequences were downloaded from GenBank (last accessed 9th Jun 2022), and 103 sequences were left after the same filtering procedure. The host information for the above two parasite datasets was obtained from the GyroDb database [20]. We also built two additional P18SHMITO subdatasets for additional analyses. Among the 100 hosts in our study, only 17 were parasitised by more than three gyrodactylids. As Page (1993) argued that host species infected by multiple parasite species could cause incongruence between the host and parasite phylogenies, to assess the impact of these hosts on host-parasite co-evolution analyses, we tested an additional P18SHMITO subdataset, from which these 17 hosts were removed (“17 hosts removed” dataset). Also, to test whether gyrodactylids with low host-specificity interfered with the analyses, we removed gyrodactylids with the top 15% SPSi values from the P18SHMITO dataset (see the definition of SPSi in the “Host specificity” section). Although different species were removed to generate the “removed top 15% SPSi” dataset, the number of gyrodactylids (88) and hosts (85) in this dataset was almost identical to the corresponding numbers in the “17 hosts removed” dataset (82 gyrodactylids and 83 hosts).
Phylogenetic analyses
PhyloSuite v.1.2.3 [21] was used to retrieve the data from GenBank, extract and modify the default taxonomic data, extract mitogenomic data, parse and extract the mitogenome annotations recorded in Word documents, create GenBank submission files, create organization tables for mitogenomes, make genomic statistics of the mitogenomes of gyrodactylids, generate annotation files for the architecture visualisation in iTOL (I Letunic and P Bork [22], and conduct phylogenetic analyses using two plug-in software programs: IQ-TREE (maximum likelihood analysis – ML) [23] and MrBayes (Bayesian Inference - BI) [24]. Two datasets were used: 1. nucleotide alignment of all PCGs + 2rRNAs of the PHMITO dataset (PHMITOS henceforth); 2. nucleotide alignments of 18S + ITS (gyrodactylid) and 13PCGs + 2rRNAs (hosts) of the P18SHMITO dataset. All preparatory steps were also conducted using PhyloSuite and its plug-in programs: sequences were aligned in MAFFT [25] and PCG alignments further refined using MACSE (Ranwez et al.,2011); alignments were concatenated by PhyloSuite; optimal evolutionary model and partitioning strategy were inferred using ModelFinder (for 18S + ITS data) [26] and PartitionFinder2 (for mitogenome data) [27].
Host specificity
For each gyrodactylid, we calculated its basic host specificity, phylogenetic host specificity, as well as geographic host specificity. Basic host specificity was calculated with the number of recorded host species or ‘host species range’. Phylogenetic host specificity was represented by the standardized effect size of Faith’s phylogenetic diversity (PD) of hosts exploited by a parasite, which is calculated using the R function ses.pd in the Picante package [28]. The standardized effect size of phylogenetic diversity (SPSi) is the difference between the observed phylogenetic diversity between hosts of each gyrodactylid and the mean phylogenetic diversity obtained with the same number of hosts generated using a random choice of host species from the tree (null data), divided by the standard deviation of phylogenetic diversities in the null data, inferred by performing 999 runs and 1,000 iterations for PSi, as recommended by Poulin et al. (2011).
$$SP{S}_{i}=\frac{\left(P{S}_{i}-\stackrel{-}{P{S}_{sim}}\right)}{SD\left(P{S}_{sim}\right)}$$
where PSi is the observed phylospecificity of parasite i, \(\stackrel{-}{\text{P}{\text{S}}_{\text{s}\text{i}\text{m}}}\) is the mean phylospecificity of all random host subsets, and SD(PSsim) is the standard deviation of all randomized phylospecificity values. As there is no phylogenetic distance linking the host species when a gyrodactylid has only one host, gyrodactylids with only one host are expected to have the lowest phylogenetic specificity, we manually set the SPSi value of parasites with only one host to correspond to the lowest SPSi value among our results (-4.7).
Geographic host specificity (BSi) was calculated to estimate the dissimilarity in host species identities across localities. The geographic distribution range of hosts was obtained from FishBase (Froese and Pauly, 2022). For parasites, we did not have the geographic distribution range data, so for the purpose of this study, we assumed that the parasite’s geographic distribution perfectly matches that of its host according to NC Harris and RR Dunn [29]. We used an extension of the Sørensen dissimilarity index for multiple sites (BSi) to measure the geographic host specificity:
$$B{S}_{i}=1-\frac{T}{T-1}\left(1-\frac{{S}_{T}}{{\sum }_{t} {S}_{t}}\right)$$
where T is the number of samples or localities, St is the number of host species used by the parasite on the locality t, and ST is the total number of host species used by the parasite species i across all T localities (i.e. the regional host pool). If parasite i exploits the same host species across all localities, then St = ST and BSi = 0. If parasite i uses completely different host species at different localities, then ST = \({\sum }_{t} {S}_{t}\) and BSi = 1.
For the phylobeta host specificity, we used an extension of the Sørensen index to branches instead of species following the principle underlying the construction of the Phylosor index.
$$PB{S}_{i}=1-\frac{T}{T-1}\left(1-\frac{P{D}_{T}}{{\sum }_{t} P{D}_{t}}\right)$$
where T is the number of samples or localities, PDt is the phylogenetic diversity of host species used by the parasite at locality t, and PDT is the phylogenetic diversity of all host species used by the parasite species i across all T localities. If a parasite i exploits the same host species across all localities, we obtain PDt = PDT and PBSi = 0. If parasite i exploits the same host species over all localities, we obtain PDt = PDT and PBSi = 0. If a parasite i uses different host species at different localities, then the PBSi value is inversely correlated to the phylogenetic relatedness of hosts (the less phylogenetically related the hosts are, the higher the PBSi value).
Testing for cospeciation
We used four methods to test the co-speciation for gyrodactylids and their hosts.
1. TreeMap 3 [30] was used to test for statistically significant topological congruence between two given phylogenies, which in turn tests for the hypothesis of a history of codivergence between mimetic populations.
2. Jane 4 [31] is an algorithm for cophylogeny reconstruction. The input file for Jane 4 is also a pair of phylogenetic trees (the host and parasite trees) and a map recording the associations between parasites and hosts. Jane 4 reconciles the host and the parasite tree by introducing five types of events: cospeciation, duplication, duplication & host switching, loss, and failure to diverge events. Each of these five event types has an associated cost that may be specified by the user, and the Jane 4 algorithm seeks to find a mapping with the minimal total cost. The algorithm was run with a population size of 500 for 20 generations using the default costs of three types of events: cospeciation = 0, duplication = 1, and failure to diverge = 1. Based on the characteristics of gyrodactylids (see Introduction) and referring to Hamerlinck et al. (2016), we slightly adjusted the costs of two types of events: duplication & host switch = 1 (default = 2), and loss = 2 (default = 1). In more detail, as gyrodactylids can easily detach themselves from hosts and spread between hosts via contact transmission, we estimated that the cost of host switch should be lesser than that of duplication and the cost of loss should be the highest. The significance of the cophylogenetic signal was estimated by re-running the algorithm on 100 randomly permuted host-parasite associations with the same settings and comparing the resulting cost distribution with the observed cost.
3 and 4. TreeMap 3 and Jane 4 algorithms are topology-based methods, so their precision can be negatively affected by the presence of phylogenetic artefacts in host or parasite topologies. To assess the reliability of the results of these two algorithms, we used additional two algorithms, which make use of raw or patristic distances, thus overcoming the limitations stemming from incompletely resolved topologies: ParaFit [32] and PACo [33] programs. They statistically assess the fit between the host and parasite phylogenetic distance matrices mediated by the matrix of host-parasite links. As input data for the two software programs, we inferred genetic distances from the host and parasite phylogenetic trees using the cophenetic.phylo function in the ape package in R (Paradis and Schliep,2019) and then converted them to matrices.
Phylogenetic patterns in host switch
We compiled all pairs of fish species connected by a gyrodactylid (e.g. if a gyrodactylid had four host species, six fish pairs were recorded). Then, we calculated the phylogenetic distance between each fish species pair using pairwise patristic distance analysis implemented in the TreeSuite function in PhyloSuite and arranged the phylogenetic distances of fish pairs into 10 distance bins. Finally, we calculated the proportion of fish pairs falling into each of the 10 phylogenetic distance bins. In this way, we calculated the probability that a fish pair sharing a gyrodactylid has a given phylogenetic distance.
Evaluation of the divergence time of gyrodactylids
To gain a more comprehensive understanding of the evolutionary history of gyrodactylids and the evolutionary trends of their host specificity, we evaluate the divergence times of gyrodactylids. The topologies obtained using IQ-TREE and P18SHMITO datasets were used for a subsequent dating analysis performed using the MCMCTREE tool of the PAML 4.9 package (Yang,2007). We used two calibration points for the inference of divergence time in gyrodactylids. The differentiation time of the most common ancestor of Gyrodactylus_vimbi and Gyrodactylus teuchis was set between 0.8 and 2.2 Mya (Hahn et al.,2015). The differentiation time of the most common ancestor of Gyrodactylus pannonicus, Gyrodactylus albolacustris, Gyrodactylus danastriae and Gyrodactylus botnicus was set between 1.3 and 1.8 Mya (Lumme et al.,2017). Burnin was set to 400,000 (400 K), sampfreq to 10, and nsample to 100 K. The convergence of MCMCTREE runs was checked using the program Tracer (Rambaut et al.,2018).
Correlation analysis
To compare the relationships between different host specificities, as well as the variation of host specificity throughout evolutionary history, and the contribution of host specificity to the consistency of co-speciation events, we conducted correlation analyses for four kinds of host specificity, divergence time and p values of host-parasite individual links in Parafit. We performed correlation analysis using the Pearson correlation coefficient to assess the relationships between the variables mentioned earlier.
Network analysis
To map the host use by gyrodactylids, we used the community detection analysis, borrowed from network theory, to study network topological communities or modules [34]. First, we created a bipartite interaction network based on the known interactions between gyrodactylids parasites and their hosts. Bipartite networks are composed of two subsets of nodes, or levels (for instance hosts and parasites), where links occur only between nodes of different levels. In contrast to the typical applications of interaction networks that focus on studying co-occurring species within a community, our approach involves constructing a global network that incorporates the documented interactions between mite parasites and their plant hosts, irrespective of their co-occurrence in specific locations. Community detection approaches have also been applied to explore the compartmentalization of occurrence networks [35, 36]. We also created a spatial co-occurrence network (geographic network), which also had a bipartite structure: gyrodactylids and the regions where they occur constituted two different subsets of nodes, establishing a link based on the presence of a given species at a given locality.
When applied to a bipartite network community detection analysis informs which groups of nodes from both levels are more densely connected. In the interaction network, this refers to groups of hosts and their associated gyrodactylids, whereas in the geographic network, this refers to groups of gyrodactylids occurring in the same regions. To analyze the modular structure of both networks, we used the modularity index proposed by Barber [37] for bipartite networks. This index was optimized using the Louvain algorithm [38, 39] and was implemented in the function “Gen Louvain” [40] in MATLAB. Finally, to test whether our networks were more modular than expected for a random network, we compared the observed values of modularity against the distributions of 100 random networks. We then calculated the proportion of random networks that were equally or more modular than the observed network. To generate the null models we used the independent swap algorithm implemented in the R package picante [41]. This algorithm keeps a constant node degree (i.e. row and columns marginal total) as well as network size and connectivity.
Phylogenetic structure of gyrodactylids-host modules
To investigate the phylogenetic structure within modules, we calculated the extent to which taxa belonging to a given module were, on average, more closely related within them than with taxa from other modules. In other words, we conducted calculations for the phylogenetic mean pairwise distances (MPD) between the taxon i and all other taxa within the same module (MPDi intracommunity), and subtracted the mean pairwise distances of the taxon i and all other taxa from different modules (MPDi intercommunity). Subsequently, we determined the relative phylogenetic distinctiveness (RPD) by taking the reciprocal of the average value among all taxa within a module, resulting in higher values when the phylogenetic distinctiveness was greater:
$$RPD=-1\text{*}\frac{{\sum }_{i=1}^{N} MP{D}_{i\text{ intracommunity }}-MP{D}_{i\text{ intercommunity }}}{N}$$
N is the number of species in a given module. To test whether gyrodactylids belonging to a module were more phylogenetically distinct than a random array of lineages, we recalculated this index 9999 times by randomizing the tip labels of the gyrodactylid phylogeny. The probability (P) of being phylogenetically distinctive was then calculated as the proportion of these 9999 null cases being more or equally distinctive than the observed phylogeny. We calculated the proportion of significant (P < = 0.01) cases for each module.
Effects of geographical co-occurrence on the phylogenetic structure of host use
We also explored to what extent the geographic distribution of gyrodactylids accounts for the phylogenetic structure of interaction modules. To do so, we calculated the degree to which gyrodactylids belonging to the same host module were, on average, more closely related among themselves than with the species occurring in the same geographical module. By modifying the RPD index, we calculated MPDintracommunity for each gyrodactylid of a given interaction module in the same way as previously explained. However, when calculating the MPDintercommunity, we only took into account the gyrodactylids that cooccur in the same geographical module. The remaining calculations were done as explained above for the RPD index. Finally, significance was assessed following the above-explained randomization procedure.
The relative weight of phylogeny and geography in shaping host use in gyrodactylids
To measure both host use and geographic dissimilarities we used the Simpson dissimilarity index [42] as implemented in the R package betapart [41]. We then fitted a generalized linear regression to host use dissimilarities as a response to geographical dissimilarities and phylogenetic distances. Further, to identify significant relationships, we randomized the matrix 500 times using the independent swap algorithm as implemented in picante [28] to get 500 null cases according to the method of J Calatayud, JL Hórreo, J Madrigal-González, A Migeon, MÁ Rodríguez, S Magalhães and J Hortal [43]. We calculated host use dissimilarities for these 500 null cases and interpreted coefficients larger than 95% of the null cases as significant.