Study species and sample collection
The Seychelles warbler is a small insectivorous passerine, endemic to the Seychelles archipelago. They are facultative cooperative breeders, defending strict year-round territories [43]. The population on Cousin Island (4°20’S,55°40’E; 0.29km2) has been extensively monitored since 1985, increasing in intensity from 1997 onwards [43, 44]. The Cousin Island population is small, with a carrying capacity of around 320 adult individuals, existing in 115 territories [43, 55, 56], and there is virtually no migration to or from Cousin [57]. Individuals can live for a maximum of 19 years, with a median post-fledging lifespan of 5.5 years [58]. A comprehensive population census is conducted bi-yearly during the major breeding season (June–September) in the south-east monsoon, and the minor breeding season (January–March) in the north-west monsoon [59]. Territory quality varies quantifiably within and between years [60]; thus, it is possible to separate the influence of environmental factors from host-intrinsic factors when investigating GM variation at the individual level.
Individuals are either caught as chicks in the nest, or by mist net. At first catch, each bird is given a metal British Trust for Ornithology (BTO) ring and a unique combination of three colour rings, allowing it to be individually identifiable. Birds are aged based on hatch date, behaviour or eye colour; grey eyes indicate an age < 5 months, light brown eyes are characteristic of sub-adults (5–12 months), and adults (> 12 months) have red-brown eyes [43, 61]. Blood samples (25 µl) are collected by brachial venipuncture and stored in 0.8 ml of absolute ethanol at either room temperature or 4°C.
Captured birds were placed into a clean bag immediately following capture. In the first major breeding season of our study (2017) this was a laundered cotton bag, however, for all following seasons, birds were placed into a single use plastic-lined, flat-bottomed paper bag containing a plastic tray covered by a metal grate, according to an established protocol [62]. The metal grate and tray were sterilised with a 10% bleach solution between use. Individuals were removed from the bag once they had defaecated, or after a maximum of 30 minutes. A sterile flocked swab was used to transfer faecal material into a sterile microcentrifuge tube containing 1 ml of absolute ethanol. If the bird defaecated outside of the bag or tray then a sample was collected. Control samples from possible sources of contamination such as the bag, grate, tray, and fieldworkers’ hands, were taken throughout each sampling season, using sterile flocked swabs. Faecal samples were stored in the field at 4°C, before being transported to the lab, where they were stored at -80°C prior to extraction.
Molecular methods
Genomic DNA (gDNA) was extracted from blood using the DNeasy blood and tissue kit (Qiagen) Individuals were genotyped at 30 polymorphic microsatellite loci [47, 56] and standardised individual microsatellite heterozygosity (Hs) was calculated using the R 3.6.1 function Genhet 3.1 [63]. Sex was determined via PCR [56, 64]. Variation at one non-synonymous SNP within the leucine-rich repeat domain of TLR3 exon 4 was determined following [45].
MHC sequencing and bioinformatics
In total, 314 samples were MHC sequenced, including 229 samples from individuals that had microbiome data and 31 samples from individuals previously MHC screened at either MHC-I [65] or MHC-II [49] using older techniques. The remainder included 30 repeated samples, 23 negative controls (making up at least 5% of each plate), and four samples (one per plate) from one great reed warbler (Acrocephalus arundinaceus) individual to serve as a positive control.
MHC-I exon 3 and MHC-II exon 2 were amplified and sequenced using previously validated primer sets [48, 49] (Additional file 1: Table S1), with the addition of Illumina index sites. Additionally, six random hexamers (N) were added to the first round PCR (PCR1) primers to increase diversity and improve cluster separation [66]. The two primer pairs amplifying MHC-I each included a motif-specific primer situated within exon 3, and one general primer situated in intron 3, and so amplified 262 bp of the full exon (274 bp). These primers had been designed to preferentially amplify functional variants, while avoiding known pseudogenes [48, 67]. The primers for MHC-II, situated within the flanking introns 1 and 2 of exon 2, amplify a 291 bp fragment. These sequences were then edited to the 270 bp MHC-II exon 2 [49]. The term ‘allele’ is used to describe the different variants amplified for each class of MHC, consistent with other publications investigating MHC diversity; however, alleles cannot be assigned to specific (duplicated) loci within each MHC class. Previous work suggests that, in the Seychelles warbler, there are a minimum of four duplicated MHC-I loci and six MHC-II loci [48, 49]. Sequencing of the MHC-I and MHC-II exons was carried out using 2x 250-bp paired-end sequencing on an Illumina MiSeq platform (Illumina, San Diego) (see Additional file 1: Supplementary methods for details).
Processing and MHC genotyping of raw Illumina MiSeq data was conducted using the Amplicon Sequencing Assignment (AmpliSAS) tool [68]. First, FastQC was used to check read quality, before merging pair-ended reads together using AmpliMERGE (10,257,832 merged sequences). AmpliCLEAN was then used to remove low-quality reads (Phred score of <30) and any that were missing either primers or barcodes (e.g., from residual PhiX). MHC-I and MHC-II sequences were separated at this stage, resulting in 3,044,897 raw MHC-I sequences, and 6,144,575 raw MHC-II sequences. Following this step, all downstream bioinformatics and analysis were conducted separately for MHC-I and MHC-II. Cutadapt 1.6 [69] was used to remove MHC-specific primers, the six random hexamers and short reads (<100 bp). For MHC-II sequences, remaining intron regions were also removed, leaving a 270-bp fragment spanning the full exon. AmpliCHECK was used for preliminary data exploration, using Illumina-based default settings. Finally, AmpliSAS was used for demultiplexing, clustering, and filtering reads. First, a subset of 30 duplicated samples were used to optimise parameters for MHC-I and MHC-II, testing both minimum dominant frequency settings for the clustering step, and minimum amplicon depth for the filtering step, as recommended in [68]. Based on these results (Additional file 1: Table S2, S3) Illumina default clustering settings were used (1% substitution errors, 0.001% indel errors, 25% minimum dominant frequency) for both MHC classes. For the filtering step, chimaeras and sequences that only appeared in one sample were removed, and the minimum amplicon frequency was set as 1.6% for MHC-I, and 1.8% for MHC-II. This resulted in 1,267,410 raw MHC-I sequences, and 1,385,049 raw MHC-II sequences. Due to computational limitations, the MHC-II dataset was split into two halves and analysed using AmpliSAS separately, before being combined using AmpliCOMBINE in the web version of AmpliSAS.
For MHC-I, the majority of putative alleles were 262 bp, although three sequences that were <262 bp were present in >80% individuals – these were not homologous to any known MHC gene when checked using blastn. These shorter fragments were removed from downstream analysis. The majority of MHC-II putative alleles were the full 270 bp length, although there were also sequences between 267–269 bp, which were similar to MHC genes (see results). All MHC-II sequences <267 bp in length were not similar to any known MHC genes and, as putative sequencing artifacts, were removed from downstream analysis. MHC-I and MHC-II putative alleles were first checked against all known Seychelles warbler alleles. Any unknown putative alleles were then checked against the GenBank (NCBI) nucleotide database (accessed on 25th June 2020) to assess similarity to known MHC alleles from other related species. Additionally, samples of insufficient read depth based on rarefaction curves, which equated to a minimum read depth of 150 per amplicon for MHC-I, and 100 per amplicon for MHC-II, were removed. For 30 individuals sequenced twice to confirm repeatability, the sample with the greatest read depth was retained. After processing, the total number of reads assigned to an allele was 1,071,525 for MHC-I (mean ± SEM = 4391.5 ± 149.3 per sample) and 1,123,211 for MHC-II (mean ± SEM = 4603.3 ± 888.2 per sample) in the Seychelles warbler.
Microbial extraction, sequencing, and bioinformatics
In total, 400 faecal samples were sequenced across three sequencing runs (two plates per sequencing run). This included 343 unique faecal samples (from 235 individuals) and 14 control samples, the latter of which included six extraction negative controls, four positive controls (using a microbial community standard), and four sampling controls. Additionally, 43 faecal samples were sequenced twice (20 within the same run and 23 across different runs) to determine sequencing repeatability.
Faecal samples were centrifuged for 10 minutes at 10,000 rpm and the supernatant was removed. To remove any residual ethanol, the resulting pellet was washed with 100 µl RNase/DNA-free molecular grade water by centrifuging at 10,000 rpm for 10 minutes. The supernatant was then removed, and the washing step repeated a further two times. Microbial DNA was then extracted from 0.05–0.1 g of each sample using the DNeasy PowerSoil Kit (Qiagen), according to an optimised version of the manufacturer’s instructions. Modifications consisted of a heat block step (65°C for 10 minutes) prior to bead-beating, and elution of DNA in a final volume of 60 µl elution buffer. A ZymoBIOMICS microbial community standard (D6300) was extracted as a positive control using a ZymoBIOMICS DNA miniprep kit (Zymo Research), according to the manufacturer’s instructions.
Extracted gDNA was quantified using a Qubit 4.0 Fluorometer (Invitrogen) with a Qubit dsDNA HS assay kit (Invitrogen). Aliquots of gDNA were shipped on dry ice to the Centre for Genomic Research, University of Liverpool for library preparation, pooling and sequencing. Bacterial barcoding was performed with a 2-step amplification process using the primers 515F (5'TGCCAGCMGCCGCGGTAA3’) and 806R (5’GGACTACHVGGGTWTCTAAT3’) [70], which amplify the V4 region of the 16S rRNA gene (see Additional file 1: Supplementary methods for details). Paired-end sequencing was carried out using 2x 250-bp paired-end sequencing on an Illumina MiSeq platform (Illumina, San Diego).
For each sequencing run, raw reads were first trimmed using Cutadapt 1.2.1 [69] to remove Illumina adapter sequences. Reads were further trimmed using Sickle 1.200 with a minimum window quality score of 20, resulting in totals of 12,308,047, 9,397,303 and 9,831,508 demultiplexed reads for the three runs, respectively (mean ± SEM per sample: 102,567.1 ± 10,454.8, 67,123.6 ± 6,633.1, 70,225.1 ± 5,423.5). Sequences were then analysed using QIIME2 2019.10 [71]. Based on overall quality scores, the first 10 bases of each read were trimmed, and sequences truncated to 210 bp for both forward and reverse reads. The DADA2 plugin 2019.10.0 was used to join paired-end reads, denoise, remove chimeras and residual PhiX reads, dereplicate and call amplicon sequence variants (ASVs) [72, 73]. Following this, results from the three separate runs were merged, resulting in a total of 22,997,693 reads (mean ± SEM per sample: 57,494.3 ± 3424.8) with 36,182 ASVs. A mid-point rooted phylogeny was then constructed using the masked alignment MAFFT [74] and the Fast Tree approach [75]. Taxonomic assignment of ASVs was performed by training a naïve-Bayes classifier on the SILVA 132 16S dataset using 99% sequence similarity [76, 77]. Plastid-like and archaeal sequences were removed, as well as singletons which likely represent sequencing errors. Additionally, two ASVs – one from the genus Defltia (relative abundance of 90.5% in a negative extraction control from the first run) and one from the genus Limnobacter (relative abundance of 99.9% in a negative extraction control from the third run) were removed as probable contaminants. This resulted in a total of 21,904,965 reads (mean ± SEM per sample: 54,899.7 ± 3,429.5) with 35,428 ASVs. The final sample metadata, ASV and taxonomy tables were all exported from QIIME2 into R 3.6.1 where they were processed using phyloseq 1.28.0 [78]. Sample completeness and rarefaction curves were generated using iNEXT 2.0.20 [79]; completeness plateaued at approximately 10,000 reads and 34 samples (including all six negative extraction controls) with fewer reads were excluded from downstream analyses. Overall, 320 unique samples (93%) were retained from 224 individuals.
The repeatability of GM sequencing was tested by comparing the 37 samples that were sequenced multiply within and across sequencing runs. Euclidean dissimilarity between pairs of samples was compared using one metric of alpha diversity (the Shannon diversity index), and two metrics of beta diversity (unweighted and weighted UniFrac of between and within duplicated samples) using Kruskal–Wallis tests.
Statistical analyses
Unless otherwise stated, all analyses were conducted in R 3.6.1.
To characterise the Seychelles warbler GM, samples sequenced twice for repeatability analysis (sample duplicates) were filtered such that only the sample with the greatest read-depth was retained for downstream analysis. Samples collected from the same bird during the same catch session (catch duplicates) were also filtered to retain the single sample with the smallest potential exposure to external contamination, i.e., samples collected from cleaned trays were prioritised over those collected from other substrates, then the sample with the highest read depth was prioritised. The removal of sample and catch duplicates resulted in 281 samples (from 224 individuals). For all alpha diversity, beta diversity and differential abundance analyses, microbiome samples taken from chicks were excluded due to a small sample size (n = 11). Individuals with incomplete MHC genotype data (n = 25) were also excluded. Lastly, to prevent pseudo-replication, where an individual had multiple samples taken at different capture events, a single sample was selected at random, giving an overall sample size of 195 samples from 195 individuals from Cousin Island. This resulted in a total of 10,998,587 reads (mean ± SEM per sample: 56,403 ± 4181.0) with 35,428 ASVs in the non-rarefied dataset.
Alpha diversity
All 195 samples were rarefied to a depth of 10,000 reads, based on sample completeness curves, leaving a total of 1,950,000 reads and 27,861 ASVs across samples in the rarefied dataset. Analyses were run using both rarefied and non-rarefied data, however, as results were comparable between datasets and library size was highly variable across samples, only the outcome of models using the rarefied dataset are presented here. Three metrics of alpha diversity were calculated: Chao1 [80] (a measure of microbial species richness), Shannon diversity index [81] (a measure of species richness, taking into account sample evenness), and Faith’s phylogenetic diversity index [82] (a measure of the phylogenetic diversity of a sample). Chao1 and Shannon diversity indices were calculated using phyloseq 1.28.0 [78], and Faith’s phylogenetic diversity was calculated using btools 0.0.1 [83]. Both Chao1 and Faith’s phylogenetic diversity were log-transformed in models to improve residual fit.
Linear models with a Gaussian distribution were constructed using glmmTMB 0.2.3 [84] to determine whether the alpha diversity of the GM differed with: (1) the presence or absence of individual immune genes, and (2) immune gene diversity. The first set of models included the presence or absence of all MHC-I and MHC-II alleles that were present in at least 15% of sampled individuals, and that were the correct length (see above). This included the following alleles: Ase-dab3, Ase-dab4, Ase-dab5, Ase-ua1/10, Ase-ua3, Ase-ua4, Ase-ua5, Ase-ua6, Ase-ua7, Ase-ua8, Ase-ua9 and Ase-ua11 (Ase-ua1 and Ase-ua10 were perfectly correlated, so only Ase-ua1 was included in analyses). The second set of models contained MHC-I diversity, MHC-II diversity and the squares of each of these terms, since optimal, rather than a greater diversity, of MHC alleles could be more beneficial [85]. Both sets of models also included individual heterozygosity (Hs), and TLR3 genotype (TLR3AA, TLR3AC or TLR3CC). The field period sampled (major 2017, major 2018 or minor 2018), sex (male, female), and age (fledgling, old fledging, sub-adult or adult) were also included in models, as these factors have been shown to influence GM variation in other studies. Alpha diversity (Shannon, log Chao1, or log Faith) was entered as the response variable. In all models, continuous factors were standardised (scaled and centred) using arm 1.10-1 [86]. All biologically relevant interactions were initially included in models but were removed prior to model averaging to enable interpretation of first-order effects, as all were non-significant (P < 0.1). Field period and territory quality were correlated (linear model; F2,185 = 117.2, P < 0.001), so only field period was included in the models. Collinearity between independent variables was tested using variance inflation factors ensuring an upper limit of three. Collinearity between the presence and absence of immune alleles was further assessed using GGally 2.0.0 [87]. DHARMA 0.2.4 [88] was used to confirm that there was no over or under dispersion, or residual spatial or temporal autocorrelation in the models. Model averaging – an information-theoretic approach using the dredge function in MUMIn 1.43.6 [89] – was used to select plausible models. All models within seven AICc of the top model were included in the averaged model, to obtain the final conditional model [90].
Beta diversity
The unrarefied dataset was further filtered to remove ASVs that appeared in fewer than five samples, and that had a total read count of <50 across samples, based on an assessment of overall ASV prevalence and abundance (Additional file 1: Fig S1). Overall, 1,944 out of 35,428 ASVs were retained following filtering. To account for uneven sequencing depth across samples, reads were normalised using the cumulative sum scaling function [91] in metagenomeSeq 1.26.3 [92]. Two beta diversity metrics that incorporate phylogenetic distance were then calculated using phyloseq 1.28.0 [93]; these were unweighted UniFrac distance (based on the presence/absence of microbial taxa) and weighted UniFrac distance (a quantitative measure which also accounts for differences in the abundances of microbial taxa) [94]. Marginal Permutational Analysis of Variance tests (PERMANOVAs) were used to assess whether GM community composition differed with immune gene characteristics, using the adonis2 function in Vegan 2.5.6 [95] with 10,000 permutations. As with alpha diversity models, two sets of PERMANOVA tests were constructed for each beta diversity metric, with the first set of models including the presence/absence of MHC alleles, and the second set of models including MHC diversity; other variables were included as in alpha diversity models. To clarify whether significant differences detected in PERMANOVA tests were caused by differences in mean values, rather than variation in dispersion across groups [96], homogeneity of group dispersions was tested using the betadisper function in Vegan 2.5.6 [95]. Principle coordinate analysis (PCoA), based on weighted and unweighted UniFrac distances, was used to visualise the differences in composition between groups.
Differential abundance analysis
To assess whether particular ASVs were differentially abundant between groups of individuals with different immune gene characteristics, DESeq2 1.24.0 [97] was used. For this analysis, unrarefied reads were filtered but untransformed, as DeSeq2 uses its own variance stabilising transformation to account for variation in library size across samples. DeSeq2 estimates the log2-fold change in microbial abundance between sample groups using a negative binomial distribution to model ASV counts. Only variables that were associated with significant compositional shifts in PERMANOVA tests (see above) were included in this analysis to avoid over-parametrization (Additional file 1: Table S4). To account for the large number of zero counts for individual ASVs, the “poscounts” estimator was included when estimating size factors. Differential ASV abundance was assessed using negative binomial Wald tests and P values were adjusted using the Benjamini and Hochberg false-discovery rate correction, with a significance cut-off of P < 0.01. Two ASVs did not converge due to a high number of zero counts across samples; these were removed from the analysis.