CNV discovery in the genome build ARS-UCD1.2
We performed CNV detection on bovine autosomes using the PennCNV software [27], based on Illumina BovineHD BeadChip (Illumina, San Diego, CA, USA) genotypes of animals (n=475) from two distinct dairy breeds (Holstein Friesian – HOL (n=331), Jersey – JER (n=115)) and their crossbreds (n=29). The SNPs were aligned to genome assembly ARS-UCD1.2.
We discovered 14,272 CNV calls from 451 individuals that passed the quality control criteria (31.6 calls/individual). Deletion calls were 1.8 times more frequent but 40% shorter (n=9,171, mean length=44.2 kb) than duplication calls (n=5,101, mean length=74.6 kb; Table S1 and Figure S1). The mean probe density (number of supporting SNPs per Mb CNV) was 403 SNPs/Mb. The 14,272 CNV calls were aggregated into 1,755 CNV regions (CNVRs), based on at least 1 bp overlap, following Redon et al. [28]. These CNVRs cover 2.8% of the autosomal genome sequence (69.6/2,489.4 Mb; Figure 1; A full list of CNVR is in Table S2.). These CNVRs consist of 1,125 deletion CNVRs (mean length=29.2 kb), 513 duplication CNVRs (mean length=36.8 kb), and 117 complex CNVRs (mean length=152.7 kb). The distribution of CNVR length is exponential, where the majority CNV are short to medium length (<100 kb, 93%), while only a few observations were made for long CNVRs (>100 kb, 7%). The CNVRs are non-randomly distributed over the chromosomes: chromosome-wide CNVR coverage varies from 0.6% on BTA12 to 4.9% on BTA24 (Table S3). BTA12 is most densely covered with CNVR in terms of bp (4.2 Mb), and especially enriched for complex type CNVRs (2.2 Mb). Allele frequency of CNVRs ranges between 0.001 and 0.21.
Since most cattle CNV studies used genome assembly UMD3.1, we also repeated the CNV detection procedures, using UMD3.1. Subsequently, we used these calls to assess our CNV discovery results with other cattle CNV papers. From the 447 individuals that passed the QC criteria, 24,264 CNVs were called (54.3 calls/individual) and the mean probe density was 326 SNPs/Mb. These CNVs were aggregated into 1,866 CNVRs (1,130 deletions, 593 duplications, and 143 complex CNVRs). The mean length of deletion, duplication, and complex CNVRs was 29, 36, and 193 kb, respectively (Table S1). These CNVRs together cover 82 Mb (3.3%) of bovine autosomes. The chromosome-wide coverage varies between 1% on BTA24 and 10% on BTA12 (Table S4 and Figure S2). Compared to other cattle CNV studies conducted using the same SNP array and the genome assembly UMD3.1 [22,24,29–32], our CNV discovery results are in the same range as previously detected CNV studies (Table S5).
When we compared to our CNVs discovered based on UMD3.1 and ARS-UCD1.2, we observed several differences. Firstly, the number of CNVs called per individual based on ARS-UCD1.2 is 42% lower than what was obtained using UMD3.1. Also, the mean probe density increased from 326 SNPs/Mb in UMD3.1 to 404 SNPs/Mb in ARS-UCD1.2, indicating that with ARS-UCD1.2, CNVs are supported by more SNPs. Lastly, the mean length of complex CNVRs decreased by 40kb, from 193 kb in UMD3.1 to 152.7 kb in ARS-UCD1.2. We further inspected BTA12:70-77 MB region where a large change between UMD3.1 and ARS-UCD1.2 was observed. This region was reported to have a large number deletion and duplication calls by other cattle CNV studies based on UMD3.1, regardless of the studied breeds [24,29–33]. In our CNV discovery, we identified 7 CNVRs (total length of ~6.2 Mb) in this region based on UMD3.1, whereas ARS-UCD1.2 based results revealed 9 CNVRs that covered ~1 Mb. We compared the positions of BovineHD SNPs in UMD3.1 and ARS-UCD1.2 to see whether the changes in genome assemblies caused this discrepancy. The results showed that 43% of the SNPs located in BTA12:70-77Mb based on UMD3.1 were either moved to unmapped contigs or reference and alternative SNPs were undefined. The genome-wide ratio of SNPs that were moved to different chromosomes or contigs was much lower (2.3%) than 43%. This indeed indicates that the two genome assemblies differ in this regions, and thus led to different CNV discovery results.
Functional impact of CNVRs
The expression of genes can be altered by CNVs. Deletions and duplications of a part of and/or complete gene can disrupt the gene expression and can potentially lead to changes in various phenotypes [34]. Therefore, identification CNVRs that coincide with genes can be a primary step to assess their functional impact. To achieve this, we explored CNVRs found based on ARS-UCD1.2 further. The overlap of CNVRs with Ensembl annotated genes were analysed, and among the 1,755 CNVRs, 912 (52%) were genic and 843 (48%) were intergenic. Genic CNVRs overlap with 1,739 genes out of 27,570 Ensembl annotated genes (6.3%) and 2,936 out of 43,949 gene transcripts (6.7%). Among the 1,739 genes that overlap with CNVRs, 957 (55%) are completely within the CNVRs and the rest (45%) are partially affected (genic features were inside the CNVRs). The following functional impact categories were assigned to each CNVR depending on types of overlap between CNVRs and genes (numbers in the brackets indicate number of CNVRs and genes respectively for each category; see materials and methods for detailed explanation for the classification): 1) intergenic (843 CNVRs; 0 genes), 2) intronic (214 CNVRs; 234 genes), 3) whole gene (253 CNVRs; 957 genes), 4) stop codon (147 CNVRs; 203 genes), 5) promoter regions (124 CNVRs; 187 genes), and 6) exonic (174 CNVRs; 165 genes). Then, these functional categories were intersected with other features of CNVRs such as types (deletion, duplication, complex), MAF (common, intermediate, and rare; see methods for detailed explanation), and the populations (HOL and JER; Figure 2). The functional consequences of CNVRs differ depending on the type of CNVRs: deletion and duplication CNVRs are biased away from genic regions (51-52% were genic), whereas complex CNVRs were biased towards genic regions (68% were genic). Also, we observed that MAF have impact on different types of overlap between genes and CNVRs. Rare CNVRs tend to be genic more often (60%), whereas common CNVRs have less overlap compared to it (48%). However, when seen it separately for deletion CNVRs and duplication CNVRs, we saw a different pattern. Common deletion CNVRs were more often intergenic (61%), yet the common duplication CNVRs were often genic (68%). When CNVRs between HOL and JER are compared, common JER CNVRs are more often genic (51%), than common HOL CNVRs (44%). Subsequently, we performed permutation tests on overlaps between CNVRs and autosomal genes, to test whether the overlap is significantly higher than expected under a neutral scenario. The results show that CNVRs overlap with autosomal genes more often than what is expected from permutation tests with random genomic regions (P<0.001). Subsequently, gene ontology analyses were performed to understand the functions of the genes that overlap with CNVRs. Genes overlapping deletions, duplications, and complex CNVRs were tested for GO enrichment as separate classes (Table 1). Among the findings, genes overlapping with the complex CNVRs (n=407) show a pronounced enrichment in response to stimulus (GO:0050896; FDR=1.8 X 10-6), immune response (GO:0006955; FDR=1.9 X 10-3), and detection of stimulus involved in sensory perception (GO:0050906; FDR=1.1 X 10-2). These findings are similar to the findings from earlier cattle CNV studies [30,33].
Population genetics of CNVRs
Population genetics analyses provide a framework to understand genetic variation seen in specific (cattle) populations. Understanding general properties of genetic variants is important, but further characterization of specific variants of interest can bring insights in recent adaptation and genome biology [35]. Although SNPs have been extensively used in characterizing various cattle populations [36], we explored the population genetic properties of CNVRs.
We focused our analyses on HOL (n=315) and JER (n=107) animals, derived from distinct origins and with a different breed formation history [37]. First, we coded the genotypes of our bi-allelic CNVRs (n=1,154 for HOL; n=700 for JER) as “+/+”, “+/-”, and “-/-”. The CNVR allele frequency was classified as rare (MAF<0.01), intermediate (0.01≤MAF<0.05) and common (0.05≤MAF). In HOL, the allele frequency ranged from 0.002 to 0.29, and 5%, 13%, and 82% of the 1,154 CNVRs were categorized as common, intermediate, and rare CNVRs, respectively. For the JER population, allele frequency ranged from 0.005 to 0.37, and 11%, 20%, and 69% of the 700 CNVRs were categorized as common, intermediate, and rare CNVRs, respectively.
We constructed site frequency spectra of CNVRs for HOL and JER separately (Figure 3). For both populations, we observed that deletions and duplications have slightly different spectra, where deletions were more skewed towards rare CNVs, whereas duplications were observed relatively more frequent than deletions in each MAF class. We further explored the allele frequencies by applying Wright’s fixation index (Fst) [38] to characterize population structure [39] and detect loci that underwent selection [40], as done in Yali Xue et al. [41]. Given that HOL and JER have distinctive origins and breed formation history [37], we hypothesized that Fst on their CNVRs can reveal regions that underwent recent population differentiation. We identified 32 highly diverged CNVRs (Fst > mean + 3 S.D.) of which 15 are genic and 17 are intergenic (Figure 4 and Table S6). Among the 17 intergenic CNVRs with high population differentiation (Fst=0.12-0.44), 7 CNVRs had regulatory elements such as lncRNA and snoRNA within ~300 kb from the CNVRs. Among the genic CNVRs, CNVR 380 (Fst = 0.21; duplication), which is more frequent in JER (MAF = 0.24) than in HOL (MAF = 0.04), contains three genes, CLEC5A [42], TAR2R38 [43], and MGAM. The known functions of these genes include abnormal eating behaviour, bitter taste perception, and the synthesis of maltase glucoamylase, a starch digestive enzyme. Furthermore, CNVR 826, 1312, and 1458 overlap with genes that are known to regulate body size: LRRC49 [44], CA5A [45], and ADAMTS17 [46–48], respectively. Interestingly, these CNVRs are duplications and have a high allele frequency in JER (MAF=0.08–0.37), and a low allele frequency in HOL (MAF=0–0.06).
Linkage disequilibrium of CNVRs
There has been a large number of genome-wide associations (GWAS) performed using SNPs in livestock species, aiming to unravel genomic regions related to phenotypes of interest [49]. This approach exploits a large number of tagging SNPs that are in sufficient LD with causal variants. Under this framework, genetic variation caused by the causal variants is captured by the tagging SNPs, without knowing the exact causal variants. Thus, the genome-wide level of LD between SNP markers and causal variants is an important foundation of GWAS [50]. We showed that CNVRs overlap with genes more often than would be expected by chance, and that CNVs are thus likely to have an influence on phenotypes. The important follow-up question is whether the variations from CNVs are already captured by SNPs typed on commercial arrays, which are commonly used in livestock breeding programmes. We, therefore investigated pairwise LD between bi-allelic CNVRs and neighbouring SNPs on the BovineHD SNP chip. We observed generally low r2, close to zero, regardless of the distance between CNVRs and SNPs (results not shown). Subsequently, we categorized CNVRs by their allele frequency and type to investigate whether these factors influence the degree of LD. Common CNVRs have markedly higher LD (r2=~0.1 for deletion CNVRs at ~10 kb distance), compared to other CNVR categories (Figure S3). As common CNVRs had higher LD than the rest, we compared the LD of common CNVRs with the LD of SNPs in the same MAF range (0.05 ≤MAF <0.29 for HOL and 0.05 ≤ MAF < 0.37 for JER). We observed distinct LD decay patterns between the CNVR-SNP pairs and SNP-SNP pairs (Figure 5A and 5B). SNP-SNP LD follows a typical LD decay pattern where strong LD is observed with SNPs in vicinity and gradual decline as the distance increases, whereas CNVR-SNP LD does not follow this pattern. Also, compared to the CNVR-SNP LD (r2=~0.1 at ~10 kb distance), the frequency matching SNP-SNP LD was stronger (r2=~0.5 at ~10 kb distance). Afterwards, we used another metric to assess LD: taggability. Taggability is the maximum r2 among the r2 values that are obtained from a variant of interest and SNP pairs. We calculated taggability for SNP-SNP pairs and CNVR-SNP pairs. For the CNVR-SNP pairs, we considered common deletion CNVRs only, as they showed the highest LD in the previous analyses. Then, mean taggability for each MAF class (bin size=0.05) was plotted (Figure 5C and 5D).The mean taggability of common deletion CNVRs is low (<0.1) when MAF is below 0.05, and it increases as MAF increases. The SNP mean taggability follows the same pattern as shown in common deletion CNVRs. However, in spite of the similar pattern, common deletion CNVRs taggability is below the level of the SNP taggability. This shows that there is a gap in SNP taggability and CNVR taggability.
Interesting CNVR
A large number of QTLs has been identified from various GWAS on a wide range of traits. As most GWAS have been done using SNP markers, chances are that genetic variation caused by a CNV could have been captured by QTLs that are in a high-to-perfect LD (r2 =~1) with the CNV. Hence, inspecting CNVRs that are in high LD with QTLs is a preliminary step to identify potentially causal CNVs. To identify candidate causal CNVs, we subset the CNVR-QTL pairs, from the total CNVR-SNP pairs, based on the QTL information from the animal QTLdb [51]. We then subset the CNVR-QTL pairs further based on r2, and kept high LD CNVR-QTL pairs only.
In total ~100,000 bovine QTLs for various traits have been reported in the animal QTL database, and we identified 2,519 QTLs to be paired with 679 CNVRs within a distance of 100 kb in the HOL population. Among these, CNVR 547 (BTA6:84,395,081-84,428,819, deletion, MAF=0.24) had the highest LD with 13 QTLs (average r2 =0.59; max r2 = 0.74). The 13 QTLs were associated with casein proteins, which constitute four out of six bovine milk proteins. The four genes coding for the casein proteins are located in the so called casein cluster, which is ~1 Mb distant region from CNVR 547 (BTA6:85.4-85.6 Mb). Given the degree of LD for CNVR 547 and the QTLs that is lower than perfect linkage, it is unlikely that the CNVR 547 is the causal variant for the casein protein traits. Nevertheless, CNVR 547 was an interesting variant as it was only shown in HOL population with high MAF (0.24), and was close to the casein cluster that are highly relevant for dairy production.
Assuming that CNVR 547 is not the causal variant for the casein traits, a possible explanation for the high MAF can be selective sweeps. Selective sweeps increase allele frequencies of neutral variants that are in LD with the selection target variant, which in this case probably is the casein cluster. Two studies of Holstein populations support this hypothesis. Firstly, one selective sweep study in a German Holstein population revealed an extended range of LD in haplotypes that contain the casein cluster [52]. Secondly, GWA study on casein traits in a Danish Holstein population identified a broad GWAS peak (BTA6:60-100 Mb) that contains the casein cluster [53]. The broad GWAS peak (~20 Mb) also indicate high LD in this regions, that matched with the findings from Qanbari et al. [52]
Another explanation for the high MAF of CNVR 547 might be the direct selection on the variant itself. For instance, CNVR 547 overlaps with the UGT2B4 gene, which is involved in the detoxification pathway of exogenous compounds [54]. To see whether CNVR 547 overlaps with regulatory elements, besides overlapping with the upstream region of the UGT2B4 gene directly, we called promoters and enhancers from ChipSeq data from Villar et al. [55]. CNVR 547 overlaps not only with the upstream (a start codon and the first two exons), but also with the enhancer of UGT2B4 (BTA6: 84,413,246-84,413,740), and is thus likely to disrupt the function of the UGT2B4 gene. To summarize, our analyses imply that a high MAF of CNVR 547 might be due the selective sweep in the casein cluster or the consequence of direct selection on CNVR 547 itself due to the functional impact of the overlap with UGT2B4 and its enhancer. Nonetheless, we cannot exclude drift as a possible driver changing the allele frequency of the CNVR 547.