Dataset
The human reference genome GRCh38 was downloaded from the National Center for Biotechnology Information database (36). Polymorphisms, including CNVs, were obtained within the frame of the 3rd phase of 1000 Genomes Project and are available from the European Bioinformatics Institute (https://www.ebi.ac.uk) under the ID: estd214. In 1 000 Genomes Project a variety of methods was used to identify polymorphisms. Primary data resulted from oligonucleotide genotyping, low-coverage whole genome sequencing as well as exome sequencing, with complete genomics and high-coverage PCR-free sequencing of selected samples serving for validation. As much as nine programs were used for structural variant calling and the final set of SV was the merged output of all the software. SVs were then validated using various methods, including microarrays, PCR-free whole genome sequencing and PacBio sequencing, as well as PCR. The estimated false discovery rate for CNVs was below 5% (2).
In our study, from all available genetic polymorphisms, only CNVs considered as duplications or deletions were extracted. Overlapping CNVs where considered independently, resulting in 5 867 duplications and 33 181 deletions. Length of duplications ranged between 3 006 bp and 988 090 bp, with median of 37 036 bp and mean of 66 527±91 091 bp. Length of deletions ranged between 204 bp and 2 258 238 bp, with median of 3 774 bp and mean of 12 143±34 749 bp (Fig. 4).
Reference genome sequence features
The Samtools software (37) was used to extract regions covered by CNVs from the GRCh38 reference genome. Moreover, reference sequences flanking CNVs coordinates (100 nucleotides up- and downstream of each CNV) were extracted. These regions were considered in the context of unknown nucleotides (denoted as “N”), Guanine-Cytosine pairs, sequence complexity and functional annotation. Unknown nucleotides and Guanine-Cytosine pairs content were calculated in all considered sequences ( i.e. CNVs and their flanking regions). In order to compare regions covered by CNVs with regions of the genome not affected by CNVs, random sequences were chosen. First, random coordinates were generated using Python the “random” module, then regions defined by those coordinates were extracted from the reference genome using the Samtools software. In total four sets of random sequences were generated: (i) Set 1 contained 5 859 sequences of length equal to the median length of duplications (37 036 bp), (ii) Set 2 contained 33 175 sequences of length equal to the median length of deletions (3 774 bp) , (iii) Set 3 contained 5 867 sequences of 100 bp length and was used for comparisons with sequences up- and downstream of duplications, (iv) Set 4 contained 33 181 sequences of 100 bp length and was used for comparison with sequences up- and downstream of deletions. All sequences containing unknown nucleotides were excluded and re-randomized. The number of random sequences in each Set matched the number of sequences in the corresponding CNV and flanking groups, as defined above. The percent of GC pairs was calculated for all sets of sequences using Python (38). The distributions of GC-content were tested for normality using the Kolmogorov test. The H0 stating that the distributions of GC-content follow the normal distribution with mean and variance given by the considered data sets. The test statistics, which is defined as the supremum of difference between theoretical and empirical distribution, has the same distribution as the classical Kolmogorov statistics. Furthermore, the distributions of GC-pairs content of true CNV-related sequences were compared with the corresponding randomised Sets using the Wilcoxon–Mann–Whitney test, with H0 stating that the distributions of GC-content are equal. The normalised test statistic is given by:
where , denotes ranks corresponding to the GC-pairs percentage classes in the random sequences, n is a count of deletion/duplication/flanking CNVs regions and m is a count of sets with random sequences.
Sequence complexity
Sequence complexity of the entire reference genome was estimated using the sDust software (39). The overlap between low-complexity regions defined by sDust and CNV-related regions was determined by using the bedtools software (40) for true CNVs and flanking regions, as well as for the random Sets. The distributions of low-complexity sequence contents in CNV and flanking regions as well as in randomised data were compared using the Wilcoxon-Mann-Whitney test. Testing and visualisation were created using the R package (41).
Functional annotation
The Variant Effect Predictor (VEP) software (42) was used for the functional annotation of CNVs. Gene Ontology enrichment (35) was tested using the Fisher's Exact test with the False Discovery Rate (FDR). Moreover, significantly enriched signalling pathways from the Panther (35) database, was identified using the KOBAS tool (43) applying the Fisher's Exact test with FDR.