Detection of candidate NRS
An assembly-versus-assembly approach was used for each assembly to identify unaligned sequences (<90% identity) to the human reference genome (GRCh38.p12, hg38) (Fig. 1, Methods). Apart from hg38, our study included 31 de novo assemblies: 17 from LRS (PacBio or nanopore sequencing technology), 13 generated with next generation sequencing and one from Sanger sequencing (Additional file 1). We further examined the BUSCO completeness score, length distribution of structural variants as assessed by Assemblytics and composition of TE elements to ensure that these assemblies are of high quality for downstream analysis. All the assemblies present comparable completeness scores (94.4±0.81,mean±SD) except for ASM101398v1 (BUSCO score 90.3) and the percent of TE elements in all assemblies were all at similar levels (0.51±0.01,mean±SD) (Additional file 1). All the assemblies shows similar pattern of length distributions of structural variants (Additional file 2).
The final unaligned sequences spanned, on average, 12.9 Mb for LRS assemblies, which were much larger than the other assemblies (2.3 Mb on average) (P<0.01, unpaired two-tailed Student's t-test) (Fig. 2). For each assembly, 70-80% of the preliminary unaligned sequence contents belonged to simple sequence repeats or low complexity sequences and were removed to obtain the final unaligned sequences for each assembly. After removing redundant sequences and sequences shorter than 400 bp, we obtained the unaligned sequences from each assembly, which were then merged into a unified non-reference call set of 15,055 sequences (hereafter referred to as NRS) adding up to 129.1 Mb with a median length of 2,848 bp (N50=1,066 bp). Furthermore, 78.6 Mb of the NRS had no alignment with hg38 using the criteria of >80% identity and 50% coverage as adopted by [5]. Nevertheless, the call set of 129.1 Mb were used in downstream analysis. In order to compare with four previous studies (methods), we aligned our NRS to each of the four datasets in a reciprocal manner with BLAST to determine the intersection with each of the studies. In total, 13.8 Mb NRS intersected with previous reports whereas the rest (115.3 Mb) has not been identified before (Additional file 3).
We aligned all the NRS to genomes of four great apes and found that 1,211 sequences (4.32 Mb) were present in at least one of the four great apes (identity ≥95% and coverage ≥80%). The presence of each NRS was also examined in each of the human de novo assemblies and those which were present in at least two of the 31 de novo assemblies and great ape assemblies were determined as non-private sequences. Taken together, 28.2% (4,248) of the NRS spanning 29.3 Mb were non-private sequences, indicating that they are of high confidence (Methods and Additional file 4).
Placing candidate NRS to the reference genome
We next determined the genomic locations of the NRS by aligning their flanking sequences to hg38 (Methods), by which we resolved the precise location of 6,112 NRS (40.5% of the total NRS) adding up to 12.8 Mb. Notably, 13 sequences reside in the remaining gaps of the reference genome (Additional file 5). Another 2,711 NRS were anchored to chromosomes without a precise location due to sequence gaps. The remaining sequences cannot be placed on hg38 due to a lack of flanking sequences or conflict anchoring information from the two sides.
For the precisely placed sequences, we can determine whether they belong to insertions or alternate alleles as described in next section. For the unplaced NRS and those without precise locations, 2,855 were non-private sequences adding up to 25.8 Mb, indicating that they are of high confidence (Additional file 6). Although we could not determine the precise locations of these sequences, their discovery will greatly expand the repertoire of sequence diversity in the human genome.
Insertions within NRS
We first determined the insertion events for the precisely placed NRS. A total of 1,571 (3.2 Mb) were found to be insertions including 769 non-private sequences (Fig. 3a). Furthermore, 246 were present in more than half of the assemblies, indicating that they could represent major alleles. Notably, 56.8% (881) of the insertions, including 158 non-private ones, were firstly described in our study. Principal component analysis (PCA) of all the insertions based on their presence in the 16 LRS de novo assemblies of Pacbio sequencing showed a population-specific pattern (Fig. 3e). PC1 clusters African samples away from other populations, while PC2 further separates the East Asians from the Europeans, Americans and South Asians.
Alternate alleles within NRS
We further found that many NRS represent an alternate allele to their counterparts in the reference instead of insertions. To identify alternate alleles, the NRS should share less than 90% identity (or none) and have comparable length with reference alleles (Methods). In this way, 3,041 NRS (6.38 Mb) were identified as candidate alternate alleles. The remaining 1,500 precisely placed NRS did not meet our criteria of insertions or alternate alleles and thus were classified as ambiguous sequences. Unlike insertions, the alternate alleles represent allelic sequences (Fig. 4a and 4d). Notably, a long alternate sequence of 24,676 bp was anchored to chr6: 29,955,749-29,986,299, which belongs to the class I major histocompatibility complex (MHC gene) (Fig. 4b) and potentially harbors a novel HLA-B gene (Additional file 7). This allele was present in two African assemblies (YRI, NA19240), in one American assembly (ASM311317v1) and in the gorilla genome whereas absent in other assemblies. Moreover, the reference allele was found in chimpanzees, indicating the presence of ancestral polymorphism at this locus. Furthermore, this alternate allelic sequence was not reported in the NCBI nr/nt database or in the human HLA database (IPD-IMGT/HLA database), suggesting that this alternate sequence represents a putative novel MHC allele.
Among the alternate alleles, 1,348 intersected with the genic region of 1,143 protein-coding genes. The genomic distribution of the alternate alleles was dispersed throughout the genome, and those belonging to non-private sequences are shown in Fig. 5. A total of 59 non-private alternate alleles intersected with the genic region, and five of them further intersected with the CDS region of eight genes: HLA-W, MICD, HCG9, DDX39BP2, LOC107985837, ZNF880, PLIN4 and LOC728715.
Only 2.6% (80 out of 3,041) of the alternate alleles have been identified before but were miss-classified as insertions. Therefore, most of the identified alternate alleles described in our study are newly reported. The discovery frequency of alternate alleles in human assemblies appears to be lower than that of insertions (Fig. 3a and 3b). Most alternate alleles were present in less than half of the 31 assemblies, indicating that many of them represent minor alleles in the human genome. Nevertheless, 144 alternate alleles were non-private (Additional file 8), with the longest one found in 17 assemblies and spanning 19,330 bp (genomic placement position: chr7, 62408641-62451864). The ambiguous sequences also included a number of putative insertions and alternate alleles (Fig. 3c) and deserve further efforts for verification. The length distributions of the alternate alleles and insertions did not differ (p=0.09, Kolmogorov-Smirnow test) (Fig. 3d).
Similar to the insertions, PCA also showed that PC1 clusters African samples away from other populations, while PC2 separates the East Asians from the Europeans, South Asians and Americans (Fig. 3f). We also explored the potentially transcribed sequences that either have mapped RNA-seq reads (≥10 reads in at least two samples) or hits to the human expressed sequence tag database (dbEST) (e-value <1e-5). We totally identified 74 transcribed alternate alleles from RNA-seq data and 238 with hits to the human dbEST, resulting in a total of 244 potentially transcribed sequences (Additional file 9). One alternate allele was found to be expressed in a tissue-specific manner (Fig. 4b and 4c), which is potentially a long non-coding gene since we couldn’t annotate it to any known gene by searching NCBI nr/nt database with BLAST. The putative novel MHC allele was also found to be expressed (Fig. 4e and 4f).
To explore the origin of the alternate alleles, we analyzed associated repeats with NRS. Transposable elements (TEs) compose approximately 45% of the human genome (Lander et al. 2001), and a previous study showed that insertions were associated with TEs [13]. The percent of TEs within alternate alleles (10.0%) was much lower than that of insertions (55.1%) (Fig. 6a). The flanking sequences (5 kb on each side) of the alternate alleles also had less TE content (33.3%) than the insertions (48.0%) (Fig. 6b), suggesting that the alternate alleles are not associated with TEs. We then screened the tandem repeat content among the sequences. The alternate alleles possessed a much higher content of tandem repeats either within the sequences (Fig. 6c) or in the flanking sequences (5 kb on each side, Fig. 6d) compared with the insertions. Notably, the reference allele also be enriched in tandem repeat when the alternate allele have a large content of tandem repeat (R2=0.65, Fig. 6e and Additional file 10), thereby implying that tandem repeats are associated with alternate alleles.