Evaluating simple structural variants detection with HG002. To benchmark the performance of SV detection tools on HG002, we follow the procedure introduced by Genome-In-A-Bottle (GIAB) and adopted by CuteSV. Briefly, the high confidence insertion and deletion calls and high confidence regions published by the GIAB consortium are used as ground truth. The HiFi reads are aligned to reference hg19 by pbmm2 (https://github.com/PacificBiosciences/pbmm2, v1.4.0) with parameter ‘--preset CCS’, while ONT reads are aligned with pbmm2 default settings. The 5X and 10X coverage of HiFi and ONT data were further obtained with SAMtools21 ‘-s’ option. Sniffles (v1.0.12), CuteSV (v1.0.10), pbsv (v2.2.2), SVision (v1.3.6) and SVIM (v1.4.0) were applied to the pbmm2 aligned file with default parameters. The minimum supporting read was 2 and 3 for 5X and 10X data, while 10 was used for the original coverage.
Simulating complex structural variants. Ten complex structural variant (CSV) types were simulated according to frequently reported CSV types introduced by 1000 Genomes Project (1KGP)1 and a cohort study of autism spectrum disorder (ASD)2 (Supplementary Table S2, Supplementary Note). For the simulation, a CSV was essentially a combination of breakpoints from simple structural variants (SSVs). Therefore, a four-step simulation process was developed as follows. VISOR22 was first used to simulate and to randomly implant five SSV types (deletion, inverted dispersed duplication, inverted tandem duplication, tandem duplication and dispersed duplication), on reference genome GRCh38. Secondly, we followed the procedure introduced by SURVIVOR23 to simulate CSVs, where SSVs of the above five types were randomly added adjacent to the existing SSVs. In particular, 3,000 SSV of five types were created by VISOR with parameters ‘-n 3000 –r 20:20:20:20:20 –l 500 –s 150’. Then, we added extra variants required in predefined CSV structures to existing SSVs by following order of types, deletion, inverted dispersed duplication, inverted tandem duplication, tandem duplication and dispersed duplication. For instance, we first used deletions as seeds to create all deletion involved CSV instances, and turned to instances of next type. Finally, the variation genome with CSVs was used as input for the VISOR LASoR module to simulate 30X HiFi reads for subsequent alignment by ngmlr10 (v0.2.7) with the default setting. Note that VISOR is only used to simulate variants at one haplotype in this study (Supplementary Note).
Evaluating simulated complex structural variants detection. To examine the correctness of detected CSVs, we used closeness and size similarity to assess whether two events are identical according to Truvari (https://github.com/spiralgenetics/truvari/) introduced by GIAB. The closeness bpDist and size similarity sim between prediction and benchmark were 500bp and 0.7, respectively. For example, assume a particular benchmark CSV [\(\text{b}.\text{s}\text{t}\text{a}\text{r}\text{t}\), \(\text{b}.\text{e}\text{n}\text{d}\), \(\text{b}.\text{s}\text{i}\text{z}\text{e}\)], and a prediction [\(\text{p}.\text{s}\text{t}\text{a}\text{r}\text{t}\), \(\text{p}.\text{e}\text{n}\text{d}\), \(\text{p}.\text{s}\text{i}\text{z}\text{e}\)]; then a correct region-match should satisfy the following equations:
$$\text{max}\left(|\text{b}.\text{s}\text{t}\text{a}\text{r}\text{t}-\text{p}.\text{s}\text{t}\text{a}\text{r}\text{t}|,|\text{b}.\text{e}\text{n}\text{d}-\text{p}.\text{e}\text{n}\text{d}|\right)\le \text{b}\text{p}\text{D}\text{i}\text{s}\text{t}$$
$$\text{b}.\text{s}\text{i}\text{z}\text{e}\times \text{s}\text{i}\text{m}\le \text{p}.\text{s}\text{i}\text{z}\text{e}\le \text{b}.\text{s}\text{i}\text{z}\text{e}\times (2-\text{s}\text{i}\text{m})$$
Comparably, the exact-match not only required region-match but also required the correct detection of all subcomponent of CSV, including the subcomponent breakpoint type. Therefore, for a deletion-inversion that contained two subcomponents, e.g. inversion and deletion, the exact-match became a three-step evaluation:
- Region-match between predicted CSV and benchmark deletion-inversion event.
- For each subcomponent, we examine the breakpoint closeness and event size as well as the detected type.
- The correct detection should pass condition 1) and 2). The subcomponent match is considered as either deletion or inversion correctly detected in 2).
In this study, we only considered INS, DEL, DUP and INV as subcomponent types in the evaluation. Any called CSVs without a matched prediction were counted as false negatives. Based on the numbers of true positives (TP) and false negatives (FN), we computed the recall, precision and F-score with the following equations, respectively.
$$\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P}}$$
$$\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}$$
$$\text{F}-\text{s}\text{c}\text{o}\text{r}\text{e}=\frac{2\times \text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\times \text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}{\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}+\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}$$
Each caller was run with different number of variant supporting read (i.e., 1, 3, 5 and 10), and the performance of detecting simulated CSVs was assessed correspondingly (Supplementary Table S3, Supplementary Note).
Examining complex structural variants detection in NA12878. We used complex structural variants (CSV) from NA12878 to assess the performance of SVision. The CSV set of NA12878 was obtained from supplementary table 12 and 15 of a study conducted by the 1000 Genomes Project (1KGP)1, containing 62 deletion and 251 inversion associated CSV sites in hg19 coordinates. We aligned the HiFi reads of NA12878 released by Human Genome Structural Variants Consortium (HGSVC)7 using ngmlr (v0.2.7) with the default setting for SVision SV detection. SAMtools was used to extract HiFi reads spanning the CSV loci, and Gepard24 was used to create the Dotplots between HiFi reads and their corresponding reference sequences. We then manually inspected all Dotplots associated with a reported CVS locus (Supplementary File 1).
Three-channel coding of feature sequence. SVision takes the sequence alignment file (BAM) and reference file as input. The encoder consisted of two major steps, i.e., variant feature sequence selection and sequence coding. Variant feature sequences are directly identified from long-read aberrant alignments containing SV signatures, such as inter-read and intra-read alignments. Intra-read alignments are derived from reads spanning the entire SV locus, while inter-read alignments are obtained from reads that are aligned to larger SV event, resulting in supplementary alignments. SVision identifies additional SV signatures by applying a k-mer based realignment approach for unmapped segment in feature sequence, such as ‘I’s from CIGAR string and gap sequence obtained from inter-read alignments. Then, sequence differences and similarities derived from matched and unmatched segments between variant feature sequence (VAR) and its corresponding segment on reference genome (referring to REF) is coded as an image (Supplementary Note). The image contains three channels, including (0, 0, 255), (0, 255, 0), and (255, 0, 0), to code the matched, the duplicated and the inverted segments, respectively. Given the three-channel image, SVision first creates the REF-to-REF image through k-mer realignment. As for VAR-to-REF image, matched segments obtained from CIGAR string and supplementary alignments, originating from aligner’s outputs, are directly used for image coding to reduce computational cost, and realignment results are further added to complete image coding. The denoised image is obtained by subtracting the REF-to-REF image from the VAR-to-REF image. Because the background is originated from reference sequence context, the encoder subtracts the segments of two images based on the REF sequence coordinates. Specifically, if segments from two images overlap on the reference dimension and their difference is larger than 50bp (minimum SV report size), the encoder keeps the non-overlapping part of the segment in the similarity image, where its coordinates are determined by VAR-to-REF image. Finally, the denoised image of each variant feature sequence is created and saved as matrix along with segment information tables for further process (Supplementary Note).
Detecting CSVs from denoised images via tMOR. In principle, for each denoised image, the regions where VAR and REF are identical must be a straight line while SVs introduce discontinuous segments. These discontinuous segments indicating putative variants and their breakpoints in the denoised image are surrounded by segment signatures, which are considered as breakpoint object and further defined as Segment-of-Interest (SOI). Since long reads are likely to span more than one variant in the denoised image, the tMOR contains a two-step image segmentation process for further SOI recognition. Specifically, the tMOR first obtains a so-called one-variant image, from the denoised image based on the following steps (Supplementary Note).
1) Sorting and tagging. We sort all segments in the denoised image by their positions on read in ascending order. Then, the major segment is defined according to the matched segments derived from CIGAR operations, while the minor segment should meet one of the following conditions:
Condition 1: the segment is derived from the hash-table based realignment.
Condition 2: the segment is inverted compared to the reference genome.
Condition 3: the segment is totally covered by another one.
2) Creating one-variant image. SVision partitions the denoised image into several one-variant images via sequential combination of the major segments. Specifically, each major segment and its neighboring major segment along with the minor segments (if they exist) between them are used to create a one-variant image.
Afterwards, SVision clusters similar one-variant images by measuring the distance of segment signatures between one-variant images. Thus, one-variant images in a cluster supports the same variant, and the size of a cluster is termed as the number of variant supporting image (Supplementary Note). Secondly, SVision collects SOIs from each one-variant image. Unlike traditional multi-object recognition that uses complex algorithms to select regions of interest, the segment signatures in the one-variant image enable efficient SOI identification by sequentially combining both major and minor segments. Then, SOIs are used as input for CNN prediction, and the interpreted SV types are given by the labels involved in the training set, including deletion (DEL), inversion (INV), insertion (INS), duplication (DUP) and tandem duplication (tDUP). The CNN assigns the probability score to assess the existence of variant subcomponents in the one-variant image (Supplementary Note).
Creating CSV graphs from one-variant images. SVision uses a graph to unify the definition of different CSV types and provides a computational method to compare different CSV graph structures. To create a CSV graph \(\text{G}=(\text{V}, \text{E})\), SVision first collects the node set \(\text{V}={\text{V}}_{\text{s}}\cup {\text{V}}_{\text{I}}\cup {\text{V}}_{\text{D}}\) of \(\text{G}\). Specifically, \({\text{V}}_{\text{s}}=\left\{{\text{S}}_{1},{\text{S}}_{2},\dots ,{\text{S}}_{\text{n}}\right\}\), \({\text{V}}_{\text{I}}=\left\{{\text{I}}_{1}, {\text{I}}_{2},\dots ,{\text{I}}_{\text{m}}\right\}\) and \({\text{V}}_{\text{D}}=\{{\text{D}}_{1},{\text{D}}_{2},\dots ,{\text{D}}_{\text{k}}\}\), where \(\text{n}\), \(\text{m}\) and \(\text{k}\) are the number of skeleton nodes, insertion nodes and duplication nodes in the graph, respectively. Skeleton nodes are derived from major segments in a one-variant image and sequence between discontinuous major segments on REF (i.e., concordant segments between VAR and REF). Insertion nodes consists of minor segments in one-variant image, while insertion nodes with known origins are defined as duplication nodes, representing duplicated segments in one-variant image. Moreover, each node \({\text{v}}_{\text{i}}\in \text{V}\) is represented as a tuple\({\text{v}}_{\text{i}}=(\text{S}\text{e}\text{q}, \text{P}\text{o}\text{s}, \text{S}\text{t}\text{r}\text{a}\text{n}\text{d})\), which represents a segment in the one-variant image. The \(\text{S}\text{e}\text{q}\)indicates the segment sequence, \(\text{P}\text{o}\text{s}\) is the position of the segment on VAR and \(\text{S}\text{t}\text{r}\text{a}\text{n}\text{d}\) represents the forward or reverse strand of the segment. The edges in \(\text{G}\) is collected by \(\text{E}={\text{E}}_{\text{a}\text{d}}\cup {\text{E}}_{\text{d}\text{p}}\). \({\text{E}}_{\text{a}\text{d}}\) represents a set of adjacency edge \({\text{e}}_{\text{a}\text{d}}^{\text{k}}=({\text{v}}_{\text{k}},{\text{v}}_{\text{k}+1})\), connecting two adjacent nodes \({\text{v}}_{\text{k}}\) and \({\text{v}}_{\text{k}+1}\), and \({\text{E}}_{\text{d}\text{p}}\) represents a set of duplication edge \({\text{e}}_{dp}\), connecting the duplicated node with its known origin.
Given a graph \(\text{G}\), a CSV could be interpreted by visiting each node through the \({\text{E}}_{\text{a}\text{d}}\) edges. For example (Extended Data Fig. 10a), the CSV path is interpreted as “S1+S3-S3-S4+”, where ‘+’ or ‘-’indicates the direction of visiting a specific node, i.e., node \(\text{S}\text{t}\text{r}\text{a}\text{n}\text{d}\). Specifically, node S1 and S4 are visited in forward direction (+), while S3 is visited in reverse direction (-), so that the path should be “S1+S1+S3-S3-S4+S4+”. But for simplicity, only the intermediate nodes, such as S3, are kept twice, whereas the start node (S1) and the end node (S4) are used once in the path. The comparison of two graphs \({\text{G}}_{1}=({\text{V}}_{1},{\text{E}}_{1})\) and \({\text{G}}_{2}=({\text{V}}_{2},{\text{E}}_{2})\) is a NP-hard problem, but the ordered nodes based on the reference simplifies this problem. Therefore, SVision first compares the numbers of edges and nodes between two graphs \({\text{G}}_{1}\) and \({\text{G}}_{2}\), which are consider as different if either number is different. On the other hand, if graph \({\text{G}}_{1}\) and \({\text{G}}_{2}\) have topologically identical path in addition to the same numbers of nodes and edges, they are termed as isomorphic CSV graphs, i.e., \({\text{G}}_{1}={\text{G}}_{2}\).If graph \({\text{G}}_{1}\) and \({\text{G}}_{2}\) have the same numbers of nodes and edges but differ in paths, we further examine whether \({\text{G}}_{1}\) and \({\text{G}}_{2}\) share symmetric topology (Extended Data Fig. 10b), since a variant might be identified on either forward or minus strand, i.e., from 5’ to 3’ or from 3’ to 5’. In particular, we create a mirror graph \({\text{G}}_{1}{\prime }\) of the original graph \({\text{G}}_{1}\), and obtain a new path from \({\text{G}}_{1}{\prime }\). Similarly we also create \({\text{G}}_{2}{\prime }\) from \({\text{G}}_{2}\) Then, we cross compare whether the paths between \({\text{G}}_{1}{\prime }\) and \({\text{G}}_{2}\) as well as between \({\text{G}}_{2}{\prime }\) and \({\text{G}}_{1}\) are topologically identical. We consider \({\text{G}}_{1}\) and \({\text{G}}_{2}\) are isomorphic if both comparisons are equal. SVision keeps isomorphic graphs and symmetric graphs in two separate files, enabling search of CSV events of same structure. For each variant call, SVision keeps its all breakpoints in the “BKPS” column in the INFO field and a type (“SVTYPE” column). Especially for CSVs, their breakpoints are kept with both coordinates and associated graph structure in the “BKPS” and ‘GraphID’ column, respectively. Note that the ‘GraphID’ is used to search events of a specific graph structure in isomorphic and symmetric graph output files. Moreover, SVision involves the graph breakpoints induced from CSV Reference Graphical Fragment Assembly (rGFA) file in the ‘GraphBRPKS’ column. Note that the ‘GraphID’ and ‘GraphBRPKS’ columns are only reported when the parameter ‘--graph’ and ‘--qnames’ are activated.
Training data. The CNN model in SVision is trained with both real and simulated simple SVs of DEL, INV, INS, DUP and tDUP, to avoid usually unbalanced numbers of SV types in real data. We obtained real SVs from NA19240 (4,282) and HG00514 (3,682) (Supplementary Note) by selecting calls supported by both PacBio CLR reads and Illumina reads17. In this integrated real SV set, we labeled SVs with the above-mentioned five rearrangement types. We further used VISOR to simulate SV events with the parameters ‘-n 4000 -r 20:20:20:20:20 –l 1000 –s 500’, and simulated the PacBio CLR reads. For all training SVs, their one-variant images and SOIs are created as we described in the above sections, leading to 75,000 SOIs (15,000 per type) in total, where 50% SOIs are from real events. These SOIs are shuffled for further CNN model training.
CNN model training. SVision adopts AlexNet, a widely-used CNN model, to recognize sequence differences in similarity images. The AlexNet architecture consists of five convolutional layers and three fully-connected layers. Specifically, the first convolution layer loads images of size \(224\times 224\times 3\), and it uses the \(11\times 11\times 3\) convolution kernel with stride 4. The last three layers are fully connected and contains a five-class SoftMax layer with inputs from the five preceding convolution layers. In the end, the input SOIs are detected as either INS, DEL, INV, DUP, tDUP or mixed types for CSVs. We apply the idea of transfer learning to train CNN with 75,000 SOIs. First, the parameters of all layers in the CNN are initialized to the best parameter set that was achieved on the ImageNet competition. Afterwards, we fine-tune the parameters of the last three fully-connected layers on our data using back propagation and gradient descent optimization with a learning rate of 0.001. The loss function is defined as the cross entropy between predicted probability and the true class labels. Moreover, SVision’s CNN architecture is lightweight and has far fewer layers than complex CNN models such as ResNet and Inception V3, which results in a highly efficient fine-tuning process with large batch size (default: 128) even on a CPU machine. To evaluate the trained CNN model, we apply ten-fold cross validation, and the trained model at each round is applied to an independent test set of 7,500 SOIs derived from simulated SVs. Finally, SVision selects the model with the best performance (Supplementary Note).
Quality score of discoveries. SVision uses a score function to measure the quality of each discovery based on consistency and prediction reliability derived from one-variant image clusters.
1) One-variant image consistency. Intuitively, the non-linear segments in a given one-variant image indicate potential differences between REF and VAR. We thus first compute the non-linear score for all images that support each event, i.e., one-variant images originated from a variant feature sequence cluster. The non-linear score of a one-variant image is calculated by its segments coordinates and lengths. Specifically, for a one-variant image with segments:
$$\text{N}\text{o}\text{n}\text{l}\text{i}\text{n}\text{e}\text{a}\text{r} \text{s}\text{c}\text{o}\text{r}{\text{e}}_{\text{i}}=\frac{\sum _{k}\left(\left|\text{k}.\text{r}\text{e}{\text{f}}_{\text{m}\text{i}\text{d}}-\text{k}.\text{r}\text{e}\text{a}{\text{d}}_{\text{m}\text{i}\text{d}}\right|\right)\times \text{k}.\text{l}\text{e}\text{n}\text{g}\text{t}\text{h}}{\text{R}\text{e}\text{f}\text{S}\text{p}\text{a}\text{n}}$$
where the summation is over all segments \(\text{k}\) in image \(\text{i}\), \(\text{k}.{\text{r}\text{e}\text{f}}_{mid}\) and \(\text{k}.{\text{r}\text{e}\text{a}\text{d}}_{\text{m}\text{i}\text{d}}\) are the center of segment on reference and read, respectively. Then we normalize the summation by dividing \(\text{R}\text{e}\text{f}\text{S}\text{p}\text{a}\text{n}\), which denotes the distance between the leftmost and rightmost coordinates of the similarity image. Finally, for a SV of \(\text{M}\) supporting images, we calculate the consistency score with the following equation:
$$\text{C}\text{o}\text{n}\text{s}\text{i}\text{s}\text{t}\text{e}\text{n}\text{c}\text{y}=\frac{\text{S}\text{t}\text{d}\left(\right\{\text{n}\text{o}\text{n}\text{l}\text{i}\text{n}\text{e}\text{a}\text{r} \text{s}\text{c}\text{o}\text{r}{\text{e}}_{1},\dots ,\text{n}\text{o}\text{n}\text{l}\text{i}\text{n}\text{e}\text{a}\text{r} \text{s}\text{c}\text{o}\text{r}{\text{e}}_{\text{M}}\left\}\right)}{\text{M}}$$
Accordingly, we expect a smaller consistency value for high-quality SV predictions.
2) Prediction reliability. This part evaluates the deep learning prediction quality. The last layer in the CNN architecture is a SoftMax layer, which outputs the probability of the prediction results. Therefore, we use the average probability of all SOIs as the CNN reliability:
$$\text{R}\text{e}\text{l}\text{i}\text{a}\text{b}\text{i}\text{l}\text{i}\text{t}\text{y}=\frac{\sum _{s}\text{s}.\text{s}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\times 100}{\# \text{S}\text{O}\text{I}\text{s}}$$
where the summation is over all SOIs in an one-variant image. The reliability will range from 0 to 100 because the SoftMax probabilities always range from 0 to 1. We expect higher reliability values for accurate SVs.
Finally, we summed up the two features and normalized it to range from 0 to 100:
$$\text{q}\text{u}\text{a}\text{l}=\text{C}\text{o}\text{n}\text{s}\text{i}\text{s}\text{t}\text{e}\text{n}\text{c}\text{y}+(1-\text{R}\text{e}\text{l}\text{i}\text{a}\text{b}\text{i}\text{l}\text{i}\text{t}\text{y})$$
$$\text{N}\text{o}\text{r}\text{m}\text{a}\text{l}\text{i}\text{z}\text{e}\text{d} \text{s}\text{c}\text{o}\text{r}\text{e}=(1-\frac{\text{s}\text{u}\text{m}\left(\text{S}\text{c}\text{o}\text{r}\text{e}\text{s}\right)-\text{m}\text{i}\text{n}\left(\text{S}\text{c}\text{o}\text{r}\text{e}\text{s}\right)}{\text{max}\left(\text{S}\text{c}\text{o}\text{r}\text{e}\text{s}\right)-\text{m}\text{i}\text{n}\left(\text{S}\text{c}\text{o}\text{r}\text{e}\text{s}\right)})\times 100$$
where \(\text{S}\text{c}\text{o}\text{r}\text{e}\text{s}=\{{\text{q}\text{u}\text{a}\text{l}}_{1},...,{\text{q}\text{u}\text{a}\text{l}}_{M}\}\), \(\text{M}\) is the total number of images supporting this variant.
Analysis of CSVs detected from HG00733. SVision was run under the default setting except parameters ‘-s 5 --graph --qname’. The HiFi reads of HG00733 were aligned to reference GRCh38 by ngmlr (v0.2.7) with the default setting. Firstly, the events detected by SVsion at low mapping quality regions, centromeres, genome gap regions, and etc. were excluded in analysis. These regions were obtained from https://github.com/mills-lab/svelter/tree/master/Support/GRCh38 and the UCSC genome centromere for reference GRCh38. Then, we applied the following steps to filter CSVs from the raw callset. 1) Filtering CSVs of length larger than 100kbp; 2) Filtering CSVs without complete graph representation, where the path ends with other node types instead of ‘S’ and 3) For multiple CSVs at one site, we only kept the one with the most number of supporting reads. SVision revealed two special complex structures, i.e., a structure consists of nodes ‘S:2,I:2,D:1’ and path ‘S1+I1+I1+I2+I2+S2+’ as well as another structure consists of nodes ‘S:2,I:1,D:1’ and path ‘S1+I1+I1+S2+’, which were visually confirmed as local targeted site duplication (Extended Data Fig. 10c) and tandem duplication (Extended Data Fig. 10d). Events of these two structures were also filtered because they were considered as simple events from biological perspective. Afterwards, we used RepeatMasker and tandem repeat finder (TRF) annotated files from UCSC genome browser to annotate the CSVs passed the filters through BEDtools25 intersect option. The repeat type was assigned if the CSV region overlaps with the repeat element, while the size or percentage of overlaps was not required. For CSVs with multiple repeat types, the one with the largest overlapping region with the CSV was chosen. Meanwhile, CSV was annotated as STR if the repeat unit length < 7bp, otherwise, it was annotated as VNTR. Finally, we termed all CSVs that outside of VNTR/STR regions as high-quality CSVs, which were further validated and used for further analysis. The PAV and short-read data matched CSV loci were obtained through BEDtools without requiring overlap size. For the short-read data, a matched CSV locus was considered as completely reconstructed if both breakpoint positions and types matched what SVision reported, otherwise as partially reconstructed events if either breakpoints or types agreed with SVision’s prediction.
The PAV merged call set from 35 haplotype-resolved samples was used to explore the frequency of CSV on CNTN5 (Supplementary Table S11). In addition, the RNA-Seq data of precuneus and primary visual cortex from both control and disease samples were obtained from a recent study of Alzheimer’s disease18 to understand the potential functional impact of CSV on CNTN5 (Supplementary Table S12). The paired-end RNA data was aligned with hisat2 default setting, from which the duplicated exon signature could be observed from discordant read-pair alignment, i.e., read-pair aligned in reverse and forward direction (Extended Data Fig. 8). For the insertion-inversion-insertion event at chr9:74,283,222-74,283,473 detected by SVision, it was reported as insertion of variant id chr9-74283228-INS-1797 by a recent study conducted by HGSVC7 (Supplementary Note). The insertional sequence was extracted from HiFi assembly and Blast against several primate genomes. Moreover, the assembly of chimpanzee and gorilla were mapped to GRCh38 with minimap2 and called variant with PAV, from which the same insertion event was identified.
Validation of high-quality CSVs detected from HG00733. We validated 80 CSVs detected by SVision in HG00733 via 1) graph-based alignment; 2) contig-based visual confirmation and 3) PRC and Sanger sequencing.
Graph-based alignment. For each CSV graph in rGFA format, we extracted the CSV locus spanning reads with SAMtools and aligned these reads to each CSV graph via GraphAligner (v1.0.12) with the default setting. A CSV was successfully validated if a single ONT read could be aligned to the corresponding variant path specified in the rGFA file. We then counted the number of long reads covering the entire VAR path as the number of support for this CSV event.
Contig-based visual confirmation. To examine the internal structure of CSVs, the phased-assembly specified in the PAV (v1.1.2, TIG_REGION column) at the reported variant region was used for further analysis. We first extracted the contig sequence harboring variant based on the coordinates provided in the ‘PAV_TIG_REGION’ (Supplementary Table S8). For example, a sequence containing variant was extracted from the h1 assembled genome for ‘1|1’ and ‘1|0’ genotype, while from h2 assembled genome for ‘0|1’. In order to validate CSV structure containing complex insertion, we extended 5Kbp both upstream and downstream the CSV region to extract the reference genome via BEDtools getfasta option, from which the origin of the inserted sequence could be identified. Afterwards, Gepard was used to create the Dotplot of contig sequence (y-axis in the Dotplot) and reference sequence (x-axis in the Dotplot) for each CSV locus. Based on each contig Dotplot, the manual validation contained two tiers of metrics: 1) whether the reported region contains a variant; and 2) whether the SVision reported structure is identical to what revealed by Dotplot. A CSV was considered completely reconstructed if both 1) and 2) were satisfied, while others were considered as inconclusive events.
PCR and Sanger sequencing. We first determined that about half of the 80 CSVs (39/80) were intractable for PCR due to their location within segmental duplications, the size of the amplicon needed to validate the rearrangement, or the simple repeat nature of the rearrangement. We then randomly selected 20 of the remaining rearrangements, and performed BLAT on the local region from the HG0733 assembly data. We next attempted to PCR each of the 20 CSVs. Briefly, we designed primers flanking the CSV or flanking breakpoints within the CSV for each of the 20 events (Supplemental Table S14). Next, we attempted to amplify each region using Takara LA taq. We obtained the predicted band size for 12 of the 20 variant loci; the remaining 8 regions did not amplify in 3 separate attempts with alterations of the PCR conditions and template amounts. All PCR products were sent to Sanger sequencing and validated as on target, and contained the correct amplicon with the breakpoint from the assembly and SVision call.