Dual optical map
Figure 1a illustrates a representative DOM image of a DNA fragment, which is composed of two distinct microscopic images. The first, shown in green, is a sequence-specific labelled optical map. We utilized a commercialized kit replete with enzymes and reagents obtained from BioNano Genomics (Fig. 1b). The DLE-1 (direct labelling enzyme) is based on methyltransferase 10, 11, which can prevent DNA breaks induced by processive polymerase reactions during nick-translation 8. Although barcode-like optical mapping has been widely used for genome analysis, it has an inherent limitation: the label-to-label distance is so large that a substantial sequence information between adjacent labels can be easily omitted. For example, the magenta arrow in Fig. 1a corresponds to 93.2 kb, containing 91 genes in the E. coli K-12 MG1655 genome (NC_000913.3) (Supplementary Fig. 1). The distance between closely labeled spots is 3.1 kb (the yellow arrow in Fig. 1a), which still includes five genes in the genomic region (Supplementary Fig. 1). Therefore, such information might be missing when using solely barcode-based optical mapping. In addition, experimental errors like false positives and negatives are also problematic, as this approach utilizes enzymes to cleave or label DNA, which can lead to the bypassing of enzymatic reactions targeting specific sequences. Moreover, technical constraints such as image noise or the bleaching of fluorescent dyes could affect the accurate interpretation of OM sequence data.
To address these limitations, DNA profile mapping technologies has been introduced 11, 15, 18, 23. Profile mapping generates continuous fluorescent intensity signals, which can directly reflect DNA sequence. We have previously obtained sequence information from the large DNA of the E. coli genome employing TAMRA-conjugated pyrrole octamer 18. This approach demonstrates fluorescence intensity that correlates with the A/T base pair ratio. To leverage the improved characteristics of fluorophores, here we employed AP8, ATTO-647N conjugated pyrrole octamer (Fig. 1c) to generate an AT-profiled DNA image 24, 25. The fluorescence-based AT profile offers a distinct advantage in providing more comprehensive sequence information compared to sparsely labeled sequence beacons.
Pyrrole polyamides in AP8 are known to bind weak bases (represented as W, denoting A or T) within the minor groove of the DNA double helix 26. To characterize the binding of AP8 capability, we conducted a fluorescence resonance energy transfer (FRET) experiment using AP8 and fluorescein amidite (FAM)-conjugated DNA oligomers, varying the number of W bases as shown in Fig. 1d. Upon excitation at 498 nm, FAM (acting as the donor) transfers its resonance energy to ATTO-647N (functioning as the acceptor). The graph reveals that the FRET efficiency starts increasing with W4 and reaches a plateau when W is repeated six times (5’-ATATAT-3’). In comparison to the W6 FRET intensity, W4 demonstrates a 5.5-fold reduction in intensity (18.1%), while W5 shows a 2.3-fold decrease (43.9%). These results align with earlier studies: dimeric pyrrole-imidazole trimers select five base pairs 27 and dimeric pyrrole-imidazole tetramers selectively bind to six base pairs 28. Thus, AP8 is expected to bind six base pairs because AP8 comprises two pyrrole-tetramers with a beta-alanine linkage at the center. Subsequently, we assessed the influence of guanine (G) within the hexamer binding sites and observed a reduction in binding affinity in contrast to W6, FRET intensity for 5’-ATAGAT-3’ was reduced by 4.2-fold (24.1%). The introduction of an additional G in the hexamer, as seen in sequences like 5’-ATGTAG-3’ and 5’-ATAGAG-3’, resulted in decreased FRET intensities by 11.9- (8.4%) and 10.2-fold (9.8%), respectively. Note that the sequence 5’-ATGTGC-3’ exhibited mere binding affinity, showing a 48.9-fold (2.0%) difference in discrimination compared to W6. Taken together, these results suggest that W6 predominantly functions as the binding site for AP8, while displaying a lesser binding affinity for other sequences containing W.
AT Frequency Reference Map from the Genome Sequence
To accurately align DOM images with the genome, we first sought to develop an in silico referencing method. Given our findings for the binding characteristics of AP8, we simply calculated the frequency of W6 in the sequence by dividing it into 200 bp segments, considering the pixel resolution of our microscopic system. Figure 2a illustrates the in silico W6 frequency graph of the bacteriophage λ genome (48.5 kb) with an image-like DNA map for W6 and W1 profiles and displays the fluorescent microscope images of λ DNA molecules stained with AP8. Despite our efforts to achieve uniform full-length stretching using a recently described method 25 that involves tethering, elongating, drying, and immobilizing the DNA molecules, there are still variations in fluorescent intensity profiles among the captured DNA images. For example, AP8-stained λ1 DNA exhibits a dim spot in the 6.3 kb region (indicated by the yellow arrow), which is absent in λ2. Additionally, λ5 displays a smoother fluorescent profile at the 30 kb region (marked by the cyan arrow) compared to other images. These results suggest that inherent characteristics of single-molecule experiments persist in optical mapping experiments.
Therefore, we assumed that simple binning of an in-silico map does not convey the exact binding modality of AP8 to large DNA molecules. To this end, we conducted computer simulations for AP8 binding, taking molecular characteristics into account. We hypothesized that when AP8 randomly binds to W6 DNA, it might disturb the adjacent binding due to steric hindrance. We also considered the potential binding occurrence of AP8 to other repetitive Ws, from W1 to W5, for instance, requiring consideration of numerous molecular binding capabilities. Given the assumption that AP8 putatively occupies around 10 base pairs with W6 at the center, we simulated AP8 binding events to the in silico λ DNA sequence using the data of Fig. 1d. We generated individual image-like DNA maps from the random binding events simulation (denoted as SIM01 ~ SIM10) and applied a 2D Gaussian point spread function to the map. As a result, we obtained each simulation outcome, which brings resemblance to one another yet shows subtle differences (Fig. 2b). Further, we averaged the results of ten individual simulated images to construct a reference consensus map for AP8-stained DNA molecules (SIM Consensus in Fig. 2b).
Next, we sought to investigate whether our simulation-based consensus map is more accurate than the sequence frequency-based in silico map. So, we computed Pearson cross-correlation (cc) values between the AP8 intensity profile and the AT profile map to score their resemblance. As shown in Fig. 2c, the cc values for 133 molecule images of λ DNA were calculated compared to W6 (black), W1 (red), individual simulation results (SIMs), and the SIM consensus map (green). The average cc value obtained using λ DNA image clips and W1 from the 200-bp map was 0.880 ± 0.00258, while with W6, it was 0.822 ± 0.00283. Notably, the cc value between DNA images and our SIM consensus map increased to 0.900 ± 0.00288 (Fig. 2c). When we compared individual experimental data, the cc values obtained using SIM consensus map were higher for 88% of molecules (117/133) compared to those using the W1 reference map, which was used in a previous study18. Moreover, there was an improvement for 97.7% of molecules (130/133) when compared to the values from the W6 reference map (Supplementary Fig. 2). These results suggest that we could obtain a more accurate AT frequency reference map through molecular binding simulations.
Encouraged by these results, we expanded our analysis target to unknown DNA fragments from E. coli genome. It is essential to possess an accurate sequence for the cell used in this experiment, so we conducted Illumina sequencing on our lab strain of E. coli via a sequencing company (DNA Link, Seoul, Korea). Compare to the standard sequence of MG1655 (GenBank U00096.3: 4,641,652 bp, https://www.ncbi.nlm.nih.gov/nuccore/545778205), the genome of our lab strain is 4,629,724 bp long, exhibiting a 99.4% alignment to the standard with some variations. In this study, all experimental data was aligned with the simulation map derived from our sequencing result. Complete sequence information for the lab strain can be found in the Supplementary Information (MG1655_Lab.fasta). Utilizing this sequence, we repeated the simulation procedure 10 times, averaging the results to produce an in silico consensus profile map. As illustrated in Fig. 2d, the cc value between the experimental data example and the SIM consensus map was found to be 0.741. In contrast, the cc value with W6 frequency map yielded only 0.633, and that with W1 frequency map produced 0.656. Note that these scores are considerably lower when compared to the improved SIM comparison approach utilized in this study.
Alignment DOM image to the SIM consensus map
We aimed to align dual-color DNA images with the SIM consensus map of the E. coli genome. To achieve this, we developed a Python program, DOM.py, depicted in Fig. 3. This program calculates the cc_rg value, derived from the product of two cc values: one from the red AT profile (cc_r) and the other from the green barcode-based alignment (cc_g). Figure 3a displays the aligned position on the genome with the reference map using green barcode markers, and Fig. 3b presents the red AT profiles. Because the E. coli genome is circular, the reference map includes a dark-colored region. This is done by appending the front region of the genome to the end of the reference map, preventing misalignment at the terminus (as shown by the dark color in Fig. 3a and 3b). DOM.py offers candidates based on the highest cross-correlation (cc_rg) values, and also calls Maligner 21 and OMBlast 29, the data shown in Fig. 3a. Using green barcode labels, Maligner proposes ten candidates with corresponding scores, and OMBlast indicates a single location with the highest likelihood. In the example given, all three methods pinpointed the same genome location (4.257 Mbp with 194.8 kb).
Figure 3c presents the dual-color DNA image juxtaposed with the in silico dual-color image, and Fig. 3d delineates their associated intensity graphs. The program further provides comprehensive details at the bottom, listing information such as filename, DNA length, alignment location, cc_rg, cc_r, cc_g, stretching scale, ranking, and direction. Given that surface-immobilized DNA molecules manifest varying stretch extents, DOM.py accounts for potential under- or over-stretching, ranging between 88% and 123%. To optimize computational efficiency, especially considering the multiple calculations involved, we implemented multiprocessing for parallel computing. Additionally, we employed the binary compile option (numba.njit) when writing DOM.py. This approach substantially cut down processing time; the software can process 182 images of the E. coli genome in merely 18 minutes, a stark contrast to the over 20 hours needed without parallel computing. In summary, Fig. 3 demonstrates a 194.8 kb DNA molecule alongside its aligned map, detailing a profile comparison where the cc_rg stands at 0.265 at a 99% stretch. The red profile depicted in Fig. 3d aligns closely with the violet in silico profile derived from the SIM consensus map. It's noteworthy that, for the green barcode labels, this DNA still presents one false positive and misses five labels from the anticipated twenty (+ 1/-5/20).
In Fig. 4, our goal was to evaluate the enhanced genome alignment accuracy of DOM.py by contrasting it with existing optical mapping analysis tools like Maligner and OMBlast. Often, barcode-like markers alone are inadequate for pinpointing the exact location. As an illustration, Maligner identified the top match at 0.65 Mbp, labeled as Maligner1 in Fig. 4a. However, this alignment fails to adequately align with the AT frequency, leading to a meager cc_r value of 0.006. The reason for such a low cc_r is depicted in Fig. 4b. Contrarily, DOM.py identifies 2.3 Mbp as the best match based on cc_rg, even though Maligner ranks it as the fifth-best match (Maligner5). Figure 4c displays the molecular images along with the related graphs, elucidating why Maligner placed this molecule fifth. Despite the presence of certain false positives and negatives in Fig. 4c's barcode map, DOM.py successfully maps the DNA segment to a specific genomic region, boasting high values for both cc_r (0.493) and cc_g (0.409). On the other hand, OMBlast couldn't determine the position for this particular molecule. From our set of 182 DNA images, OMBlast could only correctly identify the alignment position for 74 molecules, a success rate of 41%.
Complete DOM of E. coli genome
Finally, we aimed to produce a complete dual optical map of the 4.6 Mbp long E. coli MG1655 genome. Figure 5a depicts 178 out of the 182 DNA image clips we acquired, all successfully aligned to the SIM consensus map. Notably, a subset of just 34 DNA image clips provided full coverage of the optical map (Supplementary Fig. 4). Out of the 178 successful alignments, 177 were based on the cc_rg values; only one DNA image was exclusively aligned using Maligner, as highlighted in Fig. 5b. One primary factor for this misalignment via cc_rg might be the aggregation of the initial 20 kb of DNA, creating a pronounced red profile, thereby yielding a diminished cc_r value of 0.166. A contrasting scenario is presented in Fig. 5c, where the green labels appeared only at eight of the 16 predicted sites (indicated by the magenta box in both Fig. 5a and 5c, 8/16 = 50.0% efficiency). Such a pattern complicates alignment through Maligner or OMBlast. However, the pronounced red profile has a high matching cc_r value of 0.747, leading to a cc_rg value of 0.331 when paired with the SIM consensus map. These two distinct cases possibly elucidate the four DNA molecules that remain unaligned. In instances where DNA is inadequately labeled as well as unproperly stretched, it becomes highly challenging to map the DNA image against the SIM consensus map.
Upon scrutinizing the alignment of the 178 DNA molecules, Maligner aligned with 131 image clips, while OMBlast matched with 74. Of the 131 images identified by Maligner, the majority (115) received the top matching score, leaving 16 ranked second or in subsequent positions, as demonstrated in Fig. 4. Across our entire dataset, we identified 218 (6.9%) false positives and 1066 (33.9%) false negatives. Our observed mean DLE-1 labeling efficiency at predicted sites was 77 ± 1.8% (refer to Supplementary Fig. 3). This efficiency resonates with past research, where methyltransferase efficiencies of 78% and 75% were respectively reported 30, 31.