­­­A novel quantum algorithm for Biological Sequence Alignment using Quantum Accelerated Mapping in Seed-and-Extend Technique.

doi:10.21203/rs.3.rs-4305700/v1

Download PDF

Article

A novel quantum algorithm for Biological Sequence Alignment using Quantum Accelerated Mapping in Seed-and-Extend Technique.

https://doi.org/10.21203/rs.3.rs-4305700/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

A novel quantum algorithm for use in Biological Sequence Alignment is presented and analyzed. The large amounts of data extracted from genome sequencing, de novo assembly sequencing, resequencing, and transcriptome sequencing at the DNA or RNA level, foreshadow the need for higher computing power as well as more sophisticated alignment methods. Modern and faster sequencing techniques in genomics have led to the reconsideration of current methods of designing or implementing alignment protocols. Novel quantum computing accelerators may provide drastic solutions in this field reaching the desired levels of gate operation maturity. This paper proposes a computer vision-based approach, using the unprecedented power of entanglement in a dot-matrix, to address the high demand for fast harnessing of biological data. A quantum accelerated protocol is demonstrated and tested using the Qiskit software framework of IBM. Runtime tests verify the expectations to obtain a potentially advantageous sequence alignment algorithmic process in terms of accuracy, completeness and computing complexity. The performance has been tested under various conditions and promises a clear and viable advantage.

Biological sciences/Computational biology and bioinformatics

Physical sciences/Physics/Quantum physics/Quantum information

Physical sciences/Physics/Quantum physics/Quantum simulation

quantum sequence alignment

circuit model of quantum computation

seed-and-extend

linear pattern recognition

quantum Hamming distance.

In the algorithmic fields of string metrics and string searching in information processing, comparing two strings, and estimating their similarity degree is a highly applicable task in many scientific areas. In the fields of bioinformatics and molecular biology, this process is known as sequence alignment (SA) and constitutes a foundation for several key areas such as: evolutionary relationships, functional annotations, comparative genomics, drug design and development, disease studies, database search, research in molecular and structural biology. At the same time, genomic data are generated in an immersive way, due to the faster and diversified Next Generation Sequencing (NGS) technologies, entering an era where biodata require efficient and powerful computing machines for analytics as never before. In spite of this necessity, the current computing capacity is not up to our expectations.

Since the milestone algorithms of Needleman Wunsch (N.W.) in 1970 and Smith Waterman (S.W.) in 1981 [1-3], several heuristic strategies have been invented to accelerate the SA procedure such as FASTA or BLAST series [4-5]. Dynamic programming of the past decades provided optimal solutions but remains a time-consuming approach. When NGS machines established the splitting of a genome into smaller fragments – the well-known “reads” – in fact, they dictated the form of modern aligners – meaning the generation of the “short-read aligners”. Most of them adopted the popular seed-and-extend strategy which was adjusted to the specifications of NGS. Then, more advanced tools appeared like BWA [6], HISAT [7], MUMmer series [8-9], etc. However, in the era of big data, the need for even stronger alternatives led to completely unconventional ideas for hardware systems that are based on optics [10] and quantum computing technologies [11-12]. Both examples are still in an embryonic stage.

In this project, a previously reported algorithmic idea [12] is re-examined and transformed into a functional pairwise SA (or PSA) system implemented by quantum computing means. The initial idea was to create a quantum dot-plot and apply quantum Fourier Transform (QFT) to detect regions of high similarity. Although the QFT is an exceptional mathematical tool to detect similarities [13], further experimentation on 4-letter biological data (DNA/RNA) confronted inaccuracies and pattern detection difficulties when multiple diagonals exist in the same dot-plot window. This problem requires extra segmentation and examination of the window in a brute force inefficient fashion without guarantee of detecting the exact pattern, something which is commented in [12]. Now, the same concept is partially revised to provide more accuracy. The idea of the dot-matrix is conserved but the QFT is replaced by a hamming distance calculator. A series of XOR operations isolates specific areas of possible similarities and a counting algorithm validates similarities of pre-specified threshold – working as a cut-off. Since acceptable similarities are detected, they can be further evaluated providing more alignment details, if required. The final step may be implemented classically or not. In case that this system is part of a larger classical system, it may be viewed as a hybrid quantumly accelerated solution.

The proposed system is tested under the OpenQASM v3.0 software platform [14] under Qiskit Ecosystem framework [15], and it is connected and transpiled to a possible hardware implementation – theoretically acceptable by IBM’s gate machinery in our disposal. The multifaceted applicability of the proposed system is discussed for different operational modes or SA conditions.

The well-known seed-and-extend strategy is adopted by an overwhelming majority of read-aligners, letting this study focus solely on this strategy. This strategy was initially proposed in the FASTA SA-tool. A typical protocol of this type can be staged into four phases: data preparation, seeding, filtering, and extending. The proposed modification aims to skip the preparation phase with the constructions of large indexing structures and focuses on the seeding phase running a quantum computing vision technique. The rest two phases may be applied classically to refine the result, but some alternative QC suggestions may replace them, too. The whole hybrid protocol is presented and evaluated by conducting a multifaceted complexity analysis.

2.1 System Architecture

The prototype system architecture of a typical read-aligner is depicted in Figure 1, including the read-mapping mechanism Qmap, which also involves the quantumly accelerated part, that was used and developed in this research work.

According to the user’s settings, a dataset manager fetches the selected reference genome sequence from a bio-database with a list of query reads and puts forward the PSA. Then, the segmentation scheduler finds the optimal number of segments into which to slice the reference sequence being in accordance with user’s settings. The sliced segments, also referred to as "sliding windows", can be equal or greater than the size of the read length. These sliced segments are also referred to as “sliding windows” and may be equal or larger than the size of a read length. Thus, the scheduler calculates how many times is required to run the QMap routine as it is executed once per sliding window, in order to completely map to the reference a single read. Scheduling policies are discussed in detail in the Supplementary material.

When scheduling is ready, the scheduler feeds constantly QMap with all the formed query datasets. In this case, a dataset is considered a pair of strings, meaning the read and the sliced segments of the reference. QMap is responsible for the detection of consecutive similarities of base pairs (bps) within sliced segments. This quantum routine generates the dot-matrix and tries to detect similarities (dots) on the diagonals. A final refinement may be accomplished to the stitch regions which bare similarities, which regions are illustrated through a heatmap graph. The result may be improved further with the aid of S.M. or N.W. algorithms just to include indel corrections that are not addressed by the quantum algorithm.

Some important parameters that affect the overall performance to consider are the type of data for alignment, the input dataset format (e.g. FASTA), the mapping scheme, the size of the queries or the adopted sliding-window, or the cut-off thresholds. As proteins are built from twenty amino acids while DNA contains four different bases, protein bases tend to perform better in alignment than DNA/RNA bases, as the 'signal-to-noise ratio' in protein sequences is much better than in DNA/RNA. The format follows FASTA specifications, and the mapping scheme considers a list of queries (reads) to be mapped onto a huge reference sequence. Longer reads are easier detected than short-reads and larger sliding-window frames accelerate the process by reducing the sensitivity. Moreover, larger thresholds will speed up the process as smaller diagonals are skipped.

2.2 Circuit Design of Accelerator

A possible design for the QMap quantum routine may be the circuit model as shown in Figure 2. Its main constituent elements are two basic sub-circuits. The first one is responsible for the dot-matrix pairwise alignment generator while the second one is responsible for the detection of regions which bare significant similarity. The first sub-circuit is executed once per sliding window while the second sub-circuit is executed multiple times per sliding window. The repetition number of the second subcircuit depends on many parameters: (a) mainly by the design of the circuit, (b) the similarity threshold as a function of the size of the query read and/or the sliding window, (c) the available resources of the system which may set some bounds, (d) the adopted fragmentation policy of the reference, (e) the adopted scanning policy of the matrix by the single or double diagonal vectors, or even (f) the acceptable room for computing parallelism. Suppressing the repetition number will significantly accelerate the whole mapping procedure.

The first sub-circuit gets as input two sequences and generates the dot-matrix plot where the second subcircuit will check for large diagonals. Since the user has set the acceptable threshold of dots per single diagonal, then the second sub-circuit will detect them. The upper series of Hamming distance subcircuits search in the upper triangle of the dot-plot while the lower series in the lower triangle of the dot plot. The results are recorded when the threshold condition is satisfied.

3.1 Mapping Accuracy and Completeness

Any evaluation of an alignment system would include mapping correctness and completeness. Both measures are applied to estimate the actual performance of the proposed logic. Since the hardware requirements of the proposed SA circuit model cannot be fully satisfied, the SA protocol was tested on accuracy and completeness by making simulations on realistic data. In addition, when employing the classical algorithm, to compare with the quantum implementation, a matching threshold was used. This threshold indicates the minimum number of matches that a reference string should have, when compared with a smaller read string. The results are visualized in the form of a heatmap, where a reference string is matched with multiple read strings. The reference string is divided into segments of the same length as the read string, and each segment is iteratively compared to the read string. In Figure 3 (a), an indicative heatmap is presented for an 8x8 match, where black regions indicate a match. Also, in Figure 3 (b) a larger heatmap structure is presented, where a black region indicates that this segment of the reference string has a matching letter with the read string used. The reference string used was provided by Bioinformatics experts and represents a COVID-19 genome. The results were validated and deemed satisfactory in terms of accuracy and completeness as they were in consistency with other professional classical alignment results like BLAST or QMap.

3.2 Complexity Analysis

3.2.1 Computing Resources Estimations

A detailed analysis was conducted on the proposed circuit model to estimate the overall memory consumption in terms of logical qubits. The analysis is presented in the Supplementary material. The conclusion is that the required memory is increased by 13 qubits each time the size 2^v´2^v of the dot-plot window is increased by one in ν. This linear increment allows to embed exponentially more biodata which is a strong benefit for the width of the model circuit.

A second detailed analysis was conducted for the depth of the circuit, which is also presented in the Supplementary material but only for logical gates. Multi-controlled gates are not decomposed and as it is usually the case the prevalent gate is the multi-controlled Not (MCN) gate. In the Novel Enhanced Quantum Representation (NEQR) encoding scheme required 512 logic MCN for a window of 256x256 size and it seems that they are doubled for a unit increment in ν.

Generally, it was found that in architectures in which the qubit connectivity is limited, such as the transmon based qubits of IBM’s quantum platform, the circuits that are based mostly in QPLA techniques (i.e., the encoder and the XOR gates) have comparatively a bigger depth than the counting circuit which is composed mostly of Grover operators and a QFT sub-circuit. This occurs, despite the fact that in the QASM gate set the counting circuit is significantly larger. The explanation for this behavior is that QPLA circuits usually demand a lot of swaps and re-arranging of qubits when transpliled, whereas the counting circuit has a more local behavior with the Grover operator and the QFT sub-circuit being applied to the same qubits.

3.2.2 Runtime Estimation

A rough calculation of the execution time of the logical QMap circuit is attempted to obtain a theoretical estimation about the time complexity. The QMap circuit execution is considered as a composition of Subcircuit-A and an average times of Subcircuit-B. The average times are realized as a threshold 50% with its respective diagonals for each window size. An acceptable runtime simulation requires the analogous attribution of time units to each gate device considered in the model circuit in Figure 4. Time-units attribution is inspired by QC systems of IBM superconducting platforms. The adopted logic gate-set covers all the operations in the proposed protocol. Due to the general immaturity of QC technology, the estimations were conducted on the base of logic gates. However, the transpiled version for both Subcircuits A and B is provided in the Supplementary material. IBM QASM transpiler was used to generate their optimal version to demonstrate the escalation according to the size of the circuit.

The optimal window size is decided upon the balance between the read-sequence length and the reference sequence length. The larger the size of the window, the higher the execution time of the QMap routine due to diagonal scanning by Subcircuit B. However, multiple repetitions of Subcircuit-A will reduce performance.

2.2.3 Performance Estimation

Given a short query and a large reference sequence, the actual dot-plot can be realized like a horizontally long rectangular. Large sliding windows can suppress the scanning time of it making just a few applications of the circuit. The relationship between the sliding window size and the execution times of the QMap circuit is depicted in Figure 5. Obviously, it is not affordable to adopt a small sliding window as the runtime of the QMap is increased exponentially and the overhead of the generation of the dot-plot (Subcircuit-A) is repeated for no reason. On the other side, large windows increase the diagonals to be tested (Subcircuit-B), but it is more beneficial saving the repetition of Subcircuit-A. The optimal choice of the window-size depends on three parameters: the type of data under alignment, the size of the read-sequence and the similarity threshold percentage. Run-tests have shown that large windows have a better performance, but they can perform only with long-reads.

To sum up, the analysis of the proposed quantum SA system proved that it is feasible to implement QMap accelerator and integrate it within a classical SA system. The present circuit model is based on the basis encoding scheme, but the encoding problem may be improved in the near term empowering more drastically the value of our SA tool. Alternative schemes like amplitude encoding or modern AI methods seem to be promising. Despite the partial immaturity of the current quantum technology, it seems that there are some great advantages in terms of memory and gate consumption. It may also be advantageous in terms of speed, but further tests should take place upon more advanced technologies. Mapping accuracy can reach very high levels since in the small sample data the results return completeness and accuracy like other classical and modern aligners like MSA.

The investigation showed that modifications on the second repetitive sub-circuit which estimate the Hamming distance, can speed up the whole process a lot, finding the optimal settings for the sliding window and employing the maximum computing parallelism. Indicative results demonstrate that the runtime and memory consumption can be suppressed further. Some adaptations in the protocol for long-reads or de novo assembly applications are expected to lead to a better performance in terms of time and space, keeping alignment quality excellent. Another advantageous adaptation appears when the alignment is conducted for larger alphabets like proteins or even natural languages.

4.1 Quantum Accelerator: Simulation Methods

The model circuit of the QMap routine is composed of two basic component logic subcircuits. The first sub-circuit A of the dot-plot generator is executed once per QMap call. Since the setting up of the dot-matrix binary structure is accomplished, sub-circuit B will estimate the Hamming distance. Both constructions are based on well-tested sub-circuit implementations which have been simulated individually to validate the results of their combination. A universal gate-set is adopted to support all the circuitry. All the simulations are conducted with the aid of the Qiskit framework.

4.1.1 Encoding and Dot-Plot Generator

Embedding classical information into quantum states of the Hilbert space is a critical issue for any algorithmic procedure, as the adopted quantum feature mapping scheme will affect the performance of the overall algorithm in terms of time consumption. This issue is discussed in J. Clapis’s work [16] for the exponentially difficult base encoding scheme, which is admittedly a well-known problem. The most prominent theoretical encoding schemes are the basis [17], the angle [18], and the amplitude [19] encoding schemes. All of them are exponentially difficult, but there is a great utility in estimating the circuit depth and run time. Our analysis is based on the basis encoding scheme which is oriented for numerical calculations. Similar performance seems to be obtained by the application of the angle encoding scheme. The more compact way of amplitude encoding may be more advantageous in terms of qubits, but algorithmic operations are not compatible with this type of encoding. Furthermore, the instantaneous quantum polynomial protocol [20] could be applied but its restrictive non-universal model is prohibitive in our case. A comparison is given among these schemes in Table 1. Other promising ways to encode and manipulate quantum information may be employed in the near term inspired by QAOA ansatz [21].

Table 1. Comparison in the characteristics of known embedding schemes. The asterisk (*) denotes non-squared dot-plots. Lq is the length of the query sequence and Lr is the length of the reference. Symbol d is the color depth.

Embedding Schemes	Basis	Angle		Amplitude*	Inst. Q. Polynomial*
Qubit resource	d+2Lr		1+2Lr	2Lr	2Lr
Pixel-value qubit	d		1	0	0
Pixel-value encoding	Basis of qubits		Angle of qubits	Probability amplitude	Feature Mapping
Complexity	O(2^Lq2^Lr)		≥O(log2(LqLr))	≥O(log2(LqLr))	O(2^Lq2^Lr)

J. Clapis aptly pointed out that in the NEQR encoding scheme (a.k.a. basis encoding), the Espresso algorithm can reduce multi-controlled-X gates by an average of 38.26% allowing a 72% to 74% circuit depth reduction.

The dot-matrix generator sub-circuit is comprised of the Encoder and the Entangler as it is depicted in Figure 6. The Encoder gets a classical dataset of two strings to embed it into qubits. The Entangler makes the preparation of the values of the dot-matrix as a superposition and the measurement applies the operation XOR in a quantumly parallel way. This procedure is extensively explained by Clapis [16] providing a feasible solution to the previous problem of the Black-Box QDP considered in [12] – despite its limitations due to the NEQR encoding scheme.

Once the dot-matrix is created, the n+m qubits output the whole information of the generated matrix feeding multiple times Subcircuit-B, for as many times as it is needed to scan all the acceptable diagonals of the currently scanning window. Then, the whole process is repeated for the next dot-plot window.

4.1.2 Hamming Distance Estimation

Subcircuit-B gets the generated dot-matrix as input and tries to detect diagonal formations of consecutive dots upon it. Only one or more than one diagonal may be tested in a single run. It depends on the adopted scanning policy – further information about these policies can be found in the next section. In this model, we consider the single diagonal scanning policy. Thus, the output is the number of dots detected in the tested diagonal. Subcircuit-B is composed of three basic subcircuits: the trainable pattern subcircuit, the AND gate subcircuit, and the counting subcircuit.

Before the Hamming distance estimation for each diagonal, a new matrix of equal dimension to the already generated dot-plot is prepared with a perfect diagonal d[i] at index i to get ANDed with the generated dot-matrix from Subcircuit-A. This operation will return as a target output a matrix of equal dimension where each cell-value will be the AND result between the corresponding matrix cells of the two input control matrices. This way will isolate common 1’s in the selected diagonal throughout the whole matrix. The number of 1’s (or dots) on the perfect diagonal left will be counted by the well-known quantum counting algorithm. The result will reveal the degree of similarity at the indexed diagonal. It is obvious that the second subcircuit is connected to the first by a large Multi-(Anti)Controlled-Multi-X gate (MCMX). Such a gate is a complex gate.

Obviously, Subcircuit-B imposes a significant computing overhead which must be suppressed as much as possible. In the next section, the adopted policy scanning may provide significant suppression at a cost of the detection sensitivity. The alignment type may allow faster alternative policies. Another point to improve is the AND gate which makes use of programmable logic like the basis encoding scheme. Thus, similarly to the NEQR technique, Espresso algorithm can reduce the AND subcircuit implementation up to a significant degree. Last but not least, some interesting quantum algorithms have also been proposed for the estimation of the Hamming distance [22-24] but are not compatible with the proposed protocol.

4.2 Scanning Policy Modes

Typically, it is possible to scan matrix diagonals in a dot-plot in more than one way, saving a lot of runtimes. Among the multiple ways, there are two prevalent scanning protocols to be considered. The first scenario is the single diagonal scanning policy A, where each “acceptable” diagonal requires one execution of Subcircuit B. Thus, a dot-plot window will run the repetition of Subcircuit B as many times as the threshold-allowed diagonals in the dot-plot window. The second scenario is the parallel diagonal scanning policy B, where each pair of threshold-allowed diagonals requires again one execution of Subcircuit-B. In this case, a dot-plot window will run half the repetitions of Subcircuit B in comparison to Policy A. Alternative policies could be proposed by using more augmented diagonal schemes of more than two diagonals at a time but at the cost of reduced detection sensitivity. An illustrative example follows to help realize the two suggested policies.

In the 16x16 dot-plot example in Figure 8, it is considered that query and reference sequences have the same length generating a squared dot-plot and the threshold similarity is configured at 50%. Policy A requires the execution of Subcircuit-B 16 times in total, while policy B requires 8 applications making a reduction of 50%.

Although that the second policy provides a sensitivity reduction on the detection of the right number of 1’s, it is assumed that statistically it is rare to miss a double diagonal occurrence in the same dot-plot in accordance with the pattern type we want to recognize. Even if double diagonals exist or more dense patterns occur, the counting results will reveal this abnormality and extra software may interpret properly the results for different types of patterns.

Moreover, the circuit architecture could be implemented in a parallel way for the application of the multiple scanning Subcircuits-B, but this way demands a more expensive implementation in qubits and gates.

4.3 Comparative Method for Complexity Analysis

A straightforward way to evaluate the performance of the proposed QC system in terms of computational complexity is to conduct comparisons with other popular equivalents. Despite that it is unqualified to compare two completely different systems, even if a global correspondence is feasible between computing steps – it is still impossible to compare incompatible algorithmic objects. Quantum complexity theory has its own rights and at this moment there are no quantum analogues to be compared. Running a comparative study directly with its exact classical analogue would easily conclude to the result of the superiority of the quantum protocol as it is discussed in Table 3 in [9]. But running the same type of study including the currently best aligner is intrinsically hard to determine the “winner” in a one-way manner, due to the multiparametric settings of the SA protocols and the partial ignorance about the exact hardware of the quantum system.

The proposed grouping in Figure 9 enlists operations that belong at the same stage of the seed-and-extend protocol under consideration. Basic operations are separately enlisted in each stage either for the classical or the quantum protocol in a not strict but approximate way for comparative reasons. The classical version is supposed to utilize the best possible algorithmic component tools while the quantum one tries to exploit as much as possible the power of entanglement. This comparative examination makes sense only when the same strategy, type of data, sequences format, mapping schemes and policies are adopted.

Data preparation: Up-to-dated read-aligners adopt FM-indexing methods to build indexing structures for the reference genome which are usually based on a pre-built BWT tables. Despite their fast-indexing speed, they impose a standard-expensive computing overhead of preparation. In dot-matrix techniques there is no need for such an overhead, but it is replaced by two non-overlooked QC steps: the encoding of the classical data and the generation of the dot-matrix. The second one is accomplished very fast due to quantum parallelism since the encoding is ready.

Seeding: Using an index value, the array elements of the reference in the FM-index context are accessed in constant time, meaning that time complexity is O(1). On the other side, in quantum dot-matrix approach the diagonals are scanned by XORing them with a specific pattern utilizing again the power of quantum parallelism and the degree of similarity is estimated by the quantum counting algorithm.

Filtering: Classical filtering may make use of a cut-off threshold or scoring matrices. The same scoring process has already been fulfilled in the quantum context as the degree of similarity has already been calculated by counting the dots per diagonal. This stage is addressed by the training patterns which have considered the minimum acceptable degree of similarity per diagonal.

Extending: Alike to the majority of read-aligners, when a High-scoring Segment Pairs (HSR) passes filtering, then it is usually refined optimally by the dynamic SW or NW algorithms. The most important manipulations in this phase are (a) the accurate detection of the beginning or the finishing of an HSR and (b) the estimation of indel mutations. Both manipulations, theoretically, can be converted into their quantum analogues, but it is a great challenge to be investigated if the advantage over the classical counterpart is maintained. A possible implementation for operations (a) is considered in the Supplementary material making an extension upon the original circuit.

ACKNOWLEDGMENTS

This research was conducted as a collaboration between Aristotle University of Thessaloniki and Pfizer. Pfizer is the research sponsor.

The authors would like to acknowledge Daniel Katzel, Maria Antonara, and Vassilios Pantazopoulos for their ongoing support in the research.

Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
Torbjorn, R. Faster Smith–Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics 12, 221 (2011).
Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970).
Lipman, DJ; Pearson, WR . "Rapid and sensitive protein similarity searches". Science. 227 (4693): 1435-41 (1985).
Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410.
Li, H. & Durbin, R. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12:357–60.
Marçais, G., Delcher, A.L., Phillippy, A.M., Coston, R., Salzberg, S.L., Zimin, A., “MUMmer4: A fast and versatile genome alignment system”, PlOS Computational Biology, 2018, 14(1), e1005944
Maleki E, Babashah H, Koohi S, Kavehvash Z. High-speed all-optical DNA local sequence alignment based on a three-dimensional artificial neural network. J Opt Soc Am A Opt Image Sci Vis. 2017 Jul 1;34(7):1173-1186. doi: 10.1364/JOSAA.34.001173. PMID: 29036127.
Ehsan Maleki,Somayyeh Koohi,Zahra Kavehvash,Alireza Mashaghi, OptCAM: An ultra-fast all-optical architecture for DNA variantdiscovery.
Prousalis, K. & Konofaos, N. Improving the Sequence Alignment Method by Quantum Multi-Pattern Recognition. In Proceedings of the 10 th Hellenic Conference on Artificial Intelligence (SETN'18), 2018 ACM, 50 (ACM, 2018).
Prousalis, K., Konofaos, N. Α Quantum Pattern Recognition Method for Improving Pairwise Sequence Alignment. Sci Rep 9, 7226 (2019).
Schützhold, R. Pattern recognition on a quantum computer. Phys. Rev. A. 67, 062311 (2002).
Cross, Andrew W., et al. "Open quantum assembly language." arXiv preprint arXiv:1707.03429 (2017).
Qiskit contributors: Qiskit: An Open-source Framework for Quantum Computing (2023). https://doi.org/10.5281/zenodo.2573505
J. Clapis. A Quantum Dot Plot Generation Algorithm for Pairwise Sequence Alignment, arXiv:2107.11346v1.
Zhang, Y., Lu, K., Gao, Y. et al. NEQR: a novel enhanced quantum representation of digital images. Quantum Inf Process 12, 2833–2860 (2013).
Phuc Q Le, Fangyan Dong, and Kaoru Hirota. A flexible representation of quantum images for polynomial preparation, image compression, and processing operations. Quantum Information Processing, 10(1):63–84, 2011.
Yao Xi-Wei, et all. Quantum Image Processing and Its Application to Edge Detection: Theory and Experiment. Physical Review X. 7. 10.1103/PhysRevX.7.031041. (2017).
Havlíček, V., Córcoles, A.D., Temme, K. et al. Supervised learning with quantum-enhanced feature spaces. Nature 567, 209–212 (2019).
Kremenetski, Vladimir, et al. "Quantum Alternating Operator Ansatz (QAOA) beyond low depth with gradually changing unitaries." arXiv preprint arXiv:2305.04455 (2023).
Xie, Z., Qiu, D. & Cai, G. Quantum algorithms on Walsh transform and Hamming distance for Boolean functions. Quantum Inf Process 17, 139 (2018).
Kathuria, K., Ratan, A., McConnell, M. and Bekiranov, S., 2020. Implementation of a Hamming distance–like genomic quantum classifier using inner products on ibmqx2 and ibmq_16_melbourne. Quantum machine intelligence, 2(1), p.7.
Li, Jing, Song Lin, Kai Yu, and Gongde Guo. "Quantum K-nearest neighbor classification algorithm based on Hamming distance." Quantum Information Processing 21, no. 1 (2022): 18.

No competing interests reported.

Supplementary.docx

Download PDF

Reviewers invited by journal
16 Aug, 2024
Editor assigned by journal
05 Aug, 2024
Editor invited by journal
03 May, 2024
Submission checks completed at journal
30 Apr, 2024
First submitted to journal
22 Apr, 2024

You are reading this latest preprint version

A novel quantum algorithm for Biological Sequence Alignment using Quantum Accelerated Mapping in Seed-and-Extend Technique.

Status:

Version 1

Abstract

Figures

1. Introduction

2. The Proposed Sequence Alignment Protocol

3. Evaluation

3.1 Mapping Accuracy and Completeness

3.2 Complexity Analysis

3.2.1 Computing Resources Estimations

3.2.2 Runtime Estimation

2.2.3 Performance Estimation

Discussion

Quantum Computing Techniques

4.1 Quantum Accelerator: Simulation Methods

4.1.1 Encoding and Dot-Plot Generator

4.1.2 Hamming Distance Estimation

4.2 Scanning Policy Modes

4.3 Comparative Method for Complexity Analysis

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1

­­­A novel quantum algorithm for Biological Sequence Alignment using Quantum Accelerated Mapping in Seed-and-Extend Technique.

Status:

Version 1

Abstract

Figures

1. Introduction

2. The Proposed Sequence Alignment Protocol

3. Evaluation

3.1 Mapping Accuracy and Completeness

3.2 Complexity Analysis

3.2.1 Computing Resources Estimations

3.2.2 Runtime Estimation

2.2.3 Performance Estimation

Discussion

Quantum Computing Techniques

4.1 Quantum Accelerator: Simulation Methods

4.1.1 Encoding and Dot-Plot Generator

4.1.2 Hamming Distance Estimation

4.2 Scanning Policy Modes

4.3 Comparative Method for Complexity Analysis

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1

A novel quantum algorithm for Biological Sequence Alignment using Quantum Accelerated Mapping in Seed-and-Extend Technique.