2.1 Data organization, preprocessing, and augmentation
For the purpose of training and testing our deep learning models, we used two publicly available nanopore sequencing datasets: (i) Jain et al. produced a human genome assembly using long reads from nanopore sequencing30. About 14 million reads were sequenced and aligned to the 1000 genome GRCh38 reference genome31. From this dataset, we used 60,000 reads that were aligned to the mitochondria and 200,000 random reads that were aligned to the rest of the human genome. (ii) The "Cliveome" dataset, which was sequenced by ONT and released to the public in 2016 32. From this dataset, we used 8,000 reads that mapped to the mitochondria as well as 200,000 random reads that mapped to the rest of the human genome. In each dataset we separated the sequenced reads randomly into training, validation, and test sets containing 80%, 10%, and 10%, respectively, of the total reads. Only the first portion of each raw signal were used to simulate reading the beginning of the molecule with the Read Until feature.
Deep learning requires iterating through the training dataset by mini-batches, which allows handling large datasets and improves the training results33. In this research we used the Pytorch34 deep learning framework, which contains a Dataloader class; we customized this class to allow parallel data loading with custom data transformations. Our custom dataloader applies four transformations to the signal: the first transformation randomly selected a region of 2,000 values from the total 5,000 values. The second transformation changed the signal from the raw values, which represent the electric current level, to differential values in order to eliminate possible bias between voltages of different devices and flow cells. The third transformation cut the signal into a sliding window array, transforming the 1D-long linear signal into a 2D array of stacked sliding windows. The final transformation added Gaussian noise to the sample to mimic the background noise in nanopore sequencing. All of the transformations improved the training process and the final accuracy; further details are in the Supplementary Methods.
2.2 Model architecture, training, and testing
We decided to test 5 neural network varieties for our deep learning model architecture: regular CNN35, very deep CNN (VDCNN)36, regular LSTM37, LSTM with recurrent batch normalization38, and regular GRU39; further model details and justification for their selection are presented in the Supplementary Methods. All models were tested with three different sizes corresponding to the number of hidden parameters: large size, medium size, and small size models. All models were tested extensively with different configurations as explained in the supplementary methods section.
We also attempted to combine a CNN model with an RNN model whose schematic overview can be seen in Supplementary Figure 1. In theory, CNN is good at proximal feature representation and RNN can find long distance dependencies; by combining those techniques, our model could utilize both short- and long-distance information hidden in the raw signal40. We combined the VDCNN with regular GRU as well as VDCNN with LSTM with recurrent batch normalization and tested multiple configurations of these models as well as described in the supplementary methods section..
To eliminate any differences in model accuracy due to a different training process, the same python script was used to train all models similarly. We used the training dataset during training, the validation dataset for hyperparameter tuning, and the test dataset was used exclusively at the final stage to measure the accuracy of each model. Accuracy was measured separately for genomic reads and for mitochondrial reads, total accuracy was calculated by averaging the accuracy of the mitochondrial reads and the accuracy of the genomic reads. An Adam (A Method for Stochastic Optimization) optimizer41 was used; the learning rate and other parameters for the optimizer were determined by a manual search. All models were trained for 300 epochs and the learning curve of each model was assessed to determine whether the loss curve plateaued and whether overfitting became an issue. Supplementary Figure 2 illustrates the learning curve of the model with LSTM and recurrent batch normalization as an example of a successfully trained model.
After training all models on the primary dataset, a second dataset (Cliveome) was used to test the models for generalization. At first, the accuracy of the models was tested on the test dataset from the second dataset without any additional training. Later, all models were trained for 30 epochs on the second dataset training data in order to improve the accuracy specifically for the second dataset (fine-tuning). After the additional training, all models were tested again with the second dataset and its accuracy was recorded.
2.3 DNA extraction, library preparation, and MinION sequencing
Monolayer-adherent HEK-293T cells (transformed human embryonic kidney cells, ATCC, USA) were grown in Dulbecco's modified Eagle's medium (DMEM) (Thermo Fisher Scientific, USA) supplemented with 10% (vol/vol) fetal bovine serum (FBS) (Thermo Fisher Scientific, USA), 0.3 g/liter L-glutamine, 100 unit/ml penicillin, and 100 units/ml streptomycin (Biological Industries, Israel). Cells were incubated at 37°C in 5% CO2 atmosphere. Before use, cells were confirmed to have no mycoplasma contamination using the EZ-PCR Mycoplasma test kit (Biological Industries, Israel). Prior to each experiment, the cells were counted using the Countess automated cell counter (Thermo Fisher Scientific, USA).
Qiagen’s QIAamp DNA mini kit was used to extract DNA from HEK-293T cells. Next, 2.5 x 106 or 1 x 106 cells were centrifuged at 1,400 x g for 5 minutes, and the resulting pellet was resuspended in 200 μl PBS. DNA was then extracted according to the manufacturer’s protocol and eluted in 200 μl H2O. The DNA concentration was measured using the dsDNA High Sensitivity assay on a Qubit fluorometer (Thermo Fisher Scientific, USA). DNA purity was assessed using a NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Thermo Scientific, USA), to ensure OD 260/280 and OD 263/230 > 1.8.
Approximately 400 ng of purified DNA in a total volume of 7.5 μl in a 0.2 ml PCR tube was used as input for sequencing library preparation using Oxford Nanopore Technologies’ Rapid Sequencing kit (SQK-RAD004, version RSE_9046_v1_revB_17Nov2017) according to the manufacturer’s instructions. For fragmentation and transposase adapter attachment, 2.5 μl FRA was added to the DNA and mixed by inversion. The sample was then incubated at 30°C for 1 minute, followed by 80°C for 1 minute, and finally cooled on ice. Sequencing adapters were then attached by adding 1 μl RAP to the mixture and mixing by inversion. The sample with sequencing adapters was incubated at room temperature for 5 minutes, and then stored on ice until it was ready for sequencing.
MinION sequencing was conducted according to the manufacturer’s instructions using R9.4 and R9.4.1 rev. D flow cells (FLO-MIN106, ONT). After flow cell priming, 4.5 μl nuclease-free water, 34 μl sequencing buffer (SQB), and 25.5 μl mixed loading beads (LB) were added to the library and mixed by gently flicking the tube immediately before loading into the SpotOn port.
Three sequencing experiments were performed under those conditions; they will be referred to as the "HEK1", "HEK2", and "HEK3" runs. The results of the first two experiments were used for testing and training the model, whereas the third experiment was run in conjunction with the read-until script to perform real-time selective sequencing.
2.4 Sequencing data analysis
After data were acquired from the first two sequencing experiments, HEK1 and HEK2, the reads were translated to nucleotides using ONT Albacore version 2.2.5. Although Albacore is currently not supported, a recent comparison between base-calling software indicated that the differences between Albacore and more modern basecallers42 are miniscule for the purposes of our experiments. The reads were mapped to the GRCh38 human reference genome using minimap2 software43 version 2.11. Reads were separated into mitochondrial reads and genomic reads based on their mapping, and each group was separated into training/validation/testing groups with proportions of 80%/10%/10% of the total reads, respectively. Initially, the accuracy of the models trained on the first dataset was tested with the HEK1 data. Later, the models were trained for 30 epochs on the HEK1 data and accuracy was tested again (finetunned). The best performing model was determined by the highest accuracy value on the HEK2 data and saved for later use with read-until on the HEK3 sequencing experiment.
To test the performance of read-until, we utilized the developmental API provided by ONT and wrote a custom script to perform selective sequencing based on the "simple.py" file from the GitHub repository of Read-Until. This script receives the raw signal at the beginning of every DNA molecule, the raw signal is analyzed by the deep learning model, and finally the script sends a signal to the MinION device to either keep sequencing the DNA molecule or to stop and remove the unwanted DNA molecule from the pore. Reads that were classified by the model as mitochondrial reads were allowed to be fully sequenced, whereas the rest of the reads had received a signal to terminate their sequencing. In order to gather the validated results, we performed the experiments with 3 technical repeats for 3 different time spans: 10 minutes, 30 minutes, and 120 minutes. In each time span we performed 3 regular sequencing experiments without using read-until and 3 sequencing experiments utilizing Read Until. To account for the deterioration of the flow cell over time and to reduce technical bias, we performed the experiments with read-until and without it sparingly. The reads were translated and mapped to a human reference genome, then for each sequencing experiment the alignment statistics were collected. Logistic regression with proportions and a random effects variable44 analysis were performed to test for differences in the proportion of the sequenced mitochondrial nucleotides to the total sequenced nucleotides. A comparison was made between pairs of the technical repeats as follows: 10min_with_read-until_run1 VS 10min_without_read-until_run1, 10min_with_read-until_run2 VS 10min_without_read-until_run2, etc. Additionally, read lengths were collected for each of the experiments and analyzed using the Fisher-Pitman permutation test45 to check for statistical differences between the read lengths of different groups.