Metrics
To evaluate the methods proposed in this challenge, we use several metrics that include root mean squared error (RMSE), mean absolute error (MAE), median absolute error (median AE), max and min deviations. We have a set of N observed values for each task and submission, \(\:{Y}_{i}\), and a matching set of predicted values \(\:{\widehat{Y}}_{i}\). The formula of each metric is reported here below:
$$\:RMSE=\:\sqrt{\frac{1}{N}*\sum\:_{i=0}^{N}{\left({Y}_{i}-{\widehat{Y}}_{i}\right)}^{2}}$$
1
$$\:MAE=\:\frac{1}{N}\sum\:_{i=0}^{N}\left(\right|{Y}_{i}-{\widehat{Y}}_{i}\left|\right)$$
2
Min and max deviation between observed values \(\:{Y}_{mod\_i}\) and predicted values \(\:{\widehat{Y}}_{mod\_i}\) on the modified positions were also calculated to give an error range on the predicted modification frequencies.
$$\:\text{max}deviation=\underset{\text{i}}{\text{max}}\left(\right|{Y}_{i}-{\widehat{Y}}_{i}\left|\right)$$
3
$$\:min\:deviation=\:\:\underset{\text{i}}{\text{min}}\left(\right|{Y}_{i}-{\widehat{Y}}_{i}\left|\right)$$
4
F1 score, which combines the precision and recall in one metric, and accuracy values were also considered in the metrics evaluation process, and they are calculated as follows:
$$\:Acc.=\:\frac{TP+TN}{N}$$
5
$$\:F1=\:\frac{2*TP}{2*TP+FP+FN}$$
6
Where N is the total number of expected values, TP and TN are the true positive and negative, respectively, while FP and FN are the false positive and false negative, respectively. For each modified position with a given target frequency \(\:Y\), the frequency is correctly predicted, if the prediction \(\:\widehat{Y}\) is within a range of\(\:Y\pm\:Y*0.6\) and ± 1 base position from the expected one (TP, else FN). For unmodified positions with a target modification rate of 0, the prediction is correct if it is also 0 (TN, else FP).
Pipeline description: Method 1
The FAST5 files were first converted to POD5 format using pod5tool (v0.2.4, pod5 convert fast5), followed by basecalling with Dorado (v0.3.4). The basecalled reads were then aligned using minimap2 (v2.26). The reference kmer table for rna_r9.4_180mv_70bps was downloaded from the ONT GitHub repository (https://github.com/nanoporetech/kmer_models). Next, bayespore was run with the POD5 and BAM files, the kmer table, and other default parameters as inputs.
Pipeline description: Method 2 and method 3
The raw data in fast5 format was base called using Guppy (version 6.5.7 + ca6d6af) with default parameters and rna_r9.4.1_70bps_hac base calling model. The reads passing quality filter in fastq format were aligned to the reference genome (template1) using minimap2 (version 2.24-r1122) with following parameters (-ax map-ont -uf -t 48 -N 20). Signal values for each 5-mer in the reads was generated using eventalign module of nanopolish (v 0.14.0) and kmer level signal information was generated. The signal information was used with kmer models generated by CHEUI to pre-process the data for identification and calculation of m6A and m5C modifications frequencies. The preprocessed files were used to predict site level m6A and m5C modifications on the reference genome provided.
Pipeline description: Method 4
The RNA fast5 files were basecalled using dorado basecaller version 0.3.2 provided by ONT. The RNA 0002 high accuracy model was used to basecall the RNA reads and the dorado analysis was stored as fastq. To achieve an alignment of about 80% using minimap259 (k-mer size = 8, -ax ont-map flag), every uridine/thymine nucleotide was substituted with a cytosine in the fastq files and the fasta reference60. To underline, the alignment without the substitution was less than 5%. After alignment, the substituted cytosine in the fastq files and fasta were reversed back to uridine/thymine. The aligned dataset was then resquiggleing using Tombo from ONT, which was used to associate specific raw signal to their respective basecalled nucleotide. Next, the processed data were used to train PseudoDec (https://github.com/mem3nto0/PseudoDeC_RMaP) which is a neural network that can be trained using processed data from Tombo for modification detection. The deep neural network will then analyse the raw signal and its respective sequence to remap all the sequence, pointing the modification position and type. For the challenge, the neural network was trained for pseudouridine detection.
Pipeline Description: Method 5
This pipeline uses the nanoRMS2 methodology61 with minor modifications, described below. Firstly, the reads were basecalled (Guppy v6.0.6 using hac model) storing trace information. Subsequently, reads were aligned (minimap2 v2.26) and resquiggled (tombo v1.5). Then, a set of features (signal intensity, dwell time, trace, modification probability) was stored in a BAM file for every base for every read using the get_features.py script, which is part of nanoRMS2. One Gradient Boosting classifier (as implemented in scikit-learn) was trained for every 3-mer centred at T using reads from the training set. Finally, trained classifiers were used to predict modification status of T positions in the reads from the testing set. Additional details and code are available at https://github.com/novoalab/RMaP_challenge.
Pipeline description: Method 6
The sequencing reads were basecalled using Guppy version 6.4.2 provided by ONT. Following basecalling, individual fastq files were merged into a single fastq file, and an index was generated using the index module from Nanopolish62 version 0.14.0. Alignment of the reads to the synthetic reference sequence was performed using Minimap257 version 2.24, with the parameters set to -ax splice -uf -k14. The resulting mapped reads were sorted, indexed, and converted into BAM files using SAMtools63 version 1.5. To align the nanopore signal squiggles to the reference genome and extract per-site features, the Nanopolish event align module was utilized. Due to the large disparity between the unmodified and modified data, 0.05% of the unmodified data was randomly sampled, resulting in approximately 50,000 data points, which were integrated with the modified dataset. In accordance with the challenge guidelines, every 5th position in the modified data was labelled as containing a modification, assigning these positions to class 1, while the unmodified positions were designated as class 0. The signal was vectorized by applying the signature transform from rough path theory, enabling the extraction of key features from the sequential data. These transformed features were used to construct feature vectors, which captured the essential characteristics of the signal. Finally, a Gradient Boosting algorithm from scikit-learn64 was applied to these feature vectors to predict the locations of modifications in the DRS signal. The model utilized the enriched feature representation from the signature transform to improve predictive accuracy.
Pipeline description: Method 7
The training fast5 files were split into two groups: unmodified and modified reads. Fast5 files from both groups, along with those from the test set, were basecalled separately using Guppy 6.5.7 to generate fastq files. These fastq files were then aligned with minimap2, using the parameters -ax splice --secondary = no -k5, and the resulting sam files were converted to bam format using samtools. Next, the fast5, fastq, and bam files were processed with f5c event align, applying the parameters --signal-index and --scale-event for event segmentation. The eventalign.txt files generated by f5c were processed with m6anet dataprep, modified to output NNTNN (equivalent to NNUNN) kmers. After merging the datapreps from the unmodified and modified groups, each site was labeled as either modified or unmodified. The merged dataprep files (data.json and data.info) were used to train m6anet. Finally, the trained model was used for predictions on the test data through m6anet inference.
Data Format
The bedRMod format is a unified data format for storing RNA modification data to enable sharing, collaborating and reuse of data, greatly enhancing the speed at which research can be done. It is based on the standard browser extensible (BED) format63, a text file format with tab-delimited rows. Introducing the bedRMod format for storing epitranscriptomic data in the RMaP Challenge creates a basis for comparable results across different methods of detecting RNA modifications. This is especially the case as a uniform data format for storing epitranscriptomic data does not exist, yet. bedRMod provides a new format, which is compatible with many established tools and thus easy to adapt into already existing workflows.
A bedRMod file consists of two main parts: The header which contains metadata, clarifying from where the RNA modification data originates and how it was obtained and the data section which stores the site-specific modification data. In the data section each row contains the site-specific RNA modification properties of one modification at one position. An example of the structure a bedRMod file can be seen in Fig. 5. For the complete specification of bedRMod, please refer to: github.com/anmabu/bedRMod/blob/main/bedRModv1.8.pdf. The advantage of using bedRMod over other formats is that it was specifically designed to be used with epitranscriptomic data. Additionally, it is straightforward to use bedRMod, as it can be viewed with any text editor and due to its extensive header, the contents are easy to interpret. A toolkit for conversion of existing RNA modification data into bedRMod was implemented using python3.10. It can be found at github.com/anmabu/bedRMod. A graphical user interface (GUI) is also available for ease of use.
Data Handling and Storage
All aspects of data management were provided through a NextCloud installation, the RMaP Challenge Cloud, which was exclusively implemented for this benchmark event by the Dieterich Lab in Heidelberg. Two virtual machines with 4 GB RAM each and 200GB shared disk space were dedicated to this purpose. For organisational reasons, we then set up a predefined folder structure for handling outgoing data (“challenge data”). Incoming data, i.e. “challenge solutions” were uploaded to private folders by challenge participants. Use and access privileges were managed through LDAP and implemented to meet the needs of data owners, solution providers and data managers. Instructions, guidelines and specifications were deposited in the RMaP Challenge Cloud as well. We performed community briefings with references to all information material by sending round mails.